Tarantool development patches archive
 help / color / mirror / Atom feed
From: Serge Petrenko via Tarantool-patches <tarantool-patches@dev.tarantool.org>
To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>,
	tarantool-patches@dev.tarantool.org, vdavydov@tarantool.org
Subject: Re: [Tarantool-patches] [PATCH 9/9] box: enrich ER_READONLY with new details
Date: Mon, 8 Nov 2021 18:18:33 +0300	[thread overview]
Message-ID: <2af7c607-01a1-7060-4ecd-fe9f526c8c07@tarantool.org> (raw)
In-Reply-To: <99799340f06652929897e363e42a5227d12277a1.1636156453.git.v.shpilevoy@tarantool.org>



06.11.2021 02:56, Vladislav Shpilevoy пишет:
> ER_READONLY used not to have any details about the exact reason
> why the instance is read-only. The patch changes that by adding
> new fields into the error which explain why the error happened and
> even help to avoid it for next requests.
>
> Now from the error object can tell whether it was raised because
> of box.cfg.read_only = true, or the instance is an orphan, or it
> has election enabled but is not a leader, or the transaction limbo
> belongs to another instance.
>
> The alternative to ClientError alteration via error_payload was
> not to touch struct error and introduce a new error type
> specifically for ER_READONLY via a new C++ class like
> ReadOnlyError. But it had drawbacks:
>
> - There may be clients who expect ER_READONLY to have ClientError
>    type. For instance, they check err.code == ER_READONLY only for
>    err.type == 'ClientError';
>
> - Having to introduce a new C++ class each time when want to add a
>    new field into an error has to end. Rather sooner than later.
>
> Closes #5568

Thanks for the patch!
Looks nice, generally, with 3 minor comments.

>
> @TarantoolBot document
> Title: box.error.READONLY new attributes
>
> Users could see the error code as `box.error.READONLY` in `.code`
> field of an error object. The error didn't have any other
> attributes except common ones like 'type'.
>
> Now from the `box.error.READONLY` error users can see why it
> happened. The reasons can be the following:
>
> * The instance has `box.cfg.read_only = true`. Then the `READONLY`
> error has at least these fields:
> ```
> tarantool> err:unpack()
> ---
> - state: read_only
>    reason: state
>    code: 7
>    type: ClientError
> ...
> ```
>
> * The instance is an orphan. It enters that state if number of
> connected replicas is < `box.cfg.replication_connect_quorum`. Then
> `READONLY` error has at least these fields:
> ```
> tarantool> err:unpack()
> ---
> - state: orphan
>    reason: state
>    code: 7
>    type: ClientError
> ...
> ```
>
> * The synchro queue has an owner which is not the given instance.
> It usually happens if synchronous replication is used and there is
> another instance who called `box.ctl.promote()`. Then `READONLY`
> error has at least these fields:
> ```
> tarantool> err:unpack()
> ---
> - queue_owner_id: <box.info.id of the queue owner>
>    queue_owner_uuid: <box.info.uuid of the queue owner>
>    reason: synchro
>    term: <box.info.election.term of this instance>
>    code: 7
>    type: ClientError
> ...
> ```
> Note than `queue_owner_uuid` sometimes might be not present.
>
> * The instance has `box.cfg.election_mode` not `off` and it is not
> a leader. Then `READONLY` error has at least these fields:
> ```
> tarantool> err:unpack()
> ---
> - state: <box.info.election.state of this instance>
>    leader_id: <box.info.id of the leader>
>    leader_uuid: <box.info.uuid of the leader>
>    reason: election
>    term: <box.info.election.term of this instance>
>    code: 7
>    type: ClientError
> ...
> ```
> `leader_id` and `leader_uuid` might be absent if the leader is not
> known. For example, an election is still in progress. Note, than
> `leader_uuid` sometimes might be not present even if `leader_id`
> is.
>
> If multiple reasons are true at the same time, then only one is
> returned in order reversed from how they are listed above. IOW,
> election, synchro, box.cfg.read_only, orphan.
> ---
> There are 2 test groups because the 'unsafe' one sometimes crashes if try to
> make a cleanup. Instances of the second test group can only be recreated.
>
>   .../unreleased/gh-5568-readonly-reason.md     |  24 ++
>   src/box/box.cc                                |  56 +++-
>   .../gh_5568_read_only_reason_test.lua         | 287 ++++++++++++++++++
>   3 files changed, 359 insertions(+), 8 deletions(-)
>   create mode 100644 changelogs/unreleased/gh-5568-readonly-reason.md
>   create mode 100644 test/replication-luatest/gh_5568_read_only_reason_test.lua
>
> diff --git a/changelogs/unreleased/gh-5568-readonly-reason.md b/changelogs/unreleased/gh-5568-readonly-reason.md
> new file mode 100644
> index 000000000..f90b53bdf
> --- /dev/null
> +++ b/changelogs/unreleased/gh-5568-readonly-reason.md
> @@ -0,0 +1,24 @@
> +## feature/core
> +
> +* Error objects with the code `box.error.READONLY` now have additional fields
> +  explaining why the error happened (gh-5568).
> +
> +  For instance, if the error was due to `box.cfg{read_only = true}`, then the
> +  error will have fields:
> +  ```
> +  {
> +    reason = 'state',
> +    state = 'read_only'
> +  }
> +  ```
> +  Or if it was due to not being a leader with `box.cfg{election_mode = ...}` set
> +  not to `off`, then:
> +  ```
> +  {
> +    state: <box.info.election.state of this instance>,
> +    leader_id: <box.info.id of the leader>,
> +    leader_uuid: <box.info.uuid of the leader>,
> +    reason: election,
> +    term: <box.info.election.term of this instance>
> +  }
> +  ```

1. Let's duplicate the additional error information in the error message.

    If it's hard to log every field, let's at least encode reason and state,
    so that the user knows what's wrong by simply looking at the error 
message.

> \ No newline at end of file
> diff --git a/src/box/box.cc b/src/box/box.cc
> index 1ed1ce3f8..236c02533 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -177,16 +177,56 @@ box_update_ro_summary(void)
>   static int
>   box_check_writable(void)
>   {
> -	if (is_ro_summary) {
> +	if (!is_ro_summary)
> +		return 0;
> +	struct error *e = diag_set(ClientError, ER_READONLY);
> +	struct raft *raft = box_raft();
> +	/*
> +	 * In case of multiple reasons at the same time only one is reported.
> +	 * But the order is important. For example, if the instance has election
> +	 * enabled, for the client it is better to see that it is a 'follower'
> +	 * and who is the leader than just see cfg 'read_only' is true.
> +	 */
> +	if (raft_is_ro(raft)) {
> +		error_field_set_str(e, "reason", "election");
> +		error_field_set_str(e, "state", raft_state_str(raft->state));
> +		error_field_set_uint(e, "term", raft->volatile_term);
> +		uint32_t id = raft->leader;
> +		if (id != REPLICA_ID_NIL) {
> +			error_field_set_uint(e, "leader_id", id);
> +			struct replica *r = replica_by_id(id);
> +			/*
> +			 * XXX: when the leader is dropped from _cluster, it
> +			 * is not reported to Raft.
> +			 */
> +			if (r != NULL) {
> +				error_field_set_uuid(e, "leader_uuid",
> +						     &r->uuid);
> +			}
> +		}
> +	} else if (txn_limbo_is_ro(&txn_limbo)) {
> +		error_field_set_str(e, "reason", "synchro");
> +		uint32_t id = txn_limbo.owner_id;
> +		error_field_set_uint(e, "queue_owner_id", id);
> +		error_field_set_uint(e, "term", raft->volatile_term);
> +		struct replica *r = replica_by_id(id);
>   		/*
> -		 * XXX: return a special error when the node is not a leader to
> -		 * reroute to the leader node.
> +		 * XXX: when an instance is deleted from _cluster, its limbo's
> +		 * ownership is not cleared.
>   		 */
> -		diag_set(ClientError, ER_READONLY);
> -		diag_log();
> -		return -1;
> -	}
> -	return 0;
> +		if (r != NULL)
> +			error_field_set_uuid(e, "queue_owner_uuid", &r->uuid);
> +	} else {
> +		error_field_set_str(e, "reason", "state");
> +		if (is_ro)
> +			error_field_set_str(e, "state", "read_only");
> +		else if (is_orphan)
> +			error_field_set_str(e, "state", "orphan");
> +		else
> +			assert(false);
> +	}
> +	diag_log();
> +	return -1;
>   }
>   
>   static void
> diff --git a/test/replication-luatest/gh_5568_read_only_reason_test.lua b/test/replication-luatest/gh_5568_read_only_reason_test.lua
> new file mode 100644
> index 000000000..d4f7ff957
> --- /dev/null
> +++ b/test/replication-luatest/gh_5568_read_only_reason_test.lua
> @@ -0,0 +1,287 @@
> +local t = require('luatest')
> +local cluster = require('test.luatest_helpers.cluster')
> +local helpers = require('test.luatest_helpers')
> +
> +--
> +-- gh-5568: ER_READONLY needs to carry additional info explaining the reason
> +-- why the error happened. It should help the clients to do better error logging
> +-- and to use the provided info to find the actual leader if the reason is in
> +-- being not a leader.
> +--
> +
> +--
> +-- Make functions which will create and destroy a cluster. Usage of the closures
> +-- is a workaround for luatest hooks not accepting their group as a parameter.
> +--
> +local function make_create_cluster(g) return function(param)
> +    g.cluster = cluster:new({})
> +    local master_uri = helpers.instance_uri('master')
> +    local replica_uri = helpers.instance_uri('replica')
> +    local replication = {master_uri, replica_uri}
> +    local box_cfg = {
> +        listen = master_uri,
> +        replication = replication,
> +        -- To speed up new term when try to elect a new leader.
> +        replication_timeout = 0.1,
> +    }
> +    g.master = g.cluster:build_server({alias = 'master', box_cfg = box_cfg})
> +
> +    box_cfg.listen = replica_uri
> +    g.replica = g.cluster:build_server({alias = 'replica', box_cfg = box_cfg})
> +
> +    g.cluster:add_server(g.master)
> +    g.cluster:add_server(g.replica)
> +    g.cluster:start()
> +end end
> +
> +local function make_destroy_cluster(g) return function()
> +    g.cluster:drop()
> +end end
> +
> +--
> +-- This group's test cases leave instances in a valid state after which they can
> +-- be reused.
> +--
> +local g = t.group('gh-5568-read-only-reason1')
> +
> +g.before_all(make_create_cluster(g))
> +g.after_all(make_destroy_cluster(g))
> +
> +--
> +-- Read-only because of box.cfg{read_only = true}.
> +--
> +g.test_read_only_reason_cfg = function(g)
> +    local ok, err = g.master:exec(function()
> +        box.cfg{read_only = true}
> +        local ok, err = pcall(box.schema.create_space, 'test')
> +        return ok, err:unpack()
> +    end)
> +    t.assert(not ok, 'fail ddl')
> +    t.assert_covers(err, {
> +        reason = 'state',
> +        state = 'read_only',
> +        code = box.error.READONLY,
> +        type = 'ClientError'
> +    }, 'reason is read_only')
> +
> +    -- Cleanup.
> +    g.master:exec(function()
> +        box.cfg{read_only = false}
> +    end)
> +end
> +
> +--
> +-- Read-only because is an orphan.
> +--
> +g.test_read_only_reason_orphan = function(g)
> +    local fake_uri = helpers.instance_uri('fake')
> +    local old_timeout, ok, err = g.master:exec(function(fake_uri)
> +        -- Make connect-quorum impossible to satisfy using a fake instance.
> +        local old_timeout = box.cfg.replication_connect_timeout
> +        local repl = table.copy(box.cfg.replication)
> +        table.insert(repl, fake_uri)
> +        box.cfg{
> +            replication = repl,
> +            replication_connect_timeout = 0.001,
> +        }
> +        local ok, err = pcall(box.schema.create_space, 'test')
> +        return old_timeout, ok, err:unpack()
> +    end, {fake_uri})
> +    t.assert(not ok, 'fail ddl')
> +    t.assert_covers(err, {
> +        reason = 'state',
> +        state = 'orphan',
> +        code = box.error.READONLY,
> +        type = 'ClientError'
> +    }, 'reason is orphan')
> +
> +    -- Cleanup.
> +    g.master:exec(function(old_timeout)
> +        local repl = table.copy(box.cfg.replication)
> +        table.remove(repl)
> +        box.cfg{
> +            replication = repl,
> +            replication_connect_timeout = old_timeout,
> +        }
> +    end, {old_timeout})
> +end
> +
> +--
> +-- Read-only because is not a leader. Does not know the leader.
> +--
> +g.test_read_only_reason_election_no_leader = function(g)
> +    local ok, err = g.master:exec(function()
> +        box.cfg{election_mode = 'voter'}
> +        local ok, err = pcall(box.schema.create_space, 'test')
> +        return ok, err:unpack()
> +    end)
> +    t.assert(not ok, 'fail ddl')
> +    t.assert(err.term, 'has term')
> +    t.assert_covers(err, {
> +        reason = 'election',
> +        state = 'follower',
> +        code = box.error.READONLY,
> +        type = 'ClientError'
> +    }, 'reason is election, no leader')
> +    t.assert(not err.leader_id, 'no leader id')
> +    t.assert(not err.leader_uuid, 'no leader uuid')
> +
> +    -- Cleanup.
> +    g.master:exec(function()
> +        box.cfg{election_mode = 'off'}
> +    end)
> +end
> +
> +--
> +-- Read-only because is not a leader. Knows the leader.
> +--
> +g.test_read_only_reason_election_has_leader = function(g)
> +    g.master:exec(function()
> +        box.cfg{election_mode = 'candidate'}
> +    end)
> +    g.master:wait_election_leader()
> +    g.replica:wait_election_leader_found()
> +    local ok, err = g.replica:exec(function()
> +        box.cfg{election_mode = 'voter'}
> +        local ok, err = pcall(box.schema.create_space, 'test')
> +        return ok, err:unpack()
> +    end)
> +    t.assert(not ok, 'fail ddl')
> +    t.assert(err.term, 'has term')
> +    t.assert_covers(err, {
> +        reason = 'election',
> +        state = 'follower',
> +        leader_id = g.master:instance_id(),
> +        leader_uuid = g.master:instance_uuid(),
> +        code = box.error.READONLY,
> +        type = 'ClientError'
> +    }, 'reason is election, has leader')
> +
> +    -- Cleanup.
> +    g.master:exec(function()
> +        box.cfg{election_mode = 'off'}
> +        box.ctl.demote()
> +    end)
> +    g.replica:exec(function()
> +        box.cfg{election_mode = 'off'}
> +    end)
> +end
> +
> +--
> +-- Read-only because does not own the limbo. Knows the owner.
> +--
> +g.test_read_only_reason_synchro = function(g)
> +    g.master:exec(function()
> +        box.schema.create_space('test', {is_sync = true}):create_index('pk')

2. The space creation isn't necessary for this test.

> +        box.ctl.promote()
> +    end)
> +
> +    t.helpers.retrying({}, function()
> +        assert(g.replica:exec(function()
> +            return box.info.synchro.queue.owner ~= 0
> +        end))
> +    end)
> +
> +    local ok, err = g.replica:exec(function()
> +        local ok, err = pcall(box.schema.create_space, 'test2')
> +        return ok, err:unpack()
> +    end)
> +    t.assert(not ok, 'fail ddl')
> +    t.assert(err.term, 'has term')
> +    t.assert_covers(err, {
> +        reason = 'synchro',
> +        queue_owner_id = g.master:instance_id(),
> +        queue_owner_uuid = g.master:instance_uuid(),
> +        code = box.error.READONLY,
> +        type = 'ClientError'
> +    }, 'reason is synchro, has owner')
> +
> +    -- Cleanup.
> +    g.master:exec(function()
> +        box.space.test:drop()
> +        box.ctl.demote()
> +    end)
> +end
> +
> +--
> +-- This group's test cases leave instances in an invalid state after which they
> +-- should be re-created.
> +--
> +g = t.group('gh-5568-read-only-reason2')
> +
> +g.before_each(make_create_cluster(g))
> +g.after_each(make_destroy_cluster(g))
> +
> +--
> +-- Read-only because is not a leader. Knows the leader, but not its UUID.
> +--
> +g.test_read_only_reason_election_has_leader_no_uuid = function(g)
> +    g.replica:exec(function()
> +        box.cfg{election_mode = 'voter'}
> +    end)
> +    g.master:exec(function()
> +        box.cfg{election_mode = 'candidate'}
> +    end)
> +    g.master:wait_election_leader()
> +    g.replica:wait_election_leader_found()
> +    local leader_id = g.master:instance_id()
> +
> +    g.master:exec(function()
> +        box.space._cluster:run_triggers(false)
> +        box.space._cluster:delete{box.info.id}
> +    end)
> +
> +    t.helpers.retrying({}, function()
> +        assert(g.replica:exec(function(leader_id)
> +            return box.space._cluster:get{leader_id} == nil
> +        end, {leader_id}))
> +    end)
> +
> +    local ok, err = g.replica:exec(function()
> +        local ok, err = pcall(box.schema.create_space, 'test')
> +        return ok, err:unpack()
> +    end)
> +    t.assert(not ok, 'fail ddl')
> +    t.assert(err.term, 'has term')
> +    t.assert(not err.leader_uuid, 'has no leader uuid')
> +    t.assert_covers(err, {
> +        reason = 'election',
> +        state = 'follower',
> +        leader_id = leader_id,
> +        code = box.error.READONLY,
> +        type = 'ClientError'
> +    }, 'reason is election, has leader but no uuid')
> +end
> +
> +--
> +-- Read-only because does not own the limbo. Knows the owner, but now its UUID.
> +--

3. Typo: now -> not.

> +g.test_read_only_reason_synchro_no_uuid = function(g)
> +    g.master:exec(function()
> +        box.ctl.promote()
> +        box.space._cluster:run_triggers(false)
> +        box.space._cluster:delete{box.info.id}
> +    end)
> +
> +    local leader_id = g.master:instance_id()
> +    t.helpers.retrying({}, function()
> +        assert(g.replica:exec(function(leader_id)
> +            return box.info.synchro.queue.owner ~= 0 and
> +                   box.space._cluster:get{leader_id} == nil
> +        end, {leader_id}))
> +    end)
> +
> +    local ok, err = g.replica:exec(function()
> +        local ok, err = pcall(box.schema.create_space, 'test')
> +        return ok, err:unpack()
> +    end)
> +    t.assert(not ok, 'fail ddl')
> +    t.assert(err.term, 'has term')
> +    t.assert(not err.queue_owner_uuid)
> +    t.assert_covers(err, {
> +        reason = 'synchro',
> +        queue_owner_id = leader_id,
> +        code = box.error.READONLY,
> +        type = 'ClientError'
> +    }, 'reason is synchro, has owner but no uuid')
> +end

-- 
Serge Petrenko


  parent reply	other threads:[~2021-11-08 15:18 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-05 23:56 [Tarantool-patches] [PATCH 0/9] ER_READONLY reason Vladislav Shpilevoy via Tarantool-patches
2021-11-05 23:56 ` [Tarantool-patches] [PATCH 1/9] diag: return created error from diag_set() Vladislav Shpilevoy via Tarantool-patches
2021-11-05 23:56 ` [Tarantool-patches] [PATCH 2/9] error: introduce error_payload Vladislav Shpilevoy via Tarantool-patches
2021-11-08 15:14   ` Serge Petrenko via Tarantool-patches
2021-11-11 23:50     ` Vladislav Shpilevoy via Tarantool-patches
2021-11-12  6:29       ` Serge Petrenko via Tarantool-patches
2021-11-05 23:56 ` [Tarantool-patches] [PATCH 3/9] error: move code to struct error from ClientError Vladislav Shpilevoy via Tarantool-patches
2021-11-08 15:15   ` Serge Petrenko via Tarantool-patches
2021-11-11 23:50     ` Vladislav Shpilevoy via Tarantool-patches
2021-11-12  6:31       ` Serge Petrenko via Tarantool-patches
2021-11-05 23:56 ` [Tarantool-patches] [PATCH 4/9] error: use error_payload to store optional members Vladislav Shpilevoy via Tarantool-patches
2021-11-05 23:56 ` [Tarantool-patches] [PATCH 5/9] error: use error_payload in MessagePack codecs Vladislav Shpilevoy via Tarantool-patches
2021-11-05 23:56 ` [Tarantool-patches] [PATCH 6/9] error: use error_payload in Lua Vladislav Shpilevoy via Tarantool-patches
2021-11-05 23:56 ` [Tarantool-patches] [PATCH 7/9] luatest: copy config in cluster:build_server() Vladislav Shpilevoy via Tarantool-patches
2021-11-05 23:56 ` [Tarantool-patches] [PATCH 8/9] luatest: add new helpers for 'server' object Vladislav Shpilevoy via Tarantool-patches
2021-11-08 15:16   ` Serge Petrenko via Tarantool-patches
2021-11-11 23:51     ` Vladislav Shpilevoy via Tarantool-patches
2021-11-05 23:56 ` [Tarantool-patches] [PATCH 9/9] box: enrich ER_READONLY with new details Vladislav Shpilevoy via Tarantool-patches
2021-11-06 19:30   ` Cyrill Gorcunov via Tarantool-patches
2021-11-07 16:45     ` Vladislav Shpilevoy via Tarantool-patches
2021-11-07 20:19       ` Cyrill Gorcunov via Tarantool-patches
2021-11-08 15:18   ` Serge Petrenko via Tarantool-patches [this message]
2021-11-11 23:52     ` Vladislav Shpilevoy via Tarantool-patches
2021-11-08 14:25 ` [Tarantool-patches] [PATCH 0/9] ER_READONLY reason Vladimir Davydov via Tarantool-patches

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2af7c607-01a1-7060-4ecd-fe9f526c8c07@tarantool.org \
    --to=tarantool-patches@dev.tarantool.org \
    --cc=sergepetrenko@tarantool.org \
    --cc=v.shpilevoy@tarantool.org \
    --cc=vdavydov@tarantool.org \
    --subject='Re: [Tarantool-patches] [PATCH 9/9] box: enrich ER_READONLY with new details' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox