[Tarantool-patches] [PATCH 1/3] raft: send state to new subscribers if Raft worked
Serge Petrenko
sergepetrenko at tarantool.org
Wed Oct 21 14:41:56 MSK 2020
20.10.2020 23:43, Vladislav Shpilevoy пишет:
> The first patch was reworked to look at the subscriber's version.
> And send Raft state if it is supported. Even when mode is 'off'.
Hi! Thanks for the patch!
Please see my comments below.
>
> ====================
> raft: send state always if version is >= 2.6.0
>
> Before the patch a node sent its complete Raft state to new
> replication subscribers only when Raft was enabled: voter or
> candidate. When the mode was off, it didn't send anything.
>
> This led to a bug (or maybe just inconvenience), that if the node
> was a leader, was turned off, and couldn't deliver the 'off'
> message to all the followers, some of them would still think the
> node is a leader until their restart or a new leader election.
>
> If the followers were just voters or also had Raft disabled, they
> would treat the old node as a leader infinitely until a new one
> was elected somewhere.
>
> The state wasn't sent not to break connections with the old
> versions, which won't understand the Raft messages.
>
> The patch makes Raft state being sent even if Raft is disabled on
> the node, but the connected node supports Raft (its version is >=
> 2.6.0). In that way the old leader will keep sending resignation
> messages even in case of restart or any other issue. And it won't
> break the old versions when Raft is not used and was not used
> ever.
>
> The issue was found while working on #5433, but is not strictly
> related.
>
> diff --git a/src/box/box.cc b/src/box/box.cc
> index 18568df3b..86daaf626 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -80,6 +80,7 @@
> #include "msgpack.h"
> #include "raft.h"
> #include "trivia/util.h"
> +#include "version.h"
>
> static char status[64] = "unknown";
>
> @@ -2154,13 +2155,14 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
> tt_uuid_str(&replica_uuid), sio_socketname(io->fd));
> say_info("remote vclock %s local vclock %s",
> vclock_to_string(&replica_clock), vclock_to_string(&vclock));
> - if (raft_is_enabled()) {
> + if (replica_version_id >= version_id(2, 6, 0)) {
Why don't you use 'raft_was_used()` here together with version check?
I've just understood we should also check for relay version in
relay_push_raft(), shouldn't we?
> /*
> - * Send out the current raft state of the instance. Don't do
> - * that if Raft is disabled. It can be that a part of the
> - * cluster still contains old versions, which can't handle Raft
> - * messages. So when it is disabled, its network footprint
> - * should be 0.
> + * Send Raft state always to all instances supporting it.
> + * Regardless of its mode, even when disabled. That is because
> + * if it was ever enabled, then this node could be a leader in
> + * the past. And there may be nodes, who still consider it a
> + * leader. So even if it is disabled now, but was probably
> + * enabled in the past, the state must be sent.
> */
> struct raft_request req;
> raft_serialize_for_network(&req, &vclock);
> diff --git a/test/replication/gh-5426-election-on-off.result b/test/replication/gh-5426-election-on-off.result
> index 1abfb9154..08253443c 100644
> --- a/test/replication/gh-5426-election-on-off.result
> +++ b/test/replication/gh-5426-election-on-off.result
> @@ -6,13 +6,6 @@ box.schema.user.grant('guest', 'super')
> | ---
> | ...
>
> -old_election_mode = box.cfg.election_mode
> - | ---
> - | ...
> -old_replication_timeout = box.cfg.replication_timeout
> - | ---
> - | ...
> -
> test_run:cmd('create server replica with rpl_master=default,\
> script="replication/replica.lua"')
> | ---
> @@ -111,23 +104,63 @@ test_run:wait_cond(function() return box.info.election.leader == 0 end)
> | - true
> | ...
>
> +-- When turned off, the former leader could stop sending its state to new
> +-- subscribers. That could make them think the leader is still alive.
> test_run:switch('default')
> | ---
> | - true
> | ...
> -test_run:cmd('stop server replica')
> +box.cfg{election_mode = 'candidate'}
> + | ---
> + | ...
> +test_run:wait_cond(function() return box.info.election.state == 'leader' end)
> | ---
> | - true
> | ...
> -test_run:cmd('delete server replica')
> +
> +test_run:switch('replica')
> | ---
> | - true
> | ...
> -box.cfg{ \
> - election_mode = old_election_mode, \
> - replication_timeout = old_replication_timeout, \
> -}
> +test_run:wait_cond(function() return box.info.election.leader ~= 0 end)
> + | ---
> + | - true
> + | ...
> +
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('restart server default')
> + |
> +-- All settings are reset, the election is 'off'. The replica re-subscribes,
> +-- and the former leader should send its state. Even though the election is
> +-- disabled on it.
> +assert(box.cfg.election_mode == 'off')
> | ---
> + | - true
> + | ...
> +
> +test_run:switch('replica')
> + | ---
> + | - true
> + | ...
> +test_run:wait_cond(function() return box.info.election.leader == 0 end)
> + | ---
> + | - true
> + | ...
> +
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('stop server replica')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('delete server replica')
> + | ---
> + | - true
> | ...
> box.schema.user.revoke('guest', 'super')
> | ---
> diff --git a/test/replication/gh-5426-election-on-off.test.lua b/test/replication/gh-5426-election-on-off.test.lua
> index d6b980d0a..adbfcd97d 100644
> --- a/test/replication/gh-5426-election-on-off.test.lua
> +++ b/test/replication/gh-5426-election-on-off.test.lua
> @@ -1,9 +1,6 @@
> test_run = require('test_run').new()
> box.schema.user.grant('guest', 'super')
>
> -old_election_mode = box.cfg.election_mode
> -old_replication_timeout = box.cfg.replication_timeout
> -
> test_run:cmd('create server replica with rpl_master=default,\
> script="replication/replica.lua"')
> test_run:cmd('start server replica with wait=True, wait_load=True')
> @@ -47,11 +44,26 @@ box.cfg{election_mode = 'off'}
> test_run:switch('replica')
> test_run:wait_cond(function() return box.info.election.leader == 0 end)
>
> +-- When turned off, the former leader could stop sending its state to new
> +-- subscribers. That could make them think the leader is still alive.
> +test_run:switch('default')
> +box.cfg{election_mode = 'candidate'}
> +test_run:wait_cond(function() return box.info.election.state == 'leader' end)
> +
> +test_run:switch('replica')
> +test_run:wait_cond(function() return box.info.election.leader ~= 0 end)
> +
> +test_run:switch('default')
> +test_run:cmd('restart server default')
> +-- All settings are reset, the election is 'off'. The replica re-subscribes,
> +-- and the former leader should send its state. Even though the election is
> +-- disabled on it.
> +assert(box.cfg.election_mode == 'off')
> +
> +test_run:switch('replica')
> +test_run:wait_cond(function() return box.info.election.leader == 0 end)
> +
> test_run:switch('default')
> test_run:cmd('stop server replica')
> test_run:cmd('delete server replica')
> -box.cfg{ \
> - election_mode = old_election_mode, \
> - replication_timeout = old_replication_timeout, \
> -}
> box.schema.user.revoke('guest', 'super')
--
Serge Petrenko
More information about the Tarantool-patches
mailing list