From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp16.mail.ru (smtp16.mail.ru [94.100.176.153]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 017FE469719 for ; Wed, 21 Oct 2020 14:41:57 +0300 (MSK) References: <15c343a9d2140a6c7ba31cb8761b2bca0ca6e09a.1602954690.git.v.shpilevoy@tarantool.org> <47356d83-4464-070b-0df8-c7fcbbdb1e02@tarantool.org> From: Serge Petrenko Message-ID: Date: Wed, 21 Oct 2020 14:41:56 +0300 MIME-Version: 1.0 In-Reply-To: <47356d83-4464-070b-0df8-c7fcbbdb1e02@tarantool.org> Content-Type: text/plain; charset="utf-8"; format="flowed" Content-Transfer-Encoding: 8bit Content-Language: en-GB Subject: Re: [Tarantool-patches] [PATCH 1/3] raft: send state to new subscribers if Raft worked List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Vladislav Shpilevoy , tarantool-patches@dev.tarantool.org 20.10.2020 23:43, Vladislav Shpilevoy пишет: > The first patch was reworked to look at the subscriber's version. > And send Raft state if it is supported. Even when mode is 'off'. Hi! Thanks for the patch! Please see my comments below. > > ==================== > raft: send state always if version is >= 2.6.0 > > Before the patch a node sent its complete Raft state to new > replication subscribers only when Raft was enabled: voter or > candidate. When the mode was off, it didn't send anything. > > This led to a bug (or maybe just inconvenience), that if the node > was a leader, was turned off, and couldn't deliver the 'off' > message to all the followers, some of them would still think the > node is a leader until their restart or a new leader election. > > If the followers were just voters or also had Raft disabled, they > would treat the old node as a leader infinitely until a new one > was elected somewhere. > > The state wasn't sent not to break connections with the old > versions, which won't understand the Raft messages. > > The patch makes Raft state being sent even if Raft is disabled on > the node, but the connected node supports Raft (its version is >= > 2.6.0). In that way the old leader will keep sending resignation > messages even in case of restart or any other issue. And it won't > break the old versions when Raft is not used and was not used > ever. > > The issue was found while working on #5433, but is not strictly > related. > > diff --git a/src/box/box.cc b/src/box/box.cc > index 18568df3b..86daaf626 100644 > --- a/src/box/box.cc > +++ b/src/box/box.cc > @@ -80,6 +80,7 @@ > #include "msgpack.h" > #include "raft.h" > #include "trivia/util.h" > +#include "version.h" > > static char status[64] = "unknown"; > > @@ -2154,13 +2155,14 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header) > tt_uuid_str(&replica_uuid), sio_socketname(io->fd)); > say_info("remote vclock %s local vclock %s", > vclock_to_string(&replica_clock), vclock_to_string(&vclock)); > - if (raft_is_enabled()) { > + if (replica_version_id >= version_id(2, 6, 0)) { Why don't you use 'raft_was_used()` here together with version check? I've just understood we should also check for relay version in relay_push_raft(), shouldn't we? > /* > - * Send out the current raft state of the instance. Don't do > - * that if Raft is disabled. It can be that a part of the > - * cluster still contains old versions, which can't handle Raft > - * messages. So when it is disabled, its network footprint > - * should be 0. > + * Send Raft state always to all instances supporting it. > + * Regardless of its mode, even when disabled. That is because > + * if it was ever enabled, then this node could be a leader in > + * the past. And there may be nodes, who still consider it a > + * leader. So even if it is disabled now, but was probably > + * enabled in the past, the state must be sent. > */ > struct raft_request req; > raft_serialize_for_network(&req, &vclock); > diff --git a/test/replication/gh-5426-election-on-off.result b/test/replication/gh-5426-election-on-off.result > index 1abfb9154..08253443c 100644 > --- a/test/replication/gh-5426-election-on-off.result > +++ b/test/replication/gh-5426-election-on-off.result > @@ -6,13 +6,6 @@ box.schema.user.grant('guest', 'super') > | --- > | ... > > -old_election_mode = box.cfg.election_mode > - | --- > - | ... > -old_replication_timeout = box.cfg.replication_timeout > - | --- > - | ... > - > test_run:cmd('create server replica with rpl_master=default,\ > script="replication/replica.lua"') > | --- > @@ -111,23 +104,63 @@ test_run:wait_cond(function() return box.info.election.leader == 0 end) > | - true > | ... > > +-- When turned off, the former leader could stop sending its state to new > +-- subscribers. That could make them think the leader is still alive. > test_run:switch('default') > | --- > | - true > | ... > -test_run:cmd('stop server replica') > +box.cfg{election_mode = 'candidate'} > + | --- > + | ... > +test_run:wait_cond(function() return box.info.election.state == 'leader' end) > | --- > | - true > | ... > -test_run:cmd('delete server replica') > + > +test_run:switch('replica') > | --- > | - true > | ... > -box.cfg{ \ > - election_mode = old_election_mode, \ > - replication_timeout = old_replication_timeout, \ > -} > +test_run:wait_cond(function() return box.info.election.leader ~= 0 end) > + | --- > + | - true > + | ... > + > +test_run:switch('default') > + | --- > + | - true > + | ... > +test_run:cmd('restart server default') > + | > +-- All settings are reset, the election is 'off'. The replica re-subscribes, > +-- and the former leader should send its state. Even though the election is > +-- disabled on it. > +assert(box.cfg.election_mode == 'off') > | --- > + | - true > + | ... > + > +test_run:switch('replica') > + | --- > + | - true > + | ... > +test_run:wait_cond(function() return box.info.election.leader == 0 end) > + | --- > + | - true > + | ... > + > +test_run:switch('default') > + | --- > + | - true > + | ... > +test_run:cmd('stop server replica') > + | --- > + | - true > + | ... > +test_run:cmd('delete server replica') > + | --- > + | - true > | ... > box.schema.user.revoke('guest', 'super') > | --- > diff --git a/test/replication/gh-5426-election-on-off.test.lua b/test/replication/gh-5426-election-on-off.test.lua > index d6b980d0a..adbfcd97d 100644 > --- a/test/replication/gh-5426-election-on-off.test.lua > +++ b/test/replication/gh-5426-election-on-off.test.lua > @@ -1,9 +1,6 @@ > test_run = require('test_run').new() > box.schema.user.grant('guest', 'super') > > -old_election_mode = box.cfg.election_mode > -old_replication_timeout = box.cfg.replication_timeout > - > test_run:cmd('create server replica with rpl_master=default,\ > script="replication/replica.lua"') > test_run:cmd('start server replica with wait=True, wait_load=True') > @@ -47,11 +44,26 @@ box.cfg{election_mode = 'off'} > test_run:switch('replica') > test_run:wait_cond(function() return box.info.election.leader == 0 end) > > +-- When turned off, the former leader could stop sending its state to new > +-- subscribers. That could make them think the leader is still alive. > +test_run:switch('default') > +box.cfg{election_mode = 'candidate'} > +test_run:wait_cond(function() return box.info.election.state == 'leader' end) > + > +test_run:switch('replica') > +test_run:wait_cond(function() return box.info.election.leader ~= 0 end) > + > +test_run:switch('default') > +test_run:cmd('restart server default') > +-- All settings are reset, the election is 'off'. The replica re-subscribes, > +-- and the former leader should send its state. Even though the election is > +-- disabled on it. > +assert(box.cfg.election_mode == 'off') > + > +test_run:switch('replica') > +test_run:wait_cond(function() return box.info.election.leader == 0 end) > + > test_run:switch('default') > test_run:cmd('stop server replica') > test_run:cmd('delete server replica') > -box.cfg{ \ > - election_mode = old_election_mode, \ > - replication_timeout = old_replication_timeout, \ > -} > box.schema.user.revoke('guest', 'super') -- Serge Petrenko