From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp60.i.mail.ru (smtp60.i.mail.ru [217.69.128.40]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id E840C469719 for ; Tue, 10 Nov 2020 10:44:32 +0300 (MSK) References: <7bc1dd8d-2b2d-2808-5c98-db649937ef3f@tarantool.org> <70029618-854d-ddec-95f7-b7de0adcc057@tarantool.org> From: Serge Petrenko Message-ID: <3612ded2-c2d4-aac8-6b2b-877e2084bb23@tarantool.org> Date: Tue, 10 Nov 2020 10:44:31 +0300 MIME-Version: 1.0 In-Reply-To: <70029618-854d-ddec-95f7-b7de0adcc057@tarantool.org> Content-Type: text/plain; charset="utf-8"; format="flowed" Content-Transfer-Encoding: 8bit Content-Language: en-GB Subject: Re: [Tarantool-patches] [PATCH 2/2] test: speed up election_qsync List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Vladislav Shpilevoy , tarantool-patches@dev.tarantool.org 10.11.2020 01:36, Vladislav Shpilevoy пишет: > Hi! Thanks for the review! > >>> @@ -62,6 +65,7 @@ fiber = require('fiber') >>>   -- Replication timeout is small to speed up a first election start. >> I checked, and the test speeds up a lot with this patch. But I don't understand >> why. We have only two instances, only one of them is candidate. >> Thanks to the small replication_timeout, the election starts shortly after the >> old leader dies. election_timeout isn't involved here, AFAICS. >> Am I missing something? >> Even when the test is restarted, there shouldn't be any 're-elections'. > That is a very good question! Honestly, I didn't think of it. Strangely, > when I did the patch, I just "believed" this is what I should do. > > I did some digging now, and there is a simple explanation. And a bug. So > this is really good you looked at this patch skeptically. > > Here is what is happening in the test, in short: > > master: create_replica() > master: set_mode('voter') > > replica: set_mode('candidate') > replica: wait_leader() > > When the instance is clear, master's term is 1 and this term does not have > votes yet. When replica is started, it votes for self, and gets elected > quite fast. > > This is happening fast even without this patch, when you run this test > separately. > > But when the test is running after some other Raft tests, on the leader the > term is not 1. It can be 3, 4, or more. When replica is attached, the > leader does not send its state to the replica, because election is disabled. > > So the replica starts from term 1, tries to elect itself and is ignored > because its term is too old. Waits for election timeout and tries again. This > is repeated as many times as is the term value. > > This is why the tests running in the end were always slower. > > There is a bug. The first node should have sent its state when election mode > was set to the voter on it. But it didn't. > > I reworked the patch so now the timeout, on the contrary, is set to 1000000 > seconds. And the tests hangs infinitely unless we send Raft state when election > is enabled. Thanks for all the digging and the explanation! The new patch LGTM. > > The new patch: > > ==================== > raft: send state when state machine is started > > Raft didn't broadcast its state when the state machine was > started. It could lead to the state being never sent until some > other node would generate a term number bigger that the local one. > > That happened when a node participated in some elections, > accumulated a big term number, then the election was turned off, > and a new replica was connected in a 'candidate' state. Then the > first node was configured to be a 'voter'. > > The first node didn't send anything to the replica, because at > the moment of its connection the election was off. > > So the replica started from term 1, tried to start elections in > this term, but was ignored by the first node. It waited for > election timeout, bumped the term to 2, and the process was > repeated until the replica reached the first node's term + 1. It > could take very long time. > > The patch fixes it so now Raft broadcasts its state when it is > enabled. To cover the replicas connected while it was disabled. > > Closes #5499 > > diff --git a/src/box/raft.c b/src/box/raft.c > index 914b0d68f..28ca74cb5 100644 > --- a/src/box/raft.c > +++ b/src/box/raft.c > @@ -877,6 +877,14 @@ raft_sm_start(void) > raft_sm_wait_leader_found(); > } > box_update_ro_summary(); > + /* > + * Nothing changed. But when raft was stopped, its state wasn't sent to > + * replicas. At least this was happening at the moment of this being > + * written. On the other hand, this instance may have a term bigger than > + * any other term in the cluster. And if it wouldn't share the term, it > + * would ignore all the messages, including vote requests. > + */ > + raft_schedule_broadcast(); > } > > static void > diff --git a/test/replication/election_qsync.result b/test/replication/election_qsync.result > index 086b17686..cb349efcc 100644 > --- a/test/replication/election_qsync.result > +++ b/test/replication/election_qsync.result > @@ -9,6 +9,9 @@ box.schema.user.grant('guest', 'super') > old_election_mode = box.cfg.election_mode > | --- > | ... > +old_election_timeout = box.cfg.election_timeout > + | --- > + | ... > old_replication_synchro_timeout = box.cfg.replication_synchro_timeout > | --- > | ... > @@ -60,8 +63,11 @@ fiber = require('fiber') > | --- > | ... > -- Replication timeout is small to speed up a first election start. > +-- Election timeout is set to a huge value to ensure the election does not hang > +-- anywhere. Indeed, there can't be a split-vote when candidate is only one. > box.cfg{ \ > election_mode = 'candidate', \ > + election_timeout = 1000000, \ > replication_synchro_quorum = 3, \ > replication_synchro_timeout = 1000000, \ > replication_timeout = 0.1, \ > @@ -114,8 +120,11 @@ box.cfg{replication_synchro_timeout = 1000000} > -- Configure separately from synchro timeout not to depend on the order of > -- synchro and election options appliance. Replication timeout is tiny to speed > -- up notice of the old leader death. > +-- Election timeout is set to a huge value to ensure the election does not hang > +-- anywhere. Indeed, there can't be a split-vote when candidate is only one. > box.cfg{ \ > election_mode = 'candidate', \ > + election_timeout = 1000000, \ > replication_timeout = 0.01, \ > } > | --- > @@ -143,6 +152,7 @@ test_run:cmd('delete server replica') > | ... > box.cfg{ \ > election_mode = old_election_mode, \ > + election_timeout = old_election_timeout, \ > replication_timeout = old_replication_timeout, \ > replication = old_replication, \ > replication_synchro_timeout = old_replication_synchro_timeout, \ > diff --git a/test/replication/election_qsync.test.lua b/test/replication/election_qsync.test.lua > index 6a80f4859..eb89e5b79 100644 > --- a/test/replication/election_qsync.test.lua > +++ b/test/replication/election_qsync.test.lua > @@ -2,6 +2,7 @@ test_run = require('test_run').new() > box.schema.user.grant('guest', 'super') > > old_election_mode = box.cfg.election_mode > +old_election_timeout = box.cfg.election_timeout > old_replication_synchro_timeout = box.cfg.replication_synchro_timeout > old_replication_timeout = box.cfg.replication_timeout > old_replication = box.cfg.replication > @@ -28,8 +29,11 @@ box.cfg{election_mode = 'voter'} > test_run:switch('replica') > fiber = require('fiber') > -- Replication timeout is small to speed up a first election start. > +-- Election timeout is set to a huge value to ensure the election does not hang > +-- anywhere. Indeed, there can't be a split-vote when candidate is only one. > box.cfg{ \ > election_mode = 'candidate', \ > + election_timeout = 1000000, \ > replication_synchro_quorum = 3, \ > replication_synchro_timeout = 1000000, \ > replication_timeout = 0.1, \ > @@ -57,8 +61,11 @@ box.cfg{replication_synchro_timeout = 1000000} > -- Configure separately from synchro timeout not to depend on the order of > -- synchro and election options appliance. Replication timeout is tiny to speed > -- up notice of the old leader death. > +-- Election timeout is set to a huge value to ensure the election does not hang > +-- anywhere. Indeed, there can't be a split-vote when candidate is only one. > box.cfg{ \ > election_mode = 'candidate', \ > + election_timeout = 1000000, \ > replication_timeout = 0.01, \ > } > > @@ -70,6 +77,7 @@ box.space.test:drop() > test_run:cmd('delete server replica') > box.cfg{ \ > election_mode = old_election_mode, \ > + election_timeout = old_election_timeout, \ > replication_timeout = old_replication_timeout, \ > replication = old_replication, \ > replication_synchro_timeout = old_replication_synchro_timeout, \ -- Serge Petrenko