From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtpng3.m.smailru.net (smtpng3.m.smailru.net [94.100.177.149]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 41FED469719 for ; Tue, 10 Nov 2020 01:36:44 +0300 (MSK) References: <7bc1dd8d-2b2d-2808-5c98-db649937ef3f@tarantool.org> From: Vladislav Shpilevoy Message-ID: <70029618-854d-ddec-95f7-b7de0adcc057@tarantool.org> Date: Mon, 9 Nov 2020 23:36:42 +0100 MIME-Version: 1.0 In-Reply-To: <7bc1dd8d-2b2d-2808-5c98-db649937ef3f@tarantool.org> Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: 8bit Subject: Re: [Tarantool-patches] [PATCH 2/2] test: speed up election_qsync List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Serge Petrenko , tarantool-patches@dev.tarantool.org Hi! Thanks for the review! >> @@ -62,6 +65,7 @@ fiber = require('fiber') >>   -- Replication timeout is small to speed up a first election start. > > I checked, and the test speeds up a lot with this patch. But I don't understand > why. We have only two instances, only one of them is candidate. > Thanks to the small replication_timeout, the election starts shortly after the > old leader dies. election_timeout isn't involved here, AFAICS. > Am I missing something? > Even when the test is restarted, there shouldn't be any 're-elections'. That is a very good question! Honestly, I didn't think of it. Strangely, when I did the patch, I just "believed" this is what I should do. I did some digging now, and there is a simple explanation. And a bug. So this is really good you looked at this patch skeptically. Here is what is happening in the test, in short: master: create_replica() master: set_mode('voter') replica: set_mode('candidate') replica: wait_leader() When the instance is clear, master's term is 1 and this term does not have votes yet. When replica is started, it votes for self, and gets elected quite fast. This is happening fast even without this patch, when you run this test separately. But when the test is running after some other Raft tests, on the leader the term is not 1. It can be 3, 4, or more. When replica is attached, the leader does not send its state to the replica, because election is disabled. So the replica starts from term 1, tries to elect itself and is ignored because its term is too old. Waits for election timeout and tries again. This is repeated as many times as is the term value. This is why the tests running in the end were always slower. There is a bug. The first node should have sent its state when election mode was set to the voter on it. But it didn't. I reworked the patch so now the timeout, on the contrary, is set to 1000000 seconds. And the tests hangs infinitely unless we send Raft state when election is enabled. The new patch: ==================== raft: send state when state machine is started Raft didn't broadcast its state when the state machine was started. It could lead to the state being never sent until some other node would generate a term number bigger that the local one. That happened when a node participated in some elections, accumulated a big term number, then the election was turned off, and a new replica was connected in a 'candidate' state. Then the first node was configured to be a 'voter'. The first node didn't send anything to the replica, because at the moment of its connection the election was off. So the replica started from term 1, tried to start elections in this term, but was ignored by the first node. It waited for election timeout, bumped the term to 2, and the process was repeated until the replica reached the first node's term + 1. It could take very long time. The patch fixes it so now Raft broadcasts its state when it is enabled. To cover the replicas connected while it was disabled. Closes #5499 diff --git a/src/box/raft.c b/src/box/raft.c index 914b0d68f..28ca74cb5 100644 --- a/src/box/raft.c +++ b/src/box/raft.c @@ -877,6 +877,14 @@ raft_sm_start(void) raft_sm_wait_leader_found(); } box_update_ro_summary(); + /* + * Nothing changed. But when raft was stopped, its state wasn't sent to + * replicas. At least this was happening at the moment of this being + * written. On the other hand, this instance may have a term bigger than + * any other term in the cluster. And if it wouldn't share the term, it + * would ignore all the messages, including vote requests. + */ + raft_schedule_broadcast(); } static void diff --git a/test/replication/election_qsync.result b/test/replication/election_qsync.result index 086b17686..cb349efcc 100644 --- a/test/replication/election_qsync.result +++ b/test/replication/election_qsync.result @@ -9,6 +9,9 @@ box.schema.user.grant('guest', 'super') old_election_mode = box.cfg.election_mode | --- | ... +old_election_timeout = box.cfg.election_timeout + | --- + | ... old_replication_synchro_timeout = box.cfg.replication_synchro_timeout | --- | ... @@ -60,8 +63,11 @@ fiber = require('fiber') | --- | ... -- Replication timeout is small to speed up a first election start. +-- Election timeout is set to a huge value to ensure the election does not hang +-- anywhere. Indeed, there can't be a split-vote when candidate is only one. box.cfg{ \ election_mode = 'candidate', \ + election_timeout = 1000000, \ replication_synchro_quorum = 3, \ replication_synchro_timeout = 1000000, \ replication_timeout = 0.1, \ @@ -114,8 +120,11 @@ box.cfg{replication_synchro_timeout = 1000000} -- Configure separately from synchro timeout not to depend on the order of -- synchro and election options appliance. Replication timeout is tiny to speed -- up notice of the old leader death. +-- Election timeout is set to a huge value to ensure the election does not hang +-- anywhere. Indeed, there can't be a split-vote when candidate is only one. box.cfg{ \ election_mode = 'candidate', \ + election_timeout = 1000000, \ replication_timeout = 0.01, \ } | --- @@ -143,6 +152,7 @@ test_run:cmd('delete server replica') | ... box.cfg{ \ election_mode = old_election_mode, \ + election_timeout = old_election_timeout, \ replication_timeout = old_replication_timeout, \ replication = old_replication, \ replication_synchro_timeout = old_replication_synchro_timeout, \ diff --git a/test/replication/election_qsync.test.lua b/test/replication/election_qsync.test.lua index 6a80f4859..eb89e5b79 100644 --- a/test/replication/election_qsync.test.lua +++ b/test/replication/election_qsync.test.lua @@ -2,6 +2,7 @@ test_run = require('test_run').new() box.schema.user.grant('guest', 'super') old_election_mode = box.cfg.election_mode +old_election_timeout = box.cfg.election_timeout old_replication_synchro_timeout = box.cfg.replication_synchro_timeout old_replication_timeout = box.cfg.replication_timeout old_replication = box.cfg.replication @@ -28,8 +29,11 @@ box.cfg{election_mode = 'voter'} test_run:switch('replica') fiber = require('fiber') -- Replication timeout is small to speed up a first election start. +-- Election timeout is set to a huge value to ensure the election does not hang +-- anywhere. Indeed, there can't be a split-vote when candidate is only one. box.cfg{ \ election_mode = 'candidate', \ + election_timeout = 1000000, \ replication_synchro_quorum = 3, \ replication_synchro_timeout = 1000000, \ replication_timeout = 0.1, \ @@ -57,8 +61,11 @@ box.cfg{replication_synchro_timeout = 1000000} -- Configure separately from synchro timeout not to depend on the order of -- synchro and election options appliance. Replication timeout is tiny to speed -- up notice of the old leader death. +-- Election timeout is set to a huge value to ensure the election does not hang +-- anywhere. Indeed, there can't be a split-vote when candidate is only one. box.cfg{ \ election_mode = 'candidate', \ + election_timeout = 1000000, \ replication_timeout = 0.01, \ } @@ -70,6 +77,7 @@ box.space.test:drop() test_run:cmd('delete server replica') box.cfg{ \ election_mode = old_election_mode, \ + election_timeout = old_election_timeout, \ replication_timeout = old_replication_timeout, \ replication = old_replication, \ replication_synchro_timeout = old_replication_synchro_timeout, \