[Tarantool-patches] [PATCH 2/2] test: speed up election_qsync

Serge Petrenko sergepetrenko at tarantool.org
Tue Nov 10 10:44:31 MSK 2020


10.11.2020 01:36, Vladislav Shpilevoy пишет:
> Hi! Thanks for the review!
>
>>> @@ -62,6 +65,7 @@ fiber = require('fiber')
>>>    -- Replication timeout is small to speed up a first election start.
>> I checked, and the test speeds up a lot with this patch. But I don't understand
>> why. We have only two instances, only one of them is candidate.
>> Thanks to the small replication_timeout, the election starts shortly after the
>> old leader dies. election_timeout isn't involved here, AFAICS.
>> Am I missing something?
>> Even when the test is restarted, there shouldn't be any 're-elections'.
> That is a very good question! Honestly, I didn't think of it. Strangely,
> when I did the patch, I just "believed" this is what I should do.
>
> I did some digging now, and there is a simple explanation. And a bug. So
> this is really good you looked at this patch skeptically.
>
> Here is what is happening in the test, in short:
>
> 	master:  create_replica()
> 	master:  set_mode('voter')
>
> 	replica: set_mode('candidate')
> 	replica: wait_leader()
>
> When the instance is clear, master's term is 1 and this term does not have
> votes yet. When replica is started, it votes for self, and gets elected
> quite fast.
>
> This is happening fast even without this patch, when you run this test
> separately.
>
> But when the test is running after some other Raft tests, on the leader the
> term is not 1. It can be 3, 4, or more. When replica is attached, the
> leader does not send its state to the replica, because election is disabled.
>
> So the replica starts from term 1, tries to elect itself and is ignored
> because its term is too old. Waits for election timeout and tries again. This
> is repeated as many times as is the term value.
>
> This is why the tests running in the end were always slower.
>
> There is a bug. The first node should have sent its state when election mode
> was set to the voter on it. But it didn't.
>
> I reworked the patch so now the timeout, on the contrary, is set to 1000000
> seconds. And the tests hangs infinitely unless we send Raft state when election
> is enabled.

Thanks for all the digging and the explanation!

The new patch LGTM.

>
> The new patch:
>
> ====================
>      raft: send state when state machine is started
>      
>      Raft didn't broadcast its state when the state machine was
>      started. It could lead to the state being never sent until some
>      other node would generate a term number bigger that the local one.
>      
>      That happened when a node participated in some elections,
>      accumulated a big term number, then the election was turned off,
>      and a new replica was connected in a 'candidate' state. Then the
>      first node was configured to be a 'voter'.
>      
>      The first node didn't send anything to the replica, because at
>      the moment of its connection the election was off.
>      
>      So the replica started from term 1, tried to start elections in
>      this term, but was ignored by the first node. It waited for
>      election timeout, bumped the term to 2, and the process was
>      repeated until the replica reached the first node's term + 1. It
>      could take very long time.
>      
>      The patch fixes it so now Raft broadcasts its state when it is
>      enabled. To cover the replicas connected while it was disabled.
>      
>      Closes #5499
>
> diff --git a/src/box/raft.c b/src/box/raft.c
> index 914b0d68f..28ca74cb5 100644
> --- a/src/box/raft.c
> +++ b/src/box/raft.c
> @@ -877,6 +877,14 @@ raft_sm_start(void)
>   		raft_sm_wait_leader_found();
>   	}
>   	box_update_ro_summary();
> +	/*
> +	 * Nothing changed. But when raft was stopped, its state wasn't sent to
> +	 * replicas. At least this was happening at the moment of this being
> +	 * written. On the other hand, this instance may have a term bigger than
> +	 * any other term in the cluster. And if it wouldn't share the term, it
> +	 * would ignore all the messages, including vote requests.
> +	 */
> +	raft_schedule_broadcast();
>   }
>   
>   static void
> diff --git a/test/replication/election_qsync.result b/test/replication/election_qsync.result
> index 086b17686..cb349efcc 100644
> --- a/test/replication/election_qsync.result
> +++ b/test/replication/election_qsync.result
> @@ -9,6 +9,9 @@ box.schema.user.grant('guest', 'super')
>   old_election_mode = box.cfg.election_mode
>    | ---
>    | ...
> +old_election_timeout = box.cfg.election_timeout
> + | ---
> + | ...
>   old_replication_synchro_timeout = box.cfg.replication_synchro_timeout
>    | ---
>    | ...
> @@ -60,8 +63,11 @@ fiber = require('fiber')
>    | ---
>    | ...
>   -- Replication timeout is small to speed up a first election start.
> +-- Election timeout is set to a huge value to ensure the election does not hang
> +-- anywhere. Indeed, there can't be a split-vote when candidate is only one.
>   box.cfg{                                                                        \
>       election_mode = 'candidate',                                                \
> +    election_timeout = 1000000,                                                 \
>       replication_synchro_quorum = 3,                                             \
>       replication_synchro_timeout = 1000000,                                      \
>       replication_timeout = 0.1,                                                  \
> @@ -114,8 +120,11 @@ box.cfg{replication_synchro_timeout = 1000000}
>   -- Configure separately from synchro timeout not to depend on the order of
>   -- synchro and election options appliance. Replication timeout is tiny to speed
>   -- up notice of the old leader death.
> +-- Election timeout is set to a huge value to ensure the election does not hang
> +-- anywhere. Indeed, there can't be a split-vote when candidate is only one.
>   box.cfg{                                                                        \
>       election_mode = 'candidate',                                                \
> +    election_timeout = 1000000,                                                 \
>       replication_timeout = 0.01,                                                 \
>   }
>    | ---
> @@ -143,6 +152,7 @@ test_run:cmd('delete server replica')
>    | ...
>   box.cfg{                                                                        \
>       election_mode = old_election_mode,                                          \
> +    election_timeout = old_election_timeout,                                    \
>       replication_timeout = old_replication_timeout,                              \
>       replication = old_replication,                                              \
>       replication_synchro_timeout = old_replication_synchro_timeout,              \
> diff --git a/test/replication/election_qsync.test.lua b/test/replication/election_qsync.test.lua
> index 6a80f4859..eb89e5b79 100644
> --- a/test/replication/election_qsync.test.lua
> +++ b/test/replication/election_qsync.test.lua
> @@ -2,6 +2,7 @@ test_run = require('test_run').new()
>   box.schema.user.grant('guest', 'super')
>   
>   old_election_mode = box.cfg.election_mode
> +old_election_timeout = box.cfg.election_timeout
>   old_replication_synchro_timeout = box.cfg.replication_synchro_timeout
>   old_replication_timeout = box.cfg.replication_timeout
>   old_replication = box.cfg.replication
> @@ -28,8 +29,11 @@ box.cfg{election_mode = 'voter'}
>   test_run:switch('replica')
>   fiber = require('fiber')
>   -- Replication timeout is small to speed up a first election start.
> +-- Election timeout is set to a huge value to ensure the election does not hang
> +-- anywhere. Indeed, there can't be a split-vote when candidate is only one.
>   box.cfg{                                                                        \
>       election_mode = 'candidate',                                                \
> +    election_timeout = 1000000,                                                 \
>       replication_synchro_quorum = 3,                                             \
>       replication_synchro_timeout = 1000000,                                      \
>       replication_timeout = 0.1,                                                  \
> @@ -57,8 +61,11 @@ box.cfg{replication_synchro_timeout = 1000000}
>   -- Configure separately from synchro timeout not to depend on the order of
>   -- synchro and election options appliance. Replication timeout is tiny to speed
>   -- up notice of the old leader death.
> +-- Election timeout is set to a huge value to ensure the election does not hang
> +-- anywhere. Indeed, there can't be a split-vote when candidate is only one.
>   box.cfg{                                                                        \
>       election_mode = 'candidate',                                                \
> +    election_timeout = 1000000,                                                 \
>       replication_timeout = 0.01,                                                 \
>   }
>   
> @@ -70,6 +77,7 @@ box.space.test:drop()
>   test_run:cmd('delete server replica')
>   box.cfg{                                                                        \
>       election_mode = old_election_mode,                                          \
> +    election_timeout = old_election_timeout,                                    \
>       replication_timeout = old_replication_timeout,                              \
>       replication = old_replication,                                              \
>       replication_synchro_timeout = old_replication_synchro_timeout,              \

-- 
Serge Petrenko



More information about the Tarantool-patches mailing list