[Tarantool-patches] [PATCH 2/2] test: speed up election_qsync
Serge Petrenko
sergepetrenko at tarantool.org
Tue Nov 10 10:44:31 MSK 2020
10.11.2020 01:36, Vladislav Shpilevoy пишет:
> Hi! Thanks for the review!
>
>>> @@ -62,6 +65,7 @@ fiber = require('fiber')
>>> -- Replication timeout is small to speed up a first election start.
>> I checked, and the test speeds up a lot with this patch. But I don't understand
>> why. We have only two instances, only one of them is candidate.
>> Thanks to the small replication_timeout, the election starts shortly after the
>> old leader dies. election_timeout isn't involved here, AFAICS.
>> Am I missing something?
>> Even when the test is restarted, there shouldn't be any 're-elections'.
> That is a very good question! Honestly, I didn't think of it. Strangely,
> when I did the patch, I just "believed" this is what I should do.
>
> I did some digging now, and there is a simple explanation. And a bug. So
> this is really good you looked at this patch skeptically.
>
> Here is what is happening in the test, in short:
>
> master: create_replica()
> master: set_mode('voter')
>
> replica: set_mode('candidate')
> replica: wait_leader()
>
> When the instance is clear, master's term is 1 and this term does not have
> votes yet. When replica is started, it votes for self, and gets elected
> quite fast.
>
> This is happening fast even without this patch, when you run this test
> separately.
>
> But when the test is running after some other Raft tests, on the leader the
> term is not 1. It can be 3, 4, or more. When replica is attached, the
> leader does not send its state to the replica, because election is disabled.
>
> So the replica starts from term 1, tries to elect itself and is ignored
> because its term is too old. Waits for election timeout and tries again. This
> is repeated as many times as is the term value.
>
> This is why the tests running in the end were always slower.
>
> There is a bug. The first node should have sent its state when election mode
> was set to the voter on it. But it didn't.
>
> I reworked the patch so now the timeout, on the contrary, is set to 1000000
> seconds. And the tests hangs infinitely unless we send Raft state when election
> is enabled.
Thanks for all the digging and the explanation!
The new patch LGTM.
>
> The new patch:
>
> ====================
> raft: send state when state machine is started
>
> Raft didn't broadcast its state when the state machine was
> started. It could lead to the state being never sent until some
> other node would generate a term number bigger that the local one.
>
> That happened when a node participated in some elections,
> accumulated a big term number, then the election was turned off,
> and a new replica was connected in a 'candidate' state. Then the
> first node was configured to be a 'voter'.
>
> The first node didn't send anything to the replica, because at
> the moment of its connection the election was off.
>
> So the replica started from term 1, tried to start elections in
> this term, but was ignored by the first node. It waited for
> election timeout, bumped the term to 2, and the process was
> repeated until the replica reached the first node's term + 1. It
> could take very long time.
>
> The patch fixes it so now Raft broadcasts its state when it is
> enabled. To cover the replicas connected while it was disabled.
>
> Closes #5499
>
> diff --git a/src/box/raft.c b/src/box/raft.c
> index 914b0d68f..28ca74cb5 100644
> --- a/src/box/raft.c
> +++ b/src/box/raft.c
> @@ -877,6 +877,14 @@ raft_sm_start(void)
> raft_sm_wait_leader_found();
> }
> box_update_ro_summary();
> + /*
> + * Nothing changed. But when raft was stopped, its state wasn't sent to
> + * replicas. At least this was happening at the moment of this being
> + * written. On the other hand, this instance may have a term bigger than
> + * any other term in the cluster. And if it wouldn't share the term, it
> + * would ignore all the messages, including vote requests.
> + */
> + raft_schedule_broadcast();
> }
>
> static void
> diff --git a/test/replication/election_qsync.result b/test/replication/election_qsync.result
> index 086b17686..cb349efcc 100644
> --- a/test/replication/election_qsync.result
> +++ b/test/replication/election_qsync.result
> @@ -9,6 +9,9 @@ box.schema.user.grant('guest', 'super')
> old_election_mode = box.cfg.election_mode
> | ---
> | ...
> +old_election_timeout = box.cfg.election_timeout
> + | ---
> + | ...
> old_replication_synchro_timeout = box.cfg.replication_synchro_timeout
> | ---
> | ...
> @@ -60,8 +63,11 @@ fiber = require('fiber')
> | ---
> | ...
> -- Replication timeout is small to speed up a first election start.
> +-- Election timeout is set to a huge value to ensure the election does not hang
> +-- anywhere. Indeed, there can't be a split-vote when candidate is only one.
> box.cfg{ \
> election_mode = 'candidate', \
> + election_timeout = 1000000, \
> replication_synchro_quorum = 3, \
> replication_synchro_timeout = 1000000, \
> replication_timeout = 0.1, \
> @@ -114,8 +120,11 @@ box.cfg{replication_synchro_timeout = 1000000}
> -- Configure separately from synchro timeout not to depend on the order of
> -- synchro and election options appliance. Replication timeout is tiny to speed
> -- up notice of the old leader death.
> +-- Election timeout is set to a huge value to ensure the election does not hang
> +-- anywhere. Indeed, there can't be a split-vote when candidate is only one.
> box.cfg{ \
> election_mode = 'candidate', \
> + election_timeout = 1000000, \
> replication_timeout = 0.01, \
> }
> | ---
> @@ -143,6 +152,7 @@ test_run:cmd('delete server replica')
> | ...
> box.cfg{ \
> election_mode = old_election_mode, \
> + election_timeout = old_election_timeout, \
> replication_timeout = old_replication_timeout, \
> replication = old_replication, \
> replication_synchro_timeout = old_replication_synchro_timeout, \
> diff --git a/test/replication/election_qsync.test.lua b/test/replication/election_qsync.test.lua
> index 6a80f4859..eb89e5b79 100644
> --- a/test/replication/election_qsync.test.lua
> +++ b/test/replication/election_qsync.test.lua
> @@ -2,6 +2,7 @@ test_run = require('test_run').new()
> box.schema.user.grant('guest', 'super')
>
> old_election_mode = box.cfg.election_mode
> +old_election_timeout = box.cfg.election_timeout
> old_replication_synchro_timeout = box.cfg.replication_synchro_timeout
> old_replication_timeout = box.cfg.replication_timeout
> old_replication = box.cfg.replication
> @@ -28,8 +29,11 @@ box.cfg{election_mode = 'voter'}
> test_run:switch('replica')
> fiber = require('fiber')
> -- Replication timeout is small to speed up a first election start.
> +-- Election timeout is set to a huge value to ensure the election does not hang
> +-- anywhere. Indeed, there can't be a split-vote when candidate is only one.
> box.cfg{ \
> election_mode = 'candidate', \
> + election_timeout = 1000000, \
> replication_synchro_quorum = 3, \
> replication_synchro_timeout = 1000000, \
> replication_timeout = 0.1, \
> @@ -57,8 +61,11 @@ box.cfg{replication_synchro_timeout = 1000000}
> -- Configure separately from synchro timeout not to depend on the order of
> -- synchro and election options appliance. Replication timeout is tiny to speed
> -- up notice of the old leader death.
> +-- Election timeout is set to a huge value to ensure the election does not hang
> +-- anywhere. Indeed, there can't be a split-vote when candidate is only one.
> box.cfg{ \
> election_mode = 'candidate', \
> + election_timeout = 1000000, \
> replication_timeout = 0.01, \
> }
>
> @@ -70,6 +77,7 @@ box.space.test:drop()
> test_run:cmd('delete server replica')
> box.cfg{ \
> election_mode = old_election_mode, \
> + election_timeout = old_election_timeout, \
> replication_timeout = old_replication_timeout, \
> replication = old_replication, \
> replication_synchro_timeout = old_replication_synchro_timeout, \
--
Serge Petrenko
More information about the Tarantool-patches
mailing list