[Tarantool-patches] [PATCH 2/2] test: speed up election_qsync
Vladislav Shpilevoy
v.shpilevoy at tarantool.org
Tue Nov 10 01:36:42 MSK 2020
Hi! Thanks for the review!
>> @@ -62,6 +65,7 @@ fiber = require('fiber')
>> -- Replication timeout is small to speed up a first election start.
>
> I checked, and the test speeds up a lot with this patch. But I don't understand
> why. We have only two instances, only one of them is candidate.
> Thanks to the small replication_timeout, the election starts shortly after the
> old leader dies. election_timeout isn't involved here, AFAICS.
> Am I missing something?
> Even when the test is restarted, there shouldn't be any 're-elections'.
That is a very good question! Honestly, I didn't think of it. Strangely,
when I did the patch, I just "believed" this is what I should do.
I did some digging now, and there is a simple explanation. And a bug. So
this is really good you looked at this patch skeptically.
Here is what is happening in the test, in short:
master: create_replica()
master: set_mode('voter')
replica: set_mode('candidate')
replica: wait_leader()
When the instance is clear, master's term is 1 and this term does not have
votes yet. When replica is started, it votes for self, and gets elected
quite fast.
This is happening fast even without this patch, when you run this test
separately.
But when the test is running after some other Raft tests, on the leader the
term is not 1. It can be 3, 4, or more. When replica is attached, the
leader does not send its state to the replica, because election is disabled.
So the replica starts from term 1, tries to elect itself and is ignored
because its term is too old. Waits for election timeout and tries again. This
is repeated as many times as is the term value.
This is why the tests running in the end were always slower.
There is a bug. The first node should have sent its state when election mode
was set to the voter on it. But it didn't.
I reworked the patch so now the timeout, on the contrary, is set to 1000000
seconds. And the tests hangs infinitely unless we send Raft state when election
is enabled.
The new patch:
====================
raft: send state when state machine is started
Raft didn't broadcast its state when the state machine was
started. It could lead to the state being never sent until some
other node would generate a term number bigger that the local one.
That happened when a node participated in some elections,
accumulated a big term number, then the election was turned off,
and a new replica was connected in a 'candidate' state. Then the
first node was configured to be a 'voter'.
The first node didn't send anything to the replica, because at
the moment of its connection the election was off.
So the replica started from term 1, tried to start elections in
this term, but was ignored by the first node. It waited for
election timeout, bumped the term to 2, and the process was
repeated until the replica reached the first node's term + 1. It
could take very long time.
The patch fixes it so now Raft broadcasts its state when it is
enabled. To cover the replicas connected while it was disabled.
Closes #5499
diff --git a/src/box/raft.c b/src/box/raft.c
index 914b0d68f..28ca74cb5 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -877,6 +877,14 @@ raft_sm_start(void)
raft_sm_wait_leader_found();
}
box_update_ro_summary();
+ /*
+ * Nothing changed. But when raft was stopped, its state wasn't sent to
+ * replicas. At least this was happening at the moment of this being
+ * written. On the other hand, this instance may have a term bigger than
+ * any other term in the cluster. And if it wouldn't share the term, it
+ * would ignore all the messages, including vote requests.
+ */
+ raft_schedule_broadcast();
}
static void
diff --git a/test/replication/election_qsync.result b/test/replication/election_qsync.result
index 086b17686..cb349efcc 100644
--- a/test/replication/election_qsync.result
+++ b/test/replication/election_qsync.result
@@ -9,6 +9,9 @@ box.schema.user.grant('guest', 'super')
old_election_mode = box.cfg.election_mode
| ---
| ...
+old_election_timeout = box.cfg.election_timeout
+ | ---
+ | ...
old_replication_synchro_timeout = box.cfg.replication_synchro_timeout
| ---
| ...
@@ -60,8 +63,11 @@ fiber = require('fiber')
| ---
| ...
-- Replication timeout is small to speed up a first election start.
+-- Election timeout is set to a huge value to ensure the election does not hang
+-- anywhere. Indeed, there can't be a split-vote when candidate is only one.
box.cfg{ \
election_mode = 'candidate', \
+ election_timeout = 1000000, \
replication_synchro_quorum = 3, \
replication_synchro_timeout = 1000000, \
replication_timeout = 0.1, \
@@ -114,8 +120,11 @@ box.cfg{replication_synchro_timeout = 1000000}
-- Configure separately from synchro timeout not to depend on the order of
-- synchro and election options appliance. Replication timeout is tiny to speed
-- up notice of the old leader death.
+-- Election timeout is set to a huge value to ensure the election does not hang
+-- anywhere. Indeed, there can't be a split-vote when candidate is only one.
box.cfg{ \
election_mode = 'candidate', \
+ election_timeout = 1000000, \
replication_timeout = 0.01, \
}
| ---
@@ -143,6 +152,7 @@ test_run:cmd('delete server replica')
| ...
box.cfg{ \
election_mode = old_election_mode, \
+ election_timeout = old_election_timeout, \
replication_timeout = old_replication_timeout, \
replication = old_replication, \
replication_synchro_timeout = old_replication_synchro_timeout, \
diff --git a/test/replication/election_qsync.test.lua b/test/replication/election_qsync.test.lua
index 6a80f4859..eb89e5b79 100644
--- a/test/replication/election_qsync.test.lua
+++ b/test/replication/election_qsync.test.lua
@@ -2,6 +2,7 @@ test_run = require('test_run').new()
box.schema.user.grant('guest', 'super')
old_election_mode = box.cfg.election_mode
+old_election_timeout = box.cfg.election_timeout
old_replication_synchro_timeout = box.cfg.replication_synchro_timeout
old_replication_timeout = box.cfg.replication_timeout
old_replication = box.cfg.replication
@@ -28,8 +29,11 @@ box.cfg{election_mode = 'voter'}
test_run:switch('replica')
fiber = require('fiber')
-- Replication timeout is small to speed up a first election start.
+-- Election timeout is set to a huge value to ensure the election does not hang
+-- anywhere. Indeed, there can't be a split-vote when candidate is only one.
box.cfg{ \
election_mode = 'candidate', \
+ election_timeout = 1000000, \
replication_synchro_quorum = 3, \
replication_synchro_timeout = 1000000, \
replication_timeout = 0.1, \
@@ -57,8 +61,11 @@ box.cfg{replication_synchro_timeout = 1000000}
-- Configure separately from synchro timeout not to depend on the order of
-- synchro and election options appliance. Replication timeout is tiny to speed
-- up notice of the old leader death.
+-- Election timeout is set to a huge value to ensure the election does not hang
+-- anywhere. Indeed, there can't be a split-vote when candidate is only one.
box.cfg{ \
election_mode = 'candidate', \
+ election_timeout = 1000000, \
replication_timeout = 0.01, \
}
@@ -70,6 +77,7 @@ box.space.test:drop()
test_run:cmd('delete server replica')
box.cfg{ \
election_mode = old_election_mode, \
+ election_timeout = old_election_timeout, \
replication_timeout = old_replication_timeout, \
replication = old_replication, \
replication_synchro_timeout = old_replication_synchro_timeout, \
More information about the Tarantool-patches
mailing list