[Tarantool-patches] [PATCH 3/3] replication: add test with random leaders promotion and demotion
Serge Petrenko
sergepetrenko at tarantool.org
Thu Aug 27 13:49:42 MSK 2020
Hi! Thanks for the fixes!
I’m pasting parts of the patch below to comment on.
+box.cfg({
+ listen = instance_uri(INSTANCE_ID);
+ replication_connect_quorum = 3;
+ replication = {
+ instance_uri(1);
+ instance_uri(2);
+ instance_uri(3);
+ instance_uri(4);
+ instance_uri(5);
+ };
+})
* You should either omit `replication_connect_quorum` at all, or set it to 5.
Omitting it will have the same effect.
I think you meant `replication_synchro_quorum` here, then it makes sense
to set it to 3. Also `replication_synchro_timeout` should be set here,
I’ll mention it again below.
+
+NUM_INSTANCES = 5
+BROKEN_QUORUM = NUM_INSTANCES + 1
+
* BROKEN_QUORUM assigned but never used.
+
+test_run:create_cluster(SERVERS, "replication", {args="0.1"})
+test_run:wait_fullmesh(SERVERS)
* You’re passing some argument to qsync1, … qsync5 instances, but you never use it.
+current_leader_id = 1
+test_run:switch(SERVERS[current_leader_id])
+box.cfg{replication_synchro_timeout=1}
* You should set `replication_synchro_timeout` on every instance, not only
on qsync1 so you better move this box.cfg call to the instance file.
Besides, the timeout should be bigger (much bigger), like Vlad said.
We typically use 30 seconds for various replication timeouts.
It’s fairly common when a test is stable on your machine, but is flaky on
testing machines.
+test_run:switch('default')
+
+-- Testcase body.
+for i=1,10 do \
+ new_leader_id = random(current_leader_id, 1, #SERVERS) \
+ test_run:switch(SERVERS[new_leader_id]) \
+ box.cfg{read_only=false} \
+ f1 = fiber.create(function() \
+ pcall(box.space.sync:truncate{}) \
+ end) \
* Why put truncate call in a separate fiber?
Why use truncate at all? You may just replace all your `insert` calls below
with `replace`, and then truncate won’t be needed. This is up to you
though.
+ f2 = fiber.create(function() \
+ for i=1,10000 do box.space.sync:insert{i} end \
+ end) \
* So you’re testing a case when a leader has some unconfirmed transactions in
limbo and then a leader change happens. Then you need to call
`clear_synchro_queue` on a new leader to wait for confirmation of old txns. Otherwise
the new leader fails to insert its data, but the test doesn’t show this, because you
don’t check fiber state or `insert()` return values.
+ test_run:switch('default') \
+ test_run:switch(SERVERS[current_leader_id]) \
* The 2 lines above are useless.
+ box.cfg{read_only=true} \
+ test_run:switch('default') \
+ current_leader_id = new_leader_id \
+end
+
+-- Teardown.
--
Serge Petrenko
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.tarantool.org/pipermail/tarantool-patches/attachments/20200827/ca5f2e55/attachment.html>
More information about the Tarantool-patches
mailing list