[Tarantool-patches] [PATCH 3/3] replication: add test with random leaders promotion and demotion
Serge Petrenko
sergepetrenko at tarantool.org
Tue Sep 1 16:03:16 MSK 2020
31.08.2020 13:05, Sergey Bronnikov пишет:
> Sergey, thanks for review!
> See my comments inline, test updated in a branch.
Hi! Thanks for the fixes.
See my comments below.
>
> On 13:49 Thu 27 Aug , Serge Petrenko wrote:
>> Hi! Thanks for the fixes!
>> I’m pasting parts of the patch below to comment on.
>>
>> +box.cfg({
>> + listen = instance_uri(INSTANCE_ID);
>> + replication_connect_quorum = 3;
>> + replication = {
>> + instance_uri(1);
>> + instance_uri(2);
>> + instance_uri(3);
>> + instance_uri(4);
>> + instance_uri(5);
>> + };
>> +})
>>
>>
>> * You should either omit `replication_connect_quorum` at all, or set it to 5.
>> Omitting it will have the same effect.
>> I think you meant `replication_synchro_quorum` here, then it makes sense
>> to set it to 3. Also `replication_synchro_timeout` should be set here,
>> I’ll mention it again below.
> removed replication_connect_quorum at all
+
+box.once("bootstrap", function()
+ local test_run = require('test_run').new()
+ box.cfg{replication_synchro_timeout = 1000,
replication_synchro_quorum = 5}
+ box.cfg{read_only = false}
+ box.schema.user.grant("guest", 'replication')
+end)
Since you moved space creation out of `box.once` call, you don't need
test_run here.
>> +
>> +NUM_INSTANCES = 5
>> +BROKEN_QUORUM = NUM_INSTANCES + 1
>> +
>>
>>
>> * BROKEN_QUORUM assigned but never used.
> removed
>
>>
>> +
>> +test_run:create_cluster(SERVERS, "replication", {args="0.1"})
>> +test_run:wait_fullmesh(SERVERS)
>>
>>
>> * You’re passing some argument to qsync1, … qsync5 instances, but you never use it.
> removed argument
>
>>
>> +current_leader_id = 1
>> +test_run:switch(SERVERS[current_leader_id])
>> +box.cfg{replication_synchro_timeout=1}
>>
>>
>> * You should set `replication_synchro_timeout` on every instance, not only
>> on qsync1 so you better move this box.cfg call to the instance file.
>> Besides, the timeout should be bigger (much bigger), like Vlad said.
>> We typically use 30 seconds for various replication timeouts.
>> It’s fairly common when a test is stable on your machine, but is flaky on
>> testing machines.
> increased timeout up to 1000 and set synchro settings in a qsync.lua file
>
>> +test_run:switch('default')
>> +
>> +-- Testcase body.
>> +for i=1,10 do \
>> + new_leader_id = random(current_leader_id, 1, #SERVERS) \
>> + test_run:switch(SERVERS[new_leader_id]) \
>> + box.cfg{read_only=false} \
>> + f1 = fiber.create(function() \
>> + pcall(box.space.sync:truncate{}) \
>> + end) \
>>
>>
>> * Why put truncate call in a separate fiber?
>>
>> Why use truncate at all? You may just replace all your `insert` calls below
>> with `replace`, and then truncate won’t be needed. This is up to you
>> though.
> As far as I remember original idea was to cleanup before next insert.
> I rewrote a body of test and keep all inserted values in a space and
> after loop check amount of inserted values.
>
>>
>> + f2 = fiber.create(function() \
>> + for i=1,10000 do box.space.sync:insert{i} end \
>> + end) \
>>
>> * So you’re testing a case when a leader has some unconfirmed transactions in
>> limbo and then a leader change happens. Then you need to call
>> `clear_synchro_queue` on a new leader to wait for confirmation of old txns. Otherwise
>> the new leader fails to insert its data, but the test doesn’t show this, because you
>> don’t check fiber state or `insert()` return values.
> added clear_synchro_queue() call to updated test
>
>> + test_run:switch('default') \
>> + test_run:switch(SERVERS[current_leader_id]) \
>>
>>
>> * The 2 lines above are useless.
> removed in updated test:
>
> -- Testcase body.
> for i=1,100 do \
> test_run:eval(SERVERS[current_leader_id], \
> "box.cfg{replication_synchro_quorum=6}") \
> test_run:eval(SERVERS[current_leader_id], \
> string.format("box.space.sync:insert{%d}", i)) \
> new_leader_id = random(current_leader_id, 1, #SERVERS) \
> test_run:eval(SERVERS[new_leader_id], \
> "box.cfg{replication_synchro_quorum=3}") \
> test_run:eval(SERVERS[new_leader_id], "box.ctl.clear_synchro_queue()") \
> replica = random(new_leader_id, 1, #SERVERS) \
> test_run:eval(SERVERS[replica], \
> string.format("box.space.sync:get{%d}", i)) \
It is not checked whether get returns any data or not.
You only make a call but never check the return value.
It'd also be nice to check whether the insert on the old leader (the one
with
quorum = 6) succeeds after the new leader executes `clear_synchro_queue`.
> test_run:switch('default') \
> current_leader_id = new_leader_id \
> end
>
> test_run:switch('qsync1')
> box.space.sync:count() -- 100
>
>>
>> + box.cfg{read_only=true} \
>> + test_run:switch('default') \
>> + current_leader_id = new_leader_id \
>> +end
>> +
>> +-- Teardown.
>> --
>> Serge Petrenko
--
Serge Petrenko
More information about the Tarantool-patches
mailing list