[Tarantool-patches] [PATCH 3/3] replication: add test with random leaders promotion and demotion

Serge Petrenko sergepetrenko at tarantool.org
Tue Sep 1 16:03:16 MSK 2020


31.08.2020 13:05, Sergey Bronnikov пишет:
> Sergey, thanks for review!
> See my comments inline, test updated in a branch.

Hi! Thanks for the fixes.

See my comments below.


>
> On 13:49 Thu 27 Aug , Serge Petrenko wrote:
>> Hi! Thanks for the fixes!
>> I’m pasting parts of the patch below to comment on.
>>
>> +box.cfg({
>> +    listen = instance_uri(INSTANCE_ID);
>> +    replication_connect_quorum = 3;
>> +    replication = {
>> +        instance_uri(1);
>> +        instance_uri(2);
>> +        instance_uri(3);
>> +        instance_uri(4);
>> +        instance_uri(5);
>> +    };
>> +})
>>   
>>   
>> *  You should either omit `replication_connect_quorum` at all, or set it to 5.
>> Omitting it will have the same effect.
>> I think you meant `replication_synchro_quorum` here, then it makes sense
>> to set it to 3. Also   `replication_synchro_timeout` should be set here,
>> I’ll mention it  ​​​​​​​ again below.
> removed replication_connect_quorum at all

+
+box.once("bootstrap", function()
+    local test_run = require('test_run').new()
+    box.cfg{replication_synchro_timeout = 1000, 
replication_synchro_quorum = 5}
+    box.cfg{read_only = false}
+    box.schema.user.grant("guest", 'replication')
+end)

Since you moved space creation out of `box.once` call, you don't need 
test_run here.


>> +
>> +NUM_INSTANCES = 5
>> +BROKEN_QUORUM = NUM_INSTANCES + 1
>> +
>>   
>>   
>> *  BROKEN_QUORUM assigned but never used.
> removed
>
>>   
>> +
>> +test_run:create_cluster(SERVERS, "replication", {args="0.1"})
>> +test_run:wait_fullmesh(SERVERS)
>>   
>>   
>> *  You’re passing some argument to qsync1, … qsync5 instances, but you never use  it.
> removed argument
>
>>   
>> +current_leader_id = 1
>> +test_run:switch(SERVERS[current_leader_id])
>> +box.cfg{replication_synchro_timeout=1}
>>   
>>   
>> *  You should set `replication_synchro_timeout` on every instance, not only
>> on qsync1 so   you better move this box.cfg call to the instance file.
>> Besides, the timeout should be bigger (much bigger), like Vlad said.
>> We typically use 30 seconds for various replication timeouts.
>> It’s fairly common when a test is stable on your machine, but is flaky on
>> testing machines.
> increased timeout up to 1000 and set synchro settings in a qsync.lua file
>
>> +test_run:switch('default')
>> +
>> +-- Testcase body.
>> +for i=1,10 do                                                                  \
>> +    new_leader_id = random(current_leader_id, 1, #SERVERS)                     \
>> +    test_run:switch(SERVERS[new_leader_id])                                    \
>> +    box.cfg{read_only=false}                                                   \
>> +    f1 = fiber.create(function()                                               \
>> +        pcall(box.space.sync:truncate{})                                       \
>> +    end)                                                                       \
>>   
>>   
>> *  Why put truncate call in a separate fiber?
>>
>> Why use truncate at all? You may just replace all your `insert` calls below
>> with `replace`, and then truncate won’t be needed. This is up to you
>> though.
> As far as I remember original idea was to cleanup before next insert.
> I rewrote a body of test and keep all inserted values in a space and
> after loop check amount of inserted values.
>
>>   
>> +    f2 = fiber.create(function()                                               \
>> +        for i=1,10000 do box.space.sync:insert{i} end                          \
>> +    end)                                                                       \
>>   
>> *  So you’re testing a case when a leader has some unconfirmed transactions in
>> limbo and then a leader change happens. Then you need to call
>> `clear_synchro_queue` on a new leader to wait for confirmation of old txns. Otherwise
>> the new leader fails to insert its data, but the test doesn’t show this, because you
>> don’t check fiber state or `insert()` return values.
> added clear_synchro_queue() call to updated test
>
>> +    test_run:switch('default')                                                 \
>> +    test_run:switch(SERVERS[current_leader_id])                                \
>>   
>>   
>> *  The 2 lines above are useless.
> removed in updated test:
>
> -- Testcase body.
> for i=1,100 do                                                                 \
>      test_run:eval(SERVERS[current_leader_id],                                  \
>          "box.cfg{replication_synchro_quorum=6}")                               \
>      test_run:eval(SERVERS[current_leader_id],                                  \
>          string.format("box.space.sync:insert{%d}", i))                         \
>      new_leader_id = random(current_leader_id, 1, #SERVERS)                     \
>      test_run:eval(SERVERS[new_leader_id],                                      \
>          "box.cfg{replication_synchro_quorum=3}")                               \
>      test_run:eval(SERVERS[new_leader_id], "box.ctl.clear_synchro_queue()")     \
>      replica = random(new_leader_id, 1, #SERVERS)                               \
>      test_run:eval(SERVERS[replica],                                            \
>          string.format("box.space.sync:get{%d}", i))                            \

It  is not checked whether get returns any data or not.
You only make a call but never check the return value.

It'd also be nice to check whether the insert on the old leader (the one 
with
quorum =  6) succeeds after the new leader executes `clear_synchro_queue`.


>      test_run:switch('default')                                                 \
>      current_leader_id = new_leader_id                                          \
> end
>
> test_run:switch('qsync1')
> box.space.sync:count() -- 100
>
>>   
>> +    box.cfg{read_only=true}                                                    \
>> +    test_run:switch('default')                                                 \
>> +    current_leader_id = new_leader_id                                          \
>> +end
>> +
>> +-- Teardown.
>> --
>> Serge Petrenko

-- 
Serge Petrenko



More information about the Tarantool-patches mailing list