[Tarantool-patches] [PATCH 3/3] replication: add test with random leaders promotion and demotion

Mon Aug 31 13:05:39 MSK 2020

Sergey, thanks for review!
See my comments inline, test updated in a branch.

On 13:49 Thu 27 Aug , Serge Petrenko wrote:
>
> Hi! Thanks for the fixes!
> I’m pasting parts of the patch below to comment on.
>
> +box.cfg({
> +    listen = instance_uri(INSTANCE_ID);
> +    replication_connect_quorum = 3;
> +    replication = {
> +        instance_uri(1);
> +        instance_uri(2);
> +        instance_uri(3);
> +        instance_uri(4);
> +        instance_uri(5);
> +    };
> +})
>  
>  
> *  You should either omit `replication_connect_quorum` at all, or set it to 5.
> Omitting it will have the same effect.
> I think you meant `replication_synchro_quorum` here, then it makes sense
> to set it to 3. Also   `replication_synchro_timeout` should be set here,
> I’ll mention it   again below.

removed replication_connect_quorum at all

> +
> +NUM_INSTANCES = 5
> +BROKEN_QUORUM = NUM_INSTANCES + 1
> +
>  
>  
> *  BROKEN_QUORUM assigned but never used.

removed

>  
> +
> +test_run:create_cluster(SERVERS, "replication", {args="0.1"})
> +test_run:wait_fullmesh(SERVERS)
>  
>  
> *  You’re passing some argument to qsync1, … qsync5 instances, but you never use  it.

removed argument

>  
> +current_leader_id = 1
> +test_run:switch(SERVERS[current_leader_id])
> +box.cfg{replication_synchro_timeout=1}
>  
>  
> *  You should set `replication_synchro_timeout` on every instance, not only
> on qsync1 so   you better move this box.cfg call to the instance file.
> Besides, the timeout should be bigger (much bigger), like Vlad said.
> We typically use 30 seconds for various replication timeouts.
> It’s fairly common when a test is stable on your machine, but is flaky on
> testing machines.

increased timeout up to 1000 and set synchro settings in a qsync.lua file

> +test_run:switch('default')
> +
> +-- Testcase body.
> +for i=1,10 do                                                                  \
> +    new_leader_id = random(current_leader_id, 1, #SERVERS)                     \
> +    test_run:switch(SERVERS[new_leader_id])                                    \
> +    box.cfg{read_only=false}                                                   \
> +    f1 = fiber.create(function()                                               \
> +        pcall(box.space.sync:truncate{})                                       \
> +    end)                                                                       \
>  
>  
> *  Why put truncate call in a separate fiber?
> 
> Why use truncate at all? You may just replace all your `insert` calls below
> with `replace`, and then truncate won’t be needed. This is up to you
> though.

As far as I remember original idea was to cleanup before next insert.
I rewrote a body of test and keep all inserted values in a space and
after loop check amount of inserted values.

>  
> +    f2 = fiber.create(function()                                               \
> +        for i=1,10000 do box.space.sync:insert{i} end                          \
> +    end)                                                                       \
>  
> *  So you’re testing a case when a leader has some unconfirmed transactions in
> limbo and then a leader change happens. Then you need to call
> `clear_synchro_queue` on a new leader to wait for confirmation of old txns. Otherwise
> the new leader fails to insert its data, but the test doesn’t show this, because you
> don’t check fiber state or `insert()` return values.

added clear_synchro_queue() call to updated test

> +    test_run:switch('default')                                                 \
> +    test_run:switch(SERVERS[current_leader_id])                                \
>  
>  
> *  The 2 lines above are useless.

removed in updated test:

-- Testcase body.
for i=1,100 do                                                                 \
    test_run:eval(SERVERS[current_leader_id],                                  \
        "box.cfg{replication_synchro_quorum=6}")                               \
    test_run:eval(SERVERS[current_leader_id],                                  \
        string.format("box.space.sync:insert{%d}", i))                         \
    new_leader_id = random(current_leader_id, 1, #SERVERS)                     \
    test_run:eval(SERVERS[new_leader_id],                                      \
        "box.cfg{replication_synchro_quorum=3}")                               \
    test_run:eval(SERVERS[new_leader_id], "box.ctl.clear_synchro_queue()")     \
    replica = random(new_leader_id, 1, #SERVERS)                               \
    test_run:eval(SERVERS[replica],                                            \
        string.format("box.space.sync:get{%d}", i))                            \
    test_run:switch('default')                                                 \
    current_leader_id = new_leader_id                                          \
end

test_run:switch('qsync1')
box.space.sync:count() -- 100

>  
> +    box.cfg{read_only=true}                                                    \
> +    test_run:switch('default')                                                 \
> +    current_leader_id = new_leader_id                                          \
> +end
> +
> +-- Teardown.
> --
> Serge Petrenko

-- 
sergeyb@