Hi! Thanks for the fixes!
I’m pasting parts of the patch below to comment on.
 

 

+box.cfg({

+    listen = instance_uri(INSTANCE_ID);

+    replication_connect_quorum = 3;

+    replication = {

+        instance_uri(1);

+        instance_uri(2);

+        instance_uri(3);

+        instance_uri(4);

+        instance_uri(5);

+    };

+})

 
 
  1. You should either omit `replication_connect_quorum` at all, or set it to 5.
    Omitting it will have the same effect.
    I think you meant `replication_synchro_quorum` here, then it makes sense
    to set it to 3. Also
     `replication_synchro_timeout` should be set here,
    I’ll mention it
     ​​​​​​​again below.
 
 

+

+NUM_INSTANCES = 5

+BROKEN_QUORUM = NUM_INSTANCES + 1

+

 
 
  1. BROKEN_QUORUM assigned but never used.
 
 

+

+test_run:create_cluster(SERVERS, "replication", {args="0.1"})

+test_run:wait_fullmesh(SERVERS)

 
 
  1. You’re passing some argument to qsync1, … qsync5 instances, but you never use it.
 
 

+current_leader_id = 1

+test_run:switch(SERVERS[current_leader_id])

+box.cfg{replication_synchro_timeout=1}

 
 
  1. You should set `replication_synchro_timeout` on every instance, not only
    on qsync1 so
     you better move this box.cfg call to the instance file.
    Besides, the timeout should be bigger (much bigger), like Vlad said.
    We typically use 30 seconds for various replication timeouts.
    It’s fairly common when a test is stable on your machine, but is flaky on
    testing machines.
 
 

+test_run:switch('default')

+

+-- Testcase body.

+for i=1,10 do                                                                  \

+    new_leader_id = random(current_leader_id, 1, #SERVERS)                     \

+    test_run:switch(SERVERS[new_leader_id])                                    \

+    box.cfg{read_only=false}                                                   \

+    f1 = fiber.create(function()                                               \

+        pcall(box.space.sync:truncate{})                                       \

+    end)                                                                       \

 
 
  1. Why put truncate call in a separate fiber?

    Why use truncate at all? You may just replace all your `insert` calls below
    with `replace`, and then truncate won’t be needed. This is up to you
    though.
 
 

+    f2 = fiber.create(function()                                               \

+        for i=1,10000 do box.space.sync:insert{i} end                          \

+    end)                                                                       \

 
  1. So you’re testing a case when a leader has some unconfirmed transactions in
    limbo and then a leader change happens. Then you need to call
    `clear_synchro_queue` on a new leader to wait for confirmation of old txns. Otherwise
    the new leader fails to insert its data, but the test doesn’t show this, because you
    don’t check fiber state or `insert()` return values.
 

+    test_run:switch('default')                                                 \

+    test_run:switch(SERVERS[current_leader_id])                                \

 
 
  1. The 2 lines above are useless.
 
 

+    box.cfg{read_only=true}                                                    \

+    test_run:switch('default')                                                 \

+    current_leader_id = new_leader_id                                          \

+end

+

+-- Teardown.

 

 
 
--
Serge Petrenko