Hi, Sergey! Thanks for the patch and the fixes. New version LGTM. I couldn’t find the other patches in my mailbox though, could you please resend them? If you need my review, of course. >Вторник, 25 августа 2020, 15:49 +03:00 от Sergey Bronnikov : >  >Vlad, thanks for review! >Patches updated in a branch and please see my answers inline. > >On 00:00 Tue 21 Jul , Vladislav Shpilevoy wrote: >> Thanks for the patch! >> >> See 8 comments below. >> >> On 09.07.2020 19:16, sergeyb@tarantool.org wrote: >> > From: Sergey Bronnikov < sergeyb@tarantool.org > >> > >> > Part of #5055 >> > Part of #4849 >> > --- >> > test/replication/qsync_basic.result | 85 +++++++++++++++++++++++++++ >> > test/replication/qsync_basic.test.lua | 31 ++++++++++ >> > 2 files changed, 116 insertions(+) >> > >> > diff --git a/test/replication/qsync_basic.result b/test/replication/qsync_basic.result >> > index ab4be0c7e..464df75a7 100644 >> > --- a/test/replication/qsync_basic.result >> > +++ b/test/replication/qsync_basic.result >> > @@ -32,6 +32,14 @@ s2.is_sync >> > | - false >> > | ... >> > >> > +-- >> > +-- gh-4849: clear synchro queue with unconfigured box >> > +-- >> > +box.ctl.clear_synchro_queue() >> >> 1. Enough to test that it does not crash here. No need to >> check result. > >Well, removed assignment to variable. > >> > + | --- >> > + | - -1 >> > + | ... >> > + >> > -- Net.box takes sync into account. >> > box.schema.user.grant('guest', 'super') >> > | --- >> > @@ -553,6 +561,82 @@ box.space.sync:select{7} >> > | - - [7] >> > | ... >> > >> > +-- >> > +-- gh-4849: clear synchro queue on a replica >> > +-- >> > +test_run:switch('default') >> > + | --- >> > + | - true >> > + | ... >> > +box.cfg{replication_synchro_quorum = 3, replication_synchro_timeout = 2} >> >> 2. If you want this timeout to fail, it is too big. The test will be too >> long. If it is supposed not to fail, then it is way too small. If you want >> it to fail, it should be something like 0.001. If you want it to hold, it >> should be 1000 to be sure. Keep in mind that you can change the timeout on >> fly. That allows to build quite complex reactive tests completely event-based. > >set replication_synchro_timeout value to 1000 > >> > + | --- >> > + | ... >> > +f1 = fiber.create(box.space.sync.replace, box.space.sync, {9}) >> >> 3. You need to extract the exact result value. Use pcall for that, and print >> its result after the fiber is dead. See other examples with fiber.create() in >> qsync test suite. The same for the next test case. > >Done. > >> > + | --- >> > + | ... >> > +f1:status() >> > + | --- >> > + | - suspended >> > + | ... >> > +test_run:switch('replica') >> > + | --- >> > + | - true >> >> 4. Better wait until the value is delivered. Otherwise you can switch to >> replica before master finishes WAL write, and the queue will be empty here. >> Try >> >> test_run:wait_cond(function() return box.space.sync:get{9} ~= nil end) > >Agree, added wait_cond(). > >> > + | ... >> > +box.ctl.clear_synchro_queue() >> > + | --- >> > + | - 0 >> > + | ... >> > +box.space.sync:select{9} >> > + | --- >> > + | - [] >> >> 5. If select returns 9 before queue clean, and returns empty afterwards, >> it means the queue was cleared. So here the queue size is not really needed, >> as you can see. > >removed this and next select{} statements > >> > + | ... >> > +test_run:switch('default') >> > + | --- >> > + | - true >> > + | ... >> > +box.space.sync:select{9} >> > + | --- >> > + | - [] >> > + | ... >> > +f1:status() >> > + | --- >> > + | - dead >> > + | ... >> > + >> > +-- >> > +-- gh-4849: clear synchro queue on a master >> >> 6. Since the previous test the replica's WAL is different from master's. >> I am not sure the replication is still alive. > >Test with clear_synchro_queue() on master keeps master alive at the end >of test so I moved it before the same test for the replica. Also added >note for others that cluster may be in a broken state here. > >> > +-- >> > +test_run:switch('default') >> > + | --- >> > + | - true >> > + | ... >> > +box.cfg{replication_synchro_quorum = 3, replication_synchro_timeout = 2} >> > + | --- >> > + | ... >> > +f1 = fiber.create(box.space.sync.replace, box.space.sync, {10}) >> > + | --- >> > + | ... >> > +f1:status() >> > + | --- >> > + | - suspended >> > + | ... >> > +box.ctl.clear_synchro_queue() >> > + | --- >> > + | - -2 >> > + | ... >> > +box.space.sync:select{10} >> > + | --- >> > + | - - [10] >> >> 7. Why is it 10? The quorum is not reached, it should have been rolled back. > >Added "box.cfg{replication_synchro_timeout = 0.1}" before clear_synchro_queue() >call and now tx is rolled back. Also I've added checks of current value with >wait_cond() that is more reliable than select{} on a master and replica. >Updated output looks like this: > >box.ctl.clear_synchro_queue() > | --- > | ... >test_run:switch('replica') > | --- > | - true > | ... >test_run:wait_cond(function() return box.space.sync:get{10} == nil end) > | --- > | - true > | ... >test_run:switch('default') > | --- > | - true > | ... >test_run:wait_cond(function() return f:status() == 'dead' end) > | --- > | - true > | ... >ok, err > | --- > | - false > | - Quorum collection for a synchronous transaction is timed out > | ... >test_run:wait_cond(function() return box.space.sync:get{10} == nil end) > | --- > | - true > | ... > >> > + | ... >> > +test_run:switch('replica') >> > + | --- >> > + | - true >> > + | ... >> > +box.space.sync:select{10} >> > + | --- >> > + | - - [10] >> > + | ... >> > + >> > -- Cleanup. >> > test_run:cmd('switch default') >> > | --- >> > @@ -576,6 +660,7 @@ test_run:cmd('delete server replica') >> > | ... >> > box.space.test:drop() >> > | --- >> > + | - error: A rollback for a synchronous transaction is received >> >> 8. Why is it changed? > >Perhaps it is because in previous version of test I haven't wait fibers 'dead'. >There is no error now: > >box.space.test:drop() > | --- > | ... >box.space.sync:drop() > | --- > | ... > >> >> > | ... >> > box.space.sync:drop() >> > | ---   -- Serge Petrenko