From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp43.i.mail.ru (smtp43.i.mail.ru [94.100.177.103]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 28BC5430407 for ; Tue, 25 Aug 2020 15:49:41 +0300 (MSK) Date: Tue, 25 Aug 2020 15:49:39 +0300 From: Sergey Bronnikov Message-ID: <20200825124939.GB47610@pony.bronevichok.ru> References: <2e5a2d3948eafebd1488ab120ad0458838661d8a.1594314820.git.sergeyb@tarantool.org> <8c3b7d4c-22e8-108a-b15e-ef3c89510a63@tarantool.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <8c3b7d4c-22e8-108a-b15e-ef3c89510a63@tarantool.org> Subject: Re: [Tarantool-patches] [PATCH 2/3] replication: test clear_synchro_queue function List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Vladislav Shpilevoy Cc: tarantool-patches@dev.tarantool.org Vlad, thanks for review! Patches updated in a branch and please see my answers inline. On 00:00 Tue 21 Jul , Vladislav Shpilevoy wrote: > Thanks for the patch! > > See 8 comments below. > > On 09.07.2020 19:16, sergeyb@tarantool.org wrote: > > From: Sergey Bronnikov > > > > Part of #5055 > > Part of #4849 > > --- > > test/replication/qsync_basic.result | 85 +++++++++++++++++++++++++++ > > test/replication/qsync_basic.test.lua | 31 ++++++++++ > > 2 files changed, 116 insertions(+) > > > > diff --git a/test/replication/qsync_basic.result b/test/replication/qsync_basic.result > > index ab4be0c7e..464df75a7 100644 > > --- a/test/replication/qsync_basic.result > > +++ b/test/replication/qsync_basic.result > > @@ -32,6 +32,14 @@ s2.is_sync > > | - false > > | ... > > > > +-- > > +-- gh-4849: clear synchro queue with unconfigured box > > +-- > > +box.ctl.clear_synchro_queue() > > 1. Enough to test that it does not crash here. No need to > check result. Well, removed assignment to variable. > > + | --- > > + | - -1 > > + | ... > > + > > -- Net.box takes sync into account. > > box.schema.user.grant('guest', 'super') > > | --- > > @@ -553,6 +561,82 @@ box.space.sync:select{7} > > | - - [7] > > | ... > > > > +-- > > +-- gh-4849: clear synchro queue on a replica > > +-- > > +test_run:switch('default') > > + | --- > > + | - true > > + | ... > > +box.cfg{replication_synchro_quorum = 3, replication_synchro_timeout = 2} > > 2. If you want this timeout to fail, it is too big. The test will be too > long. If it is supposed not to fail, then it is way too small. If you want > it to fail, it should be something like 0.001. If you want it to hold, it > should be 1000 to be sure. Keep in mind that you can change the timeout on > fly. That allows to build quite complex reactive tests completely event-based. set replication_synchro_timeout value to 1000 > > + | --- > > + | ... > > +f1 = fiber.create(box.space.sync.replace, box.space.sync, {9}) > > 3. You need to extract the exact result value. Use pcall for that, and print > its result after the fiber is dead. See other examples with fiber.create() in > qsync test suite. The same for the next test case. Done. > > + | --- > > + | ... > > +f1:status() > > + | --- > > + | - suspended > > + | ... > > +test_run:switch('replica') > > + | --- > > + | - true > > 4. Better wait until the value is delivered. Otherwise you can switch to > replica before master finishes WAL write, and the queue will be empty here. > Try > > test_run:wait_cond(function() return box.space.sync:get{9} ~= nil end) Agree, added wait_cond(). > > + | ... > > +box.ctl.clear_synchro_queue() > > + | --- > > + | - 0 > > + | ... > > +box.space.sync:select{9} > > + | --- > > + | - [] > > 5. If select returns 9 before queue clean, and returns empty afterwards, > it means the queue was cleared. So here the queue size is not really needed, > as you can see. removed this and next select{} statements > > + | ... > > +test_run:switch('default') > > + | --- > > + | - true > > + | ... > > +box.space.sync:select{9} > > + | --- > > + | - [] > > + | ... > > +f1:status() > > + | --- > > + | - dead > > + | ... > > + > > +-- > > +-- gh-4849: clear synchro queue on a master > > 6. Since the previous test the replica's WAL is different from master's. > I am not sure the replication is still alive. Test with clear_synchro_queue() on master keeps master alive at the end of test so I moved it before the same test for the replica. Also added note for others that cluster may be in a broken state here. > > +-- > > +test_run:switch('default') > > + | --- > > + | - true > > + | ... > > +box.cfg{replication_synchro_quorum = 3, replication_synchro_timeout = 2} > > + | --- > > + | ... > > +f1 = fiber.create(box.space.sync.replace, box.space.sync, {10}) > > + | --- > > + | ... > > +f1:status() > > + | --- > > + | - suspended > > + | ... > > +box.ctl.clear_synchro_queue() > > + | --- > > + | - -2 > > + | ... > > +box.space.sync:select{10} > > + | --- > > + | - - [10] > > 7. Why is it 10? The quorum is not reached, it should have been rolled back. Added "box.cfg{replication_synchro_timeout = 0.1}" before clear_synchro_queue() call and now tx is rolled back. Also I've added checks of current value with wait_cond() that is more reliable than select{} on a master and replica. Updated output looks like this: box.ctl.clear_synchro_queue() | --- | ... test_run:switch('replica') | --- | - true | ... test_run:wait_cond(function() return box.space.sync:get{10} == nil end) | --- | - true | ... test_run:switch('default') | --- | - true | ... test_run:wait_cond(function() return f:status() == 'dead' end) | --- | - true | ... ok, err | --- | - false | - Quorum collection for a synchronous transaction is timed out | ... test_run:wait_cond(function() return box.space.sync:get{10} == nil end) | --- | - true | ... > > + | ... > > +test_run:switch('replica') > > + | --- > > + | - true > > + | ... > > +box.space.sync:select{10} > > + | --- > > + | - - [10] > > + | ... > > + > > -- Cleanup. > > test_run:cmd('switch default') > > | --- > > @@ -576,6 +660,7 @@ test_run:cmd('delete server replica') > > | ... > > box.space.test:drop() > > | --- > > + | - error: A rollback for a synchronous transaction is received > > 8. Why is it changed? Perhaps it is because in previous version of test I haven't wait fibers 'dead'. There is no error now: box.space.test:drop() | --- | ... box.space.sync:drop() | --- | ... > > > | ... > > box.space.sync:drop() > > | ---