[Tarantool-patches] [PATCH 2/3] replication: test clear_synchro_queue function

Sergey Bronnikov sergeyb at tarantool.org
Tue Aug 25 15:49:39 MSK 2020


Vlad, thanks for review!
Patches updated in a branch and please see my answers inline.

On 00:00 Tue 21 Jul , Vladislav Shpilevoy wrote:
> Thanks for the patch!
> 
> See 8 comments below.
> 
> On 09.07.2020 19:16, sergeyb at tarantool.org wrote:
> > From: Sergey Bronnikov <sergeyb at tarantool.org>
> > 
> > Part of #5055
> > Part of #4849
> > ---
> >  test/replication/qsync_basic.result   | 85 +++++++++++++++++++++++++++
> >  test/replication/qsync_basic.test.lua | 31 ++++++++++
> >  2 files changed, 116 insertions(+)
> > 
> > diff --git a/test/replication/qsync_basic.result b/test/replication/qsync_basic.result
> > index ab4be0c7e..464df75a7 100644
> > --- a/test/replication/qsync_basic.result
> > +++ b/test/replication/qsync_basic.result
> > @@ -32,6 +32,14 @@ s2.is_sync
> >   | - false
> >   | ...
> >  
> > +--
> > +-- gh-4849: clear synchro queue with unconfigured box
> > +--
> > +box.ctl.clear_synchro_queue()
> 
> 1. Enough to test that it does not crash here. No need to
> check result.

Well, removed assignment to variable.

> > + | ---
> > + | - -1
> > + | ...
> > +
> >  -- Net.box takes sync into account.
> >  box.schema.user.grant('guest', 'super')
> >   | ---
> > @@ -553,6 +561,82 @@ box.space.sync:select{7}
> >   | - - [7]
> >   | ...
> >  
> > +--
> > +-- gh-4849: clear synchro queue on a replica
> > +--
> > +test_run:switch('default')
> > + | ---
> > + | - true
> > + | ...
> > +box.cfg{replication_synchro_quorum = 3, replication_synchro_timeout = 2}
> 
> 2. If you want this timeout to fail, it is too big. The test will be too
> long. If it is supposed not to fail, then it is way too small. If you want
> it to fail, it should be something like 0.001. If you want it to hold, it
> should be 1000 to be sure. Keep in mind that you can change the timeout on
> fly. That allows to build quite complex reactive tests completely event-based.

set replication_synchro_timeout value to 1000

> > + | ---
> > + | ...
> > +f1 = fiber.create(box.space.sync.replace, box.space.sync, {9})
> 
> 3. You need to extract the exact result value. Use pcall for that, and print
> its result after the fiber is dead. See other examples with fiber.create() in
> qsync test suite. The same for the next test case.

Done.

> > + | ---
> > + | ...
> > +f1:status()
> > + | ---
> > + | - suspended
> > + | ...
> > +test_run:switch('replica')
> > + | ---
> > + | - true
> 
> 4. Better wait until the value is delivered. Otherwise you can switch to
> replica before master finishes WAL write, and the queue will be empty here.
> Try
> 
>     test_run:wait_cond(function() return box.space.sync:get{9} ~= nil end)

Agree, added wait_cond().

> > + | ...
> > +box.ctl.clear_synchro_queue()
> > + | ---
> > + | - 0
> > + | ...
> > +box.space.sync:select{9}
> > + | ---
> > + | - []
> 
> 5. If select returns 9 before queue clean, and returns empty afterwards,
> it means the queue was cleared. So here the queue size is not really needed,
> as you can see.

removed this and next select{} statements

> > + | ...
> > +test_run:switch('default')
> > + | ---
> > + | - true
> > + | ...
> > +box.space.sync:select{9}
> > + | ---
> > + | - []
> > + | ...
> > +f1:status()
> > + | ---
> > + | - dead
> > + | ...
> > +
> > +--
> > +-- gh-4849: clear synchro queue on a master
> 
> 6. Since the previous test the replica's WAL is different from master's.
> I am not sure the replication is still alive.

Test with clear_synchro_queue() on master keeps master alive at the end
of test so I moved it before the same test for the replica. Also added
note for others that cluster may be in a broken state here.

> > +--
> > +test_run:switch('default')
> > + | ---
> > + | - true
> > + | ...
> > +box.cfg{replication_synchro_quorum = 3, replication_synchro_timeout = 2}
> > + | ---
> > + | ...
> > +f1 = fiber.create(box.space.sync.replace, box.space.sync, {10})
> > + | ---
> > + | ...
> > +f1:status()
> > + | ---
> > + | - suspended
> > + | ...
> > +box.ctl.clear_synchro_queue()
> > + | ---
> > + | - -2
> > + | ...
> > +box.space.sync:select{10}
> > + | ---
> > + | - - [10]
> 
> 7. Why is it 10? The quorum is not reached, it should have been rolled back.

Added "box.cfg{replication_synchro_timeout = 0.1}" before clear_synchro_queue()
call and now tx is rolled back. Also I've added checks of current value with
wait_cond() that is more reliable than select{} on a master and replica.
Updated output looks like this:

box.ctl.clear_synchro_queue()
 | ---
 | ...
test_run:switch('replica')
 | ---
 | - true
 | ...
test_run:wait_cond(function() return box.space.sync:get{10} == nil end)
 | ---
 | - true
 | ...
test_run:switch('default')
 | ---
 | - true
 | ...
test_run:wait_cond(function() return f:status() == 'dead' end)
 | ---
 | - true
 | ...
ok, err
 | ---
 | - false
 | - Quorum collection for a synchronous transaction is timed out
 | ...
test_run:wait_cond(function() return box.space.sync:get{10} == nil end)
 | ---
 | - true
 | ...

> > + | ...
> > +test_run:switch('replica')
> > + | ---
> > + | - true
> > + | ...
> > +box.space.sync:select{10}
> > + | ---
> > + | - - [10]
> > + | ...
> > +
> >  -- Cleanup.
> >  test_run:cmd('switch default')
> >   | ---
> > @@ -576,6 +660,7 @@ test_run:cmd('delete server replica')
> >   | ...
> >  box.space.test:drop()
> >   | ---
> > + | - error: A rollback for a synchronous transaction is received
> 
> 8. Why is it changed?

Perhaps it is because in previous version of test I haven't wait fibers 'dead'.
There is no error now:

box.space.test:drop()
 | ---
 | ...
box.space.sync:drop()
 | ---
 | ...

> 
> >   | ...
> >  box.space.sync:drop()
> >   | ---


More information about the Tarantool-patches mailing list