[Tarantool-patches] [PATCH 2/3] replication: test clear_synchro_queue function

Serge Petrenko sergepetrenko at tarantool.org
Wed Aug 26 10:31:37 MSK 2020


Hi, Sergey!
Thanks for the patch and the fixes. New version LGTM. I couldn’t find the other patches in my mailbox though, could you please resend them? If you need my review, of course. 
>Вторник, 25 августа 2020, 15:49 +03:00 от Sergey Bronnikov <sergeyb at tarantool.org>:
> 
>Vlad, thanks for review!
>Patches updated in a branch and please see my answers inline.
>
>On 00:00 Tue 21 Jul , Vladislav Shpilevoy wrote:
>> Thanks for the patch!
>>
>> See 8 comments below.
>>
>> On 09.07.2020 19:16,  sergeyb at tarantool.org wrote:
>> > From: Sergey Bronnikov < sergeyb at tarantool.org >
>> >
>> > Part of #5055
>> > Part of #4849
>> > ---
>> > test/replication/qsync_basic.result | 85 +++++++++++++++++++++++++++
>> > test/replication/qsync_basic.test.lua | 31 ++++++++++
>> > 2 files changed, 116 insertions(+)
>> >
>> > diff --git a/test/replication/qsync_basic.result b/test/replication/qsync_basic.result
>> > index ab4be0c7e..464df75a7 100644
>> > --- a/test/replication/qsync_basic.result
>> > +++ b/test/replication/qsync_basic.result
>> > @@ -32,6 +32,14 @@ s2.is_sync
>> > | - false
>> > | ...
>> >
>> > +--
>> > +-- gh-4849: clear synchro queue with unconfigured box
>> > +--
>> > +box.ctl.clear_synchro_queue()
>>
>> 1. Enough to test that it does not crash here. No need to
>> check result.
>
>Well, removed assignment to variable.
>
>> > + | ---
>> > + | - -1
>> > + | ...
>> > +
>> > -- Net.box takes sync into account.
>> > box.schema.user.grant('guest', 'super')
>> > | ---
>> > @@ -553,6 +561,82 @@ box.space.sync:select{7}
>> > | - - [7]
>> > | ...
>> >
>> > +--
>> > +-- gh-4849: clear synchro queue on a replica
>> > +--
>> > +test_run:switch('default')
>> > + | ---
>> > + | - true
>> > + | ...
>> > +box.cfg{replication_synchro_quorum = 3, replication_synchro_timeout = 2}
>>
>> 2. If you want this timeout to fail, it is too big. The test will be too
>> long. If it is supposed not to fail, then it is way too small. If you want
>> it to fail, it should be something like 0.001. If you want it to hold, it
>> should be 1000 to be sure. Keep in mind that you can change the timeout on
>> fly. That allows to build quite complex reactive tests completely event-based.
>
>set replication_synchro_timeout value to 1000
>
>> > + | ---
>> > + | ...
>> > +f1 = fiber.create(box.space.sync.replace, box.space.sync, {9})
>>
>> 3. You need to extract the exact result value. Use pcall for that, and print
>> its result after the fiber is dead. See other examples with fiber.create() in
>> qsync test suite. The same for the next test case.
>
>Done.
>
>> > + | ---
>> > + | ...
>> > +f1:status()
>> > + | ---
>> > + | - suspended
>> > + | ...
>> > +test_run:switch('replica')
>> > + | ---
>> > + | - true
>>
>> 4. Better wait until the value is delivered. Otherwise you can switch to
>> replica before master finishes WAL write, and the queue will be empty here.
>> Try
>>
>> test_run:wait_cond(function() return box.space.sync:get{9} ~= nil end)
>
>Agree, added wait_cond().
>
>> > + | ...
>> > +box.ctl.clear_synchro_queue()
>> > + | ---
>> > + | - 0
>> > + | ...
>> > +box.space.sync:select{9}
>> > + | ---
>> > + | - []
>>
>> 5. If select returns 9 before queue clean, and returns empty afterwards,
>> it means the queue was cleared. So here the queue size is not really needed,
>> as you can see.
>
>removed this and next select{} statements
>
>> > + | ...
>> > +test_run:switch('default')
>> > + | ---
>> > + | - true
>> > + | ...
>> > +box.space.sync:select{9}
>> > + | ---
>> > + | - []
>> > + | ...
>> > +f1:status()
>> > + | ---
>> > + | - dead
>> > + | ...
>> > +
>> > +--
>> > +-- gh-4849: clear synchro queue on a master
>>
>> 6. Since the previous test the replica's WAL is different from master's.
>> I am not sure the replication is still alive.
>
>Test with clear_synchro_queue() on master keeps master alive at the end
>of test so I moved it before the same test for the replica. Also added
>note for others that cluster may be in a broken state here.
>
>> > +--
>> > +test_run:switch('default')
>> > + | ---
>> > + | - true
>> > + | ...
>> > +box.cfg{replication_synchro_quorum = 3, replication_synchro_timeout = 2}
>> > + | ---
>> > + | ...
>> > +f1 = fiber.create(box.space.sync.replace, box.space.sync, {10})
>> > + | ---
>> > + | ...
>> > +f1:status()
>> > + | ---
>> > + | - suspended
>> > + | ...
>> > +box.ctl.clear_synchro_queue()
>> > + | ---
>> > + | - -2
>> > + | ...
>> > +box.space.sync:select{10}
>> > + | ---
>> > + | - - [10]
>>
>> 7. Why is it 10? The quorum is not reached, it should have been rolled back.
>
>Added "box.cfg{replication_synchro_timeout = 0.1}" before clear_synchro_queue()
>call and now tx is rolled back. Also I've added checks of current value with
>wait_cond() that is more reliable than select{} on a master and replica.
>Updated output looks like this:
>
>box.ctl.clear_synchro_queue()
> | ---
> | ...
>test_run:switch('replica')
> | ---
> | - true
> | ...
>test_run:wait_cond(function() return box.space.sync:get{10} == nil end)
> | ---
> | - true
> | ...
>test_run:switch('default')
> | ---
> | - true
> | ...
>test_run:wait_cond(function() return f:status() == 'dead' end)
> | ---
> | - true
> | ...
>ok, err
> | ---
> | - false
> | - Quorum collection for a synchronous transaction is timed out
> | ...
>test_run:wait_cond(function() return box.space.sync:get{10} == nil end)
> | ---
> | - true
> | ...
>
>> > + | ...
>> > +test_run:switch('replica')
>> > + | ---
>> > + | - true
>> > + | ...
>> > +box.space.sync:select{10}
>> > + | ---
>> > + | - - [10]
>> > + | ...
>> > +
>> > -- Cleanup.
>> > test_run:cmd('switch default')
>> > | ---
>> > @@ -576,6 +660,7 @@ test_run:cmd('delete server replica')
>> > | ...
>> > box.space.test:drop()
>> > | ---
>> > + | - error: A rollback for a synchronous transaction is received
>>
>> 8. Why is it changed?
>
>Perhaps it is because in previous version of test I haven't wait fibers 'dead'.
>There is no error now:
>
>box.space.test:drop()
> | ---
> | ...
>box.space.sync:drop()
> | ---
> | ...
>
>>
>> > | ...
>> > box.space.sync:drop()
>> > | ---
 
--
Serge Petrenko
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.tarantool.org/pipermail/tarantool-patches/attachments/20200826/757c3467/attachment.html>


More information about the Tarantool-patches mailing list