[Tarantool-patches] [PATCH v1] replication: change space sync mode in a loop
Serge Petrenko
sergepetrenko at tarantool.org
Tue Sep 1 15:25:32 MSK 2020
31.08.2020 14:06, Sergey Bronnikov пишет:
> Sergey, thanks for review!
> See my comments inline. Test has been updated in a branch.
>
> On 14:27 Thu 27 Aug , Sergey Petrenko wrote:
>> Hi! Thanks for the patch!
>>
>> Please see my comments below.
>>
>> 26.08.2020 18:10, sergeyb at tarantool.org пишет:
>>> From: Sergey Bronnikov <sergeyb at tarantool.org>
>>>
>>> New regression tests covers cases when one can change synchronous mode
>>> of space to asynchronous and vice versa.
>>>
>>> Closes #5055
>>> Part of #5144
>>> ---
>>>
>>> Branch: ligurio/gh-4842-qsync-change-mode
>>> CI: https://gitlab.com/tarantool/tarantool/-/pipelines/182271234
>>>
>>> test/replication/qsync_sync_mode.result | 164 ++++++++++++++++++++++
>>> test/replication/qsync_sync_mode.test.lua | 90 ++++++++++++
>>> 2 files changed, 254 insertions(+)
>>> create mode 100644 test/replication/qsync_sync_mode.result
>>> create mode 100644 test/replication/qsync_sync_mode.test.lua
>>>
>>> diff --git a/test/replication/qsync_sync_mode.result b/test/replication/qsync_sync_mode.result
>>> new file mode 100644
>>> index 000000000..f2f95ec0f
>>> --- /dev/null
>>> +++ b/test/replication/qsync_sync_mode.result
>>> @@ -0,0 +1,164 @@
>>> +-- test-run result file version 2
>>> +env = require('test_run')
>>> + | ---
>>> + | ...
>>> +test_run = env.new()
>>> + | ---
>>> + | ...
>>> +engine = test_run:get_cfg('engine')
>>> + | ---
>>> + | ...
>>> +fiber = require('fiber')
>>> + | ---
>>> + | ...
>>> +math = require('math')
>>> + | ---
>>> + | ...
>>> +math.randomseed(os.time())
>>> + | ---
>>> + | ...
>>> +
>>> +orig_synchro_quorum = box.cfg.replication_synchro_quorum
>>> + | ---
>>> + | ...
>>> +orig_synchro_timeout = box.cfg.replication_synchro_timeout
>>> + | ---
>>> + | ...
>>> +
>>> +disable_sync_mode = function() \
>>> + local s = box.space._space:get(box.space.sync.id) \
>>> + local new_s = s:update({{'=', 6, {is_sync=false}}}) \
>>> + box.space._space:replace(new_s) \
>>> +end
>>> + | ---
>>> + | ...
>>> +
>>> +enable_sync_mode = function() \
>>> + local s = box.space._space:get(box.space.sync.id) \
>>> + local new_s = s:update({{'=', 6, {is_sync=true}}}) \
>>> + box.space._space:replace(new_s) \
>>> +end
>>> + | ---
>>> + | ...
>>> +
>>> +set_random_sync_mode = function() \
>>> + if (math.random(1, 10) > 5) then \
>>> + enable_sync_mode() \
>>> + else \
>>> + disable_sync_mode() \
>>> + end \
>>> +end
>>> + | ---
>>> + | ...
>>> +
>>> +set_random_quorum = function(n) \
>>> + box.cfg{replication_synchro_quorum=math.random(1, n)} \
>>> +end
>>> + | ---
>>> + | ...
>>> +
>>> +box.schema.user.grant('guest', 'replication')
>>> + | ---
>>> + | ...
>>> +
>>> +-- Setup an async cluster with two instances.
>>> +test_run:cmd('create server replica with rpl_master=default,\
>>> + script="replication/replica.lua"')
>>> + | ---
>>> + | - true
>>> + | ...
>>> +test_run:cmd('start server replica with wait=True, wait_load=True')
>>> + | ---
>>> + | - true
>>> + | ...
>>> +
>>> +-- Write data to a leader, enable and disable sync mode in background in a
>>> +-- loop. Expected no data loss.
>>> +-- Testcase setup.
>>> +_ = box.schema.space.create('sync', {is_sync=true, engine=engine})
>>> + | ---
>>> + | ...
>>> +_ = box.space.sync:create_index('pk')
>>> + | ---
>>> + | ...
>>> +box.cfg{replication_synchro_quorum=2, replication_synchro_timeout=0.001}
>>> + | ---
>>> + | ...
>>> +-- Testcase body.
>>> +for i = 1,10 do \
>>> + set_random_sync_mode() \
>>> + if pcall(box.space.sync.insert, box.space.sync, {i}) then \
>>> + test_run:switch('replica') \
>>> + test_run:wait_cond(function() return box.space.sync:get{i} ~= nil end) \
>>> + end \
>>> + test_run:switch('default') \
>>> +end
>>> + | ---
>>> + | ...
>>> +-- Testcase cleanup.
>>> +test_run:switch('default')
>>> + | ---
>>> + | - true
>>> + | ...
>>> +box.space.sync:drop()
>>> + | ---
>>> + | ...
>>> +
>>> +-- Write data to a leader, enable and disable sync mode and change quorum value
>>> +-- in background in a loop.
>>> +-- Testcase setup.
>>> +_ = box.schema.space.create('sync', {is_sync=true, engine=engine})
>>> + | ---
>>> + | ...
>>> +_ = box.space.sync:create_index('pk')
>>> + | ---
>>> + | ...
>>> +box.cfg{replication_synchro_quorum=2, replication_synchro_timeout=0.001}
>>> + | ---
>>> + | ...
>>> +-- Testcase body.
>>> +for i = 1,10 do \
>>> + set_random_sync_mode() \
>>> + set_random_quorum(5) \
>>> + if pcall(box.space.sync.insert, box.space.sync, {i}) then \
>>> + test_run:switch('replica') \
>>> + test_run:wait_cond(function() return box.space.sync:get{i} ~= nil end) \
>>> + end \
>>> + test_run:switch('default') \
>>> +end
>>> + | ---
>>> + | ...
>>> +-- Testcase cleanup.
>>> +test_run:switch('default')
>>> + | ---
>>> + | - true
>>> + | ...
>>> +box.space.sync:drop()
>>> + | ---
>>> + | ...
>>> +
>>> +-- Teardown.
>>> +test_run:cmd('switch default')
>>> + | ---
>>> + | - true
>>> + | ...
>>> +test_run:cmd('stop server replica')
>>> + | ---
>>> + | - true
>>> + | ...
>>> +test_run:cmd('delete server replica')
>>> + | ---
>>> + | - true
>>> + | ...
>>> +test_run:cleanup_cluster()
>>> + | ---
>>> + | ...
>>> +box.schema.user.revoke('guest', 'replication')
>>> + | ---
>>> + | ...
>>> +box.cfg{ \
>>> + replication_synchro_quorum = orig_synchro_quorum, \
>>> + replication_synchro_timeout = orig_synchro_timeout, \
>>> +}
>>> + | ---
>>> + | ...
>>> diff --git a/test/replication/qsync_sync_mode.test.lua b/test/replication/qsync_sync_mode.test.lua
>>> new file mode 100644
>>> index 000000000..706261c10
>>> --- /dev/null
>>> +++ b/test/replication/qsync_sync_mode.test.lua
>>> @@ -0,0 +1,90 @@
>>> +env = require('test_run')
>>> +test_run = env.new()
>>> +engine = test_run:get_cfg('engine')
>>> +fiber = require('fiber')
>>> +math = require('math')
>>> +math.randomseed(os.time())
>>> +
>>> +orig_synchro_quorum = box.cfg.replication_synchro_quorum
>>> +orig_synchro_timeout = box.cfg.replication_synchro_timeout
>>> +
>>> +disable_sync_mode = function() \
>>> + local s = box.space._space:get(box.space.sync.id) \
>>> + local new_s = s:update({{'=', 6, {is_sync=false}}}) \
>>> + box.space._space:replace(new_s) \
>>> +end
>>> +
>>> +enable_sync_mode = function() \
>>> + local s = box.space._space:get(box.space.sync.id) \
>>> + local new_s = s:update({{'=', 6, {is_sync=true}}}) \
>>> + box.space._space:replace(new_s) \
>>> +end
>>> +
>> Vlad has pushed a patch with `space:alter` a couple of days ago.
>> So now you may say `space:alter{is_sync=true}`, `space:alter{is_sync=false}`
>> It does the same work you do, but looks much simpler.
> used alter in updated patch:
>
> -- Testcase body.
> for i = 1,100 do \
> box.space.sync:alter{is_sync=random_boolean()} \
> box.space.sync:insert{i} \
> test_run:switch('replica') \
> test_run:wait_cond(function() return box.space.sync:get{i} ~= nil end) \
> test_run:switch('default') \
> end
>
>>> +set_random_sync_mode = function() \
>>> + if (math.random(1, 10) > 5) then \
>>> + enable_sync_mode() \
>>> + else \
>>> + disable_sync_mode() \
>>> + end \
>>> +end
>>> +
>>> +set_random_quorum = function(n) \
>>> + box.cfg{replication_synchro_quorum=math.random(1, n)} \
>>> +end
>>> +
>>> +box.schema.user.grant('guest', 'replication')
>>> +
>>> +-- Setup an async cluster with two instances.
>>> +test_run:cmd('create server replica with rpl_master=default,\
>>> + script="replication/replica.lua"')
>>> +test_run:cmd('start server replica with wait=True, wait_load=True')
>>> +
>>> +-- Write data to a leader, enable and disable sync mode in background in a
>>> +-- loop. Expected no data loss.
>> But sync mode is not set in background, it's set in the same loop where
>> insertions happen.
> in updated version I set sync mode just before an insert, description in comment
> has been updated
>
>>> +-- Testcase setup.
>>> +_ = box.schema.space.create('sync', {is_sync=true, engine=engine})
>>> +_ = box.space.sync:create_index('pk')
>>> +box.cfg{replication_synchro_quorum=2, replication_synchro_timeout=0.001}
>> Is such a tiny timeout intended? I see you wrap the errors in a pcall, but
>> still,
>> it may happen that you get a timeout 10 out of 10 times and then you'll
>> test nothing. We usually use 30 second replication timeouts.
> Fixed by setting it to 1000.
>
>>> +-- Testcase body.
>>> +for i = 1,10 do \
>>> + set_random_sync_mode() \
>>> + if pcall(box.space.sync.insert, box.space.sync, {i}) then \
>>> + test_run:switch('replica') \
>>> + test_run:wait_cond(function() return box.space.sync:get{i} ~= nil end) \
>>> + end \
>>> + test_run:switch('default') \
>>> +end
>>> +-- Testcase cleanup.
>>> +test_run:switch('default')
>>> +box.space.sync:drop()
>>> +
>>> +-- Write data to a leader, enable and disable sync mode and change quorum value
>>> +-- in background in a loop.
>>> +-- Testcase setup.
>>> +_ = box.schema.space.create('sync', {is_sync=true, engine=engine})
>>> +_ = box.space.sync:create_index('pk')
>>> +box.cfg{replication_synchro_quorum=2, replication_synchro_timeout=0.001}
>> Ok, I see why you need a small timeout in this testcase. Maybe set a small
>> timeout
>> only when the insertion is expected to fail(quorum = 3 or 4)?. And use a 30
>> sec timeout
>> otherwise.
> I have rewrote tests to consider to cases:
> - positive, when we have achievable quorum, set mode to sync or async on a master
> and then make an insert. Expected successful writes always.
> - negative, when we change mode to sync or async with tx's in a limbo.
>
> So comment above is not relevant now.
Hi! Thanks for the fixes.
The negative case seems pointless, since the change won't happen until
the limbo
gets emptied out. This is because alter{is_sync=...} is a DML operation
which
replaces an entry in _space system space, so it gets stuck on a
non-empty limbo,
just like any DML op.
I suggest you remove it altogether.
Other than that LGTM.
>
>>> +-- Testcase body.
>>> +for i = 1,10 do \
>>> + set_random_sync_mode() \
>>> + set_random_quorum(5) \
>> Math.random() range is inclusive, so it'd be more fair, to choose quorum
>> from 1 to 4.
>> Then there'd be 2 cases of successful write, quorum 1 and 2, and 2 cases of
>> failed write,
>> quorum 3 and 4.
> removed set_random_quorum() in updated tests
>
>>> + if pcall(box.space.sync.insert, box.space.sync, {i}) then \
>>> + test_run:switch('replica') \
>>> + test_run:wait_cond(function() return box.space.sync:get{i} ~= nil end) \
>>> + end \
>>> + test_run:switch('default') \
>>> +end
>>> +-- Testcase cleanup.
>>> +test_run:switch('default')
>>> +box.space.sync:drop()
>>> +
>>> +-- Teardown.
>>> +test_run:cmd('switch default')
>>> +test_run:cmd('stop server replica')
>>> +test_run:cmd('delete server replica')
>>> +test_run:cleanup_cluster()
>>> +box.schema.user.revoke('guest', 'replication')
>>> +box.cfg{ \
>>> + replication_synchro_quorum = orig_synchro_quorum, \
>>> + replication_synchro_timeout = orig_synchro_timeout, \
>>> +}
--
Serge Petrenko
More information about the Tarantool-patches
mailing list