Tarantool development patches archive
 help / color / mirror / Atom feed
From: Cyrill Gorcunov via Tarantool-patches <tarantool-patches@dev.tarantool.org>
To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Cc: tml <tarantool-patches@dev.tarantool.org>
Subject: Re: [Tarantool-patches] [PATCH v6 2/3] test: add a test for wal_cleanup_delay option
Date: Tue, 30 Mar 2021 14:55:36 +0300	[thread overview]
Message-ID: <YGMRuMVK+HEvMA0f@grain> (raw)
In-Reply-To: <YGLRHBzxs1P2Zlsh@grain>

On Tue, Mar 30, 2021 at 10:19:56AM +0300, Cyrill Gorcunov wrote:
> 
> Vlad, lets move another way. I suspect there are 3 ways to compare
> 
> 1) assert(is_paused == false)
> 2) assert(not is_paused)
> 3) is_running = not is_paused
>    assert(is_running)
> 
> so which of them I should use?

Here is a final update on top I've pushed out. Hopefully it is the one
you prefer to see. I put a complete result file for better review.
---
-- test-run result file version 2
--
-- gh-5806: defer xlog cleanup to keep xlogs until
-- replicas present in "_cluster" are connected.
-- Otherwise we are getting XlogGapError since
-- master might go far forward from replica and
-- replica won't be able to connect without full
-- rebootstrap.
--

fiber = require('fiber')
 | ---
 | ...
test_run = require('test_run').new()
 | ---
 | ...
engine = test_run:get_cfg('engine')
 | ---
 | ...

--
-- Case 1.
--
-- First lets make sure we're getting XlogGapError in
-- case if wal_cleanup_delay is not used.
--

test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
 | ---
 | - true
 | ...
test_run:cmd('start server master')
 | ---
 | - true
 | ...

test_run:switch('master')
 | ---
 | - true
 | ...
box.schema.user.grant('guest', 'replication')
 | ---
 | ...

--
-- Keep small number of snaps to force cleanup
-- procedure be more intensive.
box.cfg{checkpoint_count = 1}
 | ---
 | ...

engine = test_run:get_cfg('engine')
 | ---
 | ...
s = box.schema.space.create('test', {engine = engine})
 | ---
 | ...
_ = s:create_index('pk')
 | ---
 | ...

test_run:switch('default')
 | ---
 | - true
 | ...
test_run:cmd('create server replica with rpl_master=master,\
              script="replication/replica.lua"')
 | ---
 | - true
 | ...
test_run:cmd('start server replica')
 | ---
 | - true
 | ...

--
-- On replica we create an own space which allows us to
-- use more complex scenario and disables replica from
-- automatic rejoin (since replica can't do auto-rejoin if
-- there gonna be an own data loss). This allows us to
-- trigger XlogGapError in the log.
test_run:switch('replica')
 | ---
 | - true
 | ...
box.cfg{checkpoint_count = 1}
 | ---
 | ...
s = box.schema.space.create('testreplica')
 | ---
 | ...
_ = s:create_index('pk')
 | ---
 | ...
box.space.testreplica:insert({1})
 | ---
 | - [1]
 | ...
box.snapshot()
 | ---
 | - ok
 | ...

--
-- Stop the replica node and generate
-- xlogs on the master.
test_run:switch('master')
 | ---
 | - true
 | ...
test_run:cmd('stop server replica')
 | ---
 | - true
 | ...

box.space.test:insert({1})
 | ---
 | - [1]
 | ...
box.snapshot()
 | ---
 | - ok
 | ...

--
-- We need to restart the master node since otherwise
-- the replica will be preventing us from removing old
-- xlog because it will be tracked by gc consumer which
-- kept in memory while master node is running.
--
-- Once restarted we write a new record into master's
-- space and run snapshot which removes old xlog required
-- by replica to subscribe leading to XlogGapError which
-- we need to test.
test_run:cmd('restart server master')
 | 
box.space.test:insert({2})
 | ---
 | - [2]
 | ...
box.snapshot()
 | ---
 | - ok
 | ...
assert(not box.info.gc().is_paused)
 | ---
 | - true
 | ...

--
-- Start replica and wait for error.
test_run:cmd('start server replica with wait=False, wait_load=False')
 | ---
 | - true
 | ...

--
-- Wait error to appear, 60 seconds should be more than enough,
-- usually it happens in a couple of seconds.
test_run:switch('default')
 | ---
 | - true
 | ...
test_run:wait_log('master', 'XlogGapError', nil, 60) ~= nil
 | ---
 | - true
 | ...

--
-- Cleanup.
test_run:cmd('stop server master')
 | ---
 | - true
 | ...
test_run:cmd('cleanup server master')
 | ---
 | - true
 | ...
test_run:cmd('delete server master')
 | ---
 | - true
 | ...
test_run:cmd('stop server replica')
 | ---
 | - true
 | ...
test_run:cmd('cleanup server replica')
 | ---
 | - true
 | ...
test_run:cmd('delete server replica')
 | ---
 | - true
 | ...

--
-- Case 2.
--
-- Lets make sure we're not getting XlogGapError in
-- case if wal_cleanup_delay is used the code is almost
-- the same as for Case 1 except we don't disable cleanup
-- fiber but delay it up to a hour until replica is up
-- and running.
--

test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
 | ---
 | - true
 | ...
test_run:cmd('start server master with args="3600"')
 | ---
 | - true
 | ...

test_run:switch('master')
 | ---
 | - true
 | ...
box.schema.user.grant('guest', 'replication')
 | ---
 | ...

box.cfg{checkpoint_count = 1}
 | ---
 | ...

engine = test_run:get_cfg('engine')
 | ---
 | ...
s = box.schema.space.create('test', {engine = engine})
 | ---
 | ...
_ = s:create_index('pk')
 | ---
 | ...

test_run:switch('default')
 | ---
 | - true
 | ...
test_run:cmd('create server replica with rpl_master=master,\
              script="replication/replica.lua"')
 | ---
 | - true
 | ...
test_run:cmd('start server replica')
 | ---
 | - true
 | ...

test_run:switch('replica')
 | ---
 | - true
 | ...
box.cfg{checkpoint_count = 1}
 | ---
 | ...
s = box.schema.space.create('testreplica')
 | ---
 | ...
_ = s:create_index('pk')
 | ---
 | ...
box.space.testreplica:insert({1})
 | ---
 | - [1]
 | ...
box.snapshot()
 | ---
 | - ok
 | ...

test_run:switch('master')
 | ---
 | - true
 | ...
test_run:cmd('stop server replica')
 | ---
 | - true
 | ...

box.space.test:insert({1})
 | ---
 | - [1]
 | ...
box.snapshot()
 | ---
 | - ok
 | ...

test_run:cmd('restart server master with args="3600"')
 | 
box.space.test:insert({2})
 | ---
 | - [2]
 | ...
box.snapshot()
 | ---
 | - ok
 | ...
assert(box.info.gc().is_paused)
 | ---
 | - true
 | ...

test_run:cmd('start server replica')
 | ---
 | - true
 | ...

--
-- Make sure no error happened.
test_run:switch('default')
 | ---
 | - true
 | ...
assert(test_run:grep_log("master", "XlogGapError") == nil)
 | ---
 | - true
 | ...

test_run:cmd('stop server master')
 | ---
 | - true
 | ...
test_run:cmd('cleanup server master')
 | ---
 | - true
 | ...
test_run:cmd('delete server master')
 | ---
 | - true
 | ...
test_run:cmd('stop server replica')
 | ---
 | - true
 | ...
test_run:cmd('cleanup server replica')
 | ---
 | - true
 | ...
test_run:cmd('delete server replica')
 | ---
 | - true
 | ...
--
--
-- Case 3: Fill _cluster with replica but then delete
-- the replica so that master's cleanup leave in "paused"
-- state, and then simply decrease the timeout to make
-- cleanup fiber work again.
--
test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
 | ---
 | - true
 | ...
test_run:cmd('start server master with args="3600"')
 | ---
 | - true
 | ...

test_run:switch('master')
 | ---
 | - true
 | ...
box.schema.user.grant('guest', 'replication')
 | ---
 | ...

test_run:switch('default')
 | ---
 | - true
 | ...
test_run:cmd('create server replica with rpl_master=master,\
              script="replication/replica.lua"')
 | ---
 | - true
 | ...
test_run:cmd('start server replica')
 | ---
 | - true
 | ...

test_run:switch('master')
 | ---
 | - true
 | ...
test_run:cmd('stop server replica')
 | ---
 | - true
 | ...
test_run:cmd('cleanup server replica')
 | ---
 | - true
 | ...
test_run:cmd('delete server replica')
 | ---
 | - true
 | ...

test_run:cmd('restart server master with args="3600"')
 | 
assert(box.info.gc().is_paused)
 | ---
 | - true
 | ...

test_run:switch('master')
 | ---
 | - true
 | ...
box.cfg{wal_cleanup_delay = 0.01}
 | ---
 | ...
test_run:wait_cond(function() return not box.info.gc().is_paused end)
 | ---
 | - true
 | ...

test_run:switch('default')
 | ---
 | - true
 | ...
test_run:cmd('stop server master')
 | ---
 | - true
 | ...
test_run:cmd('cleanup server master')
 | ---
 | - true
 | ...
test_run:cmd('delete server master')
 | ---
 | - true
 | ...

--
-- Case 4: Fill _cluster with replica but then delete
-- the replica so that master's cleanup leave in "paused"
-- state, and finally cleanup the _cluster to kick cleanup.
--
test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
 | ---
 | - true
 | ...
test_run:cmd('start server master')
 | ---
 | - true
 | ...

test_run:switch('master')
 | ---
 | - true
 | ...
box.schema.user.grant('guest', 'replication')
 | ---
 | ...

test_run:switch('default')
 | ---
 | - true
 | ...
test_run:cmd('create server replica with rpl_master=master,\
              script="replication/replica.lua"')
 | ---
 | - true
 | ...
test_run:cmd('start server replica')
 | ---
 | - true
 | ...

test_run:switch('default')
 | ---
 | - true
 | ...
master_uuid = test_run:eval('master', 'return box.info.uuid')[1]
 | ---
 | ...
replica_uuid = test_run:eval('replica', 'return box.info.uuid')[1]
 | ---
 | ...
master_cluster = test_run:eval('master', 'return box.space._cluster:select()')[1]
 | ---
 | ...
assert(master_cluster[1][2] == master_uuid)
 | ---
 | - true
 | ...
assert(master_cluster[2][2] == replica_uuid)
 | ---
 | - true
 | ...

test_run:cmd('stop server replica')
 | ---
 | - true
 | ...
test_run:cmd('cleanup server replica')
 | ---
 | - true
 | ...
test_run:cmd('delete server replica')
 | ---
 | - true
 | ...

test_run:switch('master')
 | ---
 | - true
 | ...
test_run:cmd('restart server master with args="3600"')
 | 
assert(box.info.gc().is_paused)
 | ---
 | - true
 | ...

--
-- Drop the replica from _cluster and make sure
-- cleanup fiber is not paused anymore.
test_run:switch('default')
 | ---
 | - true
 | ...
deleted_uuid = test_run:eval('master', 'return box.space._cluster:delete(2)')[1][2]
 | ---
 | ...
assert(replica_uuid == deleted_uuid)
 | ---
 | - true
 | ...

test_run:switch('master')
 | ---
 | - true
 | ...
test_run:wait_cond(function() return not box.info.gc().is_paused end)
 | ---
 | - true
 | ...

test_run:switch('default')
 | ---
 | - true
 | ...
test_run:cmd('stop server master')
 | ---
 | - true
 | ...
test_run:cmd('cleanup server master')
 | ---
 | - true
 | ...
test_run:cmd('delete server master')
 | ---
 | - true
 | ...

  reply	other threads:[~2021-03-30 11:55 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-27 11:13 [Tarantool-patches] [PATCH v6 0/3] gc/xlog: delay xlog cleanup until relays are subscribed Cyrill Gorcunov via Tarantool-patches
2021-03-27 11:13 ` [Tarantool-patches] [PATCH v6 1/3] " Cyrill Gorcunov via Tarantool-patches
2021-03-27 11:13 ` [Tarantool-patches] [PATCH v6 2/3] test: add a test for wal_cleanup_delay option Cyrill Gorcunov via Tarantool-patches
2021-03-29 21:07   ` Vladislav Shpilevoy via Tarantool-patches
2021-03-29 21:46     ` Cyrill Gorcunov via Tarantool-patches
2021-03-29 21:54       ` Vladislav Shpilevoy via Tarantool-patches
2021-03-29 21:57         ` Cyrill Gorcunov via Tarantool-patches
2021-03-29 22:19           ` Vladislav Shpilevoy via Tarantool-patches
2021-03-29 22:40             ` Cyrill Gorcunov via Tarantool-patches
2021-03-29 22:56               ` Vladislav Shpilevoy via Tarantool-patches
2021-03-30  7:19                 ` Cyrill Gorcunov via Tarantool-patches
2021-03-30 11:55                   ` Cyrill Gorcunov via Tarantool-patches [this message]
2021-03-30 19:59                   ` Vladislav Shpilevoy via Tarantool-patches
2021-03-27 11:13 ` [Tarantool-patches] [PATCH v6 3/3] test: box-tap/gc -- add test for is_paused field Cyrill Gorcunov via Tarantool-patches
2021-03-30 19:57 ` [Tarantool-patches] [PATCH v6 0/3] gc/xlog: delay xlog cleanup until relays are subscribed Vladislav Shpilevoy via Tarantool-patches
2021-03-31  8:28 ` Kirill Yukhin via Tarantool-patches

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YGMRuMVK+HEvMA0f@grain \
    --to=tarantool-patches@dev.tarantool.org \
    --cc=gorcunov@gmail.com \
    --cc=v.shpilevoy@tarantool.org \
    --subject='Re: [Tarantool-patches] [PATCH v6 2/3] test: add a test for wal_cleanup_delay option' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox