Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH v6 0/3] gc/xlog: delay xlog cleanup until relays are subscribed
@ 2021-03-27 11:13 Cyrill Gorcunov via Tarantool-patches
  2021-03-27 11:13 ` [Tarantool-patches] [PATCH v6 1/3] " Cyrill Gorcunov via Tarantool-patches
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-03-27 11:13 UTC (permalink / raw)
  To: tml; +Cc: Vladislav Shpilevoy

Take a look please.

v2:
 - rebase code to the fresh master branch
 - keep wal_cleanup_delay option name
 - pass wal_cleanup_delay as an option to gc_init, so it
   won't be dependent on cfg engine
 - add comment about gc_delay_unref in plain bootstrap mode
 - allow to setup wal_cleanup_delay dynamically
 - update comment in gc_wait_cleanup and call it conditionally
 - declare wal_cleanup_delay as a double
 - rename gc.cleanup_is_paused to gc.is_paused and update output
 - do not show ref counter in box.info.gc() output
 - update documentation
 - move gc_delay_unref inside relay_subscribe call which runs
   in tx context (instead of relay's context)
 - update tests:
   - add a comment why we need a temp space on replica node
   - use explicit insert/snapshot operations
   - shrkink the number of insert/snapshot to speedup testing
   - use "restart" instead of stop/start pair
   - use wait_log helper instead of own function
   - add is_paused test
v3:
 - fix changelog
 - rework box_check_wal_cleanup_delay, the replication_anon
   setting is considered only in box_set_wal_cleanup_delay,
   ie when config is checked and parsed, moreover the order
   of setup is set to be behind "replication_anon" option
   processing
 - delay cycle now considers deadline instead of per cycle
   calculation
 - use `double` type for timestamp
 - test update
   - verify `.is_paused` value
   - minimize number of inserts
   - no need to use temporary space, regular space works as well
   - add comments on why we should restart the master node
v4:
 - drop argument from gc_init(), since we're configuring delay
   value from load_cfg.lua script there is no need to read the
   delay early, simply start gc paused and unpause it on demand
 - move unpause message to main wait cycle
 - test update:
   - verify tests and fix replication/replica_rejoin since it waits
     for xlogs to be cleaned up too early
   - use 10 seconds for XlogGapError instead of 0.1 second, this is
     a common deadline value
v5:
 - define limits for `wal_cleanup_delay`: it should be either 0,
   or in range [0.001; TIMEOUT_INFINITY]. This is done to not consider
   fp epsilon as a meaningul value
 - fix comment about why anon replica is not using delay
 - rework cleanup delay'ed cycle
 - test update:
   - update vinyl/replica_rejoin -- we need to disable cleanup
     delay explicitly
   - update replication/replica_rejoin for same reason
   - drop unneded test_run:switch() calls
   - add a testcase where timeout is decreased and cleanup
     fiber is kicked to run even with stuck replica

v6:
  - test update:
   - replica_rejoin.lua simplified to drop not needed data
   - update main test to check if _cluster sleanup triggers
     the fiber to run

issue https://github.com/tarantool/tarantool/issues/5806
branch gorcunov/gh-5806-xlog-gc-6

Cyrill Gorcunov (3):
  gc/xlog: delay xlog cleanup until relays are subscribed
  test: add a test for wal_cleanup_delay option
  test: box-tap/gc -- add test for is_paused field

 .../unreleased/add-wal_cleanup_delay.md       |   5 +
 src/box/box.cc                                |  41 ++
 src/box/box.h                                 |   1 +
 src/box/gc.c                                  |  95 ++-
 src/box/gc.h                                  |  36 ++
 src/box/lua/cfg.cc                            |   9 +
 src/box/lua/info.c                            |   4 +
 src/box/lua/load_cfg.lua                      |   5 +
 src/box/relay.cc                              |   1 +
 src/box/replication.cc                        |   2 +
 test/app-tap/init_script.result               |   1 +
 test/box-tap/gc.test.lua                      |   3 +-
 test/box/admin.result                         |   2 +
 test/box/cfg.result                           |   4 +
 test/replication/gh-5806-master.lua           |   8 +
 test/replication/gh-5806-xlog-cleanup.result  | 558 ++++++++++++++++++
 .../replication/gh-5806-xlog-cleanup.test.lua | 234 ++++++++
 test/replication/replica_rejoin.lua           |  11 +
 test/replication/replica_rejoin.result        |  26 +-
 test/replication/replica_rejoin.test.lua      |  19 +-
 test/vinyl/replica_rejoin.lua                 |   5 +-
 test/vinyl/replica_rejoin.result              |  13 +
 test/vinyl/replica_rejoin.test.lua            |   8 +
 23 files changed, 1074 insertions(+), 17 deletions(-)
 create mode 100644 changelogs/unreleased/add-wal_cleanup_delay.md
 create mode 100644 test/replication/gh-5806-master.lua
 create mode 100644 test/replication/gh-5806-xlog-cleanup.result
 create mode 100644 test/replication/gh-5806-xlog-cleanup.test.lua
 create mode 100644 test/replication/replica_rejoin.lua


base-commit: 234472522a924ecf62e27c27e1e29b8803a677cc
-- 
Here is a summary diff for v5

diff --git a/test/replication/gh-5806-xlog-cleanup.result b/test/replication/gh-5806-xlog-cleanup.result
index 523d400a7..da09daf17 100644
--- a/test/replication/gh-5806-xlog-cleanup.result
+++ b/test/replication/gh-5806-xlog-cleanup.result
@@ -29,7 +29,7 @@ test_run:cmd('create server master with script="replication/gh-5806-master.lua"'
  | ---
  | - true
  | ...
-test_run:cmd('start server master with wait=True, wait_load=True')
+test_run:cmd('start server master')
  | ---
  | - true
  | ...
@@ -68,7 +68,7 @@ test_run:cmd('create server replica with rpl_master=master,\
  | ---
  | - true
  | ...
-test_run:cmd('start server replica with wait=True, wait_load=True')
+test_run:cmd('start server replica')
  | ---
  | - true
  | ...
@@ -132,7 +132,7 @@ box.snapshot()
 -- space and run snapshot which removes old xlog required
 -- by replica to subscribe leading to XlogGapError which
 -- we need to test.
-test_run:cmd('restart server master with wait_load=True')
+test_run:cmd('restart server master')
  | 
 box.space.test:insert({2})
  | ---
@@ -207,7 +207,7 @@ test_run:cmd('create server master with script="replication/gh-5806-master.lua"'
  | ---
  | - true
  | ...
-test_run:cmd('start server master with args="3600", wait=True, wait_load=True')
+test_run:cmd('start server master with args="3600"')
  | ---
  | - true
  | ...
@@ -243,7 +243,7 @@ test_run:cmd('create server replica with rpl_master=master,\
  | ---
  | - true
  | ...
-test_run:cmd('start server replica with wait=True, wait_load=True')
+test_run:cmd('start server replica')
  | ---
  | - true
  | ...
@@ -288,7 +288,7 @@ box.snapshot()
  | - ok
  | ...
 
-test_run:cmd('restart server master with args="3600", wait=True, wait_load=True')
+test_run:cmd('restart server master with args="3600"')
  | 
 box.space.test:insert({2})
  | ---
@@ -303,7 +303,7 @@ assert(box.info.gc().is_paused == true)
  | - true
  | ...
 
-test_run:cmd('start server replica with wait=True, wait_load=True')
+test_run:cmd('start server replica')
  | ---
  | - true
  | ...
@@ -354,7 +354,7 @@ test_run:cmd('create server master with script="replication/gh-5806-master.lua"'
  | ---
  | - true
  | ...
-test_run:cmd('start server master with args="3600", wait=True, wait_load=True')
+test_run:cmd('start server master with args="3600"')
  | ---
  | - true
  | ...
@@ -376,7 +376,7 @@ test_run:cmd('create server replica with rpl_master=master,\
  | ---
  | - true
  | ...
-test_run:cmd('start server replica with wait=True, wait_load=True')
+test_run:cmd('start server replica')
  | ---
  | - true
  | ...
@@ -398,7 +398,7 @@ test_run:cmd('delete server replica')
  | - true
  | ...
 
-test_run:cmd('restart server master with args="3600", wait=True, wait_load=True')
+test_run:cmd('restart server master with args="3600"')
  | 
 assert(box.info.gc().is_paused == true)
  | ---
@@ -433,3 +433,126 @@ test_run:cmd('delete server master')
  | ---
  | - true
  | ...
+
+--
+-- Case 4: Fill _cluster with replica but then delete
+-- the replica so that master's cleanup leave in "paused"
+-- state, and finally cleanup the _cluster to kick cleanup.
+--
+test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server master')
+ | ---
+ | - true
+ | ...
+
+test_run:switch('master')
+ | ---
+ | - true
+ | ...
+box.schema.user.grant('guest', 'replication')
+ | ---
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('create server replica with rpl_master=master,\
+              script="replication/replica.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica')
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+master_uuid = test_run:eval('master', 'return box.info.uuid')[1]
+ | ---
+ | ...
+replica_uuid = test_run:eval('replica', 'return box.info.uuid')[1]
+ | ---
+ | ...
+master_custer = test_run:eval('master', 'return box.space._cluster:select()')[1]
+ | ---
+ | ...
+assert(master_custer[1][2] == master_uuid)
+ | ---
+ | - true
+ | ...
+assert(master_custer[2][2] == replica_uuid)
+ | ---
+ | - true
+ | ...
+
+test_run:cmd('stop server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('cleanup server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica')
+ | ---
+ | - true
+ | ...
+
+test_run:switch('master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('restart server master with args="3600"')
+ | 
+assert(box.info.gc().is_paused == true)
+ | ---
+ | - true
+ | ...
+
+--
+-- Drop the replica from _cluster and make sure
+-- cleanup fiber is not paused anymore.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+deleted_uuid = test_run:eval('master', 'return box.space._cluster:delete(2)')[1][2]
+ | ---
+ | ...
+assert(replica_uuid == deleted_uuid)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('master')
+ | ---
+ | - true
+ | ...
+test_run:wait_cond(function() return box.info.gc().is_paused == false end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('cleanup server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server master')
+ | ---
+ | - true
+ | ...
diff --git a/test/replication/gh-5806-xlog-cleanup.test.lua b/test/replication/gh-5806-xlog-cleanup.test.lua
index f16be758a..b65563e7f 100644
--- a/test/replication/gh-5806-xlog-cleanup.test.lua
+++ b/test/replication/gh-5806-xlog-cleanup.test.lua
@@ -19,7 +19,7 @@ engine = test_run:get_cfg('engine')
 --
 
 test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
-test_run:cmd('start server master with wait=True, wait_load=True')
+test_run:cmd('start server master')
 
 test_run:switch('master')
 box.schema.user.grant('guest', 'replication')
@@ -36,7 +36,7 @@ _ = s:create_index('pk')
 test_run:switch('default')
 test_run:cmd('create server replica with rpl_master=master,\
               script="replication/replica.lua"')
-test_run:cmd('start server replica with wait=True, wait_load=True')
+test_run:cmd('start server replica')
 
 --
 -- On replica we create an own space which allows us to
@@ -70,7 +70,7 @@ box.snapshot()
 -- space and run snapshot which removes old xlog required
 -- by replica to subscribe leading to XlogGapError which
 -- we need to test.
-test_run:cmd('restart server master with wait_load=True')
+test_run:cmd('restart server master')
 box.space.test:insert({2})
 box.snapshot()
 assert(box.info.gc().is_paused == false)
@@ -105,7 +105,7 @@ test_run:cmd('delete server replica')
 --
 
 test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
-test_run:cmd('start server master with args="3600", wait=True, wait_load=True')
+test_run:cmd('start server master with args="3600"')
 
 test_run:switch('master')
 box.schema.user.grant('guest', 'replication')
@@ -119,7 +119,7 @@ _ = s:create_index('pk')
 test_run:switch('default')
 test_run:cmd('create server replica with rpl_master=master,\
               script="replication/replica.lua"')
-test_run:cmd('start server replica with wait=True, wait_load=True')
+test_run:cmd('start server replica')
 
 test_run:switch('replica')
 box.cfg{checkpoint_count = 1}
@@ -134,12 +134,12 @@ test_run:cmd('stop server replica')
 box.space.test:insert({1})
 box.snapshot()
 
-test_run:cmd('restart server master with args="3600", wait=True, wait_load=True')
+test_run:cmd('restart server master with args="3600"')
 box.space.test:insert({2})
 box.snapshot()
 assert(box.info.gc().is_paused == true)
 
-test_run:cmd('start server replica with wait=True, wait_load=True')
+test_run:cmd('start server replica')
 
 --
 -- Make sure no error happened.
@@ -160,7 +160,7 @@ test_run:cmd('delete server replica')
 -- cleanup fiber work again.
 --
 test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
-test_run:cmd('start server master with args="3600", wait=True, wait_load=True')
+test_run:cmd('start server master with args="3600"')
 
 test_run:switch('master')
 box.schema.user.grant('guest', 'replication')
@@ -168,14 +168,14 @@ box.schema.user.grant('guest', 'replication')
 test_run:switch('default')
 test_run:cmd('create server replica with rpl_master=master,\
               script="replication/replica.lua"')
-test_run:cmd('start server replica with wait=True, wait_load=True')
+test_run:cmd('start server replica')
 
 test_run:switch('master')
 test_run:cmd('stop server replica')
 test_run:cmd('cleanup server replica')
 test_run:cmd('delete server replica')
 
-test_run:cmd('restart server master with args="3600", wait=True, wait_load=True')
+test_run:cmd('restart server master with args="3600"')
 assert(box.info.gc().is_paused == true)
 
 test_run:switch('master')
@@ -186,3 +186,49 @@ test_run:switch('default')
 test_run:cmd('stop server master')
 test_run:cmd('cleanup server master')
 test_run:cmd('delete server master')
+
+--
+-- Case 4: Fill _cluster with replica but then delete
+-- the replica so that master's cleanup leave in "paused"
+-- state, and finally cleanup the _cluster to kick cleanup.
+--
+test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
+test_run:cmd('start server master')
+
+test_run:switch('master')
+box.schema.user.grant('guest', 'replication')
+
+test_run:switch('default')
+test_run:cmd('create server replica with rpl_master=master,\
+              script="replication/replica.lua"')
+test_run:cmd('start server replica')
+
+test_run:switch('default')
+master_uuid = test_run:eval('master', 'return box.info.uuid')[1]
+replica_uuid = test_run:eval('replica', 'return box.info.uuid')[1]
+master_custer = test_run:eval('master', 'return box.space._cluster:select()')[1]
+assert(master_custer[1][2] == master_uuid)
+assert(master_custer[2][2] == replica_uuid)
+
+test_run:cmd('stop server replica')
+test_run:cmd('cleanup server replica')
+test_run:cmd('delete server replica')
+
+test_run:switch('master')
+test_run:cmd('restart server master with args="3600"')
+assert(box.info.gc().is_paused == true)
+
+--
+-- Drop the replica from _cluster and make sure
+-- cleanup fiber is not paused anymore.
+test_run:switch('default')
+deleted_uuid = test_run:eval('master', 'return box.space._cluster:delete(2)')[1][2]
+assert(replica_uuid == deleted_uuid)
+
+test_run:switch('master')
+test_run:wait_cond(function() return box.info.gc().is_paused == false end)
+
+test_run:switch('default')
+test_run:cmd('stop server master')
+test_run:cmd('cleanup server master')
+test_run:cmd('delete server master')
diff --git a/test/replication/replica_rejoin.lua b/test/replication/replica_rejoin.lua
index 76f6e5b75..9c743c52b 100644
--- a/test/replication/replica_rejoin.lua
+++ b/test/replication/replica_rejoin.lua
@@ -1,22 +1,11 @@
 #!/usr/bin/env tarantool
 
-local repl_include_self = arg[1] and arg[1] == 'true' or false
-local repl_list
-
-if repl_include_self then
-    repl_list = {os.getenv("MASTER"), os.getenv("LISTEN")}
-else
-    repl_list = os.getenv("MASTER")
-end
-
 -- Start the console first to allow test-run to attach even before
 -- box.cfg is finished.
 require('console').listen(os.getenv('ADMIN'))
 
 box.cfg({
     listen              = os.getenv("LISTEN"),
-    replication         = repl_list,
-    memtx_memory        = 107374182,
-    replication_timeout = 0.1,
+    replication         = {os.getenv("MASTER"), os.getenv("LISTEN")},
     wal_cleanup_delay   = 0,
 })
diff --git a/test/replication/replica_rejoin.result b/test/replication/replica_rejoin.result
index 074cc3e67..843333a19 100644
--- a/test/replication/replica_rejoin.result
+++ b/test/replication/replica_rejoin.result
@@ -47,7 +47,7 @@ test_run:cmd("create server replica with rpl_master=default, script='replication
 ---
 - true
 ...
-test_run:cmd("start server replica with args='true'")
+test_run:cmd("start server replica")
 ---
 - true
 ...
@@ -124,7 +124,7 @@ box.cfg{checkpoint_count = checkpoint_count}
 ...
 -- Restart the replica. Since xlogs have been removed,
 -- it is supposed to rejoin without changing id.
-test_run:cmd("start server replica with args='true'")
+test_run:cmd("start server replica")
 ---
 - true
 ...
@@ -229,7 +229,7 @@ test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.
 box.cfg{checkpoint_count = checkpoint_count}
 ---
 ...
-test_run:cmd("start server replica with args='true', wait=False")
+test_run:cmd("start server replica with wait=False")
 ---
 - true
 ...
@@ -271,7 +271,7 @@ test_run:cleanup_cluster()
 box.space.test:truncate()
 ---
 ...
-test_run:cmd("start server replica with args='true'")
+test_run:cmd("start server replica")
 ---
 - true
 ...
diff --git a/test/replication/replica_rejoin.test.lua b/test/replication/replica_rejoin.test.lua
index 223316d86..c3ba9bf3f 100644
--- a/test/replication/replica_rejoin.test.lua
+++ b/test/replication/replica_rejoin.test.lua
@@ -24,7 +24,7 @@ _ = box.space.test:insert{3}
 
 -- Join a replica, then stop it.
 test_run:cmd("create server replica with rpl_master=default, script='replication/replica_rejoin.lua'")
-test_run:cmd("start server replica with args='true'")
+test_run:cmd("start server replica")
 test_run:cmd("switch replica")
 box.info.replication[1].upstream.status == 'follow' or log.error(box.info)
 box.space.test:select()
@@ -53,7 +53,7 @@ box.cfg{checkpoint_count = checkpoint_count}
 
 -- Restart the replica. Since xlogs have been removed,
 -- it is supposed to rejoin without changing id.
-test_run:cmd("start server replica with args='true'")
+test_run:cmd("start server replica")
 box.info.replication[2].downstream.vclock ~= nil or log.error(box.info)
 test_run:cmd("switch replica")
 box.info.replication[1].upstream.status == 'follow' or log.error(box.info)
@@ -88,7 +88,7 @@ for i = 1, 3 do box.space.test:insert{i * 100} end
 fio = require('fio')
 test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1 end) or fio.pathjoin(box.cfg.wal_dir, '*.xlog')
 box.cfg{checkpoint_count = checkpoint_count}
-test_run:cmd("start server replica with args='true', wait=False")
+test_run:cmd("start server replica with wait=False")
 test_run:cmd("switch replica")
 test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = 'loading'})
 box.space.test:select()
@@ -104,7 +104,7 @@ test_run:cmd("stop server replica")
 test_run:cmd("cleanup server replica")
 test_run:cleanup_cluster()
 box.space.test:truncate()
-test_run:cmd("start server replica with args='true'")
+test_run:cmd("start server replica")
 -- Subscribe the master to the replica.
 replica_listen = test_run:cmd("eval replica 'return box.cfg.listen'")
 replica_listen ~= nil

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Tarantool-patches] [PATCH v6 1/3] gc/xlog: delay xlog cleanup until relays are subscribed
  2021-03-27 11:13 [Tarantool-patches] [PATCH v6 0/3] gc/xlog: delay xlog cleanup until relays are subscribed Cyrill Gorcunov via Tarantool-patches
@ 2021-03-27 11:13 ` Cyrill Gorcunov via Tarantool-patches
  2021-03-27 11:13 ` [Tarantool-patches] [PATCH v6 2/3] test: add a test for wal_cleanup_delay option Cyrill Gorcunov via Tarantool-patches
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-03-27 11:13 UTC (permalink / raw)
  To: tml; +Cc: Vladislav Shpilevoy

In case if replica managed to be far behind the master node
(so there are a number of xlog files present after the last
master's snapshot) then once master node get restarted it
may clean up the xlogs needed by the replica to subscribe
in a fast way and instead the replica will have to rejoin
reading a number of data back.

Lets try to address this by delaying xlog files cleanup
until replicas are got subscribed and relays are up
and running. For this sake we start with cleanup fiber
spinning in nop cycle ("paused" mode) and use a delay
counter to wait until relays decrement them.

This implies that if `_cluster` system space is not empty
upon restart and the registered replica somehow vanished
completely and won't ever come back, then the node
administrator has to drop this replica from `_cluster`
manually.

Note that this delayed cleanup start doesn't prevent
WAL engine from removing old files if there is no
space left on a storage device. The WAL will simply
drop old data without a question.

We need to take into account that some administrators
might not need this functionality at all, for this
sake we introduce "wal_cleanup_delay" configuration
option which allows to enable or disable the delay.

Closes #5806

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

@TarantoolBot document
Title: Add wal_cleanup_delay configuration parameter

The `wal_cleanup_delay` option defines a delay in seconds
before write ahead log files (`*.xlog`) are getting started
to prune upon a node restart.

This option is ignored in case if a node is running as
an anonymous replica (`replication_anon = true`). Similarly
if replication is unused or there is no plans to use
replication at all then this option should not be considered.

An initial problem to solve is the case where a node is operating
so fast that its replicas do not manage to reach the node state
and in case if the node is restarted at this moment (for various
reasons, for example due to power outage) then `*.xlog` files might
be pruned during restart. In result replicas will not find these
files on the main node and have to reread all data back which
is a very expensive procedure.

Since replicas are tracked via `_cluster` system space this we use
its content to count subscribed replicas and when all of them are
up and running the cleanup procedure is automatically enabled even
if `wal_cleanup_delay` is not expired.

The `wal_cleanup_delay` should be set to:

 - `0` to disable the cleanup delay;
 - `>= 0` to wait for specified number of seconds.

By default it is set to `14400` seconds (ie `4` hours).

In case if registered replica is lost forever and timeout is set to
infinity then a preferred way to enable cleanup procedure is not setting
up a small timeout value but rather to delete this replica from `_cluster`
space manually.

Note that the option does *not* prevent WAL engine from removing
old `*.xlog` files if there is no space left on a storage device,
WAL engine can remove them in a force way.

Current state of `*.xlog` garbage collector can be found in
`box.info.gc()` output. For example

``` Lua
 tarantool> box.info.gc()
 ---
   ...
   is_paused: false
```

The `is_paused` shows if cleanup fiber is paused or not.
---
 .../unreleased/add-wal_cleanup_delay.md       |  5 +
 src/box/box.cc                                | 41 ++++++++
 src/box/box.h                                 |  1 +
 src/box/gc.c                                  | 95 ++++++++++++++++++-
 src/box/gc.h                                  | 36 +++++++
 src/box/lua/cfg.cc                            |  9 ++
 src/box/lua/info.c                            |  4 +
 src/box/lua/load_cfg.lua                      |  5 +
 src/box/relay.cc                              |  1 +
 src/box/replication.cc                        |  2 +
 test/app-tap/init_script.result               |  1 +
 test/box/admin.result                         |  2 +
 test/box/cfg.result                           |  4 +
 test/replication/replica_rejoin.lua           | 11 +++
 test/replication/replica_rejoin.result        | 26 ++++-
 test/replication/replica_rejoin.test.lua      | 19 +++-
 test/vinyl/replica_rejoin.lua                 |  5 +-
 test/vinyl/replica_rejoin.result              | 13 +++
 test/vinyl/replica_rejoin.test.lua            |  8 ++
 19 files changed, 272 insertions(+), 16 deletions(-)
 create mode 100644 changelogs/unreleased/add-wal_cleanup_delay.md
 create mode 100644 test/replication/replica_rejoin.lua

diff --git a/changelogs/unreleased/add-wal_cleanup_delay.md b/changelogs/unreleased/add-wal_cleanup_delay.md
new file mode 100644
index 000000000..2e67f0a4e
--- /dev/null
+++ b/changelogs/unreleased/add-wal_cleanup_delay.md
@@ -0,0 +1,5 @@
+## bugfix/core
+
+* Introduce `wal_cleanup_delay` option to prevent early cleanup
+  of `*.xlog` files which are needed by replicas and lead to
+  `XlogGapError` (gh-5806).
diff --git a/src/box/box.cc b/src/box/box.cc
index cc59564e1..e69b7b2ff 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -771,6 +771,19 @@ box_check_wal_queue_max_size(void)
 	return size;
 }
 
+static double
+box_check_wal_cleanup_delay(void)
+{
+	double value = cfg_getd("wal_cleanup_delay");
+	if (value < 0) {
+		diag_set(ClientError, ER_CFG, "wal_cleanup_delay",
+			 "value must be >= 0");
+		return -1;
+	}
+
+	return value;
+}
+
 static void
 box_check_readahead(int readahead)
 {
@@ -918,6 +931,8 @@ box_check_config(void)
 	box_check_wal_mode(cfg_gets("wal_mode"));
 	if (box_check_wal_queue_max_size() < 0)
 		diag_raise();
+	if (box_check_wal_cleanup_delay() < 0)
+		diag_raise();
 	if (box_check_memory_quota("memtx_memory") < 0)
 		diag_raise();
 	box_check_memtx_min_tuple_size(cfg_geti64("memtx_min_tuple_size"));
@@ -1465,6 +1480,23 @@ box_set_wal_queue_max_size(void)
 	return 0;
 }
 
+int
+box_set_wal_cleanup_delay(void)
+{
+	double delay = box_check_wal_cleanup_delay();
+	if (delay < 0)
+		return -1;
+	/*
+	 * Anonymous replicas do not require
+	 * delay since they can't be a source
+	 * of replication.
+	 */
+	if (replication_anon)
+		delay = 0;
+	gc_set_wal_cleanup_delay(delay);
+	return 0;
+}
+
 void
 box_set_vinyl_memory(void)
 {
@@ -3076,6 +3108,15 @@ box_cfg_xc(void)
 	}
 	fiber_gc();
 
+	/*
+	 * Exclude self from GC delay because we care
+	 * about remote replicas only, still for ref/unref
+	 * balance we do reference self node initially and
+	 * downgrade it to zero when there is no replication
+	 * set at all.
+	 */
+	gc_delay_unref();
+
 	bootstrap_journal_guard.is_active = false;
 	assert(current_journal != &bootstrap_journal);
 
diff --git a/src/box/box.h b/src/box/box.h
index 65215b087..e2321b9b0 100644
--- a/src/box/box.h
+++ b/src/box/box.h
@@ -243,6 +243,7 @@ void box_set_checkpoint_count(void);
 void box_set_checkpoint_interval(void);
 void box_set_checkpoint_wal_threshold(void);
 int box_set_wal_queue_max_size(void);
+int box_set_wal_cleanup_delay(void);
 void box_set_memtx_memory(void);
 void box_set_memtx_max_tuple_size(void);
 void box_set_vinyl_memory(void);
diff --git a/src/box/gc.c b/src/box/gc.c
index 9af4ef958..10f899923 100644
--- a/src/box/gc.c
+++ b/src/box/gc.c
@@ -107,6 +107,11 @@ gc_init(void)
 	/* Don't delete any files until recovery is complete. */
 	gc.min_checkpoint_count = INT_MAX;
 
+	gc.wal_cleanup_delay = TIMEOUT_INFINITY;
+	gc.delay_ref = 0;
+	gc.is_paused = true;
+	say_info("wal/engine cleanup is paused");
+
 	vclock_create(&gc.vclock);
 	rlist_create(&gc.checkpoints);
 	gc_tree_new(&gc.consumers);
@@ -238,6 +243,51 @@ static int
 gc_cleanup_fiber_f(va_list ap)
 {
 	(void)ap;
+
+	/*
+	 * Stage 1 (optional): in case if we're booting
+	 * up with cleanup disabled lets do wait in a
+	 * separate cycle to minimize branching on stage 2.
+	 */
+	if (gc.is_paused) {
+		double start_time = fiber_clock();
+		double timeout = gc.wal_cleanup_delay;
+		while (!fiber_is_cancelled()) {
+			if (fiber_yield_timeout(timeout)) {
+				say_info("wal/engine cleanup is resumed "
+					 "due to timeout expiration");
+				gc.is_paused = false;
+				gc.delay_ref = 0;
+				break;
+			}
+
+			/*
+			 * If a last reference is dropped
+			 * we can exit out early.
+			 */
+			if (!gc.is_paused) {
+				say_info("wal/engine cleanup is resumed");
+				break;
+			}
+
+			/*
+			 * Woken up to update the timeout.
+			 */
+			double elapsed = fiber_clock() - start_time;
+			if (elapsed >= gc.wal_cleanup_delay) {
+				say_info("wal/engine cleanup is resumed "
+					 "due to timeout manual update");
+				gc.is_paused = false;
+				gc.delay_ref = 0;
+				break;
+			}
+			timeout = gc.wal_cleanup_delay - elapsed;
+		}
+	}
+
+	/*
+	 * Stage 2: a regular cleanup cycle.
+	 */
 	while (!fiber_is_cancelled()) {
 		int64_t delta = gc.cleanup_scheduled - gc.cleanup_completed;
 		if (delta == 0) {
@@ -253,6 +303,42 @@ gc_cleanup_fiber_f(va_list ap)
 	return 0;
 }
 
+void
+gc_set_wal_cleanup_delay(double wal_cleanup_delay)
+{
+	gc.wal_cleanup_delay = wal_cleanup_delay;
+	/*
+	 * This routine may be called at arbitrary
+	 * moment thus we must be sure the cleanup
+	 * fiber is paused to not wake up it when
+	 * it is already in a regular cleanup stage.
+	 */
+	if (gc.is_paused)
+		fiber_wakeup(gc.cleanup_fiber);
+}
+
+void
+gc_delay_ref(void)
+{
+	if (gc.is_paused) {
+		assert(gc.delay_ref >= 0);
+		gc.delay_ref++;
+	}
+}
+
+void
+gc_delay_unref(void)
+{
+	if (gc.is_paused) {
+		assert(gc.delay_ref > 0);
+		gc.delay_ref--;
+		if (gc.delay_ref == 0) {
+			gc.is_paused = false;
+			fiber_wakeup(gc.cleanup_fiber);
+		}
+	}
+}
+
 /**
  * Trigger asynchronous garbage collection.
  */
@@ -462,11 +548,12 @@ gc_checkpoint(void)
 	 * Wait for background garbage collection that might
 	 * have been triggered by this checkpoint to complete.
 	 * Strictly speaking, it isn't necessary, but it
-	 * simplifies testing as it guarantees that by the
-	 * time box.snapshot() returns, all outdated checkpoint
-	 * files have been removed.
+	 * simplifies testing. Same time if GC is paused and
+	 * waiting for old XLOGs to be read by replicas the
+	 * cleanup won't happen immediately after the checkpoint.
 	 */
-	gc_wait_cleanup();
+	if (!gc.is_paused)
+		gc_wait_cleanup();
 	return 0;
 }
 
diff --git a/src/box/gc.h b/src/box/gc.h
index 2a568c5f9..f5bc26e87 100644
--- a/src/box/gc.h
+++ b/src/box/gc.h
@@ -147,6 +147,24 @@ struct gc_state {
 	 * taken at that moment of time.
 	 */
 	int64_t cleanup_completed, cleanup_scheduled;
+	/**
+	 * A counter to wait until all replicas are managed to
+	 * subscribe so that we can enable cleanup fiber to
+	 * remove old XLOGs. Otherwise some replicas might be
+	 * far behind the master node and after the master
+	 * node been restarted they will have to reread all
+	 * data back due to XlogGapError, ie too early deleted
+	 * XLOGs.
+	 */
+	int64_t delay_ref;
+	/**
+	 * Delay timeout in seconds.
+	 */
+	double wal_cleanup_delay;
+	/**
+	 * When set the cleanup fiber is paused.
+	 */
+	bool is_paused;
 	/**
 	 * Set if there's a fiber making a checkpoint right now.
 	 */
@@ -206,6 +224,24 @@ gc_init(void);
 void
 gc_free(void);
 
+/**
+ * Set a new delay value.
+ */
+void
+gc_set_wal_cleanup_delay(double wal_cleanup_delay);
+
+/**
+ * Increment a reference to delay counter.
+ */
+void
+gc_delay_ref(void);
+
+/**
+ * Decrement a reference from the delay counter.
+ */
+void
+gc_delay_unref(void);
+
 /**
  * Advance the garbage collector vclock to the given position.
  * Deactivate WAL consumers that need older data.
diff --git a/src/box/lua/cfg.cc b/src/box/lua/cfg.cc
index b37a93ed8..1142e2726 100644
--- a/src/box/lua/cfg.cc
+++ b/src/box/lua/cfg.cc
@@ -172,6 +172,14 @@ lbox_cfg_set_wal_queue_max_size(struct lua_State *L)
 	return 0;
 }
 
+static int
+lbox_cfg_set_wal_cleanup_delay(struct lua_State *L)
+{
+	if (box_set_wal_cleanup_delay() < 0)
+		luaT_error(L);
+	return 0;
+}
+
 static int
 lbox_cfg_set_read_only(struct lua_State *L)
 {
@@ -408,6 +416,7 @@ box_lua_cfg_init(struct lua_State *L)
 		{"cfg_set_checkpoint_interval", lbox_cfg_set_checkpoint_interval},
 		{"cfg_set_checkpoint_wal_threshold", lbox_cfg_set_checkpoint_wal_threshold},
 		{"cfg_set_wal_queue_max_size", lbox_cfg_set_wal_queue_max_size},
+		{"cfg_set_wal_cleanup_delay", lbox_cfg_set_wal_cleanup_delay},
 		{"cfg_set_read_only", lbox_cfg_set_read_only},
 		{"cfg_set_memtx_memory", lbox_cfg_set_memtx_memory},
 		{"cfg_set_memtx_max_tuple_size", lbox_cfg_set_memtx_max_tuple_size},
diff --git a/src/box/lua/info.c b/src/box/lua/info.c
index 1e0c0148a..8cd379756 100644
--- a/src/box/lua/info.c
+++ b/src/box/lua/info.c
@@ -445,6 +445,10 @@ lbox_info_gc_call(struct lua_State *L)
 	lua_pushboolean(L, gc.checkpoint_is_in_progress);
 	lua_settable(L, -3);
 
+	lua_pushstring(L, "is_paused");
+	lua_pushboolean(L, gc.is_paused);
+	lua_settable(L, -3);
+
 	lua_pushstring(L, "checkpoints");
 	lua_newtable(L);
 
diff --git a/src/box/lua/load_cfg.lua b/src/box/lua/load_cfg.lua
index 9967f992e..44bb95ed1 100644
--- a/src/box/lua/load_cfg.lua
+++ b/src/box/lua/load_cfg.lua
@@ -73,6 +73,7 @@ local default_cfg = {
     wal_max_size        = 256 * 1024 * 1024,
     wal_dir_rescan_delay= 2,
     wal_queue_max_size  = 16 * 1024 * 1024,
+    wal_cleanup_delay   = 4 * 3600,
     force_recovery      = false,
     replication         = nil,
     instance_uuid       = nil,
@@ -155,6 +156,7 @@ local template_cfg = {
     wal_mode            = 'string',
     wal_max_size        = 'number',
     wal_dir_rescan_delay= 'number',
+    wal_cleanup_delay   = 'number',
     force_recovery      = 'boolean',
     replication         = 'string, number, table',
     instance_uuid       = 'string',
@@ -289,6 +291,7 @@ local dynamic_cfg = {
     feedback_interval       = ifdef_feedback_set_params,
     -- do nothing, affects new replicas, which query this value on start
     wal_dir_rescan_delay    = function() end,
+    wal_cleanup_delay       = private.cfg_set_wal_cleanup_delay,
     custom_proc_title       = function()
         require('title').update(box.cfg.custom_proc_title)
     end,
@@ -349,6 +352,8 @@ local dynamic_cfg_order = {
     -- the new one. This should be fixed when box.cfg is able to
     -- apply some parameters together and atomically.
     replication_anon        = 250,
+    -- Cleanup delay should be ignored if replication_anon is set.
+    wal_cleanup_delay       = 260,
     election_mode           = 300,
     election_timeout        = 320,
 }
diff --git a/src/box/relay.cc b/src/box/relay.cc
index 41f949e8e..6edee86bf 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -824,6 +824,7 @@ relay_subscribe(struct replica *replica, int fd, uint64_t sync,
 						   tt_uuid_str(&replica->uuid));
 		if (replica->gc == NULL)
 			diag_raise();
+		gc_delay_unref();
 	}
 
 	relay_start(relay, fd, sync, relay_send_row);
diff --git a/src/box/replication.cc b/src/box/replication.cc
index 1fa8843e7..aefb812b3 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -250,6 +250,7 @@ replica_set_id(struct replica *replica, uint32_t replica_id)
 						   tt_uuid_str(&replica->uuid));
 	}
 	replicaset.replica_by_id[replica_id] = replica;
+	gc_delay_ref();
 	++replicaset.registered_count;
 	say_info("assigned id %d to replica %s",
 		 replica->id, tt_uuid_str(&replica->uuid));
@@ -273,6 +274,7 @@ replica_clear_id(struct replica *replica)
 	replicaset.replica_by_id[replica->id] = NULL;
 	assert(replicaset.registered_count > 0);
 	--replicaset.registered_count;
+	gc_delay_unref();
 	if (replica->id == instance_id) {
 		/* See replica_check_id(). */
 		assert(replicaset.is_joining);
diff --git a/test/app-tap/init_script.result b/test/app-tap/init_script.result
index 74cdb3efb..1fdd9a227 100644
--- a/test/app-tap/init_script.result
+++ b/test/app-tap/init_script.result
@@ -53,6 +53,7 @@ vinyl_run_count_per_level:2
 vinyl_run_size_ratio:3.5
 vinyl_timeout:60
 vinyl_write_threads:4
+wal_cleanup_delay:14400
 wal_dir:.
 wal_dir_rescan_delay:2
 wal_max_size:268435456
diff --git a/test/box/admin.result b/test/box/admin.result
index b619ae6cb..f8e8808e3 100644
--- a/test/box/admin.result
+++ b/test/box/admin.result
@@ -127,6 +127,8 @@ cfg_filter(box.cfg)
     - 60
   - - vinyl_write_threads
     - 4
+  - - wal_cleanup_delay
+    - 14400
   - - wal_dir
     - <hidden>
   - - wal_dir_rescan_delay
diff --git a/test/box/cfg.result b/test/box/cfg.result
index 2c3d5a981..693c1b521 100644
--- a/test/box/cfg.result
+++ b/test/box/cfg.result
@@ -115,6 +115,8 @@ cfg_filter(box.cfg)
  |     - 60
  |   - - vinyl_write_threads
  |     - 4
+ |   - - wal_cleanup_delay
+ |     - 14400
  |   - - wal_dir
  |     - <hidden>
  |   - - wal_dir_rescan_delay
@@ -234,6 +236,8 @@ cfg_filter(box.cfg)
  |     - 60
  |   - - vinyl_write_threads
  |     - 4
+ |   - - wal_cleanup_delay
+ |     - 14400
  |   - - wal_dir
  |     - <hidden>
  |   - - wal_dir_rescan_delay
diff --git a/test/replication/replica_rejoin.lua b/test/replication/replica_rejoin.lua
new file mode 100644
index 000000000..9c743c52b
--- /dev/null
+++ b/test/replication/replica_rejoin.lua
@@ -0,0 +1,11 @@
+#!/usr/bin/env tarantool
+
+-- Start the console first to allow test-run to attach even before
+-- box.cfg is finished.
+require('console').listen(os.getenv('ADMIN'))
+
+box.cfg({
+    listen              = os.getenv("LISTEN"),
+    replication         = {os.getenv("MASTER"), os.getenv("LISTEN")},
+    wal_cleanup_delay   = 0,
+})
diff --git a/test/replication/replica_rejoin.result b/test/replication/replica_rejoin.result
index 4bc3df2ca..843333a19 100644
--- a/test/replication/replica_rejoin.result
+++ b/test/replication/replica_rejoin.result
@@ -14,6 +14,13 @@ test_run:cleanup_cluster()
 ---
 ...
 --
+-- gh-5806: this replica_rejoin test relies on the wal cleanup fiber
+-- been disabled thus lets turn it off explicitly every time we restart
+-- the main node.
+box.cfg{wal_cleanup_delay = 0}
+---
+...
+--
 -- gh-461: check that a replica refetches the last checkpoint
 -- in case it fell behind the master.
 --
@@ -36,11 +43,11 @@ _ = box.space.test:insert{3}
 ---
 ...
 -- Join a replica, then stop it.
-test_run:cmd("create server replica with rpl_master=default, script='replication/replica.lua'")
+test_run:cmd("create server replica with rpl_master=default, script='replication/replica_rejoin.lua'")
 ---
 - true
 ...
-test_run:cmd("start server replica with args='true'")
+test_run:cmd("start server replica")
 ---
 - true
 ...
@@ -69,6 +76,9 @@ test_run:cmd("stop server replica")
 -- Restart the server to purge the replica from
 -- the garbage collection state.
 test_run:cmd("restart server default")
+box.cfg{wal_cleanup_delay = 0}
+---
+...
 -- Make some checkpoints to remove old xlogs.
 checkpoint_count = box.cfg.checkpoint_count
 ---
@@ -114,7 +124,7 @@ box.cfg{checkpoint_count = checkpoint_count}
 ...
 -- Restart the replica. Since xlogs have been removed,
 -- it is supposed to rejoin without changing id.
-test_run:cmd("start server replica with args='true'")
+test_run:cmd("start server replica")
 ---
 - true
 ...
@@ -190,6 +200,9 @@ test_run:cmd("stop server replica")
 - true
 ...
 test_run:cmd("restart server default")
+box.cfg{wal_cleanup_delay = 0}
+---
+...
 checkpoint_count = box.cfg.checkpoint_count
 ---
 ...
@@ -216,7 +229,7 @@ test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.
 box.cfg{checkpoint_count = checkpoint_count}
 ---
 ...
-test_run:cmd("start server replica with args='true', wait=False")
+test_run:cmd("start server replica with wait=False")
 ---
 - true
 ...
@@ -258,7 +271,7 @@ test_run:cleanup_cluster()
 box.space.test:truncate()
 ---
 ...
-test_run:cmd("start server replica with args='true'")
+test_run:cmd("start server replica")
 ---
 - true
 ...
@@ -313,6 +326,9 @@ test_run:cmd("switch default")
 - true
 ...
 test_run:cmd("restart server default")
+box.cfg{wal_cleanup_delay = 0}
+---
+...
 replica_listen = test_run:cmd("eval replica 'return box.cfg.listen'")
 ---
 ...
diff --git a/test/replication/replica_rejoin.test.lua b/test/replication/replica_rejoin.test.lua
index 494cbbcb7..c3ba9bf3f 100644
--- a/test/replication/replica_rejoin.test.lua
+++ b/test/replication/replica_rejoin.test.lua
@@ -5,6 +5,12 @@ engine = test_run:get_cfg('engine')
 
 test_run:cleanup_cluster()
 
+--
+-- gh-5806: this replica_rejoin test relies on the wal cleanup fiber
+-- been disabled thus lets turn it off explicitly every time we restart
+-- the main node.
+box.cfg{wal_cleanup_delay = 0}
+
 --
 -- gh-461: check that a replica refetches the last checkpoint
 -- in case it fell behind the master.
@@ -17,8 +23,8 @@ _ = box.space.test:insert{2}
 _ = box.space.test:insert{3}
 
 -- Join a replica, then stop it.
-test_run:cmd("create server replica with rpl_master=default, script='replication/replica.lua'")
-test_run:cmd("start server replica with args='true'")
+test_run:cmd("create server replica with rpl_master=default, script='replication/replica_rejoin.lua'")
+test_run:cmd("start server replica")
 test_run:cmd("switch replica")
 box.info.replication[1].upstream.status == 'follow' or log.error(box.info)
 box.space.test:select()
@@ -28,6 +34,7 @@ test_run:cmd("stop server replica")
 -- Restart the server to purge the replica from
 -- the garbage collection state.
 test_run:cmd("restart server default")
+box.cfg{wal_cleanup_delay = 0}
 
 -- Make some checkpoints to remove old xlogs.
 checkpoint_count = box.cfg.checkpoint_count
@@ -46,7 +53,7 @@ box.cfg{checkpoint_count = checkpoint_count}
 
 -- Restart the replica. Since xlogs have been removed,
 -- it is supposed to rejoin without changing id.
-test_run:cmd("start server replica with args='true'")
+test_run:cmd("start server replica")
 box.info.replication[2].downstream.vclock ~= nil or log.error(box.info)
 test_run:cmd("switch replica")
 box.info.replication[1].upstream.status == 'follow' or log.error(box.info)
@@ -72,6 +79,7 @@ box.space.test:replace{1, 2, 3} -- bumps LSN on the replica
 test_run:cmd("switch default")
 test_run:cmd("stop server replica")
 test_run:cmd("restart server default")
+box.cfg{wal_cleanup_delay = 0}
 checkpoint_count = box.cfg.checkpoint_count
 box.cfg{checkpoint_count = 1}
 for i = 1, 3 do box.space.test:delete{i * 10} end
@@ -80,7 +88,7 @@ for i = 1, 3 do box.space.test:insert{i * 100} end
 fio = require('fio')
 test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1 end) or fio.pathjoin(box.cfg.wal_dir, '*.xlog')
 box.cfg{checkpoint_count = checkpoint_count}
-test_run:cmd("start server replica with args='true', wait=False")
+test_run:cmd("start server replica with wait=False")
 test_run:cmd("switch replica")
 test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = 'loading'})
 box.space.test:select()
@@ -96,7 +104,7 @@ test_run:cmd("stop server replica")
 test_run:cmd("cleanup server replica")
 test_run:cleanup_cluster()
 box.space.test:truncate()
-test_run:cmd("start server replica with args='true'")
+test_run:cmd("start server replica")
 -- Subscribe the master to the replica.
 replica_listen = test_run:cmd("eval replica 'return box.cfg.listen'")
 replica_listen ~= nil
@@ -116,6 +124,7 @@ _ = test_run:wait_vclock('default', vclock)
 -- Restart the master and force garbage collection.
 test_run:cmd("switch default")
 test_run:cmd("restart server default")
+box.cfg{wal_cleanup_delay = 0}
 replica_listen = test_run:cmd("eval replica 'return box.cfg.listen'")
 replica_listen ~= nil
 box.cfg{replication = replica_listen}
diff --git a/test/vinyl/replica_rejoin.lua b/test/vinyl/replica_rejoin.lua
index 7cb7e09a4..f97209c44 100644
--- a/test/vinyl/replica_rejoin.lua
+++ b/test/vinyl/replica_rejoin.lua
@@ -6,8 +6,9 @@ if arg[1] == 'disable_replication' then
 end
 
 box.cfg({
-    replication     = replication,
-    vinyl_memory    = 1024 * 1024,
+    replication         = replication,
+    vinyl_memory        = 1024 * 1024,
+    wal_cleanup_delay   = 0,
 })
 
 require('console').listen(os.getenv('ADMIN'))
diff --git a/test/vinyl/replica_rejoin.result b/test/vinyl/replica_rejoin.result
index 6e8156fee..cb38eaee7 100644
--- a/test/vinyl/replica_rejoin.result
+++ b/test/vinyl/replica_rejoin.result
@@ -5,6 +5,13 @@ test_run = env.new()
 ---
 ...
 --
+-- gh-5806: this replica_rejoin test relies on the wal cleanup fiber
+-- been disabled thus lets turn it off explicitly every time we restart
+-- the main node.
+box.cfg{wal_cleanup_delay = 0}
+---
+...
+--
 -- gh-461: check that garbage collection works as expected
 -- after rebootstrap.
 --
@@ -54,6 +61,9 @@ test_run:cmd("stop server replica")
 ...
 -- Invoke garbage collector on the master.
 test_run:cmd("restart server default")
+box.cfg{wal_cleanup_delay = 0}
+---
+...
 checkpoint_count = box.cfg.checkpoint_count
 ---
 ...
@@ -123,6 +133,9 @@ test_run:cmd("stop server replica")
 ...
 -- Invoke garbage collector on the master.
 test_run:cmd("restart server default")
+box.cfg{wal_cleanup_delay = 0}
+---
+...
 checkpoint_count = box.cfg.checkpoint_count
 ---
 ...
diff --git a/test/vinyl/replica_rejoin.test.lua b/test/vinyl/replica_rejoin.test.lua
index cbb17ef21..f52859121 100644
--- a/test/vinyl/replica_rejoin.test.lua
+++ b/test/vinyl/replica_rejoin.test.lua
@@ -1,6 +1,12 @@
 env = require('test_run')
 test_run = env.new()
 
+--
+-- gh-5806: this replica_rejoin test relies on the wal cleanup fiber
+-- been disabled thus lets turn it off explicitly every time we restart
+-- the main node.
+box.cfg{wal_cleanup_delay = 0}
+
 --
 -- gh-461: check that garbage collection works as expected
 -- after rebootstrap.
@@ -23,6 +29,7 @@ test_run:cmd("stop server replica")
 
 -- Invoke garbage collector on the master.
 test_run:cmd("restart server default")
+box.cfg{wal_cleanup_delay = 0}
 checkpoint_count = box.cfg.checkpoint_count
 box.cfg{checkpoint_count = 1}
 box.space.test:delete(1)
@@ -48,6 +55,7 @@ test_run:cmd("stop server replica")
 
 -- Invoke garbage collector on the master.
 test_run:cmd("restart server default")
+box.cfg{wal_cleanup_delay = 0}
 checkpoint_count = box.cfg.checkpoint_count
 box.cfg{checkpoint_count = 1}
 box.space.test:delete(2)
-- 
2.30.2


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Tarantool-patches] [PATCH v6 2/3] test: add a test for wal_cleanup_delay option
  2021-03-27 11:13 [Tarantool-patches] [PATCH v6 0/3] gc/xlog: delay xlog cleanup until relays are subscribed Cyrill Gorcunov via Tarantool-patches
  2021-03-27 11:13 ` [Tarantool-patches] [PATCH v6 1/3] " Cyrill Gorcunov via Tarantool-patches
@ 2021-03-27 11:13 ` Cyrill Gorcunov via Tarantool-patches
  2021-03-29 21:07   ` Vladislav Shpilevoy via Tarantool-patches
  2021-03-27 11:13 ` [Tarantool-patches] [PATCH v6 3/3] test: box-tap/gc -- add test for is_paused field Cyrill Gorcunov via Tarantool-patches
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-03-27 11:13 UTC (permalink / raw)
  To: tml; +Cc: Vladislav Shpilevoy

Part-of #5806

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 test/replication/gh-5806-master.lua           |   8 +
 test/replication/gh-5806-xlog-cleanup.result  | 558 ++++++++++++++++++
 .../replication/gh-5806-xlog-cleanup.test.lua | 234 ++++++++
 3 files changed, 800 insertions(+)
 create mode 100644 test/replication/gh-5806-master.lua
 create mode 100644 test/replication/gh-5806-xlog-cleanup.result
 create mode 100644 test/replication/gh-5806-xlog-cleanup.test.lua

diff --git a/test/replication/gh-5806-master.lua b/test/replication/gh-5806-master.lua
new file mode 100644
index 000000000..bc15dab67
--- /dev/null
+++ b/test/replication/gh-5806-master.lua
@@ -0,0 +1,8 @@
+#!/usr/bin/env tarantool
+
+require('console').listen(os.getenv('ADMIN'))
+
+box.cfg({
+    listen              = os.getenv("LISTEN"),
+    wal_cleanup_delay   = tonumber(arg[1]) or 0,
+})
diff --git a/test/replication/gh-5806-xlog-cleanup.result b/test/replication/gh-5806-xlog-cleanup.result
new file mode 100644
index 000000000..da09daf17
--- /dev/null
+++ b/test/replication/gh-5806-xlog-cleanup.result
@@ -0,0 +1,558 @@
+-- test-run result file version 2
+--
+-- gh-5806: defer xlog cleanup to keep xlogs until
+-- replicas present in "_cluster" are connected.
+-- Otherwise we are getting XlogGapError since
+-- master might go far forward from replica and
+-- replica won't be able to connect without full
+-- rebootstrap.
+--
+
+fiber = require('fiber')
+ | ---
+ | ...
+test_run = require('test_run').new()
+ | ---
+ | ...
+engine = test_run:get_cfg('engine')
+ | ---
+ | ...
+
+--
+-- Case 1.
+--
+-- First lets make sure we're getting XlogGapError in
+-- case if wal_cleanup_delay is not used.
+--
+
+test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server master')
+ | ---
+ | - true
+ | ...
+
+test_run:switch('master')
+ | ---
+ | - true
+ | ...
+box.schema.user.grant('guest', 'replication')
+ | ---
+ | ...
+
+--
+-- Keep small number of snaps to force cleanup
+-- procedure be more intensive.
+box.cfg{checkpoint_count = 1}
+ | ---
+ | ...
+
+engine = test_run:get_cfg('engine')
+ | ---
+ | ...
+s = box.schema.space.create('test', {engine = engine})
+ | ---
+ | ...
+_ = s:create_index('pk')
+ | ---
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('create server replica with rpl_master=master,\
+              script="replication/replica.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica')
+ | ---
+ | - true
+ | ...
+
+--
+-- On replica we create an own space which allows us to
+-- use more complex scenario and disables replica from
+-- automatic rejoin (since replica can't do auto-rejoin if
+-- there gonna be an own data loss). This allows us to
+-- trigger XlogGapError in the log.
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+box.cfg{checkpoint_count = 1}
+ | ---
+ | ...
+s = box.schema.space.create('testreplica')
+ | ---
+ | ...
+_ = s:create_index('pk')
+ | ---
+ | ...
+box.space.testreplica:insert({1})
+ | ---
+ | - [1]
+ | ...
+box.snapshot()
+ | ---
+ | - ok
+ | ...
+
+--
+-- Stop the replica node and generate
+-- xlogs on the master.
+test_run:switch('master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica')
+ | ---
+ | - true
+ | ...
+
+box.space.test:insert({1})
+ | ---
+ | - [1]
+ | ...
+box.snapshot()
+ | ---
+ | - ok
+ | ...
+
+--
+-- We need to restart the master node since otherwise
+-- the replica will be preventing us from removing old
+-- xlog because it will be tracked by gc consumer which
+-- kept in memory while master node is running.
+--
+-- Once restarted we write a new record into master's
+-- space and run snapshot which removes old xlog required
+-- by replica to subscribe leading to XlogGapError which
+-- we need to test.
+test_run:cmd('restart server master')
+ | 
+box.space.test:insert({2})
+ | ---
+ | - [2]
+ | ...
+box.snapshot()
+ | ---
+ | - ok
+ | ...
+assert(box.info.gc().is_paused == false)
+ | ---
+ | - true
+ | ...
+
+--
+-- Start replica and wait for error.
+test_run:cmd('start server replica with wait=False, wait_load=False')
+ | ---
+ | - true
+ | ...
+
+--
+-- Wait error to appear, 60 seconds should be more than enough,
+-- usually it happens in a couple of seconds.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:wait_log('master', 'XlogGapError', nil, 60) ~= nil
+ | ---
+ | - true
+ | ...
+
+--
+-- Cleanup.
+test_run:cmd('stop server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('cleanup server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('cleanup server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica')
+ | ---
+ | - true
+ | ...
+
+--
+-- Case 2.
+--
+-- Lets make sure we're not getting XlogGapError in
+-- case if wal_cleanup_delay is used the code is almost
+-- the same as for Case 1 except we don't disable cleanup
+-- fiber but delay it up to a hour until replica is up
+-- and running.
+--
+
+test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server master with args="3600"')
+ | ---
+ | - true
+ | ...
+
+test_run:switch('master')
+ | ---
+ | - true
+ | ...
+box.schema.user.grant('guest', 'replication')
+ | ---
+ | ...
+
+box.cfg{checkpoint_count = 1}
+ | ---
+ | ...
+
+engine = test_run:get_cfg('engine')
+ | ---
+ | ...
+s = box.schema.space.create('test', {engine = engine})
+ | ---
+ | ...
+_ = s:create_index('pk')
+ | ---
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('create server replica with rpl_master=master,\
+              script="replication/replica.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica')
+ | ---
+ | - true
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+box.cfg{checkpoint_count = 1}
+ | ---
+ | ...
+s = box.schema.space.create('testreplica')
+ | ---
+ | ...
+_ = s:create_index('pk')
+ | ---
+ | ...
+box.space.testreplica:insert({1})
+ | ---
+ | - [1]
+ | ...
+box.snapshot()
+ | ---
+ | - ok
+ | ...
+
+test_run:switch('master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica')
+ | ---
+ | - true
+ | ...
+
+box.space.test:insert({1})
+ | ---
+ | - [1]
+ | ...
+box.snapshot()
+ | ---
+ | - ok
+ | ...
+
+test_run:cmd('restart server master with args="3600"')
+ | 
+box.space.test:insert({2})
+ | ---
+ | - [2]
+ | ...
+box.snapshot()
+ | ---
+ | - ok
+ | ...
+assert(box.info.gc().is_paused == true)
+ | ---
+ | - true
+ | ...
+
+test_run:cmd('start server replica')
+ | ---
+ | - true
+ | ...
+
+--
+-- Make sure no error happened.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+assert(test_run:grep_log("master", "XlogGapError") == nil)
+ | ---
+ | - true
+ | ...
+
+test_run:cmd('stop server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('cleanup server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('cleanup server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica')
+ | ---
+ | - true
+ | ...
+--
+--
+-- Case 3: Fill _cluster with replica but then delete
+-- the replica so that master's cleanup leave in "paused"
+-- state, and then simply decrease the timeout to make
+-- cleanup fiber work again.
+--
+test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server master with args="3600"')
+ | ---
+ | - true
+ | ...
+
+test_run:switch('master')
+ | ---
+ | - true
+ | ...
+box.schema.user.grant('guest', 'replication')
+ | ---
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('create server replica with rpl_master=master,\
+              script="replication/replica.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica')
+ | ---
+ | - true
+ | ...
+
+test_run:switch('master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('cleanup server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica')
+ | ---
+ | - true
+ | ...
+
+test_run:cmd('restart server master with args="3600"')
+ | 
+assert(box.info.gc().is_paused == true)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('master')
+ | ---
+ | - true
+ | ...
+box.cfg{wal_cleanup_delay = 0.01}
+ | ---
+ | ...
+test_run:wait_cond(function() return box.info.gc().is_paused == false end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('cleanup server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server master')
+ | ---
+ | - true
+ | ...
+
+--
+-- Case 4: Fill _cluster with replica but then delete
+-- the replica so that master's cleanup leave in "paused"
+-- state, and finally cleanup the _cluster to kick cleanup.
+--
+test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server master')
+ | ---
+ | - true
+ | ...
+
+test_run:switch('master')
+ | ---
+ | - true
+ | ...
+box.schema.user.grant('guest', 'replication')
+ | ---
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('create server replica with rpl_master=master,\
+              script="replication/replica.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica')
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+master_uuid = test_run:eval('master', 'return box.info.uuid')[1]
+ | ---
+ | ...
+replica_uuid = test_run:eval('replica', 'return box.info.uuid')[1]
+ | ---
+ | ...
+master_custer = test_run:eval('master', 'return box.space._cluster:select()')[1]
+ | ---
+ | ...
+assert(master_custer[1][2] == master_uuid)
+ | ---
+ | - true
+ | ...
+assert(master_custer[2][2] == replica_uuid)
+ | ---
+ | - true
+ | ...
+
+test_run:cmd('stop server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('cleanup server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica')
+ | ---
+ | - true
+ | ...
+
+test_run:switch('master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('restart server master with args="3600"')
+ | 
+assert(box.info.gc().is_paused == true)
+ | ---
+ | - true
+ | ...
+
+--
+-- Drop the replica from _cluster and make sure
+-- cleanup fiber is not paused anymore.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+deleted_uuid = test_run:eval('master', 'return box.space._cluster:delete(2)')[1][2]
+ | ---
+ | ...
+assert(replica_uuid == deleted_uuid)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('master')
+ | ---
+ | - true
+ | ...
+test_run:wait_cond(function() return box.info.gc().is_paused == false end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('cleanup server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server master')
+ | ---
+ | - true
+ | ...
diff --git a/test/replication/gh-5806-xlog-cleanup.test.lua b/test/replication/gh-5806-xlog-cleanup.test.lua
new file mode 100644
index 000000000..b65563e7f
--- /dev/null
+++ b/test/replication/gh-5806-xlog-cleanup.test.lua
@@ -0,0 +1,234 @@
+--
+-- gh-5806: defer xlog cleanup to keep xlogs until
+-- replicas present in "_cluster" are connected.
+-- Otherwise we are getting XlogGapError since
+-- master might go far forward from replica and
+-- replica won't be able to connect without full
+-- rebootstrap.
+--
+
+fiber = require('fiber')
+test_run = require('test_run').new()
+engine = test_run:get_cfg('engine')
+
+--
+-- Case 1.
+--
+-- First lets make sure we're getting XlogGapError in
+-- case if wal_cleanup_delay is not used.
+--
+
+test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
+test_run:cmd('start server master')
+
+test_run:switch('master')
+box.schema.user.grant('guest', 'replication')
+
+--
+-- Keep small number of snaps to force cleanup
+-- procedure be more intensive.
+box.cfg{checkpoint_count = 1}
+
+engine = test_run:get_cfg('engine')
+s = box.schema.space.create('test', {engine = engine})
+_ = s:create_index('pk')
+
+test_run:switch('default')
+test_run:cmd('create server replica with rpl_master=master,\
+              script="replication/replica.lua"')
+test_run:cmd('start server replica')
+
+--
+-- On replica we create an own space which allows us to
+-- use more complex scenario and disables replica from
+-- automatic rejoin (since replica can't do auto-rejoin if
+-- there gonna be an own data loss). This allows us to
+-- trigger XlogGapError in the log.
+test_run:switch('replica')
+box.cfg{checkpoint_count = 1}
+s = box.schema.space.create('testreplica')
+_ = s:create_index('pk')
+box.space.testreplica:insert({1})
+box.snapshot()
+
+--
+-- Stop the replica node and generate
+-- xlogs on the master.
+test_run:switch('master')
+test_run:cmd('stop server replica')
+
+box.space.test:insert({1})
+box.snapshot()
+
+--
+-- We need to restart the master node since otherwise
+-- the replica will be preventing us from removing old
+-- xlog because it will be tracked by gc consumer which
+-- kept in memory while master node is running.
+--
+-- Once restarted we write a new record into master's
+-- space and run snapshot which removes old xlog required
+-- by replica to subscribe leading to XlogGapError which
+-- we need to test.
+test_run:cmd('restart server master')
+box.space.test:insert({2})
+box.snapshot()
+assert(box.info.gc().is_paused == false)
+
+--
+-- Start replica and wait for error.
+test_run:cmd('start server replica with wait=False, wait_load=False')
+
+--
+-- Wait error to appear, 60 seconds should be more than enough,
+-- usually it happens in a couple of seconds.
+test_run:switch('default')
+test_run:wait_log('master', 'XlogGapError', nil, 60) ~= nil
+
+--
+-- Cleanup.
+test_run:cmd('stop server master')
+test_run:cmd('cleanup server master')
+test_run:cmd('delete server master')
+test_run:cmd('stop server replica')
+test_run:cmd('cleanup server replica')
+test_run:cmd('delete server replica')
+
+--
+-- Case 2.
+--
+-- Lets make sure we're not getting XlogGapError in
+-- case if wal_cleanup_delay is used the code is almost
+-- the same as for Case 1 except we don't disable cleanup
+-- fiber but delay it up to a hour until replica is up
+-- and running.
+--
+
+test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
+test_run:cmd('start server master with args="3600"')
+
+test_run:switch('master')
+box.schema.user.grant('guest', 'replication')
+
+box.cfg{checkpoint_count = 1}
+
+engine = test_run:get_cfg('engine')
+s = box.schema.space.create('test', {engine = engine})
+_ = s:create_index('pk')
+
+test_run:switch('default')
+test_run:cmd('create server replica with rpl_master=master,\
+              script="replication/replica.lua"')
+test_run:cmd('start server replica')
+
+test_run:switch('replica')
+box.cfg{checkpoint_count = 1}
+s = box.schema.space.create('testreplica')
+_ = s:create_index('pk')
+box.space.testreplica:insert({1})
+box.snapshot()
+
+test_run:switch('master')
+test_run:cmd('stop server replica')
+
+box.space.test:insert({1})
+box.snapshot()
+
+test_run:cmd('restart server master with args="3600"')
+box.space.test:insert({2})
+box.snapshot()
+assert(box.info.gc().is_paused == true)
+
+test_run:cmd('start server replica')
+
+--
+-- Make sure no error happened.
+test_run:switch('default')
+assert(test_run:grep_log("master", "XlogGapError") == nil)
+
+test_run:cmd('stop server master')
+test_run:cmd('cleanup server master')
+test_run:cmd('delete server master')
+test_run:cmd('stop server replica')
+test_run:cmd('cleanup server replica')
+test_run:cmd('delete server replica')
+--
+--
+-- Case 3: Fill _cluster with replica but then delete
+-- the replica so that master's cleanup leave in "paused"
+-- state, and then simply decrease the timeout to make
+-- cleanup fiber work again.
+--
+test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
+test_run:cmd('start server master with args="3600"')
+
+test_run:switch('master')
+box.schema.user.grant('guest', 'replication')
+
+test_run:switch('default')
+test_run:cmd('create server replica with rpl_master=master,\
+              script="replication/replica.lua"')
+test_run:cmd('start server replica')
+
+test_run:switch('master')
+test_run:cmd('stop server replica')
+test_run:cmd('cleanup server replica')
+test_run:cmd('delete server replica')
+
+test_run:cmd('restart server master with args="3600"')
+assert(box.info.gc().is_paused == true)
+
+test_run:switch('master')
+box.cfg{wal_cleanup_delay = 0.01}
+test_run:wait_cond(function() return box.info.gc().is_paused == false end)
+
+test_run:switch('default')
+test_run:cmd('stop server master')
+test_run:cmd('cleanup server master')
+test_run:cmd('delete server master')
+
+--
+-- Case 4: Fill _cluster with replica but then delete
+-- the replica so that master's cleanup leave in "paused"
+-- state, and finally cleanup the _cluster to kick cleanup.
+--
+test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
+test_run:cmd('start server master')
+
+test_run:switch('master')
+box.schema.user.grant('guest', 'replication')
+
+test_run:switch('default')
+test_run:cmd('create server replica with rpl_master=master,\
+              script="replication/replica.lua"')
+test_run:cmd('start server replica')
+
+test_run:switch('default')
+master_uuid = test_run:eval('master', 'return box.info.uuid')[1]
+replica_uuid = test_run:eval('replica', 'return box.info.uuid')[1]
+master_custer = test_run:eval('master', 'return box.space._cluster:select()')[1]
+assert(master_custer[1][2] == master_uuid)
+assert(master_custer[2][2] == replica_uuid)
+
+test_run:cmd('stop server replica')
+test_run:cmd('cleanup server replica')
+test_run:cmd('delete server replica')
+
+test_run:switch('master')
+test_run:cmd('restart server master with args="3600"')
+assert(box.info.gc().is_paused == true)
+
+--
+-- Drop the replica from _cluster and make sure
+-- cleanup fiber is not paused anymore.
+test_run:switch('default')
+deleted_uuid = test_run:eval('master', 'return box.space._cluster:delete(2)')[1][2]
+assert(replica_uuid == deleted_uuid)
+
+test_run:switch('master')
+test_run:wait_cond(function() return box.info.gc().is_paused == false end)
+
+test_run:switch('default')
+test_run:cmd('stop server master')
+test_run:cmd('cleanup server master')
+test_run:cmd('delete server master')
-- 
2.30.2


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Tarantool-patches] [PATCH v6 3/3] test: box-tap/gc -- add test for is_paused field
  2021-03-27 11:13 [Tarantool-patches] [PATCH v6 0/3] gc/xlog: delay xlog cleanup until relays are subscribed Cyrill Gorcunov via Tarantool-patches
  2021-03-27 11:13 ` [Tarantool-patches] [PATCH v6 1/3] " Cyrill Gorcunov via Tarantool-patches
  2021-03-27 11:13 ` [Tarantool-patches] [PATCH v6 2/3] test: add a test for wal_cleanup_delay option Cyrill Gorcunov via Tarantool-patches
@ 2021-03-27 11:13 ` Cyrill Gorcunov via Tarantool-patches
  2021-03-30 19:57 ` [Tarantool-patches] [PATCH v6 0/3] gc/xlog: delay xlog cleanup until relays are subscribed Vladislav Shpilevoy via Tarantool-patches
  2021-03-31  8:28 ` Kirill Yukhin via Tarantool-patches
  4 siblings, 0 replies; 16+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-03-27 11:13 UTC (permalink / raw)
  To: tml; +Cc: Vladislav Shpilevoy

Once simple bootstrap is complete and there is no
replicas used we should run with gc unpaused.

Part-of #5806

Acked-by: Serge Petrenko <sergepetrenko@tarantool.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 test/box-tap/gc.test.lua | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/test/box-tap/gc.test.lua b/test/box-tap/gc.test.lua
index f0155779c..ced87547e 100755
--- a/test/box-tap/gc.test.lua
+++ b/test/box-tap/gc.test.lua
@@ -8,11 +8,12 @@ local debug = type(box.error.injection) == "table"
 
 -- check box.info.gc() is false if snapshot is not in progress
 local test = tap.test('box.info.gc')
-test:plan(1 + (debug and 1 or 0))
+test:plan(2 + (debug and 1 or 0))
 
 
 local gc = box.info.gc()
 test:is(gc.checkpoint_is_in_progress, false, "checkpoint is not in progress")
+test:is(gc.is_paused, false, "GC is not paused")
 
 -- check box.info.gc() is true if snapshot is in progress
 --
-- 
2.30.2


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 2/3] test: add a test for wal_cleanup_delay option
  2021-03-27 11:13 ` [Tarantool-patches] [PATCH v6 2/3] test: add a test for wal_cleanup_delay option Cyrill Gorcunov via Tarantool-patches
@ 2021-03-29 21:07   ` Vladislav Shpilevoy via Tarantool-patches
  2021-03-29 21:46     ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 1 reply; 16+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-03-29 21:07 UTC (permalink / raw)
  To: Cyrill Gorcunov, tml

Hi! Thanks for the patch!

> diff --git a/test/replication/gh-5806-xlog-cleanup.result b/test/replication/gh-5806-xlog-cleanup.result
> new file mode 100644
> index 000000000..da09daf17
> --- /dev/null
> +++ b/test/replication/gh-5806-xlog-cleanup.result
> @@ -0,0 +1,558 @@
> +master_uuid = test_run:eval('master', 'return box.info.uuid')[1]
> + | ---
> + | ...
> +replica_uuid = test_run:eval('replica', 'return box.info.uuid')[1]
> + | ---
> + | ...
> +master_custer = test_run:eval('master', 'return box.space._cluster:select()')[1]

1. master_custer -> master_cluster.

> +
> +test_run:switch('master')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('restart server master with args="3600"')
> + | 
> +assert(box.info.gc().is_paused == true)

2. Do you know that boolean expressions don't need to be compared
with the boolean constants explicitly? You could write

	assert(box.info.gc().is_paused).

The same in all the other similar places in this file. Especially
where you do '== false' which is super confusing.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 2/3] test: add a test for wal_cleanup_delay option
  2021-03-29 21:07   ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-03-29 21:46     ` Cyrill Gorcunov via Tarantool-patches
  2021-03-29 21:54       ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 1 reply; 16+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-03-29 21:46 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tml

On Mon, Mar 29, 2021 at 11:07:29PM +0200, Vladislav Shpilevoy wrote:
> 
> 2. Do you know that boolean expressions don't need to be compared
> with the boolean constants explicitly? You could write
> 
> 	assert(box.info.gc().is_paused).
> 
> The same in all the other similar places in this file. Especially
> where you do '== false' which is super confusing.

Force pushed an update, is that what you've in mind?
---
 test/replication/gh-5806-xlog-cleanup.result   | 12 ++++++------
 test/replication/gh-5806-xlog-cleanup.test.lua | 12 ++++++------
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/test/replication/gh-5806-xlog-cleanup.result b/test/replication/gh-5806-xlog-cleanup.result
index da09daf17..412e8f02d 100644
--- a/test/replication/gh-5806-xlog-cleanup.result
+++ b/test/replication/gh-5806-xlog-cleanup.result
@@ -298,7 +298,7 @@ box.snapshot()
  | ---
  | - ok
  | ...
-assert(box.info.gc().is_paused == true)
+assert(box.info.gc().is_paused)
  | ---
  | - true
  | ...
@@ -400,7 +400,7 @@ test_run:cmd('delete server replica')
 
 test_run:cmd('restart server master with args="3600"')
  | 
-assert(box.info.gc().is_paused == true)
+assert(box.info.gc().is_paused)
  | ---
  | - true
  | ...
@@ -480,14 +480,14 @@ master_uuid = test_run:eval('master', 'return box.info.uuid')[1]
 replica_uuid = test_run:eval('replica', 'return box.info.uuid')[1]
  | ---
  | ...
-master_custer = test_run:eval('master', 'return box.space._cluster:select()')[1]
+master_cluster = test_run:eval('master', 'return box.space._cluster:select()')[1]
  | ---
  | ...
-assert(master_custer[1][2] == master_uuid)
+assert(master_cluster[1][2] == master_uuid)
  | ---
  | - true
  | ...
-assert(master_custer[2][2] == replica_uuid)
+assert(master_cluster[2][2] == replica_uuid)
  | ---
  | - true
  | ...
@@ -511,7 +511,7 @@ test_run:switch('master')
  | ...
 test_run:cmd('restart server master with args="3600"')
  | 
-assert(box.info.gc().is_paused == true)
+assert(box.info.gc().is_paused)
  | ---
  | - true
  | ...
diff --git a/test/replication/gh-5806-xlog-cleanup.test.lua b/test/replication/gh-5806-xlog-cleanup.test.lua
index b65563e7f..5af98d362 100644
--- a/test/replication/gh-5806-xlog-cleanup.test.lua
+++ b/test/replication/gh-5806-xlog-cleanup.test.lua
@@ -137,7 +137,7 @@ box.snapshot()
 test_run:cmd('restart server master with args="3600"')
 box.space.test:insert({2})
 box.snapshot()
-assert(box.info.gc().is_paused == true)
+assert(box.info.gc().is_paused)
 
 test_run:cmd('start server replica')
 
@@ -176,7 +176,7 @@ test_run:cmd('cleanup server replica')
 test_run:cmd('delete server replica')
 
 test_run:cmd('restart server master with args="3600"')
-assert(box.info.gc().is_paused == true)
+assert(box.info.gc().is_paused)
 
 test_run:switch('master')
 box.cfg{wal_cleanup_delay = 0.01}
@@ -206,9 +206,9 @@ test_run:cmd('start server replica')
 test_run:switch('default')
 master_uuid = test_run:eval('master', 'return box.info.uuid')[1]
 replica_uuid = test_run:eval('replica', 'return box.info.uuid')[1]
-master_custer = test_run:eval('master', 'return box.space._cluster:select()')[1]
-assert(master_custer[1][2] == master_uuid)
-assert(master_custer[2][2] == replica_uuid)
+master_cluster = test_run:eval('master', 'return box.space._cluster:select()')[1]
+assert(master_cluster[1][2] == master_uuid)
+assert(master_cluster[2][2] == replica_uuid)
 
 test_run:cmd('stop server replica')
 test_run:cmd('cleanup server replica')
@@ -216,7 +216,7 @@ test_run:cmd('delete server replica')
 
 test_run:switch('master')
 test_run:cmd('restart server master with args="3600"')
-assert(box.info.gc().is_paused == true)
+assert(box.info.gc().is_paused)
 
 --
 -- Drop the replica from _cluster and make sure
-- 
2.30.2


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 2/3] test: add a test for wal_cleanup_delay option
  2021-03-29 21:46     ` Cyrill Gorcunov via Tarantool-patches
@ 2021-03-29 21:54       ` Vladislav Shpilevoy via Tarantool-patches
  2021-03-29 21:57         ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 1 reply; 16+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-03-29 21:54 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml

Hi! Thanks for the fixes!

Please, don't try to rush and answer my comments immediately after
I send them. Even if they seem trivial. It won't speed the things
up, and you will simply miss something, as I told many times.

You still left 3 usages of 'is_paused == false', even though
I specifically said this is the most confusing one.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 2/3] test: add a test for wal_cleanup_delay option
  2021-03-29 21:54       ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-03-29 21:57         ` Cyrill Gorcunov via Tarantool-patches
  2021-03-29 22:19           ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 1 reply; 16+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-03-29 21:57 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tml

On Mon, Mar 29, 2021 at 11:54:45PM +0200, Vladislav Shpilevoy wrote:
> Hi! Thanks for the fixes!
> 
> Please, don't try to rush and answer my comments immediately after
> I send them. Even if they seem trivial. It won't speed the things
> up, and you will simply miss something, as I told many times.
> 
> You still left 3 usages of 'is_paused == false', even though
> I specifically said this is the most confusing one.

I need to make sure the .is_paused is false, the assert(x == false)
looks exactly what is needed. and we have similar code in our other
tests, do you prefer assert(not x.is_paused) or what?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 2/3] test: add a test for wal_cleanup_delay option
  2021-03-29 21:57         ` Cyrill Gorcunov via Tarantool-patches
@ 2021-03-29 22:19           ` Vladislav Shpilevoy via Tarantool-patches
  2021-03-29 22:40             ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 1 reply; 16+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-03-29 22:19 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml

On 29.03.2021 23:57, Cyrill Gorcunov wrote:
> On Mon, Mar 29, 2021 at 11:54:45PM +0200, Vladislav Shpilevoy wrote:
>> Hi! Thanks for the fixes!
>>
>> Please, don't try to rush and answer my comments immediately after
>> I send them. Even if they seem trivial. It won't speed the things
>> up, and you will simply miss something, as I told many times.
>>
>> You still left 3 usages of 'is_paused == false', even though
>> I specifically said this is the most confusing one.
> 
> I need to make sure the .is_paused is false, the assert(x == false)
> looks exactly what is needed. and we have similar code in our other
> tests, do you prefer assert(not x.is_paused) or what?

And there is much much much more tests using the booleans as they
are supposed to be used. Why do you need exactly 'false' as a literal?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 2/3] test: add a test for wal_cleanup_delay option
  2021-03-29 22:19           ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-03-29 22:40             ` Cyrill Gorcunov via Tarantool-patches
  2021-03-29 22:56               ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 1 reply; 16+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-03-29 22:40 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tml

On Tue, Mar 30, 2021 at 12:19:10AM +0200, Vladislav Shpilevoy wrote:
> > 
> > I need to make sure the .is_paused is false, the assert(x == false)
> > looks exactly what is needed. and we have similar code in our other
> > tests, do you prefer assert(not x.is_paused) or what?
> 
> And there is much much much more tests using the booleans as they
> are supposed to be used. Why do you need exactly 'false' as a literal?

Wait, I just don't understand. is_paused is a boolean type and
I compare it with a boolean value. Could you simply point me
the preferred way to compare if some particular value is false.
I used assert() 'cause I found similar code, if there some more
suiatble way to test the value, sure thing I can use whatever
you prefer, just point me an example.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 2/3] test: add a test for wal_cleanup_delay option
  2021-03-29 22:40             ` Cyrill Gorcunov via Tarantool-patches
@ 2021-03-29 22:56               ` Vladislav Shpilevoy via Tarantool-patches
  2021-03-30  7:19                 ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 1 reply; 16+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-03-29 22:56 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml

On 30.03.2021 00:40, Cyrill Gorcunov wrote:
> On Tue, Mar 30, 2021 at 12:19:10AM +0200, Vladislav Shpilevoy wrote:
>>>
>>> I need to make sure the .is_paused is false, the assert(x == false)
>>> looks exactly what is needed. and we have similar code in our other
>>> tests, do you prefer assert(not x.is_paused) or what?
>>
>> And there is much much much more tests using the booleans as they
>> are supposed to be used. Why do you need exactly 'false' as a literal?
> 
> Wait, I just don't understand. is_paused is a boolean type and
> I compare it with a boolean value. Could you simply point me
> the preferred way to compare if some particular value is false.
> I used assert() 'cause I found similar code, if there some more
> suiatble way to test the value, sure thing I can use whatever
> you prefer, just point me an example.

Ok, I open the latest replication test we added (gh-5536), and
I see the line

	assert(box.space.test:count() == 10)

It does not look like this:

	assert((box.space.test:count() == 10) == true)

does it? Because it is pointless to have 'boolean' type in a
language if you do the manual comparison with the boolean
constants anyway.

The same as in C you don't code like this:

	bool ok = func();
	if (ok == false)
		error();

You do:

	bool ok = func();
	if (!ok)
		error();

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 2/3] test: add a test for wal_cleanup_delay option
  2021-03-29 22:56               ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-03-30  7:19                 ` Cyrill Gorcunov via Tarantool-patches
  2021-03-30 11:55                   ` Cyrill Gorcunov via Tarantool-patches
  2021-03-30 19:59                   ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 2 replies; 16+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-03-30  7:19 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tml

On Tue, Mar 30, 2021 at 12:56:47AM +0200, Vladislav Shpilevoy wrote:
> > 
> > Wait, I just don't understand. is_paused is a boolean type and
> > I compare it with a boolean value. Could you simply point me
> > the preferred way to compare if some particular value is false.
> > I used assert() 'cause I found similar code, if there some more
> > suiatble way to test the value, sure thing I can use whatever
> > you prefer, just point me an example.
> 
> Ok, I open the latest replication test we added (gh-5536), and
> I see the line
> 
> 	assert(box.space.test:count() == 10)
> 
> It does not look like this:
> 
> 	assert((box.space.test:count() == 10) == true)
> 
> does it? Because it is pointless to have 'boolean' type in a

This is because the count() is a numeric value. What had been
there if 'count()' would return a boolean?

> language if you do the manual comparison with the boolean
> constants anyway.
> 
> The same as in C you don't code like this:
> 
> 	bool ok = func();
> 	if (ok == false)
> 		error();
> 
> You do:
> 
> 	bool ok = func();
> 	if (!ok)
> 		error();
> 

I still don't get it. The examples I see is the following

test/sql/checks.result:assert(box.space.TEST.ck_constraint.CK.is_enabled == false)
test/sql/checks.result:assert(box.space.TEST.ck_constraint.CK.is_enabled == true)

Vlad, lets move another way. I suspect there are 3 ways to compare

1) assert(is_paused == false)
2) assert(not is_paused)
3) is_running = not is_paused
   assert(is_running)

so which of them I should use?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 2/3] test: add a test for wal_cleanup_delay option
  2021-03-30  7:19                 ` Cyrill Gorcunov via Tarantool-patches
@ 2021-03-30 11:55                   ` Cyrill Gorcunov via Tarantool-patches
  2021-03-30 19:59                   ` Vladislav Shpilevoy via Tarantool-patches
  1 sibling, 0 replies; 16+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-03-30 11:55 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tml

On Tue, Mar 30, 2021 at 10:19:56AM +0300, Cyrill Gorcunov wrote:
> 
> Vlad, lets move another way. I suspect there are 3 ways to compare
> 
> 1) assert(is_paused == false)
> 2) assert(not is_paused)
> 3) is_running = not is_paused
>    assert(is_running)
> 
> so which of them I should use?

Here is a final update on top I've pushed out. Hopefully it is the one
you prefer to see. I put a complete result file for better review.
---
-- test-run result file version 2
--
-- gh-5806: defer xlog cleanup to keep xlogs until
-- replicas present in "_cluster" are connected.
-- Otherwise we are getting XlogGapError since
-- master might go far forward from replica and
-- replica won't be able to connect without full
-- rebootstrap.
--

fiber = require('fiber')
 | ---
 | ...
test_run = require('test_run').new()
 | ---
 | ...
engine = test_run:get_cfg('engine')
 | ---
 | ...

--
-- Case 1.
--
-- First lets make sure we're getting XlogGapError in
-- case if wal_cleanup_delay is not used.
--

test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
 | ---
 | - true
 | ...
test_run:cmd('start server master')
 | ---
 | - true
 | ...

test_run:switch('master')
 | ---
 | - true
 | ...
box.schema.user.grant('guest', 'replication')
 | ---
 | ...

--
-- Keep small number of snaps to force cleanup
-- procedure be more intensive.
box.cfg{checkpoint_count = 1}
 | ---
 | ...

engine = test_run:get_cfg('engine')
 | ---
 | ...
s = box.schema.space.create('test', {engine = engine})
 | ---
 | ...
_ = s:create_index('pk')
 | ---
 | ...

test_run:switch('default')
 | ---
 | - true
 | ...
test_run:cmd('create server replica with rpl_master=master,\
              script="replication/replica.lua"')
 | ---
 | - true
 | ...
test_run:cmd('start server replica')
 | ---
 | - true
 | ...

--
-- On replica we create an own space which allows us to
-- use more complex scenario and disables replica from
-- automatic rejoin (since replica can't do auto-rejoin if
-- there gonna be an own data loss). This allows us to
-- trigger XlogGapError in the log.
test_run:switch('replica')
 | ---
 | - true
 | ...
box.cfg{checkpoint_count = 1}
 | ---
 | ...
s = box.schema.space.create('testreplica')
 | ---
 | ...
_ = s:create_index('pk')
 | ---
 | ...
box.space.testreplica:insert({1})
 | ---
 | - [1]
 | ...
box.snapshot()
 | ---
 | - ok
 | ...

--
-- Stop the replica node and generate
-- xlogs on the master.
test_run:switch('master')
 | ---
 | - true
 | ...
test_run:cmd('stop server replica')
 | ---
 | - true
 | ...

box.space.test:insert({1})
 | ---
 | - [1]
 | ...
box.snapshot()
 | ---
 | - ok
 | ...

--
-- We need to restart the master node since otherwise
-- the replica will be preventing us from removing old
-- xlog because it will be tracked by gc consumer which
-- kept in memory while master node is running.
--
-- Once restarted we write a new record into master's
-- space and run snapshot which removes old xlog required
-- by replica to subscribe leading to XlogGapError which
-- we need to test.
test_run:cmd('restart server master')
 | 
box.space.test:insert({2})
 | ---
 | - [2]
 | ...
box.snapshot()
 | ---
 | - ok
 | ...
assert(not box.info.gc().is_paused)
 | ---
 | - true
 | ...

--
-- Start replica and wait for error.
test_run:cmd('start server replica with wait=False, wait_load=False')
 | ---
 | - true
 | ...

--
-- Wait error to appear, 60 seconds should be more than enough,
-- usually it happens in a couple of seconds.
test_run:switch('default')
 | ---
 | - true
 | ...
test_run:wait_log('master', 'XlogGapError', nil, 60) ~= nil
 | ---
 | - true
 | ...

--
-- Cleanup.
test_run:cmd('stop server master')
 | ---
 | - true
 | ...
test_run:cmd('cleanup server master')
 | ---
 | - true
 | ...
test_run:cmd('delete server master')
 | ---
 | - true
 | ...
test_run:cmd('stop server replica')
 | ---
 | - true
 | ...
test_run:cmd('cleanup server replica')
 | ---
 | - true
 | ...
test_run:cmd('delete server replica')
 | ---
 | - true
 | ...

--
-- Case 2.
--
-- Lets make sure we're not getting XlogGapError in
-- case if wal_cleanup_delay is used the code is almost
-- the same as for Case 1 except we don't disable cleanup
-- fiber but delay it up to a hour until replica is up
-- and running.
--

test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
 | ---
 | - true
 | ...
test_run:cmd('start server master with args="3600"')
 | ---
 | - true
 | ...

test_run:switch('master')
 | ---
 | - true
 | ...
box.schema.user.grant('guest', 'replication')
 | ---
 | ...

box.cfg{checkpoint_count = 1}
 | ---
 | ...

engine = test_run:get_cfg('engine')
 | ---
 | ...
s = box.schema.space.create('test', {engine = engine})
 | ---
 | ...
_ = s:create_index('pk')
 | ---
 | ...

test_run:switch('default')
 | ---
 | - true
 | ...
test_run:cmd('create server replica with rpl_master=master,\
              script="replication/replica.lua"')
 | ---
 | - true
 | ...
test_run:cmd('start server replica')
 | ---
 | - true
 | ...

test_run:switch('replica')
 | ---
 | - true
 | ...
box.cfg{checkpoint_count = 1}
 | ---
 | ...
s = box.schema.space.create('testreplica')
 | ---
 | ...
_ = s:create_index('pk')
 | ---
 | ...
box.space.testreplica:insert({1})
 | ---
 | - [1]
 | ...
box.snapshot()
 | ---
 | - ok
 | ...

test_run:switch('master')
 | ---
 | - true
 | ...
test_run:cmd('stop server replica')
 | ---
 | - true
 | ...

box.space.test:insert({1})
 | ---
 | - [1]
 | ...
box.snapshot()
 | ---
 | - ok
 | ...

test_run:cmd('restart server master with args="3600"')
 | 
box.space.test:insert({2})
 | ---
 | - [2]
 | ...
box.snapshot()
 | ---
 | - ok
 | ...
assert(box.info.gc().is_paused)
 | ---
 | - true
 | ...

test_run:cmd('start server replica')
 | ---
 | - true
 | ...

--
-- Make sure no error happened.
test_run:switch('default')
 | ---
 | - true
 | ...
assert(test_run:grep_log("master", "XlogGapError") == nil)
 | ---
 | - true
 | ...

test_run:cmd('stop server master')
 | ---
 | - true
 | ...
test_run:cmd('cleanup server master')
 | ---
 | - true
 | ...
test_run:cmd('delete server master')
 | ---
 | - true
 | ...
test_run:cmd('stop server replica')
 | ---
 | - true
 | ...
test_run:cmd('cleanup server replica')
 | ---
 | - true
 | ...
test_run:cmd('delete server replica')
 | ---
 | - true
 | ...
--
--
-- Case 3: Fill _cluster with replica but then delete
-- the replica so that master's cleanup leave in "paused"
-- state, and then simply decrease the timeout to make
-- cleanup fiber work again.
--
test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
 | ---
 | - true
 | ...
test_run:cmd('start server master with args="3600"')
 | ---
 | - true
 | ...

test_run:switch('master')
 | ---
 | - true
 | ...
box.schema.user.grant('guest', 'replication')
 | ---
 | ...

test_run:switch('default')
 | ---
 | - true
 | ...
test_run:cmd('create server replica with rpl_master=master,\
              script="replication/replica.lua"')
 | ---
 | - true
 | ...
test_run:cmd('start server replica')
 | ---
 | - true
 | ...

test_run:switch('master')
 | ---
 | - true
 | ...
test_run:cmd('stop server replica')
 | ---
 | - true
 | ...
test_run:cmd('cleanup server replica')
 | ---
 | - true
 | ...
test_run:cmd('delete server replica')
 | ---
 | - true
 | ...

test_run:cmd('restart server master with args="3600"')
 | 
assert(box.info.gc().is_paused)
 | ---
 | - true
 | ...

test_run:switch('master')
 | ---
 | - true
 | ...
box.cfg{wal_cleanup_delay = 0.01}
 | ---
 | ...
test_run:wait_cond(function() return not box.info.gc().is_paused end)
 | ---
 | - true
 | ...

test_run:switch('default')
 | ---
 | - true
 | ...
test_run:cmd('stop server master')
 | ---
 | - true
 | ...
test_run:cmd('cleanup server master')
 | ---
 | - true
 | ...
test_run:cmd('delete server master')
 | ---
 | - true
 | ...

--
-- Case 4: Fill _cluster with replica but then delete
-- the replica so that master's cleanup leave in "paused"
-- state, and finally cleanup the _cluster to kick cleanup.
--
test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
 | ---
 | - true
 | ...
test_run:cmd('start server master')
 | ---
 | - true
 | ...

test_run:switch('master')
 | ---
 | - true
 | ...
box.schema.user.grant('guest', 'replication')
 | ---
 | ...

test_run:switch('default')
 | ---
 | - true
 | ...
test_run:cmd('create server replica with rpl_master=master,\
              script="replication/replica.lua"')
 | ---
 | - true
 | ...
test_run:cmd('start server replica')
 | ---
 | - true
 | ...

test_run:switch('default')
 | ---
 | - true
 | ...
master_uuid = test_run:eval('master', 'return box.info.uuid')[1]
 | ---
 | ...
replica_uuid = test_run:eval('replica', 'return box.info.uuid')[1]
 | ---
 | ...
master_cluster = test_run:eval('master', 'return box.space._cluster:select()')[1]
 | ---
 | ...
assert(master_cluster[1][2] == master_uuid)
 | ---
 | - true
 | ...
assert(master_cluster[2][2] == replica_uuid)
 | ---
 | - true
 | ...

test_run:cmd('stop server replica')
 | ---
 | - true
 | ...
test_run:cmd('cleanup server replica')
 | ---
 | - true
 | ...
test_run:cmd('delete server replica')
 | ---
 | - true
 | ...

test_run:switch('master')
 | ---
 | - true
 | ...
test_run:cmd('restart server master with args="3600"')
 | 
assert(box.info.gc().is_paused)
 | ---
 | - true
 | ...

--
-- Drop the replica from _cluster and make sure
-- cleanup fiber is not paused anymore.
test_run:switch('default')
 | ---
 | - true
 | ...
deleted_uuid = test_run:eval('master', 'return box.space._cluster:delete(2)')[1][2]
 | ---
 | ...
assert(replica_uuid == deleted_uuid)
 | ---
 | - true
 | ...

test_run:switch('master')
 | ---
 | - true
 | ...
test_run:wait_cond(function() return not box.info.gc().is_paused end)
 | ---
 | - true
 | ...

test_run:switch('default')
 | ---
 | - true
 | ...
test_run:cmd('stop server master')
 | ---
 | - true
 | ...
test_run:cmd('cleanup server master')
 | ---
 | - true
 | ...
test_run:cmd('delete server master')
 | ---
 | - true
 | ...

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 0/3] gc/xlog: delay xlog cleanup until relays are subscribed
  2021-03-27 11:13 [Tarantool-patches] [PATCH v6 0/3] gc/xlog: delay xlog cleanup until relays are subscribed Cyrill Gorcunov via Tarantool-patches
                   ` (2 preceding siblings ...)
  2021-03-27 11:13 ` [Tarantool-patches] [PATCH v6 3/3] test: box-tap/gc -- add test for is_paused field Cyrill Gorcunov via Tarantool-patches
@ 2021-03-30 19:57 ` Vladislav Shpilevoy via Tarantool-patches
  2021-03-31  8:28 ` Kirill Yukhin via Tarantool-patches
  4 siblings, 0 replies; 16+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-03-30 19:57 UTC (permalink / raw)
  To: Cyrill Gorcunov, tml

Hi! Thanks for the patchset!

LGTM.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 2/3] test: add a test for wal_cleanup_delay option
  2021-03-30  7:19                 ` Cyrill Gorcunov via Tarantool-patches
  2021-03-30 11:55                   ` Cyrill Gorcunov via Tarantool-patches
@ 2021-03-30 19:59                   ` Vladislav Shpilevoy via Tarantool-patches
  1 sibling, 0 replies; 16+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-03-30 19:59 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml

On 30.03.2021 09:19, Cyrill Gorcunov via Tarantool-patches wrote:
> On Tue, Mar 30, 2021 at 12:56:47AM +0200, Vladislav Shpilevoy wrote:
>>>
>>> Wait, I just don't understand. is_paused is a boolean type and
>>> I compare it with a boolean value. Could you simply point me
>>> the preferred way to compare if some particular value is false.
>>> I used assert() 'cause I found similar code, if there some more
>>> suiatble way to test the value, sure thing I can use whatever
>>> you prefer, just point me an example.
>>
>> Ok, I open the latest replication test we added (gh-5536), and
>> I see the line
>>
>> 	assert(box.space.test:count() == 10)
>>
>> It does not look like this:
>>
>> 	assert((box.space.test:count() == 10) == true)
>>
>> does it? Because it is pointless to have 'boolean' type in a
> 
> This is because the count() is a numeric value. What had been
> there if 'count()' would return a boolean?

I don't see how it is related to count() being numeric. I
am talking about 'count() == 10', not about the 'count()'
itself. That expression is boolean, and therefore is supposed
to be used with boolean operators.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 0/3] gc/xlog: delay xlog cleanup until relays are subscribed
  2021-03-27 11:13 [Tarantool-patches] [PATCH v6 0/3] gc/xlog: delay xlog cleanup until relays are subscribed Cyrill Gorcunov via Tarantool-patches
                   ` (3 preceding siblings ...)
  2021-03-30 19:57 ` [Tarantool-patches] [PATCH v6 0/3] gc/xlog: delay xlog cleanup until relays are subscribed Vladislav Shpilevoy via Tarantool-patches
@ 2021-03-31  8:28 ` Kirill Yukhin via Tarantool-patches
  4 siblings, 0 replies; 16+ messages in thread
From: Kirill Yukhin via Tarantool-patches @ 2021-03-31  8:28 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml, Vladislav Shpilevoy

Hello,

On 27 мар 14:13, Cyrill Gorcunov wrote:
> Take a look please.
> 
> v2:
>  - rebase code to the fresh master branch
>  - keep wal_cleanup_delay option name
>  - pass wal_cleanup_delay as an option to gc_init, so it
>    won't be dependent on cfg engine
>  - add comment about gc_delay_unref in plain bootstrap mode
>  - allow to setup wal_cleanup_delay dynamically
>  - update comment in gc_wait_cleanup and call it conditionally
>  - declare wal_cleanup_delay as a double
>  - rename gc.cleanup_is_paused to gc.is_paused and update output
>  - do not show ref counter in box.info.gc() output
>  - update documentation
>  - move gc_delay_unref inside relay_subscribe call which runs
>    in tx context (instead of relay's context)
>  - update tests:
>    - add a comment why we need a temp space on replica node
>    - use explicit insert/snapshot operations
>    - shrkink the number of insert/snapshot to speedup testing
>    - use "restart" instead of stop/start pair
>    - use wait_log helper instead of own function
>    - add is_paused test
> v3:
>  - fix changelog
>  - rework box_check_wal_cleanup_delay, the replication_anon
>    setting is considered only in box_set_wal_cleanup_delay,
>    ie when config is checked and parsed, moreover the order
>    of setup is set to be behind "replication_anon" option
>    processing
>  - delay cycle now considers deadline instead of per cycle
>    calculation
>  - use `double` type for timestamp
>  - test update
>    - verify `.is_paused` value
>    - minimize number of inserts
>    - no need to use temporary space, regular space works as well
>    - add comments on why we should restart the master node
> v4:
>  - drop argument from gc_init(), since we're configuring delay
>    value from load_cfg.lua script there is no need to read the
>    delay early, simply start gc paused and unpause it on demand
>  - move unpause message to main wait cycle
>  - test update:
>    - verify tests and fix replication/replica_rejoin since it waits
>      for xlogs to be cleaned up too early
>    - use 10 seconds for XlogGapError instead of 0.1 second, this is
>      a common deadline value
> v5:
>  - define limits for `wal_cleanup_delay`: it should be either 0,
>    or in range [0.001; TIMEOUT_INFINITY]. This is done to not consider
>    fp epsilon as a meaningul value
>  - fix comment about why anon replica is not using delay
>  - rework cleanup delay'ed cycle
>  - test update:
>    - update vinyl/replica_rejoin -- we need to disable cleanup
>      delay explicitly
>    - update replication/replica_rejoin for same reason
>    - drop unneded test_run:switch() calls
>    - add a testcase where timeout is decreased and cleanup
>      fiber is kicked to run even with stuck replica
> 
> v6:
>   - test update:
>    - replica_rejoin.lua simplified to drop not needed data
>    - update main test to check if _cluster sleanup triggers
>      the fiber to run
> 
> issue https://github.com/tarantool/tarantool/issues/5806
> branch gorcunov/gh-5806-xlog-gc-6

I've checked your patch into 2.6, 2.7 and master.


--
Regards, Kirill Yukhin

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-03-31  8:28 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-27 11:13 [Tarantool-patches] [PATCH v6 0/3] gc/xlog: delay xlog cleanup until relays are subscribed Cyrill Gorcunov via Tarantool-patches
2021-03-27 11:13 ` [Tarantool-patches] [PATCH v6 1/3] " Cyrill Gorcunov via Tarantool-patches
2021-03-27 11:13 ` [Tarantool-patches] [PATCH v6 2/3] test: add a test for wal_cleanup_delay option Cyrill Gorcunov via Tarantool-patches
2021-03-29 21:07   ` Vladislav Shpilevoy via Tarantool-patches
2021-03-29 21:46     ` Cyrill Gorcunov via Tarantool-patches
2021-03-29 21:54       ` Vladislav Shpilevoy via Tarantool-patches
2021-03-29 21:57         ` Cyrill Gorcunov via Tarantool-patches
2021-03-29 22:19           ` Vladislav Shpilevoy via Tarantool-patches
2021-03-29 22:40             ` Cyrill Gorcunov via Tarantool-patches
2021-03-29 22:56               ` Vladislav Shpilevoy via Tarantool-patches
2021-03-30  7:19                 ` Cyrill Gorcunov via Tarantool-patches
2021-03-30 11:55                   ` Cyrill Gorcunov via Tarantool-patches
2021-03-30 19:59                   ` Vladislav Shpilevoy via Tarantool-patches
2021-03-27 11:13 ` [Tarantool-patches] [PATCH v6 3/3] test: box-tap/gc -- add test for is_paused field Cyrill Gorcunov via Tarantool-patches
2021-03-30 19:57 ` [Tarantool-patches] [PATCH v6 0/3] gc/xlog: delay xlog cleanup until relays are subscribed Vladislav Shpilevoy via Tarantool-patches
2021-03-31  8:28 ` Kirill Yukhin via Tarantool-patches

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox