[Tarantool-patches] [PATCH v6 0/3] gc/xlog: delay xlog cleanup until relays are subscribed

Cyrill Gorcunov gorcunov at gmail.com
Sat Mar 27 14:13:07 MSK 2021


Take a look please.

v2:
 - rebase code to the fresh master branch
 - keep wal_cleanup_delay option name
 - pass wal_cleanup_delay as an option to gc_init, so it
   won't be dependent on cfg engine
 - add comment about gc_delay_unref in plain bootstrap mode
 - allow to setup wal_cleanup_delay dynamically
 - update comment in gc_wait_cleanup and call it conditionally
 - declare wal_cleanup_delay as a double
 - rename gc.cleanup_is_paused to gc.is_paused and update output
 - do not show ref counter in box.info.gc() output
 - update documentation
 - move gc_delay_unref inside relay_subscribe call which runs
   in tx context (instead of relay's context)
 - update tests:
   - add a comment why we need a temp space on replica node
   - use explicit insert/snapshot operations
   - shrkink the number of insert/snapshot to speedup testing
   - use "restart" instead of stop/start pair
   - use wait_log helper instead of own function
   - add is_paused test
v3:
 - fix changelog
 - rework box_check_wal_cleanup_delay, the replication_anon
   setting is considered only in box_set_wal_cleanup_delay,
   ie when config is checked and parsed, moreover the order
   of setup is set to be behind "replication_anon" option
   processing
 - delay cycle now considers deadline instead of per cycle
   calculation
 - use `double` type for timestamp
 - test update
   - verify `.is_paused` value
   - minimize number of inserts
   - no need to use temporary space, regular space works as well
   - add comments on why we should restart the master node
v4:
 - drop argument from gc_init(), since we're configuring delay
   value from load_cfg.lua script there is no need to read the
   delay early, simply start gc paused and unpause it on demand
 - move unpause message to main wait cycle
 - test update:
   - verify tests and fix replication/replica_rejoin since it waits
     for xlogs to be cleaned up too early
   - use 10 seconds for XlogGapError instead of 0.1 second, this is
     a common deadline value
v5:
 - define limits for `wal_cleanup_delay`: it should be either 0,
   or in range [0.001; TIMEOUT_INFINITY]. This is done to not consider
   fp epsilon as a meaningul value
 - fix comment about why anon replica is not using delay
 - rework cleanup delay'ed cycle
 - test update:
   - update vinyl/replica_rejoin -- we need to disable cleanup
     delay explicitly
   - update replication/replica_rejoin for same reason
   - drop unneded test_run:switch() calls
   - add a testcase where timeout is decreased and cleanup
     fiber is kicked to run even with stuck replica

v6:
  - test update:
   - replica_rejoin.lua simplified to drop not needed data
   - update main test to check if _cluster sleanup triggers
     the fiber to run

issue https://github.com/tarantool/tarantool/issues/5806
branch gorcunov/gh-5806-xlog-gc-6

Cyrill Gorcunov (3):
  gc/xlog: delay xlog cleanup until relays are subscribed
  test: add a test for wal_cleanup_delay option
  test: box-tap/gc -- add test for is_paused field

 .../unreleased/add-wal_cleanup_delay.md       |   5 +
 src/box/box.cc                                |  41 ++
 src/box/box.h                                 |   1 +
 src/box/gc.c                                  |  95 ++-
 src/box/gc.h                                  |  36 ++
 src/box/lua/cfg.cc                            |   9 +
 src/box/lua/info.c                            |   4 +
 src/box/lua/load_cfg.lua                      |   5 +
 src/box/relay.cc                              |   1 +
 src/box/replication.cc                        |   2 +
 test/app-tap/init_script.result               |   1 +
 test/box-tap/gc.test.lua                      |   3 +-
 test/box/admin.result                         |   2 +
 test/box/cfg.result                           |   4 +
 test/replication/gh-5806-master.lua           |   8 +
 test/replication/gh-5806-xlog-cleanup.result  | 558 ++++++++++++++++++
 .../replication/gh-5806-xlog-cleanup.test.lua | 234 ++++++++
 test/replication/replica_rejoin.lua           |  11 +
 test/replication/replica_rejoin.result        |  26 +-
 test/replication/replica_rejoin.test.lua      |  19 +-
 test/vinyl/replica_rejoin.lua                 |   5 +-
 test/vinyl/replica_rejoin.result              |  13 +
 test/vinyl/replica_rejoin.test.lua            |   8 +
 23 files changed, 1074 insertions(+), 17 deletions(-)
 create mode 100644 changelogs/unreleased/add-wal_cleanup_delay.md
 create mode 100644 test/replication/gh-5806-master.lua
 create mode 100644 test/replication/gh-5806-xlog-cleanup.result
 create mode 100644 test/replication/gh-5806-xlog-cleanup.test.lua
 create mode 100644 test/replication/replica_rejoin.lua


base-commit: 234472522a924ecf62e27c27e1e29b8803a677cc
-- 
Here is a summary diff for v5

diff --git a/test/replication/gh-5806-xlog-cleanup.result b/test/replication/gh-5806-xlog-cleanup.result
index 523d400a7..da09daf17 100644
--- a/test/replication/gh-5806-xlog-cleanup.result
+++ b/test/replication/gh-5806-xlog-cleanup.result
@@ -29,7 +29,7 @@ test_run:cmd('create server master with script="replication/gh-5806-master.lua"'
  | ---
  | - true
  | ...
-test_run:cmd('start server master with wait=True, wait_load=True')
+test_run:cmd('start server master')
  | ---
  | - true
  | ...
@@ -68,7 +68,7 @@ test_run:cmd('create server replica with rpl_master=master,\
  | ---
  | - true
  | ...
-test_run:cmd('start server replica with wait=True, wait_load=True')
+test_run:cmd('start server replica')
  | ---
  | - true
  | ...
@@ -132,7 +132,7 @@ box.snapshot()
 -- space and run snapshot which removes old xlog required
 -- by replica to subscribe leading to XlogGapError which
 -- we need to test.
-test_run:cmd('restart server master with wait_load=True')
+test_run:cmd('restart server master')
  | 
 box.space.test:insert({2})
  | ---
@@ -207,7 +207,7 @@ test_run:cmd('create server master with script="replication/gh-5806-master.lua"'
  | ---
  | - true
  | ...
-test_run:cmd('start server master with args="3600", wait=True, wait_load=True')
+test_run:cmd('start server master with args="3600"')
  | ---
  | - true
  | ...
@@ -243,7 +243,7 @@ test_run:cmd('create server replica with rpl_master=master,\
  | ---
  | - true
  | ...
-test_run:cmd('start server replica with wait=True, wait_load=True')
+test_run:cmd('start server replica')
  | ---
  | - true
  | ...
@@ -288,7 +288,7 @@ box.snapshot()
  | - ok
  | ...
 
-test_run:cmd('restart server master with args="3600", wait=True, wait_load=True')
+test_run:cmd('restart server master with args="3600"')
  | 
 box.space.test:insert({2})
  | ---
@@ -303,7 +303,7 @@ assert(box.info.gc().is_paused == true)
  | - true
  | ...
 
-test_run:cmd('start server replica with wait=True, wait_load=True')
+test_run:cmd('start server replica')
  | ---
  | - true
  | ...
@@ -354,7 +354,7 @@ test_run:cmd('create server master with script="replication/gh-5806-master.lua"'
  | ---
  | - true
  | ...
-test_run:cmd('start server master with args="3600", wait=True, wait_load=True')
+test_run:cmd('start server master with args="3600"')
  | ---
  | - true
  | ...
@@ -376,7 +376,7 @@ test_run:cmd('create server replica with rpl_master=master,\
  | ---
  | - true
  | ...
-test_run:cmd('start server replica with wait=True, wait_load=True')
+test_run:cmd('start server replica')
  | ---
  | - true
  | ...
@@ -398,7 +398,7 @@ test_run:cmd('delete server replica')
  | - true
  | ...
 
-test_run:cmd('restart server master with args="3600", wait=True, wait_load=True')
+test_run:cmd('restart server master with args="3600"')
  | 
 assert(box.info.gc().is_paused == true)
  | ---
@@ -433,3 +433,126 @@ test_run:cmd('delete server master')
  | ---
  | - true
  | ...
+
+--
+-- Case 4: Fill _cluster with replica but then delete
+-- the replica so that master's cleanup leave in "paused"
+-- state, and finally cleanup the _cluster to kick cleanup.
+--
+test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server master')
+ | ---
+ | - true
+ | ...
+
+test_run:switch('master')
+ | ---
+ | - true
+ | ...
+box.schema.user.grant('guest', 'replication')
+ | ---
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('create server replica with rpl_master=master,\
+              script="replication/replica.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica')
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+master_uuid = test_run:eval('master', 'return box.info.uuid')[1]
+ | ---
+ | ...
+replica_uuid = test_run:eval('replica', 'return box.info.uuid')[1]
+ | ---
+ | ...
+master_custer = test_run:eval('master', 'return box.space._cluster:select()')[1]
+ | ---
+ | ...
+assert(master_custer[1][2] == master_uuid)
+ | ---
+ | - true
+ | ...
+assert(master_custer[2][2] == replica_uuid)
+ | ---
+ | - true
+ | ...
+
+test_run:cmd('stop server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('cleanup server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica')
+ | ---
+ | - true
+ | ...
+
+test_run:switch('master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('restart server master with args="3600"')
+ | 
+assert(box.info.gc().is_paused == true)
+ | ---
+ | - true
+ | ...
+
+--
+-- Drop the replica from _cluster and make sure
+-- cleanup fiber is not paused anymore.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+deleted_uuid = test_run:eval('master', 'return box.space._cluster:delete(2)')[1][2]
+ | ---
+ | ...
+assert(replica_uuid == deleted_uuid)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('master')
+ | ---
+ | - true
+ | ...
+test_run:wait_cond(function() return box.info.gc().is_paused == false end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('cleanup server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server master')
+ | ---
+ | - true
+ | ...
diff --git a/test/replication/gh-5806-xlog-cleanup.test.lua b/test/replication/gh-5806-xlog-cleanup.test.lua
index f16be758a..b65563e7f 100644
--- a/test/replication/gh-5806-xlog-cleanup.test.lua
+++ b/test/replication/gh-5806-xlog-cleanup.test.lua
@@ -19,7 +19,7 @@ engine = test_run:get_cfg('engine')
 --
 
 test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
-test_run:cmd('start server master with wait=True, wait_load=True')
+test_run:cmd('start server master')
 
 test_run:switch('master')
 box.schema.user.grant('guest', 'replication')
@@ -36,7 +36,7 @@ _ = s:create_index('pk')
 test_run:switch('default')
 test_run:cmd('create server replica with rpl_master=master,\
               script="replication/replica.lua"')
-test_run:cmd('start server replica with wait=True, wait_load=True')
+test_run:cmd('start server replica')
 
 --
 -- On replica we create an own space which allows us to
@@ -70,7 +70,7 @@ box.snapshot()
 -- space and run snapshot which removes old xlog required
 -- by replica to subscribe leading to XlogGapError which
 -- we need to test.
-test_run:cmd('restart server master with wait_load=True')
+test_run:cmd('restart server master')
 box.space.test:insert({2})
 box.snapshot()
 assert(box.info.gc().is_paused == false)
@@ -105,7 +105,7 @@ test_run:cmd('delete server replica')
 --
 
 test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
-test_run:cmd('start server master with args="3600", wait=True, wait_load=True')
+test_run:cmd('start server master with args="3600"')
 
 test_run:switch('master')
 box.schema.user.grant('guest', 'replication')
@@ -119,7 +119,7 @@ _ = s:create_index('pk')
 test_run:switch('default')
 test_run:cmd('create server replica with rpl_master=master,\
               script="replication/replica.lua"')
-test_run:cmd('start server replica with wait=True, wait_load=True')
+test_run:cmd('start server replica')
 
 test_run:switch('replica')
 box.cfg{checkpoint_count = 1}
@@ -134,12 +134,12 @@ test_run:cmd('stop server replica')
 box.space.test:insert({1})
 box.snapshot()
 
-test_run:cmd('restart server master with args="3600", wait=True, wait_load=True')
+test_run:cmd('restart server master with args="3600"')
 box.space.test:insert({2})
 box.snapshot()
 assert(box.info.gc().is_paused == true)
 
-test_run:cmd('start server replica with wait=True, wait_load=True')
+test_run:cmd('start server replica')
 
 --
 -- Make sure no error happened.
@@ -160,7 +160,7 @@ test_run:cmd('delete server replica')
 -- cleanup fiber work again.
 --
 test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
-test_run:cmd('start server master with args="3600", wait=True, wait_load=True')
+test_run:cmd('start server master with args="3600"')
 
 test_run:switch('master')
 box.schema.user.grant('guest', 'replication')
@@ -168,14 +168,14 @@ box.schema.user.grant('guest', 'replication')
 test_run:switch('default')
 test_run:cmd('create server replica with rpl_master=master,\
               script="replication/replica.lua"')
-test_run:cmd('start server replica with wait=True, wait_load=True')
+test_run:cmd('start server replica')
 
 test_run:switch('master')
 test_run:cmd('stop server replica')
 test_run:cmd('cleanup server replica')
 test_run:cmd('delete server replica')
 
-test_run:cmd('restart server master with args="3600", wait=True, wait_load=True')
+test_run:cmd('restart server master with args="3600"')
 assert(box.info.gc().is_paused == true)
 
 test_run:switch('master')
@@ -186,3 +186,49 @@ test_run:switch('default')
 test_run:cmd('stop server master')
 test_run:cmd('cleanup server master')
 test_run:cmd('delete server master')
+
+--
+-- Case 4: Fill _cluster with replica but then delete
+-- the replica so that master's cleanup leave in "paused"
+-- state, and finally cleanup the _cluster to kick cleanup.
+--
+test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
+test_run:cmd('start server master')
+
+test_run:switch('master')
+box.schema.user.grant('guest', 'replication')
+
+test_run:switch('default')
+test_run:cmd('create server replica with rpl_master=master,\
+              script="replication/replica.lua"')
+test_run:cmd('start server replica')
+
+test_run:switch('default')
+master_uuid = test_run:eval('master', 'return box.info.uuid')[1]
+replica_uuid = test_run:eval('replica', 'return box.info.uuid')[1]
+master_custer = test_run:eval('master', 'return box.space._cluster:select()')[1]
+assert(master_custer[1][2] == master_uuid)
+assert(master_custer[2][2] == replica_uuid)
+
+test_run:cmd('stop server replica')
+test_run:cmd('cleanup server replica')
+test_run:cmd('delete server replica')
+
+test_run:switch('master')
+test_run:cmd('restart server master with args="3600"')
+assert(box.info.gc().is_paused == true)
+
+--
+-- Drop the replica from _cluster and make sure
+-- cleanup fiber is not paused anymore.
+test_run:switch('default')
+deleted_uuid = test_run:eval('master', 'return box.space._cluster:delete(2)')[1][2]
+assert(replica_uuid == deleted_uuid)
+
+test_run:switch('master')
+test_run:wait_cond(function() return box.info.gc().is_paused == false end)
+
+test_run:switch('default')
+test_run:cmd('stop server master')
+test_run:cmd('cleanup server master')
+test_run:cmd('delete server master')
diff --git a/test/replication/replica_rejoin.lua b/test/replication/replica_rejoin.lua
index 76f6e5b75..9c743c52b 100644
--- a/test/replication/replica_rejoin.lua
+++ b/test/replication/replica_rejoin.lua
@@ -1,22 +1,11 @@
 #!/usr/bin/env tarantool
 
-local repl_include_self = arg[1] and arg[1] == 'true' or false
-local repl_list
-
-if repl_include_self then
-    repl_list = {os.getenv("MASTER"), os.getenv("LISTEN")}
-else
-    repl_list = os.getenv("MASTER")
-end
-
 -- Start the console first to allow test-run to attach even before
 -- box.cfg is finished.
 require('console').listen(os.getenv('ADMIN'))
 
 box.cfg({
     listen              = os.getenv("LISTEN"),
-    replication         = repl_list,
-    memtx_memory        = 107374182,
-    replication_timeout = 0.1,
+    replication         = {os.getenv("MASTER"), os.getenv("LISTEN")},
     wal_cleanup_delay   = 0,
 })
diff --git a/test/replication/replica_rejoin.result b/test/replication/replica_rejoin.result
index 074cc3e67..843333a19 100644
--- a/test/replication/replica_rejoin.result
+++ b/test/replication/replica_rejoin.result
@@ -47,7 +47,7 @@ test_run:cmd("create server replica with rpl_master=default, script='replication
 ---
 - true
 ...
-test_run:cmd("start server replica with args='true'")
+test_run:cmd("start server replica")
 ---
 - true
 ...
@@ -124,7 +124,7 @@ box.cfg{checkpoint_count = checkpoint_count}
 ...
 -- Restart the replica. Since xlogs have been removed,
 -- it is supposed to rejoin without changing id.
-test_run:cmd("start server replica with args='true'")
+test_run:cmd("start server replica")
 ---
 - true
 ...
@@ -229,7 +229,7 @@ test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.
 box.cfg{checkpoint_count = checkpoint_count}
 ---
 ...
-test_run:cmd("start server replica with args='true', wait=False")
+test_run:cmd("start server replica with wait=False")
 ---
 - true
 ...
@@ -271,7 +271,7 @@ test_run:cleanup_cluster()
 box.space.test:truncate()
 ---
 ...
-test_run:cmd("start server replica with args='true'")
+test_run:cmd("start server replica")
 ---
 - true
 ...
diff --git a/test/replication/replica_rejoin.test.lua b/test/replication/replica_rejoin.test.lua
index 223316d86..c3ba9bf3f 100644
--- a/test/replication/replica_rejoin.test.lua
+++ b/test/replication/replica_rejoin.test.lua
@@ -24,7 +24,7 @@ _ = box.space.test:insert{3}
 
 -- Join a replica, then stop it.
 test_run:cmd("create server replica with rpl_master=default, script='replication/replica_rejoin.lua'")
-test_run:cmd("start server replica with args='true'")
+test_run:cmd("start server replica")
 test_run:cmd("switch replica")
 box.info.replication[1].upstream.status == 'follow' or log.error(box.info)
 box.space.test:select()
@@ -53,7 +53,7 @@ box.cfg{checkpoint_count = checkpoint_count}
 
 -- Restart the replica. Since xlogs have been removed,
 -- it is supposed to rejoin without changing id.
-test_run:cmd("start server replica with args='true'")
+test_run:cmd("start server replica")
 box.info.replication[2].downstream.vclock ~= nil or log.error(box.info)
 test_run:cmd("switch replica")
 box.info.replication[1].upstream.status == 'follow' or log.error(box.info)
@@ -88,7 +88,7 @@ for i = 1, 3 do box.space.test:insert{i * 100} end
 fio = require('fio')
 test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1 end) or fio.pathjoin(box.cfg.wal_dir, '*.xlog')
 box.cfg{checkpoint_count = checkpoint_count}
-test_run:cmd("start server replica with args='true', wait=False")
+test_run:cmd("start server replica with wait=False")
 test_run:cmd("switch replica")
 test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = 'loading'})
 box.space.test:select()
@@ -104,7 +104,7 @@ test_run:cmd("stop server replica")
 test_run:cmd("cleanup server replica")
 test_run:cleanup_cluster()
 box.space.test:truncate()
-test_run:cmd("start server replica with args='true'")
+test_run:cmd("start server replica")
 -- Subscribe the master to the replica.
 replica_listen = test_run:cmd("eval replica 'return box.cfg.listen'")
 replica_listen ~= nil


More information about the Tarantool-patches mailing list