[Tarantool-patches] [PATCH v6 0/3] gc/xlog: delay xlog cleanup until relays are subscribed
Cyrill Gorcunov
gorcunov at gmail.com
Sat Mar 27 14:13:07 MSK 2021
Take a look please.
v2:
- rebase code to the fresh master branch
- keep wal_cleanup_delay option name
- pass wal_cleanup_delay as an option to gc_init, so it
won't be dependent on cfg engine
- add comment about gc_delay_unref in plain bootstrap mode
- allow to setup wal_cleanup_delay dynamically
- update comment in gc_wait_cleanup and call it conditionally
- declare wal_cleanup_delay as a double
- rename gc.cleanup_is_paused to gc.is_paused and update output
- do not show ref counter in box.info.gc() output
- update documentation
- move gc_delay_unref inside relay_subscribe call which runs
in tx context (instead of relay's context)
- update tests:
- add a comment why we need a temp space on replica node
- use explicit insert/snapshot operations
- shrkink the number of insert/snapshot to speedup testing
- use "restart" instead of stop/start pair
- use wait_log helper instead of own function
- add is_paused test
v3:
- fix changelog
- rework box_check_wal_cleanup_delay, the replication_anon
setting is considered only in box_set_wal_cleanup_delay,
ie when config is checked and parsed, moreover the order
of setup is set to be behind "replication_anon" option
processing
- delay cycle now considers deadline instead of per cycle
calculation
- use `double` type for timestamp
- test update
- verify `.is_paused` value
- minimize number of inserts
- no need to use temporary space, regular space works as well
- add comments on why we should restart the master node
v4:
- drop argument from gc_init(), since we're configuring delay
value from load_cfg.lua script there is no need to read the
delay early, simply start gc paused and unpause it on demand
- move unpause message to main wait cycle
- test update:
- verify tests and fix replication/replica_rejoin since it waits
for xlogs to be cleaned up too early
- use 10 seconds for XlogGapError instead of 0.1 second, this is
a common deadline value
v5:
- define limits for `wal_cleanup_delay`: it should be either 0,
or in range [0.001; TIMEOUT_INFINITY]. This is done to not consider
fp epsilon as a meaningul value
- fix comment about why anon replica is not using delay
- rework cleanup delay'ed cycle
- test update:
- update vinyl/replica_rejoin -- we need to disable cleanup
delay explicitly
- update replication/replica_rejoin for same reason
- drop unneded test_run:switch() calls
- add a testcase where timeout is decreased and cleanup
fiber is kicked to run even with stuck replica
v6:
- test update:
- replica_rejoin.lua simplified to drop not needed data
- update main test to check if _cluster sleanup triggers
the fiber to run
issue https://github.com/tarantool/tarantool/issues/5806
branch gorcunov/gh-5806-xlog-gc-6
Cyrill Gorcunov (3):
gc/xlog: delay xlog cleanup until relays are subscribed
test: add a test for wal_cleanup_delay option
test: box-tap/gc -- add test for is_paused field
.../unreleased/add-wal_cleanup_delay.md | 5 +
src/box/box.cc | 41 ++
src/box/box.h | 1 +
src/box/gc.c | 95 ++-
src/box/gc.h | 36 ++
src/box/lua/cfg.cc | 9 +
src/box/lua/info.c | 4 +
src/box/lua/load_cfg.lua | 5 +
src/box/relay.cc | 1 +
src/box/replication.cc | 2 +
test/app-tap/init_script.result | 1 +
test/box-tap/gc.test.lua | 3 +-
test/box/admin.result | 2 +
test/box/cfg.result | 4 +
test/replication/gh-5806-master.lua | 8 +
test/replication/gh-5806-xlog-cleanup.result | 558 ++++++++++++++++++
.../replication/gh-5806-xlog-cleanup.test.lua | 234 ++++++++
test/replication/replica_rejoin.lua | 11 +
test/replication/replica_rejoin.result | 26 +-
test/replication/replica_rejoin.test.lua | 19 +-
test/vinyl/replica_rejoin.lua | 5 +-
test/vinyl/replica_rejoin.result | 13 +
test/vinyl/replica_rejoin.test.lua | 8 +
23 files changed, 1074 insertions(+), 17 deletions(-)
create mode 100644 changelogs/unreleased/add-wal_cleanup_delay.md
create mode 100644 test/replication/gh-5806-master.lua
create mode 100644 test/replication/gh-5806-xlog-cleanup.result
create mode 100644 test/replication/gh-5806-xlog-cleanup.test.lua
create mode 100644 test/replication/replica_rejoin.lua
base-commit: 234472522a924ecf62e27c27e1e29b8803a677cc
--
Here is a summary diff for v5
diff --git a/test/replication/gh-5806-xlog-cleanup.result b/test/replication/gh-5806-xlog-cleanup.result
index 523d400a7..da09daf17 100644
--- a/test/replication/gh-5806-xlog-cleanup.result
+++ b/test/replication/gh-5806-xlog-cleanup.result
@@ -29,7 +29,7 @@ test_run:cmd('create server master with script="replication/gh-5806-master.lua"'
| ---
| - true
| ...
-test_run:cmd('start server master with wait=True, wait_load=True')
+test_run:cmd('start server master')
| ---
| - true
| ...
@@ -68,7 +68,7 @@ test_run:cmd('create server replica with rpl_master=master,\
| ---
| - true
| ...
-test_run:cmd('start server replica with wait=True, wait_load=True')
+test_run:cmd('start server replica')
| ---
| - true
| ...
@@ -132,7 +132,7 @@ box.snapshot()
-- space and run snapshot which removes old xlog required
-- by replica to subscribe leading to XlogGapError which
-- we need to test.
-test_run:cmd('restart server master with wait_load=True')
+test_run:cmd('restart server master')
|
box.space.test:insert({2})
| ---
@@ -207,7 +207,7 @@ test_run:cmd('create server master with script="replication/gh-5806-master.lua"'
| ---
| - true
| ...
-test_run:cmd('start server master with args="3600", wait=True, wait_load=True')
+test_run:cmd('start server master with args="3600"')
| ---
| - true
| ...
@@ -243,7 +243,7 @@ test_run:cmd('create server replica with rpl_master=master,\
| ---
| - true
| ...
-test_run:cmd('start server replica with wait=True, wait_load=True')
+test_run:cmd('start server replica')
| ---
| - true
| ...
@@ -288,7 +288,7 @@ box.snapshot()
| - ok
| ...
-test_run:cmd('restart server master with args="3600", wait=True, wait_load=True')
+test_run:cmd('restart server master with args="3600"')
|
box.space.test:insert({2})
| ---
@@ -303,7 +303,7 @@ assert(box.info.gc().is_paused == true)
| - true
| ...
-test_run:cmd('start server replica with wait=True, wait_load=True')
+test_run:cmd('start server replica')
| ---
| - true
| ...
@@ -354,7 +354,7 @@ test_run:cmd('create server master with script="replication/gh-5806-master.lua"'
| ---
| - true
| ...
-test_run:cmd('start server master with args="3600", wait=True, wait_load=True')
+test_run:cmd('start server master with args="3600"')
| ---
| - true
| ...
@@ -376,7 +376,7 @@ test_run:cmd('create server replica with rpl_master=master,\
| ---
| - true
| ...
-test_run:cmd('start server replica with wait=True, wait_load=True')
+test_run:cmd('start server replica')
| ---
| - true
| ...
@@ -398,7 +398,7 @@ test_run:cmd('delete server replica')
| - true
| ...
-test_run:cmd('restart server master with args="3600", wait=True, wait_load=True')
+test_run:cmd('restart server master with args="3600"')
|
assert(box.info.gc().is_paused == true)
| ---
@@ -433,3 +433,126 @@ test_run:cmd('delete server master')
| ---
| - true
| ...
+
+--
+-- Case 4: Fill _cluster with replica but then delete
+-- the replica so that master's cleanup leave in "paused"
+-- state, and finally cleanup the _cluster to kick cleanup.
+--
+test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server master')
+ | ---
+ | - true
+ | ...
+
+test_run:switch('master')
+ | ---
+ | - true
+ | ...
+box.schema.user.grant('guest', 'replication')
+ | ---
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('create server replica with rpl_master=master,\
+ script="replication/replica.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica')
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+master_uuid = test_run:eval('master', 'return box.info.uuid')[1]
+ | ---
+ | ...
+replica_uuid = test_run:eval('replica', 'return box.info.uuid')[1]
+ | ---
+ | ...
+master_custer = test_run:eval('master', 'return box.space._cluster:select()')[1]
+ | ---
+ | ...
+assert(master_custer[1][2] == master_uuid)
+ | ---
+ | - true
+ | ...
+assert(master_custer[2][2] == replica_uuid)
+ | ---
+ | - true
+ | ...
+
+test_run:cmd('stop server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('cleanup server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica')
+ | ---
+ | - true
+ | ...
+
+test_run:switch('master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('restart server master with args="3600"')
+ |
+assert(box.info.gc().is_paused == true)
+ | ---
+ | - true
+ | ...
+
+--
+-- Drop the replica from _cluster and make sure
+-- cleanup fiber is not paused anymore.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+deleted_uuid = test_run:eval('master', 'return box.space._cluster:delete(2)')[1][2]
+ | ---
+ | ...
+assert(replica_uuid == deleted_uuid)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('master')
+ | ---
+ | - true
+ | ...
+test_run:wait_cond(function() return box.info.gc().is_paused == false end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('cleanup server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server master')
+ | ---
+ | - true
+ | ...
diff --git a/test/replication/gh-5806-xlog-cleanup.test.lua b/test/replication/gh-5806-xlog-cleanup.test.lua
index f16be758a..b65563e7f 100644
--- a/test/replication/gh-5806-xlog-cleanup.test.lua
+++ b/test/replication/gh-5806-xlog-cleanup.test.lua
@@ -19,7 +19,7 @@ engine = test_run:get_cfg('engine')
--
test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
-test_run:cmd('start server master with wait=True, wait_load=True')
+test_run:cmd('start server master')
test_run:switch('master')
box.schema.user.grant('guest', 'replication')
@@ -36,7 +36,7 @@ _ = s:create_index('pk')
test_run:switch('default')
test_run:cmd('create server replica with rpl_master=master,\
script="replication/replica.lua"')
-test_run:cmd('start server replica with wait=True, wait_load=True')
+test_run:cmd('start server replica')
--
-- On replica we create an own space which allows us to
@@ -70,7 +70,7 @@ box.snapshot()
-- space and run snapshot which removes old xlog required
-- by replica to subscribe leading to XlogGapError which
-- we need to test.
-test_run:cmd('restart server master with wait_load=True')
+test_run:cmd('restart server master')
box.space.test:insert({2})
box.snapshot()
assert(box.info.gc().is_paused == false)
@@ -105,7 +105,7 @@ test_run:cmd('delete server replica')
--
test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
-test_run:cmd('start server master with args="3600", wait=True, wait_load=True')
+test_run:cmd('start server master with args="3600"')
test_run:switch('master')
box.schema.user.grant('guest', 'replication')
@@ -119,7 +119,7 @@ _ = s:create_index('pk')
test_run:switch('default')
test_run:cmd('create server replica with rpl_master=master,\
script="replication/replica.lua"')
-test_run:cmd('start server replica with wait=True, wait_load=True')
+test_run:cmd('start server replica')
test_run:switch('replica')
box.cfg{checkpoint_count = 1}
@@ -134,12 +134,12 @@ test_run:cmd('stop server replica')
box.space.test:insert({1})
box.snapshot()
-test_run:cmd('restart server master with args="3600", wait=True, wait_load=True')
+test_run:cmd('restart server master with args="3600"')
box.space.test:insert({2})
box.snapshot()
assert(box.info.gc().is_paused == true)
-test_run:cmd('start server replica with wait=True, wait_load=True')
+test_run:cmd('start server replica')
--
-- Make sure no error happened.
@@ -160,7 +160,7 @@ test_run:cmd('delete server replica')
-- cleanup fiber work again.
--
test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
-test_run:cmd('start server master with args="3600", wait=True, wait_load=True')
+test_run:cmd('start server master with args="3600"')
test_run:switch('master')
box.schema.user.grant('guest', 'replication')
@@ -168,14 +168,14 @@ box.schema.user.grant('guest', 'replication')
test_run:switch('default')
test_run:cmd('create server replica with rpl_master=master,\
script="replication/replica.lua"')
-test_run:cmd('start server replica with wait=True, wait_load=True')
+test_run:cmd('start server replica')
test_run:switch('master')
test_run:cmd('stop server replica')
test_run:cmd('cleanup server replica')
test_run:cmd('delete server replica')
-test_run:cmd('restart server master with args="3600", wait=True, wait_load=True')
+test_run:cmd('restart server master with args="3600"')
assert(box.info.gc().is_paused == true)
test_run:switch('master')
@@ -186,3 +186,49 @@ test_run:switch('default')
test_run:cmd('stop server master')
test_run:cmd('cleanup server master')
test_run:cmd('delete server master')
+
+--
+-- Case 4: Fill _cluster with replica but then delete
+-- the replica so that master's cleanup leave in "paused"
+-- state, and finally cleanup the _cluster to kick cleanup.
+--
+test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
+test_run:cmd('start server master')
+
+test_run:switch('master')
+box.schema.user.grant('guest', 'replication')
+
+test_run:switch('default')
+test_run:cmd('create server replica with rpl_master=master,\
+ script="replication/replica.lua"')
+test_run:cmd('start server replica')
+
+test_run:switch('default')
+master_uuid = test_run:eval('master', 'return box.info.uuid')[1]
+replica_uuid = test_run:eval('replica', 'return box.info.uuid')[1]
+master_custer = test_run:eval('master', 'return box.space._cluster:select()')[1]
+assert(master_custer[1][2] == master_uuid)
+assert(master_custer[2][2] == replica_uuid)
+
+test_run:cmd('stop server replica')
+test_run:cmd('cleanup server replica')
+test_run:cmd('delete server replica')
+
+test_run:switch('master')
+test_run:cmd('restart server master with args="3600"')
+assert(box.info.gc().is_paused == true)
+
+--
+-- Drop the replica from _cluster and make sure
+-- cleanup fiber is not paused anymore.
+test_run:switch('default')
+deleted_uuid = test_run:eval('master', 'return box.space._cluster:delete(2)')[1][2]
+assert(replica_uuid == deleted_uuid)
+
+test_run:switch('master')
+test_run:wait_cond(function() return box.info.gc().is_paused == false end)
+
+test_run:switch('default')
+test_run:cmd('stop server master')
+test_run:cmd('cleanup server master')
+test_run:cmd('delete server master')
diff --git a/test/replication/replica_rejoin.lua b/test/replication/replica_rejoin.lua
index 76f6e5b75..9c743c52b 100644
--- a/test/replication/replica_rejoin.lua
+++ b/test/replication/replica_rejoin.lua
@@ -1,22 +1,11 @@
#!/usr/bin/env tarantool
-local repl_include_self = arg[1] and arg[1] == 'true' or false
-local repl_list
-
-if repl_include_self then
- repl_list = {os.getenv("MASTER"), os.getenv("LISTEN")}
-else
- repl_list = os.getenv("MASTER")
-end
-
-- Start the console first to allow test-run to attach even before
-- box.cfg is finished.
require('console').listen(os.getenv('ADMIN'))
box.cfg({
listen = os.getenv("LISTEN"),
- replication = repl_list,
- memtx_memory = 107374182,
- replication_timeout = 0.1,
+ replication = {os.getenv("MASTER"), os.getenv("LISTEN")},
wal_cleanup_delay = 0,
})
diff --git a/test/replication/replica_rejoin.result b/test/replication/replica_rejoin.result
index 074cc3e67..843333a19 100644
--- a/test/replication/replica_rejoin.result
+++ b/test/replication/replica_rejoin.result
@@ -47,7 +47,7 @@ test_run:cmd("create server replica with rpl_master=default, script='replication
---
- true
...
-test_run:cmd("start server replica with args='true'")
+test_run:cmd("start server replica")
---
- true
...
@@ -124,7 +124,7 @@ box.cfg{checkpoint_count = checkpoint_count}
...
-- Restart the replica. Since xlogs have been removed,
-- it is supposed to rejoin without changing id.
-test_run:cmd("start server replica with args='true'")
+test_run:cmd("start server replica")
---
- true
...
@@ -229,7 +229,7 @@ test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.
box.cfg{checkpoint_count = checkpoint_count}
---
...
-test_run:cmd("start server replica with args='true', wait=False")
+test_run:cmd("start server replica with wait=False")
---
- true
...
@@ -271,7 +271,7 @@ test_run:cleanup_cluster()
box.space.test:truncate()
---
...
-test_run:cmd("start server replica with args='true'")
+test_run:cmd("start server replica")
---
- true
...
diff --git a/test/replication/replica_rejoin.test.lua b/test/replication/replica_rejoin.test.lua
index 223316d86..c3ba9bf3f 100644
--- a/test/replication/replica_rejoin.test.lua
+++ b/test/replication/replica_rejoin.test.lua
@@ -24,7 +24,7 @@ _ = box.space.test:insert{3}
-- Join a replica, then stop it.
test_run:cmd("create server replica with rpl_master=default, script='replication/replica_rejoin.lua'")
-test_run:cmd("start server replica with args='true'")
+test_run:cmd("start server replica")
test_run:cmd("switch replica")
box.info.replication[1].upstream.status == 'follow' or log.error(box.info)
box.space.test:select()
@@ -53,7 +53,7 @@ box.cfg{checkpoint_count = checkpoint_count}
-- Restart the replica. Since xlogs have been removed,
-- it is supposed to rejoin without changing id.
-test_run:cmd("start server replica with args='true'")
+test_run:cmd("start server replica")
box.info.replication[2].downstream.vclock ~= nil or log.error(box.info)
test_run:cmd("switch replica")
box.info.replication[1].upstream.status == 'follow' or log.error(box.info)
@@ -88,7 +88,7 @@ for i = 1, 3 do box.space.test:insert{i * 100} end
fio = require('fio')
test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1 end) or fio.pathjoin(box.cfg.wal_dir, '*.xlog')
box.cfg{checkpoint_count = checkpoint_count}
-test_run:cmd("start server replica with args='true', wait=False")
+test_run:cmd("start server replica with wait=False")
test_run:cmd("switch replica")
test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = 'loading'})
box.space.test:select()
@@ -104,7 +104,7 @@ test_run:cmd("stop server replica")
test_run:cmd("cleanup server replica")
test_run:cleanup_cluster()
box.space.test:truncate()
-test_run:cmd("start server replica with args='true'")
+test_run:cmd("start server replica")
-- Subscribe the master to the replica.
replica_listen = test_run:cmd("eval replica 'return box.cfg.listen'")
replica_listen ~= nil
More information about the Tarantool-patches
mailing list