* [tarantool-patches] [PATCH v3] Use server kill routine instead of server stop
@ 2019-04-25 11:25 avtikhon
0 siblings, 0 replies; only message in thread
From: avtikhon @ 2019-04-25 11:25 UTC (permalink / raw)
To: Alexander Turenko; +Cc: avtikhon, tarantool-patches
Imitate the replica crash and, then, wake up.
Just 'stop server replica' (SIGTERM) is not sufficient to stop
a tarantool instance when ERRINJ_WAL_DELAY is set, because
"tarantool" thread wait for paused "wal" thread infinitely.
Changed server stop routine to to kill routine to be able
to use SIGKILL instead of SIGTERM to the replica server. In
this way the server replica will be killed immediately and
*.xlog files will be removed as it has to be.
[029] --- replication/gc.result Mon Apr 15 14:58:09 2019
[029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
[029] @@ -290,7 +290,12 @@
[029] ...
[029] wait_xlog(1) or fio.listdir('./master')
[029] ---
[048] replication/gc.test.lua vinyl [ fail ]
[048]
[048] Test failed! Result content mismatch:
[029] -- true
[029] +- - 00000000000000000305.vylog
[029] + - 00000000000000000305.xlog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000310.vylog
[029] + - 00000000000000000310.snap
[029] ...
[029] -- Stop the replica.
[029] test_run:cmd("stop server replica")
[029] @@ -326,7 +331,13 @@
[029] ...
[029] wait_xlog(2) or fio.listdir('./master')
[029] ---
[029] -- true
[029] +- - 00000000000000000305.xlog
[029] + - 00000000000000000316.xlog
[029] + - 00000000000000000316.vylog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000317.vylog
[029] + - 00000000000000000317.snap
[029] ...
[029] -- The xlog should only be deleted after the replica
[029] -- is unregistered.
[029]
Close #4162
---
Github: https://github.com/tarantool/tarantool/tree/avtikhon/gh-4162-stop-kill
Issue: https://github.com/tarantool/tarantool/issues/4162
test/replication/gc.result | 58 +++++++++++++++++-------------------
test/replication/gc.test.lua | 54 ++++++++++++++++-----------------
2 files changed, 55 insertions(+), 57 deletions(-)
diff --git a/test/replication/gc.result b/test/replication/gc.result
index 65785f47b..8e1808078 100644
--- a/test/replication/gc.result
+++ b/test/replication/gc.result
@@ -34,14 +34,14 @@ test_run:cmd("setopt delimiter ';'")
function wait_gc(n)
return test_run:wait_cond(function()
return #box.info.gc().checkpoints == n
- end, 10)
+ end, 10) or box.info.gc()
end;
---
...
-function wait_xlog(n, timeout)
+function wait_xlog(n)
return test_run:wait_cond(function()
return #fio.glob('./master/*.xlog') == n
- end, 10)
+ end, 10) or fio.glob('./master/*.xlog')
end;
---
...
@@ -117,7 +117,7 @@ test_run:cmd("switch replica")
---
- true
...
-test_run:wait_cond(function() return box.space.test:count() == 200 end, 10)
+test_run:wait_cond(function() return box.space.test:count() == 200 end, 10) or box.space.test:count()
---
- true
...
@@ -131,11 +131,11 @@ test_run:cmd("switch default")
...
-- Check that garbage collection removed the snapshot once
-- the replica released the corresponding checkpoint.
-wait_gc(1) or box.info.gc()
+wait_gc(1)
---
- true
...
-wait_xlog(1) or fio.listdir('./master') -- Make sure the replica will not receive data until
+wait_xlog(1)
---
- true
...
@@ -168,11 +168,11 @@ box.snapshot()
---
- ok
...
-wait_gc(1) or box.info.gc()
+wait_gc(1)
---
- true
...
-wait_xlog(2) or fio.listdir('./master')
+wait_xlog(2)
---
- true
...
@@ -187,7 +187,7 @@ test_run:cmd("switch replica")
---
- true
...
-test_run:wait_cond(function() return box.space.test:count() == 300 end, 10)
+test_run:wait_cond(function() return box.space.test:count() == 300 end, 10) or box.space.test:count()
---
- true
...
@@ -201,11 +201,11 @@ test_run:cmd("switch default")
...
-- Now garbage collection should resume and delete files left
-- from the old checkpoint.
-wait_gc(1) or box.info.gc()
+wait_gc(1)
---
- true
...
-wait_xlog(0) or fio.listdir('./master')
+wait_xlog(0)
---
- true
...
@@ -244,34 +244,32 @@ fiber.sleep(0.1) -- wait for master to relay data
-- Garbage collection must not delete the old xlog file
-- because it is still needed by the replica, but remove
-- the old snapshot.
-wait_gc(1) or box.info.gc()
+wait_gc(1)
---
- true
...
-wait_xlog(2) or fio.listdir('./master')
+wait_xlog(2)
---
- true
...
-test_run:cmd("switch replica")
+-- Imitate the replica crash and, then, wake up.
+-- Just 'stop server replica' (SIGTERM) is not sufficient to stop
+-- a tarantool instance when ERRINJ_WAL_DELAY is set, because
+-- "tarantool" thread wait for paused "wal" thread infinitely.
+test_run:cmd("kill server replica")
---
- true
...
--- Unblock the replica and break replication.
-box.error.injection.set("ERRINJ_WAL_DELAY", false)
----
-- ok
-...
-box.cfg{replication = {}}
+test_run:cmd("start server replica")
---
+- true
...
--- Restart the replica to reestablish replication.
-test_run:cmd("restart server replica")
-- Wait for the replica to catch up.
test_run:cmd("switch replica")
---
- true
...
-test_run:wait_cond(function() return box.space.test:count() == 310 end, 10)
+test_run:wait_cond(function() return box.space.test:count() == 310 end, 10) or box.space.test:count()
---
- true
...
@@ -284,11 +282,11 @@ test_run:cmd("switch default")
- true
...
-- Now it's safe to drop the old xlog.
-wait_gc(1) or box.info.gc()
+wait_gc(1)
---
- true
...
-wait_xlog(1) or fio.listdir('./master')
+wait_xlog(1)
---
- true
...
@@ -320,11 +318,11 @@ box.snapshot()
---
- ok
...
-wait_gc(1) or box.info.gc()
+wait_gc(1)
---
- true
...
-wait_xlog(2) or fio.listdir('./master')
+wait_xlog(2)
---
- true
...
@@ -333,11 +331,11 @@ wait_xlog(2) or fio.listdir('./master')
test_run:cleanup_cluster()
---
...
-wait_gc(1) or box.info.gc()
+wait_gc(1)
---
- true
...
-wait_xlog(1) or fio.listdir('./master')
+wait_xlog(1)
---
- true
...
@@ -438,7 +436,7 @@ box.snapshot()
---
- ok
...
-wait_xlog(0, 10) or fio.listdir('./master')
+wait_xlog(0)
---
- true
...
diff --git a/test/replication/gc.test.lua b/test/replication/gc.test.lua
index 890fe29ae..165b245cc 100644
--- a/test/replication/gc.test.lua
+++ b/test/replication/gc.test.lua
@@ -15,12 +15,12 @@ test_run:cmd("setopt delimiter ';'")
function wait_gc(n)
return test_run:wait_cond(function()
return #box.info.gc().checkpoints == n
- end, 10)
+ end, 10) or box.info.gc()
end;
-function wait_xlog(n, timeout)
+function wait_xlog(n)
return test_run:wait_cond(function()
return #fio.glob('./master/*.xlog') == n
- end, 10)
+ end, 10) or fio.glob('./master/*.xlog')
end;
test_run:cmd("setopt delimiter ''");
@@ -63,14 +63,14 @@ test_run:cmd("start server replica")
-- bootstrapped from, the replica should still receive all
-- data from the master. Check it.
test_run:cmd("switch replica")
-test_run:wait_cond(function() return box.space.test:count() == 200 end, 10)
+test_run:wait_cond(function() return box.space.test:count() == 200 end, 10) or box.space.test:count()
box.space.test:count()
test_run:cmd("switch default")
-- Check that garbage collection removed the snapshot once
-- the replica released the corresponding checkpoint.
-wait_gc(1) or box.info.gc()
-wait_xlog(1) or fio.listdir('./master') -- Make sure the replica will not receive data until
+wait_gc(1)
+wait_xlog(1)
-- we test garbage collection.
box.error.injection.set("ERRINJ_RELAY_SEND_DELAY", true)
@@ -86,8 +86,8 @@ box.snapshot()
-- Invoke garbage collection. Check that it doesn't remove
-- xlogs needed by the replica.
box.snapshot()
-wait_gc(1) or box.info.gc()
-wait_xlog(2) or fio.listdir('./master')
+wait_gc(1)
+wait_xlog(2)
-- Resume replication so that the replica catches
-- up quickly.
@@ -95,14 +95,14 @@ box.error.injection.set("ERRINJ_RELAY_SEND_DELAY", false)
-- Check that the replica received all data from the master.
test_run:cmd("switch replica")
-test_run:wait_cond(function() return box.space.test:count() == 300 end, 10)
+test_run:wait_cond(function() return box.space.test:count() == 300 end, 10) or box.space.test:count()
box.space.test:count()
test_run:cmd("switch default")
-- Now garbage collection should resume and delete files left
-- from the old checkpoint.
-wait_gc(1) or box.info.gc()
-wait_xlog(0) or fio.listdir('./master')
+wait_gc(1)
+wait_xlog(0)
--
-- Check that the master doesn't delete xlog files sent to the
-- replica until it receives a confirmation that the data has
@@ -120,22 +120,22 @@ fiber.sleep(0.1) -- wait for master to relay data
-- Garbage collection must not delete the old xlog file
-- because it is still needed by the replica, but remove
-- the old snapshot.
-wait_gc(1) or box.info.gc()
-wait_xlog(2) or fio.listdir('./master')
-test_run:cmd("switch replica")
--- Unblock the replica and break replication.
-box.error.injection.set("ERRINJ_WAL_DELAY", false)
-box.cfg{replication = {}}
--- Restart the replica to reestablish replication.
-test_run:cmd("restart server replica")
+wait_gc(1)
+wait_xlog(2)
+-- Imitate the replica crash and, then, wake up.
+-- Just 'stop server replica' (SIGTERM) is not sufficient to stop
+-- a tarantool instance when ERRINJ_WAL_DELAY is set, because
+-- "tarantool" thread wait for paused "wal" thread infinitely.
+test_run:cmd("kill server replica")
+test_run:cmd("start server replica")
-- Wait for the replica to catch up.
test_run:cmd("switch replica")
-test_run:wait_cond(function() return box.space.test:count() == 310 end, 10)
+test_run:wait_cond(function() return box.space.test:count() == 310 end, 10) or box.space.test:count()
box.space.test:count()
test_run:cmd("switch default")
-- Now it's safe to drop the old xlog.
-wait_gc(1) or box.info.gc()
-wait_xlog(1) or fio.listdir('./master')
+wait_gc(1)
+wait_xlog(1)
-- Stop the replica.
test_run:cmd("stop server replica")
test_run:cmd("cleanup server replica")
@@ -149,14 +149,14 @@ _ = s:auto_increment{}
box.snapshot()
_ = s:auto_increment{}
box.snapshot()
-wait_gc(1) or box.info.gc()
-wait_xlog(2) or fio.listdir('./master')
+wait_gc(1)
+wait_xlog(2)
-- The xlog should only be deleted after the replica
-- is unregistered.
test_run:cleanup_cluster()
-wait_gc(1) or box.info.gc()
-wait_xlog(1) or fio.listdir('./master')
+wait_gc(1)
+wait_xlog(1)
--
-- Test that concurrent invocation of the garbage collector works fine.
--
@@ -201,7 +201,7 @@ wait_xlog(3) or fio.listdir('./master')
-- all xlog files are removed.
test_run:cleanup_cluster()
box.snapshot()
-wait_xlog(0, 10) or fio.listdir('./master')
+wait_xlog(0)
-- Restore the config.
box.cfg{replication = {}}
--
2.17.1
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2019-04-25 11:25 UTC | newest]
Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-25 11:25 [tarantool-patches] [PATCH v3] Use server kill routine instead of server stop avtikhon
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox