[tarantool-patches] [PATCH v2] Use SIGKILL to stop server replica

avtikhon avtikhon at tarantool.org
Tue Apr 30 15:39:50 MSK 2019


Used the signal option set to SIGKILL to stop server replica
routine to be able to stop the replica imediately to imitate
the replica crash and, then, wake up.
Just 'stop server replica' (SIGTERM) is not sufficient to stop
a tarantool instance when ERRINJ_WAL_DELAY is set, because
"tarantool" thread wait for paused "wal" thread infinitely.
Changed server stop routine to to kill routine to be able
to use SIGKILL instead of SIGTERM to the replica server. In
this way the server replica will be killed immediately and
*.xlog files will be removed as it has to be.

[029] --- replication/gc.result Mon Apr 15 14:58:09 2019
[029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
[029] @@ -290,7 +290,12 @@
[029] ...
[029] wait_xlog(1) or fio.listdir('./master')
[029] ---
[048] replication/gc.test.lua vinyl [ fail ]
[048]
[048] Test failed! Result content mismatch:
[029] -- true
[029] +- - 00000000000000000305.vylog
[029] + - 00000000000000000305.xlog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000310.vylog
[029] + - 00000000000000000310.snap
[029] ...
[029] -- Stop the replica.
[029] test_run:cmd("stop server replica")
[029] @@ -326,7 +331,13 @@
[029] ...
[029] wait_xlog(2) or fio.listdir('./master')
[029] ---
[029] -- true
[029] +- - 00000000000000000305.xlog
[029] + - 00000000000000000316.xlog
[029] + - 00000000000000000316.vylog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000317.vylog
[029] + - 00000000000000000317.snap
[029] ...
[029] -- The xlog should only be deleted after the replica
[029] -- is unregistered.
[029]

Close #4162
---

Github: https://github.com/tarantool/tarantool/tree/avtikhon/gh-4162-stop-kill
Issue: https://github.com/tarantool/tarantool/issues/4162

 test/replication/gc.result   | 58 +++++++++++++++++-------------------
 test/replication/gc.test.lua | 54 ++++++++++++++++-----------------
 2 files changed, 55 insertions(+), 57 deletions(-)

diff --git a/test/replication/gc.result b/test/replication/gc.result
index 65785f47b..85efc4940 100644
--- a/test/replication/gc.result
+++ b/test/replication/gc.result
@@ -34,14 +34,14 @@ test_run:cmd("setopt delimiter ';'")
 function wait_gc(n)
     return test_run:wait_cond(function()
         return #box.info.gc().checkpoints == n
-    end, 10)
+    end, 10) or box.info.gc()
 end;
 ---
 ...
-function wait_xlog(n, timeout)
+function wait_xlog(n)
     return test_run:wait_cond(function()
         return #fio.glob('./master/*.xlog') == n
-    end, 10)
+    end, 10) or fio.glob('./master/*.xlog')
 end;
 ---
 ...
@@ -117,7 +117,7 @@ test_run:cmd("switch replica")
 ---
 - true
 ...
-test_run:wait_cond(function() return box.space.test:count() == 200 end, 10)
+test_run:wait_cond(function() return box.space.test:count() == 200 end, 10) or box.space.test:count()
 ---
 - true
 ...
@@ -131,11 +131,11 @@ test_run:cmd("switch default")
 ...
 -- Check that garbage collection removed the snapshot once
 -- the replica released the corresponding checkpoint.
-wait_gc(1) or box.info.gc()
+wait_gc(1)
 ---
 - true
 ...
-wait_xlog(1) or fio.listdir('./master') -- Make sure the replica will not receive data until
+wait_xlog(1)
 ---
 - true
 ...
@@ -168,11 +168,11 @@ box.snapshot()
 ---
 - ok
 ...
-wait_gc(1) or box.info.gc()
+wait_gc(1)
 ---
 - true
 ...
-wait_xlog(2) or fio.listdir('./master')
+wait_xlog(2)
 ---
 - true
 ...
@@ -187,7 +187,7 @@ test_run:cmd("switch replica")
 ---
 - true
 ...
-test_run:wait_cond(function() return box.space.test:count() == 300 end, 10)
+test_run:wait_cond(function() return box.space.test:count() == 300 end, 10) or box.space.test:count()
 ---
 - true
 ...
@@ -201,11 +201,11 @@ test_run:cmd("switch default")
 ...
 -- Now garbage collection should resume and delete files left
 -- from the old checkpoint.
-wait_gc(1) or box.info.gc()
+wait_gc(1)
 ---
 - true
 ...
-wait_xlog(0) or fio.listdir('./master')
+wait_xlog(0)
 ---
 - true
 ...
@@ -244,34 +244,32 @@ fiber.sleep(0.1) -- wait for master to relay data
 -- Garbage collection must not delete the old xlog file
 -- because it is still needed by the replica, but remove
 -- the old snapshot.
-wait_gc(1) or box.info.gc()
+wait_gc(1)
 ---
 - true
 ...
-wait_xlog(2) or fio.listdir('./master')
+wait_xlog(2)
 ---
 - true
 ...
-test_run:cmd("switch replica")
+-- Imitate the replica crash and, then, wake up.
+-- Just 'stop server replica' (SIGTERM) is not sufficient to stop
+-- a tarantool instance when ERRINJ_WAL_DELAY is set, because
+-- "tarantool" thread wait for paused "wal" thread infinitely.
+test_run:cmd("stop server replica with signal=SIGKILL")
 ---
 - true
 ...
--- Unblock the replica and break replication.
-box.error.injection.set("ERRINJ_WAL_DELAY", false)
----
-- ok
-...
-box.cfg{replication = {}}
+test_run:cmd("start server replica")
 ---
+- true
 ...
--- Restart the replica to reestablish replication.
-test_run:cmd("restart server replica")
 -- Wait for the replica to catch up.
 test_run:cmd("switch replica")
 ---
 - true
 ...
-test_run:wait_cond(function() return box.space.test:count() == 310 end, 10)
+test_run:wait_cond(function() return box.space.test:count() == 310 end, 10) or box.space.test:count()
 ---
 - true
 ...
@@ -284,11 +282,11 @@ test_run:cmd("switch default")
 - true
 ...
 -- Now it's safe to drop the old xlog.
-wait_gc(1) or box.info.gc()
+wait_gc(1)
 ---
 - true
 ...
-wait_xlog(1) or fio.listdir('./master')
+wait_xlog(1)
 ---
 - true
 ...
@@ -320,11 +318,11 @@ box.snapshot()
 ---
 - ok
 ...
-wait_gc(1) or box.info.gc()
+wait_gc(1)
 ---
 - true
 ...
-wait_xlog(2) or fio.listdir('./master')
+wait_xlog(2)
 ---
 - true
 ...
@@ -333,11 +331,11 @@ wait_xlog(2) or fio.listdir('./master')
 test_run:cleanup_cluster()
 ---
 ...
-wait_gc(1) or box.info.gc()
+wait_gc(1)
 ---
 - true
 ...
-wait_xlog(1) or fio.listdir('./master')
+wait_xlog(1)
 ---
 - true
 ...
@@ -438,7 +436,7 @@ box.snapshot()
 ---
 - ok
 ...
-wait_xlog(0, 10) or fio.listdir('./master')
+wait_xlog(0)
 ---
 - true
 ...
diff --git a/test/replication/gc.test.lua b/test/replication/gc.test.lua
index 890fe29ae..46368d45e 100644
--- a/test/replication/gc.test.lua
+++ b/test/replication/gc.test.lua
@@ -15,12 +15,12 @@ test_run:cmd("setopt delimiter ';'")
 function wait_gc(n)
     return test_run:wait_cond(function()
         return #box.info.gc().checkpoints == n
-    end, 10)
+    end, 10) or box.info.gc()
 end;
-function wait_xlog(n, timeout)
+function wait_xlog(n)
     return test_run:wait_cond(function()
         return #fio.glob('./master/*.xlog') == n
-    end, 10)
+    end, 10) or fio.glob('./master/*.xlog')
 end;
 test_run:cmd("setopt delimiter ''");
 
@@ -63,14 +63,14 @@ test_run:cmd("start server replica")
 -- bootstrapped from, the replica should still receive all
 -- data from the master. Check it.
 test_run:cmd("switch replica")
-test_run:wait_cond(function() return box.space.test:count() == 200 end, 10)
+test_run:wait_cond(function() return box.space.test:count() == 200 end, 10) or box.space.test:count()
 box.space.test:count()
 test_run:cmd("switch default")
 
 -- Check that garbage collection removed the snapshot once
 -- the replica released the corresponding checkpoint.
-wait_gc(1) or box.info.gc()
-wait_xlog(1) or fio.listdir('./master') -- Make sure the replica will not receive data until
+wait_gc(1)
+wait_xlog(1)
 -- we test garbage collection.
 box.error.injection.set("ERRINJ_RELAY_SEND_DELAY", true)
 
@@ -86,8 +86,8 @@ box.snapshot()
 -- Invoke garbage collection. Check that it doesn't remove
 -- xlogs needed by the replica.
 box.snapshot()
-wait_gc(1) or box.info.gc()
-wait_xlog(2) or fio.listdir('./master')
+wait_gc(1)
+wait_xlog(2)
 
 -- Resume replication so that the replica catches
 -- up quickly.
@@ -95,14 +95,14 @@ box.error.injection.set("ERRINJ_RELAY_SEND_DELAY", false)
 
 -- Check that the replica received all data from the master.
 test_run:cmd("switch replica")
-test_run:wait_cond(function() return box.space.test:count() == 300 end, 10)
+test_run:wait_cond(function() return box.space.test:count() == 300 end, 10) or box.space.test:count()
 box.space.test:count()
 test_run:cmd("switch default")
 
 -- Now garbage collection should resume and delete files left
 -- from the old checkpoint.
-wait_gc(1) or box.info.gc()
-wait_xlog(0) or fio.listdir('./master')
+wait_gc(1)
+wait_xlog(0)
 --
 -- Check that the master doesn't delete xlog files sent to the
 -- replica until it receives a confirmation that the data has
@@ -120,22 +120,22 @@ fiber.sleep(0.1) -- wait for master to relay data
 -- Garbage collection must not delete the old xlog file
 -- because it is still needed by the replica, but remove
 -- the old snapshot.
-wait_gc(1) or box.info.gc()
-wait_xlog(2) or fio.listdir('./master')
-test_run:cmd("switch replica")
--- Unblock the replica and break replication.
-box.error.injection.set("ERRINJ_WAL_DELAY", false)
-box.cfg{replication = {}}
--- Restart the replica to reestablish replication.
-test_run:cmd("restart server replica")
+wait_gc(1)
+wait_xlog(2)
+-- Imitate the replica crash and, then, wake up.
+-- Just 'stop server replica' (SIGTERM) is not sufficient to stop
+-- a tarantool instance when ERRINJ_WAL_DELAY is set, because
+-- "tarantool" thread wait for paused "wal" thread infinitely.
+test_run:cmd("stop server replica with signal=SIGKILL")
+test_run:cmd("start server replica")
 -- Wait for the replica to catch up.
 test_run:cmd("switch replica")
-test_run:wait_cond(function() return box.space.test:count() == 310 end, 10)
+test_run:wait_cond(function() return box.space.test:count() == 310 end, 10) or box.space.test:count()
 box.space.test:count()
 test_run:cmd("switch default")
 -- Now it's safe to drop the old xlog.
-wait_gc(1) or box.info.gc()
-wait_xlog(1) or fio.listdir('./master')
+wait_gc(1)
+wait_xlog(1)
 -- Stop the replica.
 test_run:cmd("stop server replica")
 test_run:cmd("cleanup server replica")
@@ -149,14 +149,14 @@ _ = s:auto_increment{}
 box.snapshot()
 _ = s:auto_increment{}
 box.snapshot()
-wait_gc(1) or box.info.gc()
-wait_xlog(2) or fio.listdir('./master')
+wait_gc(1)
+wait_xlog(2)
 
 -- The xlog should only be deleted after the replica
 -- is unregistered.
 test_run:cleanup_cluster()
-wait_gc(1) or box.info.gc()
-wait_xlog(1) or fio.listdir('./master')
+wait_gc(1)
+wait_xlog(1)
 --
 -- Test that concurrent invocation of the garbage collector works fine.
 --
@@ -201,7 +201,7 @@ wait_xlog(3) or fio.listdir('./master')
 -- all xlog files are removed.
 test_run:cleanup_cluster()
 box.snapshot()
-wait_xlog(0, 10) or fio.listdir('./master')
+wait_xlog(0)
 
 -- Restore the config.
 box.cfg{replication = {}}
-- 
2.17.1





More information about the Tarantool-patches mailing list