* [tarantool-patches] [PATCH v2] Use SIGKILL to stop server replica
@ 2019-04-29 19:07 avtikhon
2019-04-30 2:58 ` [tarantool-patches] " Alexander Turenko
0 siblings, 1 reply; 2+ messages in thread
From: avtikhon @ 2019-04-29 19:07 UTC (permalink / raw)
To: Alexander Turenko; +Cc: avtikhon, tarantool-patches
Used the signal option set to SIGKILL to stop server replica
routine to be able to stop the replica imediately to imitate
the replica crash and, then, wake up.
Just 'stop server replica' (SIGTERM) is not sufficient to stop
a tarantool instance when ERRINJ_WAL_DELAY is set, because
"tarantool" thread wait for paused "wal" thread infinitely.
Changed server stop routine to to kill routine to be able
to use SIGKILL instead of SIGTERM to the replica server. In
this way the server replica will be killed immediately and
*.xlog files will be removed as it has to be.
[029] --- replication/gc.result Mon Apr 15 14:58:09 2019
[029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
[029] @@ -290,7 +290,12 @@
[029] ...
[029] wait_xlog(1) or fio.listdir('./master')
[029] ---
[048] replication/gc.test.lua vinyl [ fail ]
[048]
[048] Test failed! Result content mismatch:
[029] -- true
[029] +- - 00000000000000000305.vylog
[029] + - 00000000000000000305.xlog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000310.vylog
[029] + - 00000000000000000310.snap
[029] ...
[029] -- Stop the replica.
[029] test_run:cmd("stop server replica")
[029] @@ -326,7 +331,13 @@
[029] ...
[029] wait_xlog(2) or fio.listdir('./master')
[029] ---
[029] -- true
[029] +- - 00000000000000000305.xlog
[029] + - 00000000000000000316.xlog
[029] + - 00000000000000000316.vylog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000317.vylog
[029] + - 00000000000000000317.snap
[029] ...
[029] -- The xlog should only be deleted after the replica
[029] -- is unregistered.
[029]
Close #4162
---
Github: https://github.com/tarantool/tarantool/tree/avtikhon/gh-4162-stop-kill
Issue: https://github.com/tarantool/tarantool/issues/4162
test/replication/gc.result | 58 +++++++++++++++++-------------------
test/replication/gc.test.lua | 54 ++++++++++++++++-----------------
2 files changed, 55 insertions(+), 57 deletions(-)
diff --git a/test/replication/gc.result b/test/replication/gc.result
index 65785f47b..8e1808078 100644
--- a/test/replication/gc.result
+++ b/test/replication/gc.result
@@ -34,14 +34,14 @@ test_run:cmd("setopt delimiter ';'")
function wait_gc(n)
return test_run:wait_cond(function()
return #box.info.gc().checkpoints == n
- end, 10)
+ end, 10) or box.info.gc()
end;
---
...
-function wait_xlog(n, timeout)
+function wait_xlog(n)
return test_run:wait_cond(function()
return #fio.glob('./master/*.xlog') == n
- end, 10)
+ end, 10) or fio.glob('./master/*.xlog')
end;
---
...
@@ -117,7 +117,7 @@ test_run:cmd("switch replica")
---
- true
...
-test_run:wait_cond(function() return box.space.test:count() == 200 end, 10)
+test_run:wait_cond(function() return box.space.test:count() == 200 end, 10) or box.space.test:count()
---
- true
...
@@ -131,11 +131,11 @@ test_run:cmd("switch default")
...
-- Check that garbage collection removed the snapshot once
-- the replica released the corresponding checkpoint.
-wait_gc(1) or box.info.gc()
+wait_gc(1)
---
- true
...
-wait_xlog(1) or fio.listdir('./master') -- Make sure the replica will not receive data until
+wait_xlog(1)
---
- true
...
@@ -168,11 +168,11 @@ box.snapshot()
---
- ok
...
-wait_gc(1) or box.info.gc()
+wait_gc(1)
---
- true
...
-wait_xlog(2) or fio.listdir('./master')
+wait_xlog(2)
---
- true
...
@@ -187,7 +187,7 @@ test_run:cmd("switch replica")
---
- true
...
-test_run:wait_cond(function() return box.space.test:count() == 300 end, 10)
+test_run:wait_cond(function() return box.space.test:count() == 300 end, 10) or box.space.test:count()
---
- true
...
@@ -201,11 +201,11 @@ test_run:cmd("switch default")
...
-- Now garbage collection should resume and delete files left
-- from the old checkpoint.
-wait_gc(1) or box.info.gc()
+wait_gc(1)
---
- true
...
-wait_xlog(0) or fio.listdir('./master')
+wait_xlog(0)
---
- true
...
@@ -244,34 +244,32 @@ fiber.sleep(0.1) -- wait for master to relay data
-- Garbage collection must not delete the old xlog file
-- because it is still needed by the replica, but remove
-- the old snapshot.
-wait_gc(1) or box.info.gc()
+wait_gc(1)
---
- true
...
-wait_xlog(2) or fio.listdir('./master')
+wait_xlog(2)
---
- true
...
-test_run:cmd("switch replica")
+-- Imitate the replica crash and, then, wake up.
+-- Just 'stop server replica' (SIGTERM) is not sufficient to stop
+-- a tarantool instance when ERRINJ_WAL_DELAY is set, because
+-- "tarantool" thread wait for paused "wal" thread infinitely.
+test_run:cmd("kill server replica")
---
- true
...
--- Unblock the replica and break replication.
-box.error.injection.set("ERRINJ_WAL_DELAY", false)
----
-- ok
-...
-box.cfg{replication = {}}
+test_run:cmd("start server replica")
---
+- true
...
--- Restart the replica to reestablish replication.
-test_run:cmd("restart server replica")
-- Wait for the replica to catch up.
test_run:cmd("switch replica")
---
- true
...
-test_run:wait_cond(function() return box.space.test:count() == 310 end, 10)
+test_run:wait_cond(function() return box.space.test:count() == 310 end, 10) or box.space.test:count()
---
- true
...
@@ -284,11 +282,11 @@ test_run:cmd("switch default")
- true
...
-- Now it's safe to drop the old xlog.
-wait_gc(1) or box.info.gc()
+wait_gc(1)
---
- true
...
-wait_xlog(1) or fio.listdir('./master')
+wait_xlog(1)
---
- true
...
@@ -320,11 +318,11 @@ box.snapshot()
---
- ok
...
-wait_gc(1) or box.info.gc()
+wait_gc(1)
---
- true
...
-wait_xlog(2) or fio.listdir('./master')
+wait_xlog(2)
---
- true
...
@@ -333,11 +331,11 @@ wait_xlog(2) or fio.listdir('./master')
test_run:cleanup_cluster()
---
...
-wait_gc(1) or box.info.gc()
+wait_gc(1)
---
- true
...
-wait_xlog(1) or fio.listdir('./master')
+wait_xlog(1)
---
- true
...
@@ -438,7 +436,7 @@ box.snapshot()
---
- ok
...
-wait_xlog(0, 10) or fio.listdir('./master')
+wait_xlog(0)
---
- true
...
diff --git a/test/replication/gc.test.lua b/test/replication/gc.test.lua
index 890fe29ae..46368d45e 100644
--- a/test/replication/gc.test.lua
+++ b/test/replication/gc.test.lua
@@ -15,12 +15,12 @@ test_run:cmd("setopt delimiter ';'")
function wait_gc(n)
return test_run:wait_cond(function()
return #box.info.gc().checkpoints == n
- end, 10)
+ end, 10) or box.info.gc()
end;
-function wait_xlog(n, timeout)
+function wait_xlog(n)
return test_run:wait_cond(function()
return #fio.glob('./master/*.xlog') == n
- end, 10)
+ end, 10) or fio.glob('./master/*.xlog')
end;
test_run:cmd("setopt delimiter ''");
@@ -63,14 +63,14 @@ test_run:cmd("start server replica")
-- bootstrapped from, the replica should still receive all
-- data from the master. Check it.
test_run:cmd("switch replica")
-test_run:wait_cond(function() return box.space.test:count() == 200 end, 10)
+test_run:wait_cond(function() return box.space.test:count() == 200 end, 10) or box.space.test:count()
box.space.test:count()
test_run:cmd("switch default")
-- Check that garbage collection removed the snapshot once
-- the replica released the corresponding checkpoint.
-wait_gc(1) or box.info.gc()
-wait_xlog(1) or fio.listdir('./master') -- Make sure the replica will not receive data until
+wait_gc(1)
+wait_xlog(1)
-- we test garbage collection.
box.error.injection.set("ERRINJ_RELAY_SEND_DELAY", true)
@@ -86,8 +86,8 @@ box.snapshot()
-- Invoke garbage collection. Check that it doesn't remove
-- xlogs needed by the replica.
box.snapshot()
-wait_gc(1) or box.info.gc()
-wait_xlog(2) or fio.listdir('./master')
+wait_gc(1)
+wait_xlog(2)
-- Resume replication so that the replica catches
-- up quickly.
@@ -95,14 +95,14 @@ box.error.injection.set("ERRINJ_RELAY_SEND_DELAY", false)
-- Check that the replica received all data from the master.
test_run:cmd("switch replica")
-test_run:wait_cond(function() return box.space.test:count() == 300 end, 10)
+test_run:wait_cond(function() return box.space.test:count() == 300 end, 10) or box.space.test:count()
box.space.test:count()
test_run:cmd("switch default")
-- Now garbage collection should resume and delete files left
-- from the old checkpoint.
-wait_gc(1) or box.info.gc()
-wait_xlog(0) or fio.listdir('./master')
+wait_gc(1)
+wait_xlog(0)
--
-- Check that the master doesn't delete xlog files sent to the
-- replica until it receives a confirmation that the data has
@@ -120,22 +120,22 @@ fiber.sleep(0.1) -- wait for master to relay data
-- Garbage collection must not delete the old xlog file
-- because it is still needed by the replica, but remove
-- the old snapshot.
-wait_gc(1) or box.info.gc()
-wait_xlog(2) or fio.listdir('./master')
-test_run:cmd("switch replica")
--- Unblock the replica and break replication.
-box.error.injection.set("ERRINJ_WAL_DELAY", false)
-box.cfg{replication = {}}
--- Restart the replica to reestablish replication.
-test_run:cmd("restart server replica")
+wait_gc(1)
+wait_xlog(2)
+-- Imitate the replica crash and, then, wake up.
+-- Just 'stop server replica' (SIGTERM) is not sufficient to stop
+-- a tarantool instance when ERRINJ_WAL_DELAY is set, because
+-- "tarantool" thread wait for paused "wal" thread infinitely.
+test_run:cmd("stop server replica with signal=SIGKILL")
+test_run:cmd("start server replica")
-- Wait for the replica to catch up.
test_run:cmd("switch replica")
-test_run:wait_cond(function() return box.space.test:count() == 310 end, 10)
+test_run:wait_cond(function() return box.space.test:count() == 310 end, 10) or box.space.test:count()
box.space.test:count()
test_run:cmd("switch default")
-- Now it's safe to drop the old xlog.
-wait_gc(1) or box.info.gc()
-wait_xlog(1) or fio.listdir('./master')
+wait_gc(1)
+wait_xlog(1)
-- Stop the replica.
test_run:cmd("stop server replica")
test_run:cmd("cleanup server replica")
@@ -149,14 +149,14 @@ _ = s:auto_increment{}
box.snapshot()
_ = s:auto_increment{}
box.snapshot()
-wait_gc(1) or box.info.gc()
-wait_xlog(2) or fio.listdir('./master')
+wait_gc(1)
+wait_xlog(2)
-- The xlog should only be deleted after the replica
-- is unregistered.
test_run:cleanup_cluster()
-wait_gc(1) or box.info.gc()
-wait_xlog(1) or fio.listdir('./master')
+wait_gc(1)
+wait_xlog(1)
--
-- Test that concurrent invocation of the garbage collector works fine.
--
@@ -201,7 +201,7 @@ wait_xlog(3) or fio.listdir('./master')
-- all xlog files are removed.
test_run:cleanup_cluster()
box.snapshot()
-wait_xlog(0, 10) or fio.listdir('./master')
+wait_xlog(0)
-- Restore the config.
box.cfg{replication = {}}
--
2.17.1
^ permalink raw reply [flat|nested] 2+ messages in thread
* [tarantool-patches] Re: [PATCH v2] Use SIGKILL to stop server replica
2019-04-29 19:07 [tarantool-patches] [PATCH v2] Use SIGKILL to stop server replica avtikhon
@ 2019-04-30 2:58 ` Alexander Turenko
0 siblings, 0 replies; 2+ messages in thread
From: Alexander Turenko @ 2019-04-30 2:58 UTC (permalink / raw)
To: avtikhon; +Cc: tarantool-patches
The change of the test case itself is okay for me, but I have some
comments around. See below.
WBR, Alexander Turenko.
On Mon, Apr 29, 2019 at 10:07:05PM +0300, avtikhon wrote:
> Used the signal option set to SIGKILL to stop server replica
> routine to be able to stop the replica imediately to imitate
Typo: imediately -> immediately.
> the replica crash and, then, wake up.
> Just 'stop server replica' (SIGTERM) is not sufficient to stop
> a tarantool instance when ERRINJ_WAL_DELAY is set, because
> "tarantool" thread wait for paused "wal" thread infinitely.
> Changed server stop routine to to kill routine to be able
Typo: to to -> to.
> to use SIGKILL instead of SIGTERM to the replica server. In
> this way the server replica will be killed immediately and
> *.xlog files will be removed as it has to be.
I would clarify that here you solves two problems: the incorrect test
case and flaky failures. Also it worth to mention an issue about the
flaky failures.
> Close #4162
And, I guess, #4034.
> Github: https://github.com/tarantool/tarantool/tree/avtikhon/gh-4162-stop-kill
> Issue: https://github.com/tarantool/tarantool/issues/4162
The patch contains many changes that are not related to the test case
itself. Please, factor them out or, better, just drop (if they don't
affect a behaviour of the test).
This will make the patch more clear for future readers. And, to be
honest, I don't think that there is any real difference between 10
seconds and 60 seconds for our test cases: each of them should be quite
short even when a system is under heavy load. Don't sure about a
swapping system, though; but unlikely we can do something with timeouts
tweaking for this case in general.
If these changes are really matter it worth to look at a reason and fix
separately. So we'll sure we don't just pollute git blame output.
> -test_run:cmd("switch replica")
> +-- Imitate the replica crash and, then, wake up.
> +-- Just 'stop server replica' (SIGTERM) is not sufficient to stop
> +-- a tarantool instance when ERRINJ_WAL_DELAY is set, because
> +-- "tarantool" thread wait for paused "wal" thread infinitely.
> +test_run:cmd("kill server replica")
Please, update according to test-run changes: stop server replica with
signal=SIGKILL (right?).
> -- Check that garbage collection removed the snapshot once
> -- the replica released the corresponding checkpoint.
> -wait_gc(1) or box.info.gc()
> -wait_xlog(1) or fio.listdir('./master') -- Make sure the replica will not receive data until
> +wait_gc(1)
> +wait_xlog(1)
> -- we test garbage collection.
See, it was the comment:
> -- Make sure the replica will not receive data until
> -- we test garbage collection.
Now only the second part remains :)
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2019-04-30 2:58 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-29 19:07 [tarantool-patches] [PATCH v2] Use SIGKILL to stop server replica avtikhon
2019-04-30 2:58 ` [tarantool-patches] " Alexander Turenko
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox