From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id F089A2C2CB for ; Mon, 29 Apr 2019 15:07:10 -0400 (EDT) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id tmkBBORGq2Ef for ; Mon, 29 Apr 2019 15:07:10 -0400 (EDT) Received: from smtp58.i.mail.ru (smtp58.i.mail.ru [217.69.128.38]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 16FCC2C0C1 for ; Mon, 29 Apr 2019 15:07:10 -0400 (EDT) From: avtikhon Subject: [tarantool-patches] [PATCH v2] Use SIGKILL to stop server replica Date: Mon, 29 Apr 2019 22:07:05 +0300 Message-Id: Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-Help: List-Unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-Subscribe: List-Owner: List-post: List-Archive: To: Alexander Turenko Cc: avtikhon , tarantool-patches@freelists.org Used the signal option set to SIGKILL to stop server replica routine to be able to stop the replica imediately to imitate the replica crash and, then, wake up. Just 'stop server replica' (SIGTERM) is not sufficient to stop a tarantool instance when ERRINJ_WAL_DELAY is set, because "tarantool" thread wait for paused "wal" thread infinitely. Changed server stop routine to to kill routine to be able to use SIGKILL instead of SIGTERM to the replica server. In this way the server replica will be killed immediately and *.xlog files will be removed as it has to be. [029] --- replication/gc.result Mon Apr 15 14:58:09 2019 [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019 [029] @@ -290,7 +290,12 @@ [029] ... [029] wait_xlog(1) or fio.listdir('./master') [029] --- [048] replication/gc.test.lua vinyl [ fail ] [048] [048] Test failed! Result content mismatch: [029] -- true [029] +- - 00000000000000000305.vylog [029] + - 00000000000000000305.xlog [029] + - '512' [029] + - 00000000000000000310.xlog [029] + - 00000000000000000310.vylog [029] + - 00000000000000000310.snap [029] ... [029] -- Stop the replica. [029] test_run:cmd("stop server replica") [029] @@ -326,7 +331,13 @@ [029] ... [029] wait_xlog(2) or fio.listdir('./master') [029] --- [029] -- true [029] +- - 00000000000000000305.xlog [029] + - 00000000000000000316.xlog [029] + - 00000000000000000316.vylog [029] + - '512' [029] + - 00000000000000000310.xlog [029] + - 00000000000000000317.vylog [029] + - 00000000000000000317.snap [029] ... [029] -- The xlog should only be deleted after the replica [029] -- is unregistered. [029] Close #4162 --- Github: https://github.com/tarantool/tarantool/tree/avtikhon/gh-4162-stop-kill Issue: https://github.com/tarantool/tarantool/issues/4162 test/replication/gc.result | 58 +++++++++++++++++------------------- test/replication/gc.test.lua | 54 ++++++++++++++++----------------- 2 files changed, 55 insertions(+), 57 deletions(-) diff --git a/test/replication/gc.result b/test/replication/gc.result index 65785f47b..8e1808078 100644 --- a/test/replication/gc.result +++ b/test/replication/gc.result @@ -34,14 +34,14 @@ test_run:cmd("setopt delimiter ';'") function wait_gc(n) return test_run:wait_cond(function() return #box.info.gc().checkpoints == n - end, 10) + end, 10) or box.info.gc() end; --- ... -function wait_xlog(n, timeout) +function wait_xlog(n) return test_run:wait_cond(function() return #fio.glob('./master/*.xlog') == n - end, 10) + end, 10) or fio.glob('./master/*.xlog') end; --- ... @@ -117,7 +117,7 @@ test_run:cmd("switch replica") --- - true ... -test_run:wait_cond(function() return box.space.test:count() == 200 end, 10) +test_run:wait_cond(function() return box.space.test:count() == 200 end, 10) or box.space.test:count() --- - true ... @@ -131,11 +131,11 @@ test_run:cmd("switch default") ... -- Check that garbage collection removed the snapshot once -- the replica released the corresponding checkpoint. -wait_gc(1) or box.info.gc() +wait_gc(1) --- - true ... -wait_xlog(1) or fio.listdir('./master') -- Make sure the replica will not receive data until +wait_xlog(1) --- - true ... @@ -168,11 +168,11 @@ box.snapshot() --- - ok ... -wait_gc(1) or box.info.gc() +wait_gc(1) --- - true ... -wait_xlog(2) or fio.listdir('./master') +wait_xlog(2) --- - true ... @@ -187,7 +187,7 @@ test_run:cmd("switch replica") --- - true ... -test_run:wait_cond(function() return box.space.test:count() == 300 end, 10) +test_run:wait_cond(function() return box.space.test:count() == 300 end, 10) or box.space.test:count() --- - true ... @@ -201,11 +201,11 @@ test_run:cmd("switch default") ... -- Now garbage collection should resume and delete files left -- from the old checkpoint. -wait_gc(1) or box.info.gc() +wait_gc(1) --- - true ... -wait_xlog(0) or fio.listdir('./master') +wait_xlog(0) --- - true ... @@ -244,34 +244,32 @@ fiber.sleep(0.1) -- wait for master to relay data -- Garbage collection must not delete the old xlog file -- because it is still needed by the replica, but remove -- the old snapshot. -wait_gc(1) or box.info.gc() +wait_gc(1) --- - true ... -wait_xlog(2) or fio.listdir('./master') +wait_xlog(2) --- - true ... -test_run:cmd("switch replica") +-- Imitate the replica crash and, then, wake up. +-- Just 'stop server replica' (SIGTERM) is not sufficient to stop +-- a tarantool instance when ERRINJ_WAL_DELAY is set, because +-- "tarantool" thread wait for paused "wal" thread infinitely. +test_run:cmd("kill server replica") --- - true ... --- Unblock the replica and break replication. -box.error.injection.set("ERRINJ_WAL_DELAY", false) ---- -- ok -... -box.cfg{replication = {}} +test_run:cmd("start server replica") --- +- true ... --- Restart the replica to reestablish replication. -test_run:cmd("restart server replica") -- Wait for the replica to catch up. test_run:cmd("switch replica") --- - true ... -test_run:wait_cond(function() return box.space.test:count() == 310 end, 10) +test_run:wait_cond(function() return box.space.test:count() == 310 end, 10) or box.space.test:count() --- - true ... @@ -284,11 +282,11 @@ test_run:cmd("switch default") - true ... -- Now it's safe to drop the old xlog. -wait_gc(1) or box.info.gc() +wait_gc(1) --- - true ... -wait_xlog(1) or fio.listdir('./master') +wait_xlog(1) --- - true ... @@ -320,11 +318,11 @@ box.snapshot() --- - ok ... -wait_gc(1) or box.info.gc() +wait_gc(1) --- - true ... -wait_xlog(2) or fio.listdir('./master') +wait_xlog(2) --- - true ... @@ -333,11 +331,11 @@ wait_xlog(2) or fio.listdir('./master') test_run:cleanup_cluster() --- ... -wait_gc(1) or box.info.gc() +wait_gc(1) --- - true ... -wait_xlog(1) or fio.listdir('./master') +wait_xlog(1) --- - true ... @@ -438,7 +436,7 @@ box.snapshot() --- - ok ... -wait_xlog(0, 10) or fio.listdir('./master') +wait_xlog(0) --- - true ... diff --git a/test/replication/gc.test.lua b/test/replication/gc.test.lua index 890fe29ae..46368d45e 100644 --- a/test/replication/gc.test.lua +++ b/test/replication/gc.test.lua @@ -15,12 +15,12 @@ test_run:cmd("setopt delimiter ';'") function wait_gc(n) return test_run:wait_cond(function() return #box.info.gc().checkpoints == n - end, 10) + end, 10) or box.info.gc() end; -function wait_xlog(n, timeout) +function wait_xlog(n) return test_run:wait_cond(function() return #fio.glob('./master/*.xlog') == n - end, 10) + end, 10) or fio.glob('./master/*.xlog') end; test_run:cmd("setopt delimiter ''"); @@ -63,14 +63,14 @@ test_run:cmd("start server replica") -- bootstrapped from, the replica should still receive all -- data from the master. Check it. test_run:cmd("switch replica") -test_run:wait_cond(function() return box.space.test:count() == 200 end, 10) +test_run:wait_cond(function() return box.space.test:count() == 200 end, 10) or box.space.test:count() box.space.test:count() test_run:cmd("switch default") -- Check that garbage collection removed the snapshot once -- the replica released the corresponding checkpoint. -wait_gc(1) or box.info.gc() -wait_xlog(1) or fio.listdir('./master') -- Make sure the replica will not receive data until +wait_gc(1) +wait_xlog(1) -- we test garbage collection. box.error.injection.set("ERRINJ_RELAY_SEND_DELAY", true) @@ -86,8 +86,8 @@ box.snapshot() -- Invoke garbage collection. Check that it doesn't remove -- xlogs needed by the replica. box.snapshot() -wait_gc(1) or box.info.gc() -wait_xlog(2) or fio.listdir('./master') +wait_gc(1) +wait_xlog(2) -- Resume replication so that the replica catches -- up quickly. @@ -95,14 +95,14 @@ box.error.injection.set("ERRINJ_RELAY_SEND_DELAY", false) -- Check that the replica received all data from the master. test_run:cmd("switch replica") -test_run:wait_cond(function() return box.space.test:count() == 300 end, 10) +test_run:wait_cond(function() return box.space.test:count() == 300 end, 10) or box.space.test:count() box.space.test:count() test_run:cmd("switch default") -- Now garbage collection should resume and delete files left -- from the old checkpoint. -wait_gc(1) or box.info.gc() -wait_xlog(0) or fio.listdir('./master') +wait_gc(1) +wait_xlog(0) -- -- Check that the master doesn't delete xlog files sent to the -- replica until it receives a confirmation that the data has @@ -120,22 +120,22 @@ fiber.sleep(0.1) -- wait for master to relay data -- Garbage collection must not delete the old xlog file -- because it is still needed by the replica, but remove -- the old snapshot. -wait_gc(1) or box.info.gc() -wait_xlog(2) or fio.listdir('./master') -test_run:cmd("switch replica") --- Unblock the replica and break replication. -box.error.injection.set("ERRINJ_WAL_DELAY", false) -box.cfg{replication = {}} --- Restart the replica to reestablish replication. -test_run:cmd("restart server replica") +wait_gc(1) +wait_xlog(2) +-- Imitate the replica crash and, then, wake up. +-- Just 'stop server replica' (SIGTERM) is not sufficient to stop +-- a tarantool instance when ERRINJ_WAL_DELAY is set, because +-- "tarantool" thread wait for paused "wal" thread infinitely. +test_run:cmd("stop server replica with signal=SIGKILL") +test_run:cmd("start server replica") -- Wait for the replica to catch up. test_run:cmd("switch replica") -test_run:wait_cond(function() return box.space.test:count() == 310 end, 10) +test_run:wait_cond(function() return box.space.test:count() == 310 end, 10) or box.space.test:count() box.space.test:count() test_run:cmd("switch default") -- Now it's safe to drop the old xlog. -wait_gc(1) or box.info.gc() -wait_xlog(1) or fio.listdir('./master') +wait_gc(1) +wait_xlog(1) -- Stop the replica. test_run:cmd("stop server replica") test_run:cmd("cleanup server replica") @@ -149,14 +149,14 @@ _ = s:auto_increment{} box.snapshot() _ = s:auto_increment{} box.snapshot() -wait_gc(1) or box.info.gc() -wait_xlog(2) or fio.listdir('./master') +wait_gc(1) +wait_xlog(2) -- The xlog should only be deleted after the replica -- is unregistered. test_run:cleanup_cluster() -wait_gc(1) or box.info.gc() -wait_xlog(1) or fio.listdir('./master') +wait_gc(1) +wait_xlog(1) -- -- Test that concurrent invocation of the garbage collector works fine. -- @@ -201,7 +201,7 @@ wait_xlog(3) or fio.listdir('./master') -- all xlog files are removed. test_run:cleanup_cluster() box.snapshot() -wait_xlog(0, 10) or fio.listdir('./master') +wait_xlog(0) -- Restore the config. box.cfg{replication = {}} -- 2.17.1