From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id 943112C800 for ; Thu, 25 Apr 2019 07:25:41 -0400 (EDT) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id rQiLjDZZZoXX for ; Thu, 25 Apr 2019 07:25:41 -0400 (EDT) Received: from smtp45.i.mail.ru (smtp45.i.mail.ru [94.100.177.105]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 303F32A433 for ; Thu, 25 Apr 2019 07:25:41 -0400 (EDT) From: avtikhon Subject: [tarantool-patches] [PATCH v3] Use server kill routine instead of server stop Date: Thu, 25 Apr 2019 14:25:37 +0300 Message-Id: Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-Help: List-Unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-Subscribe: List-Owner: List-post: List-Archive: To: Alexander Turenko Cc: avtikhon , tarantool-patches@freelists.org Imitate the replica crash and, then, wake up. Just 'stop server replica' (SIGTERM) is not sufficient to stop a tarantool instance when ERRINJ_WAL_DELAY is set, because "tarantool" thread wait for paused "wal" thread infinitely. Changed server stop routine to to kill routine to be able to use SIGKILL instead of SIGTERM to the replica server. In this way the server replica will be killed immediately and *.xlog files will be removed as it has to be. [029] --- replication/gc.result Mon Apr 15 14:58:09 2019 [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019 [029] @@ -290,7 +290,12 @@ [029] ... [029] wait_xlog(1) or fio.listdir('./master') [029] --- [048] replication/gc.test.lua vinyl [ fail ] [048] [048] Test failed! Result content mismatch: [029] -- true [029] +- - 00000000000000000305.vylog [029] + - 00000000000000000305.xlog [029] + - '512' [029] + - 00000000000000000310.xlog [029] + - 00000000000000000310.vylog [029] + - 00000000000000000310.snap [029] ... [029] -- Stop the replica. [029] test_run:cmd("stop server replica") [029] @@ -326,7 +331,13 @@ [029] ... [029] wait_xlog(2) or fio.listdir('./master') [029] --- [029] -- true [029] +- - 00000000000000000305.xlog [029] + - 00000000000000000316.xlog [029] + - 00000000000000000316.vylog [029] + - '512' [029] + - 00000000000000000310.xlog [029] + - 00000000000000000317.vylog [029] + - 00000000000000000317.snap [029] ... [029] -- The xlog should only be deleted after the replica [029] -- is unregistered. [029] Close #4162 --- Github: https://github.com/tarantool/tarantool/tree/avtikhon/gh-4162-stop-kill Issue: https://github.com/tarantool/tarantool/issues/4162 test/replication/gc.result | 58 +++++++++++++++++------------------- test/replication/gc.test.lua | 54 ++++++++++++++++----------------- 2 files changed, 55 insertions(+), 57 deletions(-) diff --git a/test/replication/gc.result b/test/replication/gc.result index 65785f47b..8e1808078 100644 --- a/test/replication/gc.result +++ b/test/replication/gc.result @@ -34,14 +34,14 @@ test_run:cmd("setopt delimiter ';'") function wait_gc(n) return test_run:wait_cond(function() return #box.info.gc().checkpoints == n - end, 10) + end, 10) or box.info.gc() end; --- ... -function wait_xlog(n, timeout) +function wait_xlog(n) return test_run:wait_cond(function() return #fio.glob('./master/*.xlog') == n - end, 10) + end, 10) or fio.glob('./master/*.xlog') end; --- ... @@ -117,7 +117,7 @@ test_run:cmd("switch replica") --- - true ... -test_run:wait_cond(function() return box.space.test:count() == 200 end, 10) +test_run:wait_cond(function() return box.space.test:count() == 200 end, 10) or box.space.test:count() --- - true ... @@ -131,11 +131,11 @@ test_run:cmd("switch default") ... -- Check that garbage collection removed the snapshot once -- the replica released the corresponding checkpoint. -wait_gc(1) or box.info.gc() +wait_gc(1) --- - true ... -wait_xlog(1) or fio.listdir('./master') -- Make sure the replica will not receive data until +wait_xlog(1) --- - true ... @@ -168,11 +168,11 @@ box.snapshot() --- - ok ... -wait_gc(1) or box.info.gc() +wait_gc(1) --- - true ... -wait_xlog(2) or fio.listdir('./master') +wait_xlog(2) --- - true ... @@ -187,7 +187,7 @@ test_run:cmd("switch replica") --- - true ... -test_run:wait_cond(function() return box.space.test:count() == 300 end, 10) +test_run:wait_cond(function() return box.space.test:count() == 300 end, 10) or box.space.test:count() --- - true ... @@ -201,11 +201,11 @@ test_run:cmd("switch default") ... -- Now garbage collection should resume and delete files left -- from the old checkpoint. -wait_gc(1) or box.info.gc() +wait_gc(1) --- - true ... -wait_xlog(0) or fio.listdir('./master') +wait_xlog(0) --- - true ... @@ -244,34 +244,32 @@ fiber.sleep(0.1) -- wait for master to relay data -- Garbage collection must not delete the old xlog file -- because it is still needed by the replica, but remove -- the old snapshot. -wait_gc(1) or box.info.gc() +wait_gc(1) --- - true ... -wait_xlog(2) or fio.listdir('./master') +wait_xlog(2) --- - true ... -test_run:cmd("switch replica") +-- Imitate the replica crash and, then, wake up. +-- Just 'stop server replica' (SIGTERM) is not sufficient to stop +-- a tarantool instance when ERRINJ_WAL_DELAY is set, because +-- "tarantool" thread wait for paused "wal" thread infinitely. +test_run:cmd("kill server replica") --- - true ... --- Unblock the replica and break replication. -box.error.injection.set("ERRINJ_WAL_DELAY", false) ---- -- ok -... -box.cfg{replication = {}} +test_run:cmd("start server replica") --- +- true ... --- Restart the replica to reestablish replication. -test_run:cmd("restart server replica") -- Wait for the replica to catch up. test_run:cmd("switch replica") --- - true ... -test_run:wait_cond(function() return box.space.test:count() == 310 end, 10) +test_run:wait_cond(function() return box.space.test:count() == 310 end, 10) or box.space.test:count() --- - true ... @@ -284,11 +282,11 @@ test_run:cmd("switch default") - true ... -- Now it's safe to drop the old xlog. -wait_gc(1) or box.info.gc() +wait_gc(1) --- - true ... -wait_xlog(1) or fio.listdir('./master') +wait_xlog(1) --- - true ... @@ -320,11 +318,11 @@ box.snapshot() --- - ok ... -wait_gc(1) or box.info.gc() +wait_gc(1) --- - true ... -wait_xlog(2) or fio.listdir('./master') +wait_xlog(2) --- - true ... @@ -333,11 +331,11 @@ wait_xlog(2) or fio.listdir('./master') test_run:cleanup_cluster() --- ... -wait_gc(1) or box.info.gc() +wait_gc(1) --- - true ... -wait_xlog(1) or fio.listdir('./master') +wait_xlog(1) --- - true ... @@ -438,7 +436,7 @@ box.snapshot() --- - ok ... -wait_xlog(0, 10) or fio.listdir('./master') +wait_xlog(0) --- - true ... diff --git a/test/replication/gc.test.lua b/test/replication/gc.test.lua index 890fe29ae..165b245cc 100644 --- a/test/replication/gc.test.lua +++ b/test/replication/gc.test.lua @@ -15,12 +15,12 @@ test_run:cmd("setopt delimiter ';'") function wait_gc(n) return test_run:wait_cond(function() return #box.info.gc().checkpoints == n - end, 10) + end, 10) or box.info.gc() end; -function wait_xlog(n, timeout) +function wait_xlog(n) return test_run:wait_cond(function() return #fio.glob('./master/*.xlog') == n - end, 10) + end, 10) or fio.glob('./master/*.xlog') end; test_run:cmd("setopt delimiter ''"); @@ -63,14 +63,14 @@ test_run:cmd("start server replica") -- bootstrapped from, the replica should still receive all -- data from the master. Check it. test_run:cmd("switch replica") -test_run:wait_cond(function() return box.space.test:count() == 200 end, 10) +test_run:wait_cond(function() return box.space.test:count() == 200 end, 10) or box.space.test:count() box.space.test:count() test_run:cmd("switch default") -- Check that garbage collection removed the snapshot once -- the replica released the corresponding checkpoint. -wait_gc(1) or box.info.gc() -wait_xlog(1) or fio.listdir('./master') -- Make sure the replica will not receive data until +wait_gc(1) +wait_xlog(1) -- we test garbage collection. box.error.injection.set("ERRINJ_RELAY_SEND_DELAY", true) @@ -86,8 +86,8 @@ box.snapshot() -- Invoke garbage collection. Check that it doesn't remove -- xlogs needed by the replica. box.snapshot() -wait_gc(1) or box.info.gc() -wait_xlog(2) or fio.listdir('./master') +wait_gc(1) +wait_xlog(2) -- Resume replication so that the replica catches -- up quickly. @@ -95,14 +95,14 @@ box.error.injection.set("ERRINJ_RELAY_SEND_DELAY", false) -- Check that the replica received all data from the master. test_run:cmd("switch replica") -test_run:wait_cond(function() return box.space.test:count() == 300 end, 10) +test_run:wait_cond(function() return box.space.test:count() == 300 end, 10) or box.space.test:count() box.space.test:count() test_run:cmd("switch default") -- Now garbage collection should resume and delete files left -- from the old checkpoint. -wait_gc(1) or box.info.gc() -wait_xlog(0) or fio.listdir('./master') +wait_gc(1) +wait_xlog(0) -- -- Check that the master doesn't delete xlog files sent to the -- replica until it receives a confirmation that the data has @@ -120,22 +120,22 @@ fiber.sleep(0.1) -- wait for master to relay data -- Garbage collection must not delete the old xlog file -- because it is still needed by the replica, but remove -- the old snapshot. -wait_gc(1) or box.info.gc() -wait_xlog(2) or fio.listdir('./master') -test_run:cmd("switch replica") --- Unblock the replica and break replication. -box.error.injection.set("ERRINJ_WAL_DELAY", false) -box.cfg{replication = {}} --- Restart the replica to reestablish replication. -test_run:cmd("restart server replica") +wait_gc(1) +wait_xlog(2) +-- Imitate the replica crash and, then, wake up. +-- Just 'stop server replica' (SIGTERM) is not sufficient to stop +-- a tarantool instance when ERRINJ_WAL_DELAY is set, because +-- "tarantool" thread wait for paused "wal" thread infinitely. +test_run:cmd("kill server replica") +test_run:cmd("start server replica") -- Wait for the replica to catch up. test_run:cmd("switch replica") -test_run:wait_cond(function() return box.space.test:count() == 310 end, 10) +test_run:wait_cond(function() return box.space.test:count() == 310 end, 10) or box.space.test:count() box.space.test:count() test_run:cmd("switch default") -- Now it's safe to drop the old xlog. -wait_gc(1) or box.info.gc() -wait_xlog(1) or fio.listdir('./master') +wait_gc(1) +wait_xlog(1) -- Stop the replica. test_run:cmd("stop server replica") test_run:cmd("cleanup server replica") @@ -149,14 +149,14 @@ _ = s:auto_increment{} box.snapshot() _ = s:auto_increment{} box.snapshot() -wait_gc(1) or box.info.gc() -wait_xlog(2) or fio.listdir('./master') +wait_gc(1) +wait_xlog(2) -- The xlog should only be deleted after the replica -- is unregistered. test_run:cleanup_cluster() -wait_gc(1) or box.info.gc() -wait_xlog(1) or fio.listdir('./master') +wait_gc(1) +wait_xlog(1) -- -- Test that concurrent invocation of the garbage collector works fine. -- @@ -201,7 +201,7 @@ wait_xlog(3) or fio.listdir('./master') -- all xlog files are removed. test_run:cleanup_cluster() box.snapshot() -wait_xlog(0, 10) or fio.listdir('./master') +wait_xlog(0) -- Restore the config. box.cfg{replication = {}} -- 2.17.1