From: avtikhon <avtikhon@tarantool.org> To: Alexander Turenko <alexander.turenko@tarantool.org> Cc: avtikhon <avtikhon@tarantool.org>, tarantool-patches@freelists.org Subject: [tarantool-patches] [PATCH v3] Use SIGKILL to stop server replica Date: Wed, 1 May 2019 21:15:59 +0300 [thread overview] Message-ID: <5e615fda2058511403a5520840f9fe8621fd1db1.1556734535.git.avtikhon@tarantool.org> (raw) Used the signal option set to SIGKILL to stop server replica routine to be able to stop the replica immediately to imitate the replica crash and, then, wake up. The current case happened when we wanted to set ERRINJ_WAL_DELAY for a tarantool instance and then stop it. By default the SIGTERM was used and was not sufficient there, because the main thread stil waited for the stuck WAL thread after the signal. In that case the replica finished reading the *.xlog file, but master server didn't know about it and saved the previous *.xlog file for replica for its restart. When the signal was changed from default to 9 (SIGKILL) replica didn't have a chance to read all data from *.xlog file due to it was killed immediately. So after replica restart it removed the previous *.xlog file after its reading. The logic of the replication was tried to change, but met the new issues, so the suggested fix at commit: b5b4809cf2e6d48230eb9e4301eac188b080e0f4 was reverted at commit: 766cd3e1015f6f76460a748c37212fb4c8791500 [029] --- replication/gc.result Mon Apr 15 14:58:09 2019 [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019 [029] @@ -290,7 +290,12 @@ [029] ... [029] wait_xlog(1) or fio.listdir('./master') [029] --- [048] replication/gc.test.lua vinyl [ fail ] [048] [048] Test failed! Result content mismatch: [029] -- true [029] +- - 00000000000000000305.vylog [029] + - 00000000000000000305.xlog [029] + - '512' [029] + - 00000000000000000310.xlog [029] + - 00000000000000000310.vylog [029] + - 00000000000000000310.snap [029] ... [029] -- Stop the replica. [029] test_run:cmd("stop server replica") [029] @@ -326,7 +331,13 @@ [029] ... [029] wait_xlog(2) or fio.listdir('./master') [029] --- [029] -- true [029] +- - 00000000000000000305.xlog [029] + - 00000000000000000316.xlog [029] + - 00000000000000000316.vylog [029] + - '512' [029] + - 00000000000000000310.xlog [029] + - 00000000000000000317.vylog [029] + - 00000000000000000317.snap [029] ... [029] -- The xlog should only be deleted after the replica [029] -- is unregistered. [029] Close #4162 --- Github: https://github.com/tarantool/tarantool/tree/avtikhon/gh-4162-stop-kill Issue: https://github.com/tarantool/tarantool/issues/4162 test/replication/gc.result | 16 +++++++--------- test/replication/gc.test.lua | 12 ++++++------ 2 files changed, 13 insertions(+), 15 deletions(-) diff --git a/test/replication/gc.result b/test/replication/gc.result index 65785f47b..396e7ff10 100644 --- a/test/replication/gc.result +++ b/test/replication/gc.result @@ -252,20 +252,18 @@ wait_xlog(2) or fio.listdir('./master') --- - true ... -test_run:cmd("switch replica") +-- Imitate the replica crash and, then, wake up. +-- Just 'stop server replica' (SIGTERM) is not sufficient to stop +-- a tarantool instance when ERRINJ_WAL_DELAY is set, because +-- "tarantool" thread wait for paused "wal" thread infinitely. +test_run:cmd("stop server replica with signal=9") --- - true ... --- Unblock the replica and break replication. -box.error.injection.set("ERRINJ_WAL_DELAY", false) ---- -- ok -... -box.cfg{replication = {}} +test_run:cmd("start server replica") --- +- true ... --- Restart the replica to reestablish replication. -test_run:cmd("restart server replica") -- Wait for the replica to catch up. test_run:cmd("switch replica") --- diff --git a/test/replication/gc.test.lua b/test/replication/gc.test.lua index 890fe29ae..1ebf32cc8 100644 --- a/test/replication/gc.test.lua +++ b/test/replication/gc.test.lua @@ -122,12 +122,12 @@ fiber.sleep(0.1) -- wait for master to relay data -- the old snapshot. wait_gc(1) or box.info.gc() wait_xlog(2) or fio.listdir('./master') -test_run:cmd("switch replica") --- Unblock the replica and break replication. -box.error.injection.set("ERRINJ_WAL_DELAY", false) -box.cfg{replication = {}} --- Restart the replica to reestablish replication. -test_run:cmd("restart server replica") +-- Imitate the replica crash and, then, wake up. +-- Just 'stop server replica' (SIGTERM) is not sufficient to stop +-- a tarantool instance when ERRINJ_WAL_DELAY is set, because +-- "tarantool" thread wait for paused "wal" thread infinitely. +test_run:cmd("stop server replica with signal=9") +test_run:cmd("start server replica") -- Wait for the replica to catch up. test_run:cmd("switch replica") test_run:wait_cond(function() return box.space.test:count() == 310 end, 10) -- 2.17.1
next reply other threads:[~2019-05-01 18:16 UTC|newest] Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-05-01 18:15 avtikhon [this message] 2019-05-02 1:39 ` Alexander Turenko 2019-05-02 9:51 ` Vladimir Davydov 2019-05-02 22:41 ` Alexander Turenko
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=5e615fda2058511403a5520840f9fe8621fd1db1.1556734535.git.avtikhon@tarantool.org \ --to=avtikhon@tarantool.org \ --cc=alexander.turenko@tarantool.org \ --cc=tarantool-patches@freelists.org \ --subject='Re: [tarantool-patches] [PATCH v3] Use SIGKILL to stop server replica' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox