From: "Alexander V. Tikhonov" <avtikhon@tarantool.org> To: Kirill Yukhin <kyukhin@tarantool.org> Cc: tarantool-patches@dev.tarantool.org Subject: [Tarantool-patches] [PATCH v1 11/12] test: fix replication/gc flaky failures Date: Tue, 26 Nov 2019 09:21:47 +0300 [thread overview] Message-ID: <c57bc14e42efda8e1ed2a7dc26e982e9e515ad04.1574749278.git.avtikhon@tarantool.org> (raw) In-Reply-To: <1c42ad20160f47d942cab405ce9896d6d31cc05f.1574749278.git.avtikhon@tarantool.org> In-Reply-To: <1c42ad20160f47d942cab405ce9896d6d31cc05f.1574749278.git.avtikhon@tarantool.org> From: avtikhon <avtikhon@tarantool.org> Two problems are fixed here. The first one is about correctness of the test case. The second is about flaky failures. About correctness. The test case contains the following lines: | test_run:cmd("switch replica") | -- Unblock the replica and break replication. | box.error.injection.set("ERRINJ_WAL_DELAY", false) | box.cfg{replication = {}} Usually rows are applied and the new vclock is sent to the master before replication will be disabled. So the master removes old xlog before the replica restart and the next case tests nothing. This commit uses the new test-run's ability to stop a tarantool instance with a custom signal and stops the replica with SIGKILL w/o dropping ERRINJ_WAL_DELAY. This change fixes the race between applying rows and disabling replication and so makes the test case correct. About flaky failures. They were look like so: | [029] --- replication/gc.result Mon Apr 15 14:58:09 2019 | [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019 | [029] @@ -290,7 +290,12 @@ | [029] ... | [029] wait_xlog(1) or fio.listdir('./master') | [029] --- | [029] -- true | [029] +- - 00000000000000000305.vylog | [029] + - 00000000000000000305.xlog | [029] + - '512' | [029] + - 00000000000000000310.xlog | [029] + - 00000000000000000310.vylog | [029] + - 00000000000000000310.snap | [029] ... | [029] -- Stop the replica. | [029] test_run:cmd("stop server replica") | <...next cases could have induced mismathes too...> The reason of the fail is that a replica applied all rows from the old xlog, but didn't sent an ACK with a new vclock to a master, because the replication was disabled before that. The master stops relay and keeps the old xlog. When the replica starts again it subscribes with the vclock value that instructs a relay to open the new xlog. Tarantool can remove an old xlog just after a replica's ACK when observes that the xlog was fully read by all replicas. But tarantool does not remove xlogs when a replica is subscribed. This is not a big problem, because such 'stuck' xlog file will be removed with a next xlog removal. There was the attempt to fix this behaviour and remove old xlogs at subscribe, see the following commits: * b5b4809cf2e6d48230eb9e4301eac188b080e0f4 ('replication: update replica gc state on subscribe'); * 766cd3e1015f6f76460a748c37212fb4c8791500 ('Revert "replication: update replica gc state on subscribe"'). Anyway, this commit fixes this flaky failures, because stops the replica before applying rows from the old xlog. So when the replica starts it continues reading from the old xlog and the xlog file will be removed when will be fully read. Closes #4162 (cherry picked from commit 35b5095ab0f437df6e78b1bacf4d41c3737a540e) --- test/replication/gc.result | 16 +++++++--------- test/replication/gc.test.lua | 16 ++++++++-------- 2 files changed, 15 insertions(+), 17 deletions(-) diff --git a/test/replication/gc.result b/test/replication/gc.result index cbdeffb11..5d55403b0 100644 --- a/test/replication/gc.result +++ b/test/replication/gc.result @@ -236,20 +236,18 @@ fiber.sleep(0.1) -- wait for master to relay data --- - true ... -test_run:cmd("switch replica") +-- Imitate the replica crash and, then, wake up. +-- Just 'stop server replica' (SIGTERM) is not sufficient to stop +-- a tarantool instance when ERRINJ_WAL_DELAY is set, because +-- "tarantool" thread wait for paused "wal" thread infinitely. +test_run:cmd("stop server replica with signal=KILL") --- - true ... --- Unblock the replica and break replication. -box.error.injection.set("ERRINJ_WAL_DELAY", false) ---- -- ok -... -box.cfg{replication = {}} +test_run:cmd("start server replica") --- +- true ... --- Restart the replica to reestablish replication. -test_run:cmd("restart server replica") -- Wait for the replica to catch up. test_run:cmd("switch replica") --- diff --git a/test/replication/gc.test.lua b/test/replication/gc.test.lua index 4710fd9e3..40a349167 100644 --- a/test/replication/gc.test.lua +++ b/test/replication/gc.test.lua @@ -110,14 +110,14 @@ fiber.sleep(0.1) -- wait for master to relay data -- Garbage collection must not delete the old xlog file -- because it is still needed by the replica, but remove -- the old snapshot. -#box.info.gc().checkpoints == 1 or box.info.gc() -#fio.glob('./master/*.xlog') == 2 or fio.listdir('./master') -test_run:cmd("switch replica") --- Unblock the replica and break replication. -box.error.injection.set("ERRINJ_WAL_DELAY", false) -box.cfg{replication = {}} --- Restart the replica to reestablish replication. -test_run:cmd("restart server replica") +wait_gc(1) or box.info.gc() +wait_xlog(2) or fio.listdir('./master') +-- Imitate the replica crash and, then, wake up. +-- Just 'stop server replica' (SIGTERM) is not sufficient to stop +-- a tarantool instance when ERRINJ_WAL_DELAY is set, because +-- "tarantool" thread wait for paused "wal" thread infinitely. +test_run:cmd("stop server replica with signal=KILL") +test_run:cmd("start server replica") -- Wait for the replica to catch up. test_run:cmd("switch replica") test_run:wait_cond(function() return box.space.test:count() == 310 end, 10) -- 2.17.1
next prev parent reply other threads:[~2019-11-26 6:22 UTC|newest] Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-11-26 6:21 [Tarantool-patches] [PATCH v1 01/12] test: enable parallel mode for xlog tests Alexander V. Tikhonov 2019-11-26 6:21 ` [Tarantool-patches] [PATCH v1 02/12] test: enable parallel run for long test suites Alexander V. Tikhonov 2019-11-26 6:21 ` [Tarantool-patches] [PATCH v1 03/12] test: replication parallel mode on Alexander V. Tikhonov 2019-11-26 6:21 ` [Tarantool-patches] [PATCH v1 04/12] test: enable cleaning of a test environment Alexander V. Tikhonov 2019-11-26 6:21 ` [Tarantool-patches] [PATCH v1 05/12] test: allow to run replication/misc multiple times Alexander V. Tikhonov 2019-11-26 6:21 ` [Tarantool-patches] [PATCH v1 06/12] test: increase timeouts in replication/errinj Alexander V. Tikhonov 2019-11-26 6:21 ` [Tarantool-patches] [PATCH v1 07/12] test: wait for xlog/snap/log file changes Alexander V. Tikhonov 2019-11-26 6:21 ` [Tarantool-patches] [PATCH v1 08/12] test: use wait_cond to check follow status Alexander V. Tikhonov 2019-11-26 6:21 ` [Tarantool-patches] [PATCH v1 09/12] test: increase timeouts in replication/misc Alexander V. Tikhonov 2019-11-26 6:21 ` [Tarantool-patches] [PATCH v1 10/12] test: put require in proper places Alexander V. Tikhonov 2019-11-26 6:21 ` Alexander V. Tikhonov [this message] 2019-11-26 6:21 ` [Tarantool-patches] [PATCH v1 12/12] test: errinj for pause relay_send Alexander V. Tikhonov 2019-11-26 6:54 ` [Tarantool-patches] [PATCH v1 01/12] test: enable parallel mode for xlog tests Kirill Yukhin
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=c57bc14e42efda8e1ed2a7dc26e982e9e515ad04.1574749278.git.avtikhon@tarantool.org \ --to=avtikhon@tarantool.org \ --cc=kyukhin@tarantool.org \ --cc=tarantool-patches@dev.tarantool.org \ --subject='Re: [Tarantool-patches] [PATCH v1 11/12] test: fix replication/gc flaky failures' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox