From: Alexander Turenko <alexander.turenko@tarantool.org> To: avtikhon <avtikhon@tarantool.org> Cc: tarantool-patches@freelists.org Subject: [tarantool-patches] Re: [PATCH v1] test: gc.test.lua needs to use default timeout Date: Fri, 19 Apr 2019 10:20:23 +0300 [thread overview] Message-ID: <20190419072022.y3gap53snyfwh76c@tkn_work_nb> (raw) In-Reply-To: <6a22a07474b5a9a334273db4e9062d5d2d3ad13f.1555401138.git.avtikhon@tarantool.org> Hi, 10 seconds should be enough for several context switches, several small writes and reads via a socket and to delete a file from a filesystem. Even with amount of processes that several times larger then CPUs count tarantool should obtain enough time quanta / memory bandwith / disk iops from schedulers. Intuitively I think that 10 seconds is already very large time for that and so I looked into the case deeper. The test case that fails is the following (it is the test code from the current master with markers): -- -- Check that the master doesn't delete xlog files sent to the -- replica until it receives a confirmation that the data has -- been applied (gh-2825). -- test_run:cmd("switch replica") -- Prevent the replica from applying any rows. box.error.injection.set("ERRINJ_WAL_DELAY", true) test_run:cmd("switch default") -- Generate some data on the master. for i = 1, 5 do s:auto_increment{} end box.snapshot() -- rotate xlog for i = 1, 5 do s:auto_increment{} end fiber.sleep(0.1) -- wait for master to relay data -- Garbage collection must not delete the old xlog file -- because it is still needed by the replica, but remove -- the old snapshot. wait_gc(1) or box.info.gc() wait_xlog(2) or fio.listdir('./master') test_run:cmd("switch replica") -- Unblock the replica and break replication. box.error.injection.set("ERRINJ_WAL_DELAY", false) <mark 1> box.cfg{replication = {}} -- Restart the replica to reestablish replication. test_run:cmd("restart server replica") -- Wait for the replica to catch up. test_run:cmd("switch replica") test_run:wait_cond(function() return box.space.test:count() == 310 end, 10) box.space.test:count() test_run:cmd("switch default") -- Now it's safe to drop the old xlog. wait_gc(1) or box.info.gc() wait_xlog(1) or fio.listdir('./master') -- Stop the replica. test_run:cmd("stop server replica") test_run:cmd("cleanup server replica") Actually we do two checks here: the first one the we have 1 snapshot and 2 xlogs and the second one that we have 1 snapshot and 1 xlogs (the first xlog from the two ones is removed after restarting the replica). Logs show that in the successful case (when the test passes) the master removes a first xlog before server restart. So the second xlog count check checks virtually nothing. The reason of that is that the replica sends ACKs to the master at <mark 1> when we unset ERRINJ_WAL_DELAY and before we drop replication at the next line. The master see that the replica reads all data from the first xlog and removes it. I guess that here we imitate the situation when a replica is bit before a master (doesn't receive most actual data yet) and crashed. Then we wake it up and everything should be good: the replica receives all data, the master removes unneeded xlogs. So I propose to fix the test case: diff --git a/test/replication/gc.test.lua b/test/replication/gc.test.lua index 890fe29ae..e0bd00fac 100644 --- a/test/replication/gc.test.lua +++ b/test/replication/gc.test.lua @@ -122,12 +122,15 @@ fiber.sleep(0.1) -- wait for master to relay data -- the old snapshot. wait_gc(1) or box.info.gc() wait_xlog(2) or fio.listdir('./master') -test_run:cmd("switch replica") --- Unblock the replica and break replication. -box.error.injection.set("ERRINJ_WAL_DELAY", false) -box.cfg{replication = {}} --- Restart the replica to reestablish replication. -test_run:cmd("restart server replica") +-- Imitate the replica crash and, then, wake up. +-- Just 'stop server replica' (SIGTERM) is not sufficient to stop +-- a tarantool instance when ERRINJ_WAL_DELAY is set, because +-- "tarantool" thread wait for paused "wal" thread infinitely. +-- But 'stop server replica' is needed to remove the intance from +-- test-run records. +test_run:eval('replica', [[ffi = require('ffi') ffi.cdef([=[ int getpid(void); int kill(int, int); ]=]) ffi.C.kill(ffi.C.getpid(), 9)]]) +test_run:cmd("stop server replica") +test_run:cmd("start server replica") -- Wait for the replica to catch up. test_run:cmd("switch replica") test_run:wait_cond(function() return box.space.test:count() == 310 end, 10) It is just for idea. I think that 'kill -9' within the test looks ugly and it is better to support 'signal=KILL' argument for 'stop server replica' test-run command. Surprisingly this change fixes the flaky fail (or almost fix, see below). However once I observed the following hang (but it seems to be rare): No output during 120 seconds. Will abort after 120 seconds without output. List of workers not reporting the status: - 010_replication [replication/gc.test.lua, memtx] at var/010_replication/gc.result:438 Test hung! Result content mismatch: --- replication/gc.result Fri Apr 19 06:27:55 2019 +++ var/010_replication/gc.result Fri Apr 19 06:44:12 2019 @@ -436,186 +436,3 @@ ... -- Stop the replica and write a few WALs. test_run:cmd("stop server replica") I guess that the main thread ("tarantool") waits for some other, maybe "wal", which stucks because of unknown reason, but I didn't check. (There is the issue #4127 that describes the same behaviour, so may be related.) So I see the following things need to be done here: * Support 'stop server replica with signal=KILL' command in test-run. * Use it in the test case to fix current flaky fails (also don't forget to ask a test case author about this change). * Reduce the case to reproduce current flaky fails and provide a reproducer (file an issue). * Elaborate how rare are hangs as above and check whether it connected somehow to some of known problem (or, maybe, it is a test problem?). Alexander, hope you'll catch these activities. WBR, Alexander Turenko. On Tue, Apr 16, 2019 at 11:01:22AM +0300, avtikhon wrote: > gc.test.lua test cleaned up to use default 60 secs timeout for > wait_cond routine instead of local 10 secs, because it doesn't > need to have its own special value. Also the diagnostic messages > added on wait_* routines fails. > > [029] --- replication/gc.result Mon Apr 15 14:58:09 2019 > [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019 > [029] @@ -290,7 +290,12 @@ > [029] ... > [029] wait_xlog(1) or fio.listdir('./master') > [029] --- > [048] replication/gc.test.lua vinyl [ fail ] > [048] > [048] Test failed! Result content mismatch: > [029] -- true > [029] +- - 00000000000000000305.vylog > [029] + - 00000000000000000305.xlog > [029] + - '512' > [029] + - 00000000000000000310.xlog > [029] + - 00000000000000000310.vylog > [029] + - 00000000000000000310.snap > [029] ... > [029] -- Stop the replica. > [029] test_run:cmd("stop server replica") > [029] @@ -326,7 +331,13 @@ > [029] ... > [029] wait_xlog(2) or fio.listdir('./master') > [029] --- > [029] -- true > [029] +- - 00000000000000000305.xlog > [029] + - 00000000000000000316.xlog > [029] + - 00000000000000000316.vylog > [029] + - '512' > [029] + - 00000000000000000310.xlog > [029] + - 00000000000000000317.vylog > [029] + - 00000000000000000317.snap > [029] ... > [029] -- The xlog should only be deleted after the replica > [029] -- is unregistered. > [029] > > Close #4162 > --- > > Github: https://github.com/tarantool/tarantool/tree/avtikhon/gh-4162-gc-default-timeout > Issue: https://github.com/tarantool/tarantool/issues/4162
prev parent reply other threads:[~2019-04-19 7:20 UTC|newest] Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-04-16 8:01 [tarantool-patches] " avtikhon 2019-04-19 7:20 ` Alexander Turenko [this message]
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20190419072022.y3gap53snyfwh76c@tkn_work_nb \ --to=alexander.turenko@tarantool.org \ --cc=avtikhon@tarantool.org \ --cc=tarantool-patches@freelists.org \ --subject='[tarantool-patches] Re: [PATCH v1] test: gc.test.lua needs to use default timeout' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox