From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id 8C47520459 for ; Fri, 19 Apr 2019 03:20:35 -0400 (EDT) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 3-w1rhEWNnpZ for ; Fri, 19 Apr 2019 03:20:35 -0400 (EDT) Received: from smtp59.i.mail.ru (smtp59.i.mail.ru [217.69.128.39]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id C124720084 for ; Fri, 19 Apr 2019 03:20:34 -0400 (EDT) Date: Fri, 19 Apr 2019 10:20:23 +0300 From: Alexander Turenko Subject: [tarantool-patches] Re: [PATCH v1] test: gc.test.lua needs to use default timeout Message-ID: <20190419072022.y3gap53snyfwh76c@tkn_work_nb> References: <6a22a07474b5a9a334273db4e9062d5d2d3ad13f.1555401138.git.avtikhon@tarantool.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <6a22a07474b5a9a334273db4e9062d5d2d3ad13f.1555401138.git.avtikhon@tarantool.org> Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-Help: List-Unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-Subscribe: List-Owner: List-post: List-Archive: To: avtikhon Cc: tarantool-patches@freelists.org Hi, 10 seconds should be enough for several context switches, several small writes and reads via a socket and to delete a file from a filesystem. Even with amount of processes that several times larger then CPUs count tarantool should obtain enough time quanta / memory bandwith / disk iops from schedulers. Intuitively I think that 10 seconds is already very large time for that and so I looked into the case deeper. The test case that fails is the following (it is the test code from the current master with markers): -- -- Check that the master doesn't delete xlog files sent to the -- replica until it receives a confirmation that the data has -- been applied (gh-2825). -- test_run:cmd("switch replica") -- Prevent the replica from applying any rows. box.error.injection.set("ERRINJ_WAL_DELAY", true) test_run:cmd("switch default") -- Generate some data on the master. for i = 1, 5 do s:auto_increment{} end box.snapshot() -- rotate xlog for i = 1, 5 do s:auto_increment{} end fiber.sleep(0.1) -- wait for master to relay data -- Garbage collection must not delete the old xlog file -- because it is still needed by the replica, but remove -- the old snapshot. wait_gc(1) or box.info.gc() wait_xlog(2) or fio.listdir('./master') test_run:cmd("switch replica") -- Unblock the replica and break replication. box.error.injection.set("ERRINJ_WAL_DELAY", false) box.cfg{replication = {}} -- Restart the replica to reestablish replication. test_run:cmd("restart server replica") -- Wait for the replica to catch up. test_run:cmd("switch replica") test_run:wait_cond(function() return box.space.test:count() == 310 end, 10) box.space.test:count() test_run:cmd("switch default") -- Now it's safe to drop the old xlog. wait_gc(1) or box.info.gc() wait_xlog(1) or fio.listdir('./master') -- Stop the replica. test_run:cmd("stop server replica") test_run:cmd("cleanup server replica") Actually we do two checks here: the first one the we have 1 snapshot and 2 xlogs and the second one that we have 1 snapshot and 1 xlogs (the first xlog from the two ones is removed after restarting the replica). Logs show that in the successful case (when the test passes) the master removes a first xlog before server restart. So the second xlog count check checks virtually nothing. The reason of that is that the replica sends ACKs to the master at when we unset ERRINJ_WAL_DELAY and before we drop replication at the next line. The master see that the replica reads all data from the first xlog and removes it. I guess that here we imitate the situation when a replica is bit before a master (doesn't receive most actual data yet) and crashed. Then we wake it up and everything should be good: the replica receives all data, the master removes unneeded xlogs. So I propose to fix the test case: diff --git a/test/replication/gc.test.lua b/test/replication/gc.test.lua index 890fe29ae..e0bd00fac 100644 --- a/test/replication/gc.test.lua +++ b/test/replication/gc.test.lua @@ -122,12 +122,15 @@ fiber.sleep(0.1) -- wait for master to relay data -- the old snapshot. wait_gc(1) or box.info.gc() wait_xlog(2) or fio.listdir('./master') -test_run:cmd("switch replica") --- Unblock the replica and break replication. -box.error.injection.set("ERRINJ_WAL_DELAY", false) -box.cfg{replication = {}} --- Restart the replica to reestablish replication. -test_run:cmd("restart server replica") +-- Imitate the replica crash and, then, wake up. +-- Just 'stop server replica' (SIGTERM) is not sufficient to stop +-- a tarantool instance when ERRINJ_WAL_DELAY is set, because +-- "tarantool" thread wait for paused "wal" thread infinitely. +-- But 'stop server replica' is needed to remove the intance from +-- test-run records. +test_run:eval('replica', [[ffi = require('ffi') ffi.cdef([=[ int getpid(void); int kill(int, int); ]=]) ffi.C.kill(ffi.C.getpid(), 9)]]) +test_run:cmd("stop server replica") +test_run:cmd("start server replica") -- Wait for the replica to catch up. test_run:cmd("switch replica") test_run:wait_cond(function() return box.space.test:count() == 310 end, 10) It is just for idea. I think that 'kill -9' within the test looks ugly and it is better to support 'signal=KILL' argument for 'stop server replica' test-run command. Surprisingly this change fixes the flaky fail (or almost fix, see below). However once I observed the following hang (but it seems to be rare): No output during 120 seconds. Will abort after 120 seconds without output. List of workers not reporting the status: - 010_replication [replication/gc.test.lua, memtx] at var/010_replication/gc.result:438 Test hung! Result content mismatch: --- replication/gc.result Fri Apr 19 06:27:55 2019 +++ var/010_replication/gc.result Fri Apr 19 06:44:12 2019 @@ -436,186 +436,3 @@ ... -- Stop the replica and write a few WALs. test_run:cmd("stop server replica") I guess that the main thread ("tarantool") waits for some other, maybe "wal", which stucks because of unknown reason, but I didn't check. (There is the issue #4127 that describes the same behaviour, so may be related.) So I see the following things need to be done here: * Support 'stop server replica with signal=KILL' command in test-run. * Use it in the test case to fix current flaky fails (also don't forget to ask a test case author about this change). * Reduce the case to reproduce current flaky fails and provide a reproducer (file an issue). * Elaborate how rare are hangs as above and check whether it connected somehow to some of known problem (or, maybe, it is a test problem?). Alexander, hope you'll catch these activities. WBR, Alexander Turenko. On Tue, Apr 16, 2019 at 11:01:22AM +0300, avtikhon wrote: > gc.test.lua test cleaned up to use default 60 secs timeout for > wait_cond routine instead of local 10 secs, because it doesn't > need to have its own special value. Also the diagnostic messages > added on wait_* routines fails. > > [029] --- replication/gc.result Mon Apr 15 14:58:09 2019 > [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019 > [029] @@ -290,7 +290,12 @@ > [029] ... > [029] wait_xlog(1) or fio.listdir('./master') > [029] --- > [048] replication/gc.test.lua vinyl [ fail ] > [048] > [048] Test failed! Result content mismatch: > [029] -- true > [029] +- - 00000000000000000305.vylog > [029] + - 00000000000000000305.xlog > [029] + - '512' > [029] + - 00000000000000000310.xlog > [029] + - 00000000000000000310.vylog > [029] + - 00000000000000000310.snap > [029] ... > [029] -- Stop the replica. > [029] test_run:cmd("stop server replica") > [029] @@ -326,7 +331,13 @@ > [029] ... > [029] wait_xlog(2) or fio.listdir('./master') > [029] --- > [029] -- true > [029] +- - 00000000000000000305.xlog > [029] + - 00000000000000000316.xlog > [029] + - 00000000000000000316.vylog > [029] + - '512' > [029] + - 00000000000000000310.xlog > [029] + - 00000000000000000317.vylog > [029] + - 00000000000000000317.snap > [029] ... > [029] -- The xlog should only be deleted after the replica > [029] -- is unregistered. > [029] > > Close #4162 > --- > > Github: https://github.com/tarantool/tarantool/tree/avtikhon/gh-4162-gc-default-timeout > Issue: https://github.com/tarantool/tarantool/issues/4162