[tarantool-patches] Re: [PATCH v1] test: gc.test.lua needs to use default timeout

Alexander Turenko alexander.turenko at tarantool.org
Fri Apr 19 10:20:23 MSK 2019


Hi,

10 seconds should be enough for several context switches, several small
writes and reads via a socket and to delete a file from a filesystem.
Even with amount of processes that several times larger then CPUs count
tarantool should obtain enough time quanta / memory bandwith / disk iops
from schedulers.

Intuitively I think that 10 seconds is already very large time for that
and so I looked into the case deeper.

The test case that fails is the following (it is the test code from the
current master with markers):

--
-- Check that the master doesn't delete xlog files sent to the
-- replica until it receives a confirmation that the data has
-- been applied (gh-2825).
--
test_run:cmd("switch replica")
-- Prevent the replica from applying any rows.
box.error.injection.set("ERRINJ_WAL_DELAY", true)
test_run:cmd("switch default")
-- Generate some data on the master.
for i = 1, 5 do s:auto_increment{} end
box.snapshot() -- rotate xlog
for i = 1, 5 do s:auto_increment{} end
fiber.sleep(0.1) -- wait for master to relay data
-- Garbage collection must not delete the old xlog file
-- because it is still needed by the replica, but remove
-- the old snapshot.
wait_gc(1) or box.info.gc()
wait_xlog(2) or fio.listdir('./master')
test_run:cmd("switch replica")
-- Unblock the replica and break replication.
box.error.injection.set("ERRINJ_WAL_DELAY", false)
<mark 1>
box.cfg{replication = {}}
-- Restart the replica to reestablish replication.
test_run:cmd("restart server replica")
-- Wait for the replica to catch up.
test_run:cmd("switch replica")
test_run:wait_cond(function() return box.space.test:count() == 310 end, 10)
box.space.test:count()
test_run:cmd("switch default")
-- Now it's safe to drop the old xlog.
wait_gc(1) or box.info.gc()
wait_xlog(1) or fio.listdir('./master')
-- Stop the replica.
test_run:cmd("stop server replica")
test_run:cmd("cleanup server replica")

Actually we do two checks here: the first one the we have 1 snapshot and
2 xlogs and the second one that we have 1 snapshot and 1 xlogs (the
first xlog from the two ones is removed after restarting the replica).

Logs show that in the successful case (when the test passes) the master
removes a first xlog before server restart. So the second xlog count
check checks virtually nothing.

The reason of that is that the replica sends ACKs to the master at <mark
1> when we unset ERRINJ_WAL_DELAY and before we drop replication at the
next line. The master see that the replica reads all data from the
first xlog and removes it.

I guess that here we imitate the situation when a replica is bit before
a master (doesn't receive most actual data yet) and crashed. Then we
wake it up and everything should be good: the replica receives all data,
the master removes unneeded xlogs.

So I propose to fix the test case:

diff --git a/test/replication/gc.test.lua b/test/replication/gc.test.lua
index 890fe29ae..e0bd00fac 100644
--- a/test/replication/gc.test.lua
+++ b/test/replication/gc.test.lua
@@ -122,12 +122,15 @@ fiber.sleep(0.1) -- wait for master to relay data
 -- the old snapshot.
 wait_gc(1) or box.info.gc()
 wait_xlog(2) or fio.listdir('./master')
-test_run:cmd("switch replica")
--- Unblock the replica and break replication.
-box.error.injection.set("ERRINJ_WAL_DELAY", false)
-box.cfg{replication = {}}
--- Restart the replica to reestablish replication.
-test_run:cmd("restart server replica")
+-- Imitate the replica crash and, then, wake up.
+-- Just 'stop server replica' (SIGTERM) is not sufficient to stop
+-- a tarantool instance when ERRINJ_WAL_DELAY is set, because
+-- "tarantool" thread wait for paused "wal" thread infinitely.
+-- But 'stop server replica' is needed to remove the intance from
+-- test-run records.
+test_run:eval('replica', [[ffi = require('ffi') ffi.cdef([=[ int getpid(void); int kill(int, int); ]=]) ffi.C.kill(ffi.C.getpid(), 9)]])
+test_run:cmd("stop server replica")
+test_run:cmd("start server replica")
 -- Wait for the replica to catch up.
 test_run:cmd("switch replica")
 test_run:wait_cond(function() return box.space.test:count() == 310 end, 10)

It is just for idea. I think that 'kill -9' within the test looks ugly
and it is better to support 'signal=KILL' argument for 'stop server
replica' test-run command.

Surprisingly this change fixes the flaky fail (or almost fix, see below).

However once I observed the following hang (but it seems to be rare):

No output during 120 seconds. Will abort after 120 seconds without output. List of workers not reporting the status:
- 010_replication [replication/gc.test.lua, memtx] at var/010_replication/gc.result:438
Test hung! Result content mismatch:
--- replication/gc.result	Fri Apr 19 06:27:55 2019
+++ var/010_replication/gc.result	Fri Apr 19 06:44:12 2019
@@ -436,186 +436,3 @@
 ...
 -- Stop the replica and write a few WALs.
 test_run:cmd("stop server replica")

I guess that the main thread ("tarantool") waits for some other, maybe
"wal", which stucks because of unknown reason, but I didn't check.
(There is the issue #4127 that describes the same behaviour, so may be
related.)

So I see the following things need to be done here:

* Support 'stop server replica with signal=KILL' command in test-run.
* Use it in the test case to fix current flaky fails (also don't forget
  to ask a test case author about this change).
* Reduce the case to reproduce current flaky fails and provide a
  reproducer (file an issue).
* Elaborate how rare are hangs as above and check whether it connected
  somehow to some of known problem (or, maybe, it is a test problem?).

Alexander, hope you'll catch these activities.

WBR, Alexander Turenko.

On Tue, Apr 16, 2019 at 11:01:22AM +0300, avtikhon wrote:
> gc.test.lua test cleaned up to use default 60 secs timeout for
> wait_cond routine instead of local 10 secs, because it doesn't
> need to have its own special value. Also the diagnostic messages
> added on wait_* routines fails.
> 
> [029] --- replication/gc.result Mon Apr 15 14:58:09 2019
> [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
> [029] @@ -290,7 +290,12 @@
> [029] ...
> [029] wait_xlog(1) or fio.listdir('./master')
> [029] ---
> [048] replication/gc.test.lua vinyl [ fail ]
> [048]
> [048] Test failed! Result content mismatch:
> [029] -- true
> [029] +- - 00000000000000000305.vylog
> [029] + - 00000000000000000305.xlog
> [029] + - '512'
> [029] + - 00000000000000000310.xlog
> [029] + - 00000000000000000310.vylog
> [029] + - 00000000000000000310.snap
> [029] ...
> [029] -- Stop the replica.
> [029] test_run:cmd("stop server replica")
> [029] @@ -326,7 +331,13 @@
> [029] ...
> [029] wait_xlog(2) or fio.listdir('./master')
> [029] ---
> [029] -- true
> [029] +- - 00000000000000000305.xlog
> [029] + - 00000000000000000316.xlog
> [029] + - 00000000000000000316.vylog
> [029] + - '512'
> [029] + - 00000000000000000310.xlog
> [029] + - 00000000000000000317.vylog
> [029] + - 00000000000000000317.snap
> [029] ...
> [029] -- The xlog should only be deleted after the replica
> [029] -- is unregistered.
> [029]
> 
> Close #4162
> ---
> 
> Github: https://github.com/tarantool/tarantool/tree/avtikhon/gh-4162-gc-default-timeout
> Issue: https://github.com/tarantool/tarantool/issues/4162




More information about the Tarantool-patches mailing list