[tarantool-patches] Re: [PATCH v1 2/6] Insert status calls inside wait_cond routines

Wed Apr 10 15:26:05 MSK 2019

In brief:

I factored out (and changed in some places) changes where I sure it is
right thing to do and pushed them on
Totktonada/test-replication-fix-flaky-fails. I'll send it to the mailing
list and push to master and 2.1 soon.

Filed https://github.com/tarantool/tarantool/issues/4129 re
sync.test.lua. It seems we cannot fix the test in some simple way.

Some more or less random comments are below.

WBR, Alexander Turenko.

> diff --git a/test/replication/errinj.test.lua b/test/replication/errinj.test.lua
> index e9965483e..6522df9d8 100644
> --- a/test/replication/errinj.test.lua
> +++ b/test/replication/errinj.test.lua
> @@ -129,8 +129,8 @@ test_run:cmd("switch replica")
>  -- wait for reconnect
>  while box.info.replication[1].upstream.status ~= 'follow' do fiber.sleep(0.0001) end
>  box.info.replication[1].upstream.status
> -box.info.replication[1].upstream.lag > 0
> -box.info.replication[1].upstream.lag < 1
> +test_run:wait_cond(function() return box.info.replication[1].upstream.lag > 0 end) or box.info.replication[1].upstream.lag
> +test_run:wait_cond(function() return box.info.replication[1].upstream.lag < 1 end) or box.info.replication[1].upstream.lag

I think that the correct fix would be increasing of replication_timeout
to 0.1 instead.

>  -- wait for ack timeout
>  while box.info.replication[1].upstream.message ~= 'timed out' do fiber.sleep(0.0001) end
>  
> @@ -203,9 +203,9 @@ test_run:cmd("switch replica_timeout")
>  
>  fiber = require('fiber')
>  while box.info.replication[1].upstream.status ~= 'follow' do fiber.sleep(0.0001) end
> -box.info.replication[1].upstream.status -- follow
> +test_run:wait_cond(function() return box.info.replication[1].upstream.status == 'follow' end) or box.info.replication[1].upstream.status
>  for i = 0, 15 do fiber.sleep(0.01) if box.info.replication[1].upstream.status ~= 'follow' then break end end
> -box.info.replication[1].upstream.status -- follow
> +test_run:wait_cond(function() return box.info.replication[1].upstream.status == 'follow' end) or box.info.replication[1].upstream.status

We do the same thing twice: with while and with wait_cond().

After `for i = 0, 15` we should check a status once I guess (it tests
that we was not dropped into 'stopped' status).

Anyway, I changed replication_timeout to 0.1 and it seems to fix all
problems with the test.

> diff --git a/test/replication/force_recovery.test.lua b/test/replication/force_recovery.test.lua
> index 9ac5dadf5..7669d7f22 100644
> --- a/test/replication/force_recovery.test.lua
> +++ b/test/replication/force_recovery.test.lua
> @@ -14,7 +14,7 @@ test_run:cmd("start server test")
>  
>  -- Stop the replica and wait for the relay thread to exit.
>  test_run:cmd("stop server test")
> -test_run:wait_cond(function() return box.info.replication[2].downstream.status == 'stopped' end)
> +test_run:wait_cond(function() return box.info.replication[2].downstream.status == 'stopped' end) or box.info.replication[2].downstream.status
>  
>  -- Delete an xlog file that is needed by the replica.
>  box.snapshot()
> @@ -30,7 +30,7 @@ box.cfg{force_recovery = true}
>  test_run:cmd("start server test")
>  test_run:cmd("switch test")
>  box.space.test:select()
> -box.info.replication[1].upstream.status == 'stopped' or box.info
> +test_run:wait_cond(function() return box.info.replication[1].upstream.status == 'stopped' end) or box.info

It also seems that we wait for an upstream / downstream status quite
often across our tests. Maybe it worth to add wait_upstream() and
wait_downstream() functions to test-run? I filed the issue:
https://github.com/tarantool/test-run/issues/158

> diff --git a/test/replication/quorum.test.lua b/test/replication/quorum.test.lua
> index 777f8ef7c..d67bf0f73 100644
> --- a/test/replication/quorum.test.lua
> +++ b/test/replication/quorum.test.lua
> @@ -19,37 +19,37 @@ test_run:cmd('stop server quorum1')
>  test_run:cmd('switch quorum2')
>  
>  test_run:cmd('restart server quorum2 with args="0.1 0.5"')
> -box.info.status -- orphan
> +test_run:wait_cond(function() return box.info.status == 'orphan' end) or box.info.status
>  box.ctl.wait_rw(0.001) -- timeout
>  box.info.ro -- true
>  box.space.test:replace{100} -- error
>  box.cfg{replication={}}
> -box.info.status -- running
> +test_run:wait_cond(function() return box.info.status == 'running' end) or box.info.status
>  
>  test_run:cmd('restart server quorum2 with args="0.1 0.5"')
> -box.info.status -- orphan
> +test_run:wait_cond(function() return box.info.status == 'orphan' end) or box.info.status
>  box.ctl.wait_rw(0.001) -- timeout
>  box.info.ro -- true
>  box.space.test:replace{100} -- error
>  box.cfg{replication_connect_quorum = 2}
>  box.ctl.wait_rw()
>  box.info.ro -- false
> -box.info.status -- running
> +test_run:wait_cond(function() return box.info.status == 'running' end) or box.info.status
>  
>  test_run:cmd('restart server quorum2 with args="0.1 0.5"')
> -box.info.status -- orphan
> +test_run:wait_cond(function() return box.info.status == 'orphan' end) or box.info.status
>  box.ctl.wait_rw(0.001) -- timeout
>  box.info.ro -- true
>  box.space.test:replace{100} -- error
>  test_run:cmd('start server quorum1 with args="0.1 0.5"')
>  box.ctl.wait_rw()
>  box.info.ro -- false
> -box.info.status -- running
> +test_run:wait_cond(function() return box.info.status == 'running' end) or box.info.status

Don't sure it is correct to check that 'running' status WILL be in the
future. I guess the test checks that a status is 'running' right after
box.ctl.wait_rw().

> diff --git a/test/replication/sync.test.lua b/test/replication/sync.test.lua
> index 14dea9716..2d15aeec7 100644
> --- a/test/replication/sync.test.lua
> +++ b/test/replication/sync.test.lua
> @@ -59,6 +59,7 @@ test_run:cmd("switch default")
>  fill()
>  test_run:cmd("switch replica")
>  
> +-----------------------------------------------------------------------------------------------------

We usually fit comments into 66 symbols and usually don't use such
separators. Removed.

>  -- Resume replication.
>  --
>  -- Since max allowed lag is small, all records should arrive
> @@ -66,8 +67,8 @@ test_run:cmd("switch replica")
>  --
>  box.cfg{replication_sync_lag = 0.001}
>  box.cfg{replication = replication}
> +test_run:wait_cond(function() return box.info.status == 'running' end) or box.info.status
>  box.space.test:count()
> -box.info.status -- running
>  box.info.ro -- false
>  
>  -- Stop replication.
> @@ -79,6 +80,7 @@ test_run:cmd("switch default")
>  fill()
>  test_run:cmd("switch replica")
>  
> +-----------------------------------------------------------------------------------------------------
>  -- Resume replication
>  --
>  -- Since max allowed lag is big, not all records will arrive
> @@ -87,12 +89,10 @@ test_run:cmd("switch replica")
>  --
>  box.cfg{replication_sync_lag = 1}
>  box.cfg{replication = replication}
> -box.space.test:count() < 400
> -box.info.status -- running
> -box.info.ro -- false
> +test_run:wait_cond(function() return box.space.test:count() == 400 or (box.space.test:count() < 400 and box.info.status == 'running' and box.info.ro) end) or box.info.status

Error: box.info.ro expected to be false.

In case when box.space.test:count() == 400 it test nothing. We discussed
it a bit with Vladimir and it seems that the right way to fix the test
is add an errinj that stops applier on a certain lsn and rewrite the
test from timeouts to usage of this errinj.

There is also another variant: just remove 'box.space.test:count() <
400' check. In this case the test sometimes will test nothing, but most
of time will do its work (because replication is slowed down). It is
certainly poor way, but maybe it worth to sporadially miss a test case
instead of sporadically fail due to unability of test things.

So I filed #4129.

>  -- Wait for remaining rows to arrive.
> -test_run:wait_cond(function() return box.space.test:count() == 600 end)
> +test_run:wait_cond(function() return box.space.test:count() == 600 end) or box.space.test:count()

Same here.

>  -- Make sure replica leaves oprhan state.
>  test_run:wait_cond(function() return box.info.status ~= 'orphan' end)
> -box.info.status -- running
> +test_run:wait_cond(function() return box.info.status == 'running' end) or box.info.status
>  box.info.ro -- false

If we leave from the orphan state, so should observe 'running'
immediately.

>  -- gh-3636: Check that replica set sync doesn't stop on cfg errors.
> @@ -151,10 +150,10 @@ test_run:cmd("switch replica")
>  replication = box.cfg.replication
>  box.cfg{replication = {}}
>  box.cfg{replication = replication}
> -box.info.status -- running
> +test_run:wait_cond(function() return box.info.status == 'running' end)
>  box.info.ro -- false
> -box.info.replication[1].upstream.status -- follow
> -test_run:grep_log('replica', 'ER_CFG.*')
> +test_run:wait_cond(function() return box.info.replication[1].upstream.status == 'follow' end) or box.info.replication[1].upstream.status
> +test_run:wait_log("replica", "ER_CFG.*", nil, 200)

I guess here we want to check a status right after leaving from
box.cfg(). But retrying logs grepping make sense.