Tarantool development patches archive
 help / color / mirror / Atom feed
From: Alexander Turenko <alexander.turenko@tarantool.org>
To: avtikhon <avtikhon@tarantool.org>
Cc: avtikhon <avtikhon@gmail.com>, tarantool-patches@freelists.org
Subject: [tarantool-patches] Re: [PATCH v1 2/6] Insert status calls inside wait_cond routines
Date: Wed, 10 Apr 2019 15:26:05 +0300	[thread overview]
Message-ID: <20190410122605.7yhmmns7xabs4vo3@tkn_work_nb> (raw)
In-Reply-To: <3c1ddecd66058ad9df1164062828d278924b7f46.1554127720.git.avtikhon@gmail.com>

In brief:

I factored out (and changed in some places) changes where I sure it is
right thing to do and pushed them on
Totktonada/test-replication-fix-flaky-fails. I'll send it to the mailing
list and push to master and 2.1 soon.

Filed https://github.com/tarantool/tarantool/issues/4129 re
sync.test.lua. It seems we cannot fix the test in some simple way.

Some more or less random comments are below.

WBR, Alexander Turenko.

> diff --git a/test/replication/errinj.test.lua b/test/replication/errinj.test.lua
> index e9965483e..6522df9d8 100644
> --- a/test/replication/errinj.test.lua
> +++ b/test/replication/errinj.test.lua
> @@ -129,8 +129,8 @@ test_run:cmd("switch replica")
>  -- wait for reconnect
>  while box.info.replication[1].upstream.status ~= 'follow' do fiber.sleep(0.0001) end
>  box.info.replication[1].upstream.status
> -box.info.replication[1].upstream.lag > 0
> -box.info.replication[1].upstream.lag < 1
> +test_run:wait_cond(function() return box.info.replication[1].upstream.lag > 0 end) or box.info.replication[1].upstream.lag
> +test_run:wait_cond(function() return box.info.replication[1].upstream.lag < 1 end) or box.info.replication[1].upstream.lag

I think that the correct fix would be increasing of replication_timeout
to 0.1 instead.

>  -- wait for ack timeout
>  while box.info.replication[1].upstream.message ~= 'timed out' do fiber.sleep(0.0001) end
>  
> @@ -203,9 +203,9 @@ test_run:cmd("switch replica_timeout")
>  
>  fiber = require('fiber')
>  while box.info.replication[1].upstream.status ~= 'follow' do fiber.sleep(0.0001) end
> -box.info.replication[1].upstream.status -- follow
> +test_run:wait_cond(function() return box.info.replication[1].upstream.status == 'follow' end) or box.info.replication[1].upstream.status
>  for i = 0, 15 do fiber.sleep(0.01) if box.info.replication[1].upstream.status ~= 'follow' then break end end
> -box.info.replication[1].upstream.status -- follow
> +test_run:wait_cond(function() return box.info.replication[1].upstream.status == 'follow' end) or box.info.replication[1].upstream.status

We do the same thing twice: with while and with wait_cond().

After `for i = 0, 15` we should check a status once I guess (it tests
that we was not dropped into 'stopped' status).

Anyway, I changed replication_timeout to 0.1 and it seems to fix all
problems with the test.

> diff --git a/test/replication/force_recovery.test.lua b/test/replication/force_recovery.test.lua
> index 9ac5dadf5..7669d7f22 100644
> --- a/test/replication/force_recovery.test.lua
> +++ b/test/replication/force_recovery.test.lua
> @@ -14,7 +14,7 @@ test_run:cmd("start server test")
>  
>  -- Stop the replica and wait for the relay thread to exit.
>  test_run:cmd("stop server test")
> -test_run:wait_cond(function() return box.info.replication[2].downstream.status == 'stopped' end)
> +test_run:wait_cond(function() return box.info.replication[2].downstream.status == 'stopped' end) or box.info.replication[2].downstream.status
>  
>  -- Delete an xlog file that is needed by the replica.
>  box.snapshot()
> @@ -30,7 +30,7 @@ box.cfg{force_recovery = true}
>  test_run:cmd("start server test")
>  test_run:cmd("switch test")
>  box.space.test:select()
> -box.info.replication[1].upstream.status == 'stopped' or box.info
> +test_run:wait_cond(function() return box.info.replication[1].upstream.status == 'stopped' end) or box.info

It also seems that we wait for an upstream / downstream status quite
often across our tests. Maybe it worth to add wait_upstream() and
wait_downstream() functions to test-run? I filed the issue:
https://github.com/tarantool/test-run/issues/158

> diff --git a/test/replication/quorum.test.lua b/test/replication/quorum.test.lua
> index 777f8ef7c..d67bf0f73 100644
> --- a/test/replication/quorum.test.lua
> +++ b/test/replication/quorum.test.lua
> @@ -19,37 +19,37 @@ test_run:cmd('stop server quorum1')
>  test_run:cmd('switch quorum2')
>  
>  test_run:cmd('restart server quorum2 with args="0.1 0.5"')
> -box.info.status -- orphan
> +test_run:wait_cond(function() return box.info.status == 'orphan' end) or box.info.status
>  box.ctl.wait_rw(0.001) -- timeout
>  box.info.ro -- true
>  box.space.test:replace{100} -- error
>  box.cfg{replication={}}
> -box.info.status -- running
> +test_run:wait_cond(function() return box.info.status == 'running' end) or box.info.status
>  
>  test_run:cmd('restart server quorum2 with args="0.1 0.5"')
> -box.info.status -- orphan
> +test_run:wait_cond(function() return box.info.status == 'orphan' end) or box.info.status
>  box.ctl.wait_rw(0.001) -- timeout
>  box.info.ro -- true
>  box.space.test:replace{100} -- error
>  box.cfg{replication_connect_quorum = 2}
>  box.ctl.wait_rw()
>  box.info.ro -- false
> -box.info.status -- running
> +test_run:wait_cond(function() return box.info.status == 'running' end) or box.info.status
>  
>  test_run:cmd('restart server quorum2 with args="0.1 0.5"')
> -box.info.status -- orphan
> +test_run:wait_cond(function() return box.info.status == 'orphan' end) or box.info.status
>  box.ctl.wait_rw(0.001) -- timeout
>  box.info.ro -- true
>  box.space.test:replace{100} -- error
>  test_run:cmd('start server quorum1 with args="0.1 0.5"')
>  box.ctl.wait_rw()
>  box.info.ro -- false
> -box.info.status -- running
> +test_run:wait_cond(function() return box.info.status == 'running' end) or box.info.status

Don't sure it is correct to check that 'running' status WILL be in the
future. I guess the test checks that a status is 'running' right after
box.ctl.wait_rw().

> diff --git a/test/replication/sync.test.lua b/test/replication/sync.test.lua
> index 14dea9716..2d15aeec7 100644
> --- a/test/replication/sync.test.lua
> +++ b/test/replication/sync.test.lua
> @@ -59,6 +59,7 @@ test_run:cmd("switch default")
>  fill()
>  test_run:cmd("switch replica")
>  
> +-----------------------------------------------------------------------------------------------------

We usually fit comments into 66 symbols and usually don't use such
separators. Removed.

>  -- Resume replication.
>  --
>  -- Since max allowed lag is small, all records should arrive
> @@ -66,8 +67,8 @@ test_run:cmd("switch replica")
>  --
>  box.cfg{replication_sync_lag = 0.001}
>  box.cfg{replication = replication}
> +test_run:wait_cond(function() return box.info.status == 'running' end) or box.info.status
>  box.space.test:count()
> -box.info.status -- running
>  box.info.ro -- false
>  
>  -- Stop replication.
> @@ -79,6 +80,7 @@ test_run:cmd("switch default")
>  fill()
>  test_run:cmd("switch replica")
>  
> +-----------------------------------------------------------------------------------------------------
>  -- Resume replication
>  --
>  -- Since max allowed lag is big, not all records will arrive
> @@ -87,12 +89,10 @@ test_run:cmd("switch replica")
>  --
>  box.cfg{replication_sync_lag = 1}
>  box.cfg{replication = replication}
> -box.space.test:count() < 400
> -box.info.status -- running
> -box.info.ro -- false
> +test_run:wait_cond(function() return box.space.test:count() == 400 or (box.space.test:count() < 400 and box.info.status == 'running' and box.info.ro) end) or box.info.status

Error: box.info.ro expected to be false.

In case when box.space.test:count() == 400 it test nothing. We discussed
it a bit with Vladimir and it seems that the right way to fix the test
is add an errinj that stops applier on a certain lsn and rewrite the
test from timeouts to usage of this errinj.

There is also another variant: just remove 'box.space.test:count() <
400' check. In this case the test sometimes will test nothing, but most
of time will do its work (because replication is slowed down). It is
certainly poor way, but maybe it worth to sporadially miss a test case
instead of sporadically fail due to unability of test things.

So I filed #4129.

>  -- Wait for remaining rows to arrive.
> -test_run:wait_cond(function() return box.space.test:count() == 600 end)
> +test_run:wait_cond(function() return box.space.test:count() == 600 end) or box.space.test:count()

Same here.

>  -- Make sure replica leaves oprhan state.
>  test_run:wait_cond(function() return box.info.status ~= 'orphan' end)
> -box.info.status -- running
> +test_run:wait_cond(function() return box.info.status == 'running' end) or box.info.status
>  box.info.ro -- false

If we leave from the orphan state, so should observe 'running'
immediately.

>  -- gh-3636: Check that replica set sync doesn't stop on cfg errors.
> @@ -151,10 +150,10 @@ test_run:cmd("switch replica")
>  replication = box.cfg.replication
>  box.cfg{replication = {}}
>  box.cfg{replication = replication}
> -box.info.status -- running
> +test_run:wait_cond(function() return box.info.status == 'running' end)
>  box.info.ro -- false
> -box.info.replication[1].upstream.status -- follow
> -test_run:grep_log('replica', 'ER_CFG.*')
> +test_run:wait_cond(function() return box.info.replication[1].upstream.status == 'follow' end) or box.info.replication[1].upstream.status
> +test_run:wait_log("replica", "ER_CFG.*", nil, 200)

I guess here we want to check a status right after leaving from
box.cfg(). But retrying logs grepping make sense.

  reply	other threads:[~2019-04-10 12:26 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-01 14:13 [tarantool-patches] [PATCH v1 1/6] Update the timeouts, set to use the default value avtikhon
2019-04-01 14:13 ` [tarantool-patches] [PATCH v1 2/6] Insert status calls inside wait_cond routines avtikhon
2019-04-10 12:26   ` Alexander Turenko [this message]
2019-04-01 14:13 ` [tarantool-patches] [PATCH v1 3/6] Rename replicas to the each test specified name avtikhon
2021-05-26 21:02   ` [Tarantool-patches] " Alexander Turenko via Tarantool-patches
2019-04-01 14:13 ` [tarantool-patches] [PATCH v1 4/6] Removed unused variable avtikhon
2019-04-01 14:13 ` [tarantool-patches] [PATCH v1 5/6] Test misc tunning avtikhon
2019-04-10 12:29   ` [tarantool-patches] " Alexander Turenko
2019-04-01 14:13 ` [tarantool-patches] [PATCH v1 6/6] Fix wait_fullmesh avtikhon
2019-04-10 12:32 ` [tarantool-patches] Re: [PATCH v1 1/6] Update the timeouts, set to use the default value Alexander Turenko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190410122605.7yhmmns7xabs4vo3@tkn_work_nb \
    --to=alexander.turenko@tarantool.org \
    --cc=avtikhon@gmail.com \
    --cc=avtikhon@tarantool.org \
    --cc=tarantool-patches@freelists.org \
    --subject='[tarantool-patches] Re: [PATCH v1 2/6] Insert status calls inside wait_cond routines' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox