From: Alexander Turenko <alexander.turenko@tarantool.org> To: avtikhon <avtikhon@tarantool.org> Cc: avtikhon <avtikhon@gmail.com>, tarantool-patches@freelists.org Subject: [tarantool-patches] Re: [PATCH v1 2/6] Insert status calls inside wait_cond routines Date: Wed, 10 Apr 2019 15:26:05 +0300 [thread overview] Message-ID: <20190410122605.7yhmmns7xabs4vo3@tkn_work_nb> (raw) In-Reply-To: <3c1ddecd66058ad9df1164062828d278924b7f46.1554127720.git.avtikhon@gmail.com> In brief: I factored out (and changed in some places) changes where I sure it is right thing to do and pushed them on Totktonada/test-replication-fix-flaky-fails. I'll send it to the mailing list and push to master and 2.1 soon. Filed https://github.com/tarantool/tarantool/issues/4129 re sync.test.lua. It seems we cannot fix the test in some simple way. Some more or less random comments are below. WBR, Alexander Turenko. > diff --git a/test/replication/errinj.test.lua b/test/replication/errinj.test.lua > index e9965483e..6522df9d8 100644 > --- a/test/replication/errinj.test.lua > +++ b/test/replication/errinj.test.lua > @@ -129,8 +129,8 @@ test_run:cmd("switch replica") > -- wait for reconnect > while box.info.replication[1].upstream.status ~= 'follow' do fiber.sleep(0.0001) end > box.info.replication[1].upstream.status > -box.info.replication[1].upstream.lag > 0 > -box.info.replication[1].upstream.lag < 1 > +test_run:wait_cond(function() return box.info.replication[1].upstream.lag > 0 end) or box.info.replication[1].upstream.lag > +test_run:wait_cond(function() return box.info.replication[1].upstream.lag < 1 end) or box.info.replication[1].upstream.lag I think that the correct fix would be increasing of replication_timeout to 0.1 instead. > -- wait for ack timeout > while box.info.replication[1].upstream.message ~= 'timed out' do fiber.sleep(0.0001) end > > @@ -203,9 +203,9 @@ test_run:cmd("switch replica_timeout") > > fiber = require('fiber') > while box.info.replication[1].upstream.status ~= 'follow' do fiber.sleep(0.0001) end > -box.info.replication[1].upstream.status -- follow > +test_run:wait_cond(function() return box.info.replication[1].upstream.status == 'follow' end) or box.info.replication[1].upstream.status > for i = 0, 15 do fiber.sleep(0.01) if box.info.replication[1].upstream.status ~= 'follow' then break end end > -box.info.replication[1].upstream.status -- follow > +test_run:wait_cond(function() return box.info.replication[1].upstream.status == 'follow' end) or box.info.replication[1].upstream.status We do the same thing twice: with while and with wait_cond(). After `for i = 0, 15` we should check a status once I guess (it tests that we was not dropped into 'stopped' status). Anyway, I changed replication_timeout to 0.1 and it seems to fix all problems with the test. > diff --git a/test/replication/force_recovery.test.lua b/test/replication/force_recovery.test.lua > index 9ac5dadf5..7669d7f22 100644 > --- a/test/replication/force_recovery.test.lua > +++ b/test/replication/force_recovery.test.lua > @@ -14,7 +14,7 @@ test_run:cmd("start server test") > > -- Stop the replica and wait for the relay thread to exit. > test_run:cmd("stop server test") > -test_run:wait_cond(function() return box.info.replication[2].downstream.status == 'stopped' end) > +test_run:wait_cond(function() return box.info.replication[2].downstream.status == 'stopped' end) or box.info.replication[2].downstream.status > > -- Delete an xlog file that is needed by the replica. > box.snapshot() > @@ -30,7 +30,7 @@ box.cfg{force_recovery = true} > test_run:cmd("start server test") > test_run:cmd("switch test") > box.space.test:select() > -box.info.replication[1].upstream.status == 'stopped' or box.info > +test_run:wait_cond(function() return box.info.replication[1].upstream.status == 'stopped' end) or box.info It also seems that we wait for an upstream / downstream status quite often across our tests. Maybe it worth to add wait_upstream() and wait_downstream() functions to test-run? I filed the issue: https://github.com/tarantool/test-run/issues/158 > diff --git a/test/replication/quorum.test.lua b/test/replication/quorum.test.lua > index 777f8ef7c..d67bf0f73 100644 > --- a/test/replication/quorum.test.lua > +++ b/test/replication/quorum.test.lua > @@ -19,37 +19,37 @@ test_run:cmd('stop server quorum1') > test_run:cmd('switch quorum2') > > test_run:cmd('restart server quorum2 with args="0.1 0.5"') > -box.info.status -- orphan > +test_run:wait_cond(function() return box.info.status == 'orphan' end) or box.info.status > box.ctl.wait_rw(0.001) -- timeout > box.info.ro -- true > box.space.test:replace{100} -- error > box.cfg{replication={}} > -box.info.status -- running > +test_run:wait_cond(function() return box.info.status == 'running' end) or box.info.status > > test_run:cmd('restart server quorum2 with args="0.1 0.5"') > -box.info.status -- orphan > +test_run:wait_cond(function() return box.info.status == 'orphan' end) or box.info.status > box.ctl.wait_rw(0.001) -- timeout > box.info.ro -- true > box.space.test:replace{100} -- error > box.cfg{replication_connect_quorum = 2} > box.ctl.wait_rw() > box.info.ro -- false > -box.info.status -- running > +test_run:wait_cond(function() return box.info.status == 'running' end) or box.info.status > > test_run:cmd('restart server quorum2 with args="0.1 0.5"') > -box.info.status -- orphan > +test_run:wait_cond(function() return box.info.status == 'orphan' end) or box.info.status > box.ctl.wait_rw(0.001) -- timeout > box.info.ro -- true > box.space.test:replace{100} -- error > test_run:cmd('start server quorum1 with args="0.1 0.5"') > box.ctl.wait_rw() > box.info.ro -- false > -box.info.status -- running > +test_run:wait_cond(function() return box.info.status == 'running' end) or box.info.status Don't sure it is correct to check that 'running' status WILL be in the future. I guess the test checks that a status is 'running' right after box.ctl.wait_rw(). > diff --git a/test/replication/sync.test.lua b/test/replication/sync.test.lua > index 14dea9716..2d15aeec7 100644 > --- a/test/replication/sync.test.lua > +++ b/test/replication/sync.test.lua > @@ -59,6 +59,7 @@ test_run:cmd("switch default") > fill() > test_run:cmd("switch replica") > > +----------------------------------------------------------------------------------------------------- We usually fit comments into 66 symbols and usually don't use such separators. Removed. > -- Resume replication. > -- > -- Since max allowed lag is small, all records should arrive > @@ -66,8 +67,8 @@ test_run:cmd("switch replica") > -- > box.cfg{replication_sync_lag = 0.001} > box.cfg{replication = replication} > +test_run:wait_cond(function() return box.info.status == 'running' end) or box.info.status > box.space.test:count() > -box.info.status -- running > box.info.ro -- false > > -- Stop replication. > @@ -79,6 +80,7 @@ test_run:cmd("switch default") > fill() > test_run:cmd("switch replica") > > +----------------------------------------------------------------------------------------------------- > -- Resume replication > -- > -- Since max allowed lag is big, not all records will arrive > @@ -87,12 +89,10 @@ test_run:cmd("switch replica") > -- > box.cfg{replication_sync_lag = 1} > box.cfg{replication = replication} > -box.space.test:count() < 400 > -box.info.status -- running > -box.info.ro -- false > +test_run:wait_cond(function() return box.space.test:count() == 400 or (box.space.test:count() < 400 and box.info.status == 'running' and box.info.ro) end) or box.info.status Error: box.info.ro expected to be false. In case when box.space.test:count() == 400 it test nothing. We discussed it a bit with Vladimir and it seems that the right way to fix the test is add an errinj that stops applier on a certain lsn and rewrite the test from timeouts to usage of this errinj. There is also another variant: just remove 'box.space.test:count() < 400' check. In this case the test sometimes will test nothing, but most of time will do its work (because replication is slowed down). It is certainly poor way, but maybe it worth to sporadially miss a test case instead of sporadically fail due to unability of test things. So I filed #4129. > -- Wait for remaining rows to arrive. > -test_run:wait_cond(function() return box.space.test:count() == 600 end) > +test_run:wait_cond(function() return box.space.test:count() == 600 end) or box.space.test:count() Same here. > -- Make sure replica leaves oprhan state. > test_run:wait_cond(function() return box.info.status ~= 'orphan' end) > -box.info.status -- running > +test_run:wait_cond(function() return box.info.status == 'running' end) or box.info.status > box.info.ro -- false If we leave from the orphan state, so should observe 'running' immediately. > -- gh-3636: Check that replica set sync doesn't stop on cfg errors. > @@ -151,10 +150,10 @@ test_run:cmd("switch replica") > replication = box.cfg.replication > box.cfg{replication = {}} > box.cfg{replication = replication} > -box.info.status -- running > +test_run:wait_cond(function() return box.info.status == 'running' end) > box.info.ro -- false > -box.info.replication[1].upstream.status -- follow > -test_run:grep_log('replica', 'ER_CFG.*') > +test_run:wait_cond(function() return box.info.replication[1].upstream.status == 'follow' end) or box.info.replication[1].upstream.status > +test_run:wait_log("replica", "ER_CFG.*", nil, 200) I guess here we want to check a status right after leaving from box.cfg(). But retrying logs grepping make sense.
next prev parent reply other threads:[~2019-04-10 12:26 UTC|newest] Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-04-01 14:13 [tarantool-patches] [PATCH v1 1/6] Update the timeouts, set to use the default value avtikhon 2019-04-01 14:13 ` [tarantool-patches] [PATCH v1 2/6] Insert status calls inside wait_cond routines avtikhon 2019-04-10 12:26 ` Alexander Turenko [this message] 2019-04-01 14:13 ` [tarantool-patches] [PATCH v1 3/6] Rename replicas to the each test specified name avtikhon 2021-05-26 21:02 ` [Tarantool-patches] " Alexander Turenko via Tarantool-patches 2019-04-01 14:13 ` [tarantool-patches] [PATCH v1 4/6] Removed unused variable avtikhon 2019-04-01 14:13 ` [tarantool-patches] [PATCH v1 5/6] Test misc tunning avtikhon 2019-04-10 12:29 ` [tarantool-patches] " Alexander Turenko 2019-04-01 14:13 ` [tarantool-patches] [PATCH v1 6/6] Fix wait_fullmesh avtikhon 2019-04-10 12:32 ` [tarantool-patches] Re: [PATCH v1 1/6] Update the timeouts, set to use the default value Alexander Turenko
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20190410122605.7yhmmns7xabs4vo3@tkn_work_nb \ --to=alexander.turenko@tarantool.org \ --cc=avtikhon@gmail.com \ --cc=avtikhon@tarantool.org \ --cc=tarantool-patches@freelists.org \ --subject='[tarantool-patches] Re: [PATCH v1 2/6] Insert status calls inside wait_cond routines' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox