From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id 4CA2A29476 for ; Wed, 10 Apr 2019 08:26:15 -0400 (EDT) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id P7niaQ5JAG-Y for ; Wed, 10 Apr 2019 08:26:15 -0400 (EDT) Received: from smtpng2.m.smailru.net (smtpng2.m.smailru.net [94.100.179.3]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 61C3C293B7 for ; Wed, 10 Apr 2019 08:26:14 -0400 (EDT) Date: Wed, 10 Apr 2019 15:26:05 +0300 From: Alexander Turenko Subject: [tarantool-patches] Re: [PATCH v1 2/6] Insert status calls inside wait_cond routines Message-ID: <20190410122605.7yhmmns7xabs4vo3@tkn_work_nb> References: <3c1ddecd66058ad9df1164062828d278924b7f46.1554127720.git.avtikhon@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <3c1ddecd66058ad9df1164062828d278924b7f46.1554127720.git.avtikhon@gmail.com> Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-Help: List-Unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-Subscribe: List-Owner: List-post: List-Archive: To: avtikhon Cc: avtikhon , tarantool-patches@freelists.org In brief: I factored out (and changed in some places) changes where I sure it is right thing to do and pushed them on Totktonada/test-replication-fix-flaky-fails. I'll send it to the mailing list and push to master and 2.1 soon. Filed https://github.com/tarantool/tarantool/issues/4129 re sync.test.lua. It seems we cannot fix the test in some simple way. Some more or less random comments are below. WBR, Alexander Turenko. > diff --git a/test/replication/errinj.test.lua b/test/replication/errinj.test.lua > index e9965483e..6522df9d8 100644 > --- a/test/replication/errinj.test.lua > +++ b/test/replication/errinj.test.lua > @@ -129,8 +129,8 @@ test_run:cmd("switch replica") > -- wait for reconnect > while box.info.replication[1].upstream.status ~= 'follow' do fiber.sleep(0.0001) end > box.info.replication[1].upstream.status > -box.info.replication[1].upstream.lag > 0 > -box.info.replication[1].upstream.lag < 1 > +test_run:wait_cond(function() return box.info.replication[1].upstream.lag > 0 end) or box.info.replication[1].upstream.lag > +test_run:wait_cond(function() return box.info.replication[1].upstream.lag < 1 end) or box.info.replication[1].upstream.lag I think that the correct fix would be increasing of replication_timeout to 0.1 instead. > -- wait for ack timeout > while box.info.replication[1].upstream.message ~= 'timed out' do fiber.sleep(0.0001) end > > @@ -203,9 +203,9 @@ test_run:cmd("switch replica_timeout") > > fiber = require('fiber') > while box.info.replication[1].upstream.status ~= 'follow' do fiber.sleep(0.0001) end > -box.info.replication[1].upstream.status -- follow > +test_run:wait_cond(function() return box.info.replication[1].upstream.status == 'follow' end) or box.info.replication[1].upstream.status > for i = 0, 15 do fiber.sleep(0.01) if box.info.replication[1].upstream.status ~= 'follow' then break end end > -box.info.replication[1].upstream.status -- follow > +test_run:wait_cond(function() return box.info.replication[1].upstream.status == 'follow' end) or box.info.replication[1].upstream.status We do the same thing twice: with while and with wait_cond(). After `for i = 0, 15` we should check a status once I guess (it tests that we was not dropped into 'stopped' status). Anyway, I changed replication_timeout to 0.1 and it seems to fix all problems with the test. > diff --git a/test/replication/force_recovery.test.lua b/test/replication/force_recovery.test.lua > index 9ac5dadf5..7669d7f22 100644 > --- a/test/replication/force_recovery.test.lua > +++ b/test/replication/force_recovery.test.lua > @@ -14,7 +14,7 @@ test_run:cmd("start server test") > > -- Stop the replica and wait for the relay thread to exit. > test_run:cmd("stop server test") > -test_run:wait_cond(function() return box.info.replication[2].downstream.status == 'stopped' end) > +test_run:wait_cond(function() return box.info.replication[2].downstream.status == 'stopped' end) or box.info.replication[2].downstream.status > > -- Delete an xlog file that is needed by the replica. > box.snapshot() > @@ -30,7 +30,7 @@ box.cfg{force_recovery = true} > test_run:cmd("start server test") > test_run:cmd("switch test") > box.space.test:select() > -box.info.replication[1].upstream.status == 'stopped' or box.info > +test_run:wait_cond(function() return box.info.replication[1].upstream.status == 'stopped' end) or box.info It also seems that we wait for an upstream / downstream status quite often across our tests. Maybe it worth to add wait_upstream() and wait_downstream() functions to test-run? I filed the issue: https://github.com/tarantool/test-run/issues/158 > diff --git a/test/replication/quorum.test.lua b/test/replication/quorum.test.lua > index 777f8ef7c..d67bf0f73 100644 > --- a/test/replication/quorum.test.lua > +++ b/test/replication/quorum.test.lua > @@ -19,37 +19,37 @@ test_run:cmd('stop server quorum1') > test_run:cmd('switch quorum2') > > test_run:cmd('restart server quorum2 with args="0.1 0.5"') > -box.info.status -- orphan > +test_run:wait_cond(function() return box.info.status == 'orphan' end) or box.info.status > box.ctl.wait_rw(0.001) -- timeout > box.info.ro -- true > box.space.test:replace{100} -- error > box.cfg{replication={}} > -box.info.status -- running > +test_run:wait_cond(function() return box.info.status == 'running' end) or box.info.status > > test_run:cmd('restart server quorum2 with args="0.1 0.5"') > -box.info.status -- orphan > +test_run:wait_cond(function() return box.info.status == 'orphan' end) or box.info.status > box.ctl.wait_rw(0.001) -- timeout > box.info.ro -- true > box.space.test:replace{100} -- error > box.cfg{replication_connect_quorum = 2} > box.ctl.wait_rw() > box.info.ro -- false > -box.info.status -- running > +test_run:wait_cond(function() return box.info.status == 'running' end) or box.info.status > > test_run:cmd('restart server quorum2 with args="0.1 0.5"') > -box.info.status -- orphan > +test_run:wait_cond(function() return box.info.status == 'orphan' end) or box.info.status > box.ctl.wait_rw(0.001) -- timeout > box.info.ro -- true > box.space.test:replace{100} -- error > test_run:cmd('start server quorum1 with args="0.1 0.5"') > box.ctl.wait_rw() > box.info.ro -- false > -box.info.status -- running > +test_run:wait_cond(function() return box.info.status == 'running' end) or box.info.status Don't sure it is correct to check that 'running' status WILL be in the future. I guess the test checks that a status is 'running' right after box.ctl.wait_rw(). > diff --git a/test/replication/sync.test.lua b/test/replication/sync.test.lua > index 14dea9716..2d15aeec7 100644 > --- a/test/replication/sync.test.lua > +++ b/test/replication/sync.test.lua > @@ -59,6 +59,7 @@ test_run:cmd("switch default") > fill() > test_run:cmd("switch replica") > > +----------------------------------------------------------------------------------------------------- We usually fit comments into 66 symbols and usually don't use such separators. Removed. > -- Resume replication. > -- > -- Since max allowed lag is small, all records should arrive > @@ -66,8 +67,8 @@ test_run:cmd("switch replica") > -- > box.cfg{replication_sync_lag = 0.001} > box.cfg{replication = replication} > +test_run:wait_cond(function() return box.info.status == 'running' end) or box.info.status > box.space.test:count() > -box.info.status -- running > box.info.ro -- false > > -- Stop replication. > @@ -79,6 +80,7 @@ test_run:cmd("switch default") > fill() > test_run:cmd("switch replica") > > +----------------------------------------------------------------------------------------------------- > -- Resume replication > -- > -- Since max allowed lag is big, not all records will arrive > @@ -87,12 +89,10 @@ test_run:cmd("switch replica") > -- > box.cfg{replication_sync_lag = 1} > box.cfg{replication = replication} > -box.space.test:count() < 400 > -box.info.status -- running > -box.info.ro -- false > +test_run:wait_cond(function() return box.space.test:count() == 400 or (box.space.test:count() < 400 and box.info.status == 'running' and box.info.ro) end) or box.info.status Error: box.info.ro expected to be false. In case when box.space.test:count() == 400 it test nothing. We discussed it a bit with Vladimir and it seems that the right way to fix the test is add an errinj that stops applier on a certain lsn and rewrite the test from timeouts to usage of this errinj. There is also another variant: just remove 'box.space.test:count() < 400' check. In this case the test sometimes will test nothing, but most of time will do its work (because replication is slowed down). It is certainly poor way, but maybe it worth to sporadially miss a test case instead of sporadically fail due to unability of test things. So I filed #4129. > -- Wait for remaining rows to arrive. > -test_run:wait_cond(function() return box.space.test:count() == 600 end) > +test_run:wait_cond(function() return box.space.test:count() == 600 end) or box.space.test:count() Same here. > -- Make sure replica leaves oprhan state. > test_run:wait_cond(function() return box.info.status ~= 'orphan' end) > -box.info.status -- running > +test_run:wait_cond(function() return box.info.status == 'running' end) or box.info.status > box.info.ro -- false If we leave from the orphan state, so should observe 'running' immediately. > -- gh-3636: Check that replica set sync doesn't stop on cfg errors. > @@ -151,10 +150,10 @@ test_run:cmd("switch replica") > replication = box.cfg.replication > box.cfg{replication = {}} > box.cfg{replication = replication} > -box.info.status -- running > +test_run:wait_cond(function() return box.info.status == 'running' end) > box.info.ro -- false > -box.info.replication[1].upstream.status -- follow > -test_run:grep_log('replica', 'ER_CFG.*') > +test_run:wait_cond(function() return box.info.replication[1].upstream.status == 'follow' end) or box.info.replication[1].upstream.status > +test_run:wait_log("replica", "ER_CFG.*", nil, 200) I guess here we want to check a status right after leaving from box.cfg(). But retrying logs grepping make sense.