From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\)) Subject: Re: [PATCH v3] replication: fix exit with ER_NO_SUCH_USER during bootstrap From: Serge Petrenko In-Reply-To: <20180824125418.7wm7ylwr3j4nsgyt@esperanza> Date: Fri, 24 Aug 2018 19:15:58 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: <3D16E36C-6C7E-4D5D-911E-B8854103516B@tarantool.org> References: <20180824115645.43531-1-sergepetrenko@tarantool.org> <20180824125418.7wm7ylwr3j4nsgyt@esperanza> To: Vladimir Davydov Cc: tarantool-patches@freelists.org List-ID: Hi, I pushed all fixes on the branch. > 24 =D0=B0=D0=B2=D0=B3. 2018 =D0=B3., =D0=B2 15:54, Vladimir Davydov = =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB(=D0=B0= ): >=20 > On Fri, Aug 24, 2018 at 02:56:45PM +0300, Serge Petrenko wrote: >> When replication is configured via some user created in box.once() >> function and box.once() takes more than replication_timeout seconds >> to execute, appliers recieve ER_NO_SUCH_USER error, which they don't >> handle. This leads to occasional test failures in replication suite. >> Fix this by handling the aforementioned case in applier_f() and add a >> test case. >>=20 >> Closes #3637 >> --- >> https://github.com/tarantool/tarantool/issues/3637 >> = https://github.com/tarantool/tarantool/tree/sp/gh-3637-replication-tests-f= ix >=20 > The issue was reassigned to 1.9. Please rebase. Done. >=20 >> diff --git a/test/replication/autobootstrap.test.lua = b/test/replication/autobootstrap.test.lua >> index 752d5f317..21417a738 100644 >> --- a/test/replication/autobootstrap.test.lua >> +++ b/test/replication/autobootstrap.test.lua >=20 > IMHO we'd better put this small test case in replication/misc, because > it's not really about cluster autobootstrap. Ok, I moved the case to replication/misc >=20 >> @@ -108,3 +108,31 @@ _ =3D test_run:cmd("switch default") >> -- Stop servers >> -- >> test_run:drop_cluster(SERVERS) >> + >> +-- >> +-- Test case for gh-3637. Before the fix replica would exit with >> +-- an error. Now check that we don't hang and successfully connect. >> +-- >> +fiber =3D require("fiber") >> + >> +test_run:cmd("setopt delimiter ';'") >> +function wait_replica() >> + while box.info.replication[2] =3D=3D nil do >> + fiber.sleep(0.01) >> + end >> +end; >> +test_run:cmd("setopt delimiter ''"); >=20 > Please use test_run:wait_vclock instead. It doesn=E2=80=99t work, since we cannot start replica_auth with = wait_load=3DTrue, because it won't load until we create user and give it privs. Due to not waiting = for replica to load, we may sometimes see an error in test_run:wait_vclock(), when there is = no vclock yet since the instance hasn=E2=80=99t had time to load. So I inlined the old while loop and removed function declaration. Also = added a small delay afterwards to make sure replica fully loads. >=20 >> + >> +test_run:cmd("create server replica_auth with rpl_master=3Ddefault, = script=3D'replication/replica_auth.lua'") >> +test_run:cmd("start server replica_auth with wait=3DFalse, = wait_load=3DFalse, args=3D'cluster:pass 0.1'") >=20 > I'd set the timeout to 0.01 to make sure replica tries to reconnect > several times before it succeeds. >=20 Ok. Set to 0.03, with 0.01 replica reaches timeout before applying any = rows and never loads. >> +-- Wait a bit to make sure replica waits till user is created. >> +fiber.sleep(0.1) >> +box.schema.user.create('cluster', {password=3D'pass'}) >> +box.schema.user.grant('cluster', 'replication') >> +wait_replica() >> + >> +test_run:cmd("stop server replica_auth") >> +test_run:cmd("cleanup server replica_auth") >> +test_run:cmd("delete server replica_auth") >> + >> +box.schema.user.drop('cluster')