[PATCH v3] replication: fix exit with ER_NO_SUCH_USER during bootstrap

Serge Petrenko sergepetrenko at tarantool.org
Fri Aug 24 19:15:58 MSK 2018


Hi, I pushed all fixes on the branch.


> 24 авг. 2018 г., в 15:54, Vladimir Davydov <vdavydov.dev at gmail.com> написал(а):
> 
> On Fri, Aug 24, 2018 at 02:56:45PM +0300, Serge Petrenko wrote:
>> When replication is configured via some user created in box.once()
>> function and box.once() takes more than replication_timeout seconds
>> to execute, appliers recieve ER_NO_SUCH_USER error, which they don't
>> handle. This leads to occasional test failures in replication suite.
>> Fix this by handling the aforementioned case in applier_f() and add a
>> test case.
>> 
>> Closes #3637
>> ---
>> https://github.com/tarantool/tarantool/issues/3637
>> https://github.com/tarantool/tarantool/tree/sp/gh-3637-replication-tests-fix
> 
> The issue was reassigned to 1.9. Please rebase.

Done.

> 
>> diff --git a/test/replication/autobootstrap.test.lua b/test/replication/autobootstrap.test.lua
>> index 752d5f317..21417a738 100644
>> --- a/test/replication/autobootstrap.test.lua
>> +++ b/test/replication/autobootstrap.test.lua
> 
> IMHO we'd better put this small test case in replication/misc, because
> it's not really about cluster autobootstrap.

Ok, I moved the case to replication/misc

> 
>> @@ -108,3 +108,31 @@ _ = test_run:cmd("switch default")
>> -- Stop servers
>> --
>> test_run:drop_cluster(SERVERS)
>> +
>> +--
>> +-- Test case for gh-3637. Before the fix replica would exit with
>> +-- an error. Now check that we don't hang and successfully connect.
>> +--
>> +fiber = require("fiber")
>> +
>> +test_run:cmd("setopt delimiter ';'")
>> +function wait_replica()
>> +    while box.info.replication[2] == nil do
>> +        fiber.sleep(0.01)
>> +    end
>> +end;
>> +test_run:cmd("setopt delimiter ''");
> 
> Please use test_run:wait_vclock instead.

It doesn’t work, since we cannot start replica_auth with wait_load=True, because it
won't load until we create user and give it privs. Due to not waiting for replica to load,
we may sometimes see an error in test_run:wait_vclock(), when there is no vclock yet since
the instance hasn’t had time to load.
So I inlined the old while loop and removed function declaration. Also added a small delay afterwards
to make sure replica fully loads.

> 
>> +
>> +test_run:cmd("create server replica_auth with rpl_master=default, script='replication/replica_auth.lua'")
>> +test_run:cmd("start server replica_auth with wait=False, wait_load=False, args='cluster:pass 0.1'")
> 
> I'd set the timeout to 0.01 to make sure replica tries to reconnect
> several times before it succeeds.
> 

Ok. Set to 0.03, with 0.01 replica reaches timeout before applying any rows and never loads.

>> +-- Wait a bit to make sure replica waits till user is created.
>> +fiber.sleep(0.1)
>> +box.schema.user.create('cluster', {password='pass'})
>> +box.schema.user.grant('cluster', 'replication')
>> +wait_replica()
>> +
>> +test_run:cmd("stop server replica_auth")
>> +test_run:cmd("cleanup server replica_auth")
>> +test_run:cmd("delete server replica_auth")
>> +
>> +box.schema.user.drop('cluster')




More information about the Tarantool-patches mailing list