[Tarantool-patches] [PATCH v2 1/7] replication: fix a hang on final join retry

Serge Petrenko sergepetrenko at tarantool.org
Sat Mar 27 19:52:26 MSK 2021

26.03.2021 23:44, Vladislav Shpilevoy пишет:
> Hi! Thanks for working on this!
>> diff --git a/src/box/applier.cc b/src/box/applier.cc
>> index 5a88a013e..326cf18d2 100644
>> --- a/src/box/applier.cc
>> +++ b/src/box/applier.cc
>> @@ -566,9 +566,16 @@ applier_register(struct applier *applier)
>>   	row.type = IPROTO_REGISTER;
>>   	coio_write_xrow(coio, &row);
>> -	applier_set_state(applier, APPLIER_REGISTER);
>> +	/*
>> +	 * Register may serve as a retry for final join. Set corresponding
>> +	 * states to unblock anyone who's waiting for final join to start or
>> +	 * end.
>> +	 */
>> +	applier_set_state(applier, was_anon ? APPLIER_REGISTER :
>> +					      APPLIER_FINAL_JOIN);
>>   	applier_wait_register(applier, 0);
>> -	applier_set_state(applier, APPLIER_REGISTERED);
>> +	applier_set_state(applier, was_anon ? APPLIER_REGISTERED :
>> +					      APPLIER_JOINED);
>>   	applier_set_state(applier, APPLIER_READY);
> Hm. I don't understand. Transition from anon to non-anon leads to
> re-creation of all appliers. It calls box_sync_replication() and
> creates new struct applier objects. How is it possible that during one
> life of a reader fiber it manages to see 2 states and is not terminated?

You're correct. This isn't possible for an applier to see two states, 
anon and not anon.
The flag is still needed though for the case when a normal replica 
receives some transient
error during final join. In this case applier reconnects and we get to 
the next applier loop
iteration. First it checks whether REPLICASET_UUID is nil. It isn't, 
because initial join succeeded.
Then it checks whether instance_id is 0. It is, because final join failed.
Applier now assumes that the replica was anonymous and tries to register.

The hang I'm talking about is in `bootstrap_from_master()`. It waits 
until applier enters
APPLIER_JOINED state, which never happened before this patch.

So, `was_anon` comes in play only when final join fails and is retried.

> Also could you please provide a test? Maybe it would be easier to see
> what is happening then.

Ok. I'm not sure this test is needed because this is implicitly tested 
in gh-5566-final-join-synchro test.

A test would be as follows:
     box.cfg{listen=3301, replication_synchro_quorum=10}
     box.schema.user.grant("guest", "replication")
master: wait until replica receives ER_SYNC_QUORUM_TIMEOUT, and then:

This test passes on the branch, meaning replica's box.cfg completes 
but it would hang indefinitely without this commit.

Serge Petrenko

More information about the Tarantool-patches mailing list