[Tarantool-patches] [PATCH v2 1/7] replication: fix a hang on final join retry
sergepetrenko at tarantool.org
Sat Mar 27 19:52:26 MSK 2021
26.03.2021 23:44, Vladislav Shpilevoy пишет:
> Hi! Thanks for working on this!
>> diff --git a/src/box/applier.cc b/src/box/applier.cc
>> index 5a88a013e..326cf18d2 100644
>> --- a/src/box/applier.cc
>> +++ b/src/box/applier.cc
>> @@ -566,9 +566,16 @@ applier_register(struct applier *applier)
>> row.type = IPROTO_REGISTER;
>> coio_write_xrow(coio, &row);
>> - applier_set_state(applier, APPLIER_REGISTER);
>> + /*
>> + * Register may serve as a retry for final join. Set corresponding
>> + * states to unblock anyone who's waiting for final join to start or
>> + * end.
>> + */
>> + applier_set_state(applier, was_anon ? APPLIER_REGISTER :
>> + APPLIER_FINAL_JOIN);
>> applier_wait_register(applier, 0);
>> - applier_set_state(applier, APPLIER_REGISTERED);
>> + applier_set_state(applier, was_anon ? APPLIER_REGISTERED :
>> + APPLIER_JOINED);
>> applier_set_state(applier, APPLIER_READY);
> Hm. I don't understand. Transition from anon to non-anon leads to
> re-creation of all appliers. It calls box_sync_replication() and
> creates new struct applier objects. How is it possible that during one
> life of a reader fiber it manages to see 2 states and is not terminated?
You're correct. This isn't possible for an applier to see two states,
anon and not anon.
The flag is still needed though for the case when a normal replica
receives some transient
error during final join. In this case applier reconnects and we get to
the next applier loop
iteration. First it checks whether REPLICASET_UUID is nil. It isn't,
because initial join succeeded.
Then it checks whether instance_id is 0. It is, because final join failed.
Applier now assumes that the replica was anonymous and tries to register.
The hang I'm talking about is in `bootstrap_from_master()`. It waits
until applier enters
APPLIER_JOINED state, which never happened before this patch.
So, `was_anon` comes in play only when final join fails and is retried.
> Also could you please provide a test? Maybe it would be easier to see
> what is happening then.
Ok. I'm not sure this test is needed because this is implicitly tested
in gh-5566-final-join-synchro test.
A test would be as follows:
master: wait until replica receives ER_SYNC_QUORUM_TIMEOUT, and then:
This test passes on the branch, meaning replica's box.cfg completes
but it would hang indefinitely without this commit.
More information about the Tarantool-patches