[Tarantool-patches] [PATCH v4 13/12] replication: send accumulated Raft messages after relay start

Tue Apr 20 13:38:46 MSK 2021

20.04.2021 01:36, Vladislav Shpilevoy пишет:
> Thanks for the patch!
>
> See 2 comments below.
>
>> diff --git a/src/box/relay.cc b/src/box/relay.cc
>> index 7be33ee31..85f335cd7 100644
>> --- a/src/box/relay.cc
>> +++ b/src/box/relay.cc
>> @@ -628,13 +659,38 @@ struct relay_is_raft_enabled_msg {
>>       bool is_finished;
>>   };
>>
>> +static void
>> +relay_push_raft_msg(struct relay *relay, bool do_restart_recovery)
> 1. Why is the recovery restart flag is ignored if a message is already
> sent? This might lead to recovery restart loss if I am not mistaken.

I think it's okay. As soon as the message is pushed from relay_push_raft()
rather than from tx_set_is_raft_enabled(), we may freely restart the 
recovery.

So, we only care whether do_restart_recovery is set when the message 
gets pushed
in the same call.

We don't care whether do_restart_recovery is set  or not when the call 
exits without pushing
the message. The next call will have the correct value for 
do_restart_recovery anyway.

Please see a more detailed explanation below.

>
>> +{
>> +    if (!relay->tx.is_raft_enabled || relay->tx.is_raft_push_sent)
>> +        return;
>> +    struct relay_raft_msg *msg =
>> +        &relay->tx.raft_msgs[relay->tx.raft_ready_msg];
>> +    msg->do_restart_recovery = do_restart_recovery;
>> +    cpipe_push(&relay->relay_pipe, &msg->base);
>> +    relay->tx.raft_ready_msg = (relay->tx.raft_ready_msg + 1) % 2;
>> +    relay->tx.is_raft_push_sent = true;
>> +    relay->tx.is_raft_push_pending = false;
>> +}
>> +
>>   /** TX thread part of the Raft flag setting, first hop. */
>>   static void
>>   tx_set_is_raft_enabled(struct cmsg *base)
>>   {
>>       struct relay_is_raft_enabled_msg *msg =
>>           (struct relay_is_raft_enabled_msg *)base;
>> -    msg->relay->tx.is_raft_enabled = msg->value;
>> +    struct relay *relay  = msg->relay;
>> +    relay->tx.is_raft_enabled = msg->value;
>> +    /*
>> +     * Send saved raft message as soon as relay becomes operational.
>> +     * Do not restart recovery upon the message arrival. Recovery is
>> +     * positioned at replica_clock initially, i.e. already "restarted" and
>> +     * restarting it once again would position it at the oldest xlog
>> +     * possible, because relay reader hasn't received replica vclock yet.
>> +     */
>> +    if (relay->tx.is_raft_push_pending) {
>> +        relay_push_raft_msg(msg->relay, false);
> 2. I don't understand. Why wasn't there such a problem before? Recovery
> must be restarted when the node becomes a leader. If you do not restart
> it, the data would be ignored by the replicas. How do you know it is
> positioned right now at replica_clock? You are in tx thread, you can't
> tell. What do I miss?

This is because this `relay_push_raft_msg` is delivered before
`relay_set_is_raft_enabled`.
And both these messages get processed by the cbus_process()
loop waiting for `relay_seet_is_raft_enabled`.

This happens in relay_send_is_raft_enabled() even before
the relay reader fiber is created, so recv_vclock is zero.
Restarting recovery here would lead to it being reset to the
first ever wal this instance has, which's wrong.

Such a problem might've existed before, but was extremely
hard to catch: relay_push_raft_msg() wasn't called until
relay->tx.is_raft_enabled was set. And when tx.is_raft_enabled
was set it most probably meant that relay_set_is_raft_enabled
was already delivered and relay has exited this first
cbus_process() loop, which worked before reader fiber creation.

In order to solve the problem in some another way, I need to
make relay_push_raft_msg() deliver the message to the
second cbus_process() loop, the main one. And I couldn't
come up with an idea how to do that.
The message should be pushed right in tx_set_is_raft_enabled,
and this means it'll get delivered before relay_set_is_raft_enabled.

-- 
Serge Petrenko