[Tarantool-patches] [PATCH 1/2] replication: correctly check for rows to skip in applier
Sergey Petrenko
sergepetrenko at tarantool.org
Fri Feb 14 01:03:59 MSK 2020
>Четверг, 13 февраля 2020, 9:58 +03:00 от Konstantin Osipov <kostja.osipov at gmail.com>:
>
>* Konstantin Osipov < kostja.osipov at gmail.com > [20/02/13 09:47]:
>
>From relay.cc:
>
> /*
> * We're feeding a WAL, thus responding to FINAL JOIN or SUBSCRIBE
> * request. If this is FINAL JOIN (i.e. relay->replica is NULL),
> * we must relay all rows, even those originating from the replica
> * itself (there may be such rows if this is rebootstrap). If this
> * SUBSCRIBE, only send a row if it is not from the same replica
> * (i.e. don't send replica's own rows back) or if this row is
> * missing on the other side (i.e. in case of sudden power-loss,
> * data was not written to WAL, so remote master can't recover
> * it). In the latter case packet's LSN is less than or equal to
> * local master's LSN at the moment it received 'SUBSCRIBE' request.
> */
> if (relay->replica == NULL ||
> packet->replica_id != relay->replica->id ||
> packet->lsn <= vclock_get(&relay->local_vclock_at_subscribe,
> packet->replica_id)) {
> struct errinj *inj = errinj(ERRINJ_RELAY_BREAK_LSN,
> ERRINJ_INT);
> if (inj != NULL && packet->lsn == inj->iparam) {
> packet->lsn = inj->iparam - 1;
> say_warn("injected broken lsn: %lld",
> (long long) packet->lsn);
> }
> relay_send(relay, packet);
> }
>}
>
>
>As you can see we never send our own rows back, as long as
>they are greater than relay->local_vclock_at_subscribe.
True. First of all, local_vclock_at_subscribe was updated later than
it should've been. I describe it in more detail in one of the commits.
>
>
>So what exactly does go wrong here?
>
>Is the bug triggered during initial replication configuration, or
>during a reconfiguration?
Both during reconfiguration, resubscribing in case a remote instance
dies and restarts.
>
>
>I suspect the issue is that at reconfiguration we send
>local_vclock_at_subscribe, but keep changing it.
Yep
>
>The fix then would be to make sure the local component in
>local_vclock_at_subscribe is set to infinity during
>reconfiguration
>
>
>> * sergepetrenko < sergepetrenko at tarantool.org > [20/02/13 09:34]:
>> > Fix replicaset.applier.vclock initialization issues: it wasn't
>> > initialized at all previously.
>>
>> In the next line you say that you remove the initialization. What
>> do you mean here?
>>
>> > Moreover, there is no valid point in code
>> > to initialize it, since it may get stale right away if new entries are
>> > written to WAL.
>>
>> Well, it reflects the state of the wal *as seen by* the set of
>> appliers. This is stated in the comment. So it doesn't have to
>> reflect local changes.
>>
>> > So, check for both applier and replicaset vclocks.
>> > The greater one protects the instance from applying the rows it has
>> > already applied or has already scheduled to write.
>> > Also remove an unnecessary aplier vclock initialization from
>> > replication_init().
>>
>> First of all, the race you describe applies to
>> local changes only. Yet you add the check for all replica ids.
>> This further obliterates this piece of code.
>>
>> Second, the core of the issue is a "hole" in vclock protection
>> enforced by latch_lock/latch_unlock. Basically the assumption that
>> latch_lock/latch_unlock has is that while a latch is locked, no
>> source can apply a transaction under this replica id. This, is
>> violated by the local WAL.
>>
>> We used to skip all changes by local vclock id before in applier.
>>
>> Later it was changed to be able to get-your-own logs on recovery,
>> e.g. if some replica has them , and the local node lost a piece of
>> wal.
>>
>> It will take me a while to find this commit and ticket, but this
>> is the commit and ticket which introduced the regression.
>>
>> The proper fix is to only apply local changes received from
>> remotes in orphan mode, and begin skipping them when entering
>> read-write mode.
>>
>> > Closes #4739
>> > ---
>> > src/box/applier.cc | 14 ++++++++++++--
>> > src/box/replication.cc | 1 -
>> > 2 files changed, 12 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/src/box/applier.cc b/src/box/applier.cc
>> > index ae3d281a5..acb26b7e2 100644
>> > --- a/src/box/applier.cc
>> > +++ b/src/box/applier.cc
>> > @@ -731,8 +731,18 @@ applier_apply_tx(struct stailq *rows)
>> > struct latch *latch = (replica ? &replica->order_latch :
>> > &replicaset.applier.order_latch);
>> > latch_lock(latch);
>> > - if (vclock_get(&replicaset.applier.vclock,
>> > - first_row->replica_id) >= first_row->lsn) {
>> > + /*
>> > + * We cannot tell which vclock is greater. There is no
>> > + * proper place to initialize applier vclock, since it
>> > + * may get stale right away if we write something to WAL
>> > + * and it gets replicated and then arrives back from the
>> > + * replica. So check against both vclocks. Replicaset
>> > + * vclock will guard us from corner cases like the one
>> > + * above.
>> > + */
>> > + if (MAX(vclock_get(&replicaset.applier.vclock, first_row->replica_id),
>> > + vclock_get(&replicaset.vclock, first_row->replica_id)) >=
>> > + first_row->lsn) {
>> > latch_unlock(latch);
>> > return 0;
>> > }
>> > diff --git a/src/box/replication.cc b/src/box/replication.cc
>> > index e7bfa22ab..7b04573a4 100644
>> > --- a/src/box/replication.cc
>> > +++ b/src/box/replication.cc
>> > @@ -93,7 +93,6 @@ replication_init(void)
>> > latch_create(&replicaset.applier.order_latch);
>> >
>> > vclock_create(&replicaset.applier.vclock);
>> > - vclock_copy(&replicaset.applier.vclock, &replicaset.vclock);
>> > rlist_create(&replicaset.applier.on_rollback);
>> > rlist_create(&replicaset.applier.on_commit);
>> >
>> > --
>> > 2.20.1 (Apple Git-117)
>>
>> --
>> Konstantin Osipov, Moscow, Russia
>
>--
>Konstantin Osipov, Moscow, Russia
--
Sergey Petrenko
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.tarantool.org/pipermail/tarantool-patches/attachments/20200214/69e72134/attachment.html>
More information about the Tarantool-patches
mailing list