[Tarantool-patches] [PATCH 1/2] replication: correctly check for rows to skip in applier

Fri Feb 14 01:03:59 MSK 2020



>Четверг, 13 февраля 2020, 9:58 +03:00 от Konstantin Osipov <kostja.osipov at gmail.com>:
>
>* Konstantin Osipov < kostja.osipov at gmail.com > [20/02/13 09:47]:
>
>From relay.cc:
>
>        /* 
>         * We're feeding a WAL, thus responding to FINAL JOIN or SUBSCRIBE 
>         * request. If this is FINAL JOIN (i.e. relay->replica is NULL), 
>         * we must relay all rows, even those originating from the replica 
>         * itself (there may be such rows if this is rebootstrap). If this 
>         * SUBSCRIBE, only send a row if it is not from the same replica 
>         * (i.e. don't send replica's own rows back) or if this row is 
>         * missing on the other side (i.e. in case of sudden power-loss, 
>         * data was not written to WAL, so remote master can't recover 
>         * it). In the latter case packet's LSN is less than or equal to 
>         * local master's LSN at the moment it received 'SUBSCRIBE' request. 
>         */ 
>        if (relay->replica == NULL || 
>            packet->replica_id != relay->replica->id || 
>            packet->lsn <= vclock_get(&relay->local_vclock_at_subscribe, 
>                                      packet->replica_id)) { 
>                struct errinj *inj = errinj(ERRINJ_RELAY_BREAK_LSN, 
>                                            ERRINJ_INT); 
>                if (inj != NULL && packet->lsn == inj->iparam) { 
>                        packet->lsn = inj->iparam - 1; 
>                        say_warn("injected broken lsn: %lld", 
>                                 (long long) packet->lsn); 
>                } 
>                relay_send(relay, packet); 
>        } 
>} 
>
>
>As you can see we never send our own rows back, as long as 
>they are greater than relay->local_vclock_at_subscribe.
True. First of all, local_vclock_at_subscribe was updated later than
it should've been. I describe it in more detail in one of the commits.

>
>
>So what exactly does go wrong here? 
>
>Is the bug triggered during initial replication configuration, or
>during a reconfiguration?
Both during reconfiguration, resubscribing in case a remote instance
dies and restarts.
>
>
>I suspect the issue is that at reconfiguration we send
>local_vclock_at_subscribe, but keep changing it. 
Yep

>
>The fix then would be to make sure the local component in
>local_vclock_at_subscribe is set to infinity during
>reconfiguration
>
>
>> * sergepetrenko < sergepetrenko at tarantool.org > [20/02/13 09:34]:
>> > Fix replicaset.applier.vclock initialization issues: it wasn't
>> > initialized at all previously.
>> 
>> In the next line you say that you remove the initialization. What
>> do you mean here?
>> 
>> > Moreover, there is no valid point in code
>> > to initialize it, since it may get stale right away if new entries are
>> > written to WAL.
>> 
>> Well, it reflects the state of the wal *as seen by* the set of
>> appliers. This is stated in the comment. So it doesn't have to
>> reflect local changes. 
>> 
>> > So, check for both applier and replicaset vclocks.
>> > The greater one protects the instance from applying the rows it has
>> > already applied or has already scheduled to write.
>> > Also remove an unnecessary aplier vclock initialization from
>> > replication_init().
>> 
>> First of all, the race you describe applies to
>> local changes only. Yet you add the check for all replica ids. 
>> This further obliterates this piece of code.
>> 
>> Second, the core of the issue is a "hole" in vclock protection
>> enforced by latch_lock/latch_unlock. Basically the assumption that
>> latch_lock/latch_unlock has is that while a latch is locked, no
>> source can apply a transaction under this replica id. This, is
>> violated by the local WAL.
>> 
>> We used to skip all changes by local vclock id before in applier.
>> 
>> Later it was changed to be able to get-your-own logs on recovery,
>> e.g. if some replica has them , and the local node lost a piece of
>> wal.
>> 
>> It will take me a while to find this commit and ticket, but this
>> is the commit and ticket which introduced the regression.
>> 
>> The proper fix is to only apply local changes received from
>> remotes in orphan mode, and begin skipping them when entering
>> read-write mode.
>> 
>> > Closes #4739
>> > ---
>> >  src/box/applier.cc     | 14 ++++++++++++--
>> >  src/box/replication.cc |  1 -
>> >  2 files changed, 12 insertions(+), 3 deletions(-)
>> > 
>> > diff --git a/src/box/applier.cc b/src/box/applier.cc
>> > index ae3d281a5..acb26b7e2 100644
>> > --- a/src/box/applier.cc
>> > +++ b/src/box/applier.cc
>> > @@ -731,8 +731,18 @@ applier_apply_tx(struct stailq *rows)
>> >  	struct latch *latch = (replica ? &replica->order_latch :
>> >  			       &replicaset.applier.order_latch);
>> >  	latch_lock(latch);
>> > -	if (vclock_get(&replicaset.applier.vclock,
>> > -		       first_row->replica_id) >= first_row->lsn) {
>> > +	/*
>> > +	 * We cannot tell which vclock is greater. There is no
>> > +	 * proper place to initialize applier vclock, since it
>> > +	 * may get stale right away if we write something to WAL
>> > +	 * and it gets replicated and then arrives back from the
>> > +	 * replica. So check against both vclocks. Replicaset
>> > +	 * vclock will guard us from corner cases like the one
>> > +	 * above.
>> > +	 */
>> > +	if (MAX(vclock_get(&replicaset.applier.vclock, first_row->replica_id),
>> > +		vclock_get(&replicaset.vclock, first_row->replica_id)) >=
>> > +	    first_row->lsn) {
>> >  		latch_unlock(latch);
>> >  		return 0;
>> >  	}
>> > diff --git a/src/box/replication.cc b/src/box/replication.cc
>> > index e7bfa22ab..7b04573a4 100644
>> > --- a/src/box/replication.cc
>> > +++ b/src/box/replication.cc
>> > @@ -93,7 +93,6 @@ replication_init(void)
>> >  	latch_create(&replicaset.applier.order_latch);
>> > 
>> >  	vclock_create(&replicaset.applier.vclock);
>> > -	vclock_copy(&replicaset.applier.vclock, &replicaset.vclock);
>> >  	rlist_create(&replicaset.applier.on_rollback);
>> >  	rlist_create(&replicaset.applier.on_commit);
>> > 
>> > -- 
>> > 2.20.1 (Apple Git-117)
>> 
>> -- 
>> Konstantin Osipov, Moscow, Russia
>
>-- 
>Konstantin Osipov, Moscow, Russia


-- 
Sergey Petrenko
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.tarantool.org/pipermail/tarantool-patches/attachments/20200214/69e72134/attachment.html>