From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lj1-f196.google.com (mail-lj1-f196.google.com [209.85.208.196]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id B7E04469719 for ; Thu, 13 Feb 2020 09:58:51 +0300 (MSK) Received: by mail-lj1-f196.google.com with SMTP id r19so5314767ljg.3 for ; Wed, 12 Feb 2020 22:58:51 -0800 (PST) Date: Thu, 13 Feb 2020 09:58:49 +0300 From: Konstantin Osipov Message-ID: <20200213065849.GA18311@atlas> References: <85cfed8fd0cfa23c5a8504b6516fad18028edbaf.1581551227.git.sergepetrenko@tarantool.org> <20200213064730.GA17713@atlas> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200213064730.GA17713@atlas> Subject: Re: [Tarantool-patches] [PATCH 1/2] replication: correctly check for rows to skip in applier List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: sergepetrenko , alexander.turenko@tarantool.org, v.shpilevoy@tarantool.org, tarantool-patches@dev.tarantool.org * Konstantin Osipov [20/02/13 09:47]: >From relay.cc: /* * We're feeding a WAL, thus responding to FINAL JOIN or SUBSCRIBE * request. If this is FINAL JOIN (i.e. relay->replica is NULL), * we must relay all rows, even those originating from the replica * itself (there may be such rows if this is rebootstrap). If this * SUBSCRIBE, only send a row if it is not from the same replica * (i.e. don't send replica's own rows back) or if this row is * missing on the other side (i.e. in case of sudden power-loss, * data was not written to WAL, so remote master can't recover * it). In the latter case packet's LSN is less than or equal to * local master's LSN at the moment it received 'SUBSCRIBE' request. */ if (relay->replica == NULL || packet->replica_id != relay->replica->id || packet->lsn <= vclock_get(&relay->local_vclock_at_subscribe, packet->replica_id)) { struct errinj *inj = errinj(ERRINJ_RELAY_BREAK_LSN, ERRINJ_INT); if (inj != NULL && packet->lsn == inj->iparam) { packet->lsn = inj->iparam - 1; say_warn("injected broken lsn: %lld", (long long) packet->lsn); } relay_send(relay, packet); } } As you can see we never send our own rows back, as long as they are greater than relay->local_vclock_at_subscribe. So what exactly does go wrong here? Is the bug triggered during initial replication configuration, or during a reconfiguration? I suspect the issue is that at reconfiguration we send local_vclock_at_subscribe, but keep changing it. The fix then would be to make sure the local component in local_vclock_at_subscribe is set to infinity during reconfiguration. > * sergepetrenko [20/02/13 09:34]: > > Fix replicaset.applier.vclock initialization issues: it wasn't > > initialized at all previously. > > In the next line you say that you remove the initialization. What > do you mean here? > > > Moreover, there is no valid point in code > > to initialize it, since it may get stale right away if new entries are > > written to WAL. > > Well, it reflects the state of the wal *as seen by* the set of > appliers. This is stated in the comment. So it doesn't have to > reflect local changes. > > > So, check for both applier and replicaset vclocks. > > The greater one protects the instance from applying the rows it has > > already applied or has already scheduled to write. > > Also remove an unnecessary aplier vclock initialization from > > replication_init(). > > First of all, the race you describe applies to > local changes only. Yet you add the check for all replica ids. > This further obliterates this piece of code. > > Second, the core of the issue is a "hole" in vclock protection > enforced by latch_lock/latch_unlock. Basically the assumption that > latch_lock/latch_unlock has is that while a latch is locked, no > source can apply a transaction under this replica id. This, is > violated by the local WAL. > > We used to skip all changes by local vclock id before in applier. > > Later it was changed to be able to get-your-own logs on recovery, > e.g. if some replica has them , and the local node lost a piece of > wal. > > It will take me a while to find this commit and ticket, but this > is the commit and ticket which introduced the regression. > > The proper fix is to only apply local changes received from > remotes in orphan mode, and begin skipping them when entering > read-write mode. > > > Closes #4739 > > --- > > src/box/applier.cc | 14 ++++++++++++-- > > src/box/replication.cc | 1 - > > 2 files changed, 12 insertions(+), 3 deletions(-) > > > > diff --git a/src/box/applier.cc b/src/box/applier.cc > > index ae3d281a5..acb26b7e2 100644 > > --- a/src/box/applier.cc > > +++ b/src/box/applier.cc > > @@ -731,8 +731,18 @@ applier_apply_tx(struct stailq *rows) > > struct latch *latch = (replica ? &replica->order_latch : > > &replicaset.applier.order_latch); > > latch_lock(latch); > > - if (vclock_get(&replicaset.applier.vclock, > > - first_row->replica_id) >= first_row->lsn) { > > + /* > > + * We cannot tell which vclock is greater. There is no > > + * proper place to initialize applier vclock, since it > > + * may get stale right away if we write something to WAL > > + * and it gets replicated and then arrives back from the > > + * replica. So check against both vclocks. Replicaset > > + * vclock will guard us from corner cases like the one > > + * above. > > + */ > > + if (MAX(vclock_get(&replicaset.applier.vclock, first_row->replica_id), > > + vclock_get(&replicaset.vclock, first_row->replica_id)) >= > > + first_row->lsn) { > > latch_unlock(latch); > > return 0; > > } > > diff --git a/src/box/replication.cc b/src/box/replication.cc > > index e7bfa22ab..7b04573a4 100644 > > --- a/src/box/replication.cc > > +++ b/src/box/replication.cc > > @@ -93,7 +93,6 @@ replication_init(void) > > latch_create(&replicaset.applier.order_latch); > > > > vclock_create(&replicaset.applier.vclock); > > - vclock_copy(&replicaset.applier.vclock, &replicaset.vclock); > > rlist_create(&replicaset.applier.on_rollback); > > rlist_create(&replicaset.applier.on_commit); > > > > -- > > 2.20.1 (Apple Git-117) > > -- > Konstantin Osipov, Moscow, Russia -- Konstantin Osipov, Moscow, Russia