From: "Sergey Petrenko" <sergepetrenko@tarantool.org> To: "Konstantin Osipov" <kostja.osipov@gmail.com> Cc: tarantool-patches@dev.tarantool.org, v.shpilevoy@tarantool.org Subject: Re: [Tarantool-patches] [PATCH 1/2] replication: correctly check for rows to skip in applier Date: Fri, 14 Feb 2020 01:03:59 +0300 [thread overview] Message-ID: <1581631439.134350666@f191.i.mail.ru> (raw) In-Reply-To: <20200213065849.GA18311@atlas> [-- Attachment #1: Type: text/plain, Size: 6574 bytes --] >Четверг, 13 февраля 2020, 9:58 +03:00 от Konstantin Osipov <kostja.osipov@gmail.com>: > >* Konstantin Osipov < kostja.osipov@gmail.com > [20/02/13 09:47]: > >From relay.cc: > > /* > * We're feeding a WAL, thus responding to FINAL JOIN or SUBSCRIBE > * request. If this is FINAL JOIN (i.e. relay->replica is NULL), > * we must relay all rows, even those originating from the replica > * itself (there may be such rows if this is rebootstrap). If this > * SUBSCRIBE, only send a row if it is not from the same replica > * (i.e. don't send replica's own rows back) or if this row is > * missing on the other side (i.e. in case of sudden power-loss, > * data was not written to WAL, so remote master can't recover > * it). In the latter case packet's LSN is less than or equal to > * local master's LSN at the moment it received 'SUBSCRIBE' request. > */ > if (relay->replica == NULL || > packet->replica_id != relay->replica->id || > packet->lsn <= vclock_get(&relay->local_vclock_at_subscribe, > packet->replica_id)) { > struct errinj *inj = errinj(ERRINJ_RELAY_BREAK_LSN, > ERRINJ_INT); > if (inj != NULL && packet->lsn == inj->iparam) { > packet->lsn = inj->iparam - 1; > say_warn("injected broken lsn: %lld", > (long long) packet->lsn); > } > relay_send(relay, packet); > } >} > > >As you can see we never send our own rows back, as long as >they are greater than relay->local_vclock_at_subscribe. True. First of all, local_vclock_at_subscribe was updated later than it should've been. I describe it in more detail in one of the commits. > > >So what exactly does go wrong here? > >Is the bug triggered during initial replication configuration, or >during a reconfiguration? Both during reconfiguration, resubscribing in case a remote instance dies and restarts. > > >I suspect the issue is that at reconfiguration we send >local_vclock_at_subscribe, but keep changing it. Yep > >The fix then would be to make sure the local component in >local_vclock_at_subscribe is set to infinity during >reconfiguration > > >> * sergepetrenko < sergepetrenko@tarantool.org > [20/02/13 09:34]: >> > Fix replicaset.applier.vclock initialization issues: it wasn't >> > initialized at all previously. >> >> In the next line you say that you remove the initialization. What >> do you mean here? >> >> > Moreover, there is no valid point in code >> > to initialize it, since it may get stale right away if new entries are >> > written to WAL. >> >> Well, it reflects the state of the wal *as seen by* the set of >> appliers. This is stated in the comment. So it doesn't have to >> reflect local changes. >> >> > So, check for both applier and replicaset vclocks. >> > The greater one protects the instance from applying the rows it has >> > already applied or has already scheduled to write. >> > Also remove an unnecessary aplier vclock initialization from >> > replication_init(). >> >> First of all, the race you describe applies to >> local changes only. Yet you add the check for all replica ids. >> This further obliterates this piece of code. >> >> Second, the core of the issue is a "hole" in vclock protection >> enforced by latch_lock/latch_unlock. Basically the assumption that >> latch_lock/latch_unlock has is that while a latch is locked, no >> source can apply a transaction under this replica id. This, is >> violated by the local WAL. >> >> We used to skip all changes by local vclock id before in applier. >> >> Later it was changed to be able to get-your-own logs on recovery, >> e.g. if some replica has them , and the local node lost a piece of >> wal. >> >> It will take me a while to find this commit and ticket, but this >> is the commit and ticket which introduced the regression. >> >> The proper fix is to only apply local changes received from >> remotes in orphan mode, and begin skipping them when entering >> read-write mode. >> >> > Closes #4739 >> > --- >> > src/box/applier.cc | 14 ++++++++++++-- >> > src/box/replication.cc | 1 - >> > 2 files changed, 12 insertions(+), 3 deletions(-) >> > >> > diff --git a/src/box/applier.cc b/src/box/applier.cc >> > index ae3d281a5..acb26b7e2 100644 >> > --- a/src/box/applier.cc >> > +++ b/src/box/applier.cc >> > @@ -731,8 +731,18 @@ applier_apply_tx(struct stailq *rows) >> > struct latch *latch = (replica ? &replica->order_latch : >> > &replicaset.applier.order_latch); >> > latch_lock(latch); >> > - if (vclock_get(&replicaset.applier.vclock, >> > - first_row->replica_id) >= first_row->lsn) { >> > + /* >> > + * We cannot tell which vclock is greater. There is no >> > + * proper place to initialize applier vclock, since it >> > + * may get stale right away if we write something to WAL >> > + * and it gets replicated and then arrives back from the >> > + * replica. So check against both vclocks. Replicaset >> > + * vclock will guard us from corner cases like the one >> > + * above. >> > + */ >> > + if (MAX(vclock_get(&replicaset.applier.vclock, first_row->replica_id), >> > + vclock_get(&replicaset.vclock, first_row->replica_id)) >= >> > + first_row->lsn) { >> > latch_unlock(latch); >> > return 0; >> > } >> > diff --git a/src/box/replication.cc b/src/box/replication.cc >> > index e7bfa22ab..7b04573a4 100644 >> > --- a/src/box/replication.cc >> > +++ b/src/box/replication.cc >> > @@ -93,7 +93,6 @@ replication_init(void) >> > latch_create(&replicaset.applier.order_latch); >> > >> > vclock_create(&replicaset.applier.vclock); >> > - vclock_copy(&replicaset.applier.vclock, &replicaset.vclock); >> > rlist_create(&replicaset.applier.on_rollback); >> > rlist_create(&replicaset.applier.on_commit); >> > >> > -- >> > 2.20.1 (Apple Git-117) >> >> -- >> Konstantin Osipov, Moscow, Russia > >-- >Konstantin Osipov, Moscow, Russia -- Sergey Petrenko [-- Attachment #2: Type: text/html, Size: 10951 bytes --]
next prev parent reply other threads:[~2020-02-13 22:04 UTC|newest] Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-02-12 23:50 [Tarantool-patches] [PATCH 0/2] replication: fix applying of rows originating from local instance sergepetrenko 2020-02-12 23:51 ` [Tarantool-patches] [PATCH 1/2] replication: correctly check for rows to skip in applier sergepetrenko 2020-02-13 6:47 ` Konstantin Osipov 2020-02-13 6:58 ` Konstantin Osipov 2020-02-13 22:03 ` Sergey Petrenko [this message] 2020-02-13 22:10 ` Sergey Petrenko 2020-02-12 23:51 ` [Tarantool-patches] [PATCH 2/2] wal: panic when trying to write a record with a broken lsn sergepetrenko 2020-02-13 6:48 ` Konstantin Osipov 2020-02-13 22:05 ` Sergey Petrenko 2020-02-14 7:26 ` Konstantin Osipov 2020-02-13 6:48 ` [Tarantool-patches] [PATCH 0/2] replication: fix applying of rows originating from local instance Konstantin Osipov
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=1581631439.134350666@f191.i.mail.ru \ --to=sergepetrenko@tarantool.org \ --cc=kostja.osipov@gmail.com \ --cc=tarantool-patches@dev.tarantool.org \ --cc=v.shpilevoy@tarantool.org \ --subject='Re: [Tarantool-patches] [PATCH 1/2] replication: correctly check for rows to skip in applier' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox