<HTML><BODY><br><br><br><blockquote style="border-left:1px solid #0857A6; margin:10px; padding:0 0 0 10px;">
Четверг, 13 февраля 2020, 9:58 +03:00 от Konstantin Osipov <kostja.osipov@gmail.com>:<br><br><div id=""><div class="js-helper js-readmsg-msg"><div><div id="style_15815771330712026555_BODY">* Konstantin Osipov <<a href="mailto:kostja.osipov@gmail.com">kostja.osipov@gmail.com</a>> [20/02/13 09:47]:<br><br>
From relay.cc:<br><br>
/* <br>
* We're feeding a WAL, thus responding to FINAL JOIN or SUBSCRIBE <br>
* request. If this is FINAL JOIN (i.e. relay->replica is NULL), <br>
* we must relay all rows, even those originating from the replica <br>
* itself (there may be such rows if this is rebootstrap). If this <br>
* SUBSCRIBE, only send a row if it is not from the same replica <br>
* (i.e. don't send replica's own rows back) or if this row is <br>
* missing on the other side (i.e. in case of sudden power-loss, <br>
* data was not written to WAL, so remote master can't recover <br>
* it). In the latter case packet's LSN is less than or equal to <br>
* local master's LSN at the moment it received 'SUBSCRIBE' request. <br>
*/ <br>
if (relay->replica == NULL || <br>
packet->replica_id != relay->replica->id || <br>
packet->lsn <= vclock_get(&relay->local_vclock_at_subscribe, <br>
packet->replica_id)) { <br>
struct errinj *inj = errinj(ERRINJ_RELAY_BREAK_LSN, <br>
ERRINJ_INT); <br>
if (inj != NULL && packet->lsn == inj->iparam) { <br>
packet->lsn = inj->iparam - 1; <br>
say_warn("injected broken lsn: %lld", <br>
(long long) packet->lsn); <br>
} <br>
relay_send(relay, packet); <br>
} <br>
} <br><br><br>
As you can see we never send our own rows back, as long as <br>
they are greater than relay->local_vclock_at_subscribe.</div></div></div></div></blockquote><br>True. First of all, local_vclock_at_subscribe was updated later than<br>it should've been. I describe it in more detail in one of the commits.<br><br><blockquote style="border-left:1px solid #0857A6; margin:10px; padding:0 0 0 10px;"><div id=""><div class="js-helper js-readmsg-msg"><div><div id="style_15815771330712026555_BODY"><br><br>
So what exactly does go wrong here? <br><br>
Is the bug triggered during initial replication configuration, or<br>
during a reconfiguration?</div></div></div></div></blockquote><br>Both during reconfiguration, resubscribing in case a remote instance<br>dies and restarts.<br><blockquote style="border-left:1px solid #0857A6; margin:10px; padding:0 0 0 10px;"><div id=""><div class="js-helper js-readmsg-msg"><div><div id="style_15815771330712026555_BODY"><br><br>
I suspect the issue is that at reconfiguration we send<br>
local_vclock_at_subscribe, but keep changing it.</div></div></div></div></blockquote>Yep<br><br><blockquote style="border-left:1px solid #0857A6; margin:10px; padding:0 0 0 10px;"><div id=""><div class="js-helper js-readmsg-msg"><div><div id="style_15815771330712026555_BODY"><br>
The fix then would be to make sure the local component in<br>
local_vclock_at_subscribe is set to infinity during<br>
reconfiguration</div></div></div></div></blockquote><blockquote style="border-left:1px solid #0857A6; margin:10px; padding:0 0 0 10px;"><div id=""><div class="js-helper js-readmsg-msg"><div><div id="style_15815771330712026555_BODY"><br><br><div class="mail-quote-collapse">> * sergepetrenko <<a href="mailto:sergepetrenko@tarantool.org">sergepetrenko@tarantool.org</a>> [20/02/13 09:34]:<br>
> > Fix replicaset.applier.vclock initialization issues: it wasn't<br>
> > initialized at all previously.<br>
> <br>
> In the next line you say that you remove the initialization. What<br>
> do you mean here?<br>
> <br>
> > Moreover, there is no valid point in code<br>
> > to initialize it, since it may get stale right away if new entries are<br>
> > written to WAL.<br>
> <br>
> Well, it reflects the state of the wal *as seen by* the set of<br>
> appliers. This is stated in the comment. So it doesn't have to<br>
> reflect local changes. <br>
> <br>
> > So, check for both applier and replicaset vclocks.<br>
> > The greater one protects the instance from applying the rows it has<br>
> > already applied or has already scheduled to write.<br>
> > Also remove an unnecessary aplier vclock initialization from<br>
> > replication_init().<br>
> <br>
> First of all, the race you describe applies to<br>
> local changes only. Yet you add the check for all replica ids. <br>
> This further obliterates this piece of code.<br>
> <br>
> Second, the core of the issue is a "hole" in vclock protection<br>
> enforced by latch_lock/latch_unlock. Basically the assumption that<br>
> latch_lock/latch_unlock has is that while a latch is locked, no<br>
> source can apply a transaction under this replica id. This, is<br>
> violated by the local WAL.<br>
> <br>
> We used to skip all changes by local vclock id before in applier.<br>
> <br>
> Later it was changed to be able to get-your-own logs on recovery,<br>
> e.g. if some replica has them , and the local node lost a piece of<br>
> wal.<br>
> <br>
> It will take me a while to find this commit and ticket, but this<br>
> is the commit and ticket which introduced the regression.<br>
> <br>
> The proper fix is to only apply local changes received from<br>
> remotes in orphan mode, and begin skipping them when entering<br>
> read-write mode.<br>
> <br>
> > Closes #4739<br>
> > ---<br>
> > src/box/applier.cc | 14 ++++++++++++--<br>
> > src/box/replication.cc | 1 -<br>
> > 2 files changed, 12 insertions(+), 3 deletions(-)<br>
> > <br>
> > diff --git a/src/box/applier.cc b/src/box/applier.cc<br>
> > index ae3d281a5..acb26b7e2 100644<br>
> > --- a/src/box/applier.cc<br>
> > +++ b/src/box/applier.cc<br>
> > @@ -731,8 +731,18 @@ applier_apply_tx(struct stailq *rows)<br>
> > struct latch *latch = (replica ? &replica->order_latch :<br>
> > &replicaset.applier.order_latch);<br>
> > latch_lock(latch);<br>
> > - if (vclock_get(&replicaset.applier.vclock,<br>
> > - first_row->replica_id) >= first_row->lsn) {<br>
> > + /*<br>
> > + * We cannot tell which vclock is greater. There is no<br>
> > + * proper place to initialize applier vclock, since it<br>
> > + * may get stale right away if we write something to WAL<br>
> > + * and it gets replicated and then arrives back from the<br>
> > + * replica. So check against both vclocks. Replicaset<br>
> > + * vclock will guard us from corner cases like the one<br>
> > + * above.<br>
> > + */<br>
> > + if (MAX(vclock_get(&replicaset.applier.vclock, first_row->replica_id),<br>
> > + vclock_get(&replicaset.vclock, first_row->replica_id)) >=<br>
> > + first_row->lsn) {<br>
> > latch_unlock(latch);<br>
> > return 0;<br>
> > }<br>
> > diff --git a/src/box/replication.cc b/src/box/replication.cc<br>
> > index e7bfa22ab..7b04573a4 100644<br>
> > --- a/src/box/replication.cc<br>
> > +++ b/src/box/replication.cc<br>
> > @@ -93,7 +93,6 @@ replication_init(void)<br>
> > latch_create(&replicaset.applier.order_latch);<br>
> > <br>
> > vclock_create(&replicaset.applier.vclock);<br>
> > - vclock_copy(&replicaset.applier.vclock, &replicaset.vclock);<br>
> > rlist_create(&replicaset.applier.on_rollback);<br>
> > rlist_create(&replicaset.applier.on_commit);<br>
> > <br>
> > -- <br>
> > 2.20.1 (Apple Git-117)<br>
> <br>
> -- <br>
> Konstantin Osipov, Moscow, Russia<br></div><br>
-- <br>
Konstantin Osipov, Moscow, Russia<br></div></div></div></div></blockquote>
<br>
<br>-- <br>Sergey Petrenko<br></BODY></HTML>