<HTML><BODY><div>Hi! </div><blockquote style="border-left:1px solid #0857A6; margin:10px; padding:0 0 0 10px;">Суббота, 29 февраля 2020, 2:43 +03:00 от Vladislav Shpilevoy <v.shpilevoy@tarantool.org>:<br> <div id=""><div class="js-helper js-readmsg-msg"><style type="text/css"></style><div><div id="style_15829333831099816470_BODY">Thanks for the patch!<br><br>On 28/02/2020 18:01, Serge Petrenko wrote:<br>> When checking wheter rejoin is needed, replica loops through all the<br>> instances in box.cfg.replication, which makes it believe that there is a<br>> master holding files, needed by it, since it accounts itself just like<br>> all other instances.<br>> So make replica skip itself when finding an instance which holds files<br>> needed by it, and determining whether rebootstrap is needed.<br>><br>> We already have a working test for the issue, it missed the issue due to<br>> replica.lua replication settings. Fix replica.lua to optionally include<br>> itself in box.cfg.replication, so that the corresponding test works<br>> correctly.<br>><br>> Closes #4759<br>> ---<br>> <a href="https://github.com/tarantool/tarantool/issues/4759" target="_blank">https://github.com/tarantool/tarantool/issues/4759</a><br>> <a href="https://github.com/tarantool/tarantool/tree/sp/gh-4759-rebootstrap-fix" target="_blank">https://github.com/tarantool/tarantool/tree/sp/gh-4759-rebootstrap-fix</a><br>><br>> @ChangeLog<br>> - fix rebootstrap procedure not working in case replica itself<br>> is listed in `box.cfg.replication`<br>><br>> src/box/replication.cc | 13 ++++++++++++-<br>> test/replication/replica.lua | 11 ++++++++++-<br>> test/replication/replica_rejoin.result | 12 ++++++------<br>> test/replication/replica_rejoin.test.lua | 12 ++++++------<br>> 4 files changed, 34 insertions(+), 14 deletions(-)<br>><br>> diff --git a/src/box/replication.cc b/src/box/replication.cc<br>> index e7bfa22ab..01edc0fb2 100644<br>> --- a/src/box/replication.cc<br>> +++ b/src/box/replication.cc<br>> @@ -768,8 +768,19 @@ replicaset_needs_rejoin(struct replica **master)<br>> struct replica *leader = NULL;<br>> replicaset_foreach(replica) {<br>> struct applier *applier = replica->applier;<br>> - if (applier == NULL)<br>> + /*<br>> + * The function is called right after<br>> + * box_sync_replication(), which in turn calls<br>> + * replicaset_connect(), which ensures that<br>> + * appliers are either stopped (APPLIER OFF) or<br>> + * connected.<br>> + * Also ignore self, as self applier might not<br>> + * have disconnected yet.<br>> + */<br>> + if (applier == NULL || applier->state == APPLIER_OFF ||<br>> + tt_uuid_is_equal(&replica->uuid, &INSTANCE_UUID))<br>> continue;<br>> + assert(applier->state == APPLIER_CONNECTED);<br>><br><br>Could you please understand one thing? Below I see this:<br><br>> const struct ballot *ballot = &applier->ballot;<br>> if (vclock_compare(&ballot->gc_vclock,<br>> &replicaset.vclock) <= 0) {<br>> /*<br>> * There's at least one master that still stores<br>> * WALs needed by this instance. Proceed to local<br>> * recovery.<br>> */<br>> return false;<br>> }<br><br>Question is why do we need rebootstrap if some remote node's<br>vclock is bigger than ours? It does not mean anything. It doesn't<br>say whether that remote instance still keeps any xlogs. All it<br>tells is that the remote instance committed some more data since<br>our restart.</div></div></div></div></blockquote><div> </div><div>In the piece above we check for master’s gc vclock, which is vclock of the</div><div>oldest WAL the instance still has. If master’s gc vclock is less than our vclock,</div><div>no rebootstrap is needed. Master’ll be able to send us all the fresh data.</div><div> Later we check whether our vclock is greater or incompatible with the remote</div><div>instance’s vclock. If it is, we cannot rebootstrap from the instance, since we have</div><div>some rows, that the instance doesn’t have.</div><div> </div><div>--<br>Serge Petrenko</div></BODY></HTML>