<HTML><BODY><div> </div><blockquote style="border-left:1px solid #0857A6; margin:10px; padding:0 0 0 10px;">Четверг, 27 февраля 2020, 17:13 +03:00 от Serge Petrenko <sergepetrenko@tarantool.org>:<br> <div id=""><div class="js-helper js-readmsg-msg"><style type="text/css"></style><div><div id="style_15828128211925570827_BODY"><div class="class_1582853589"><div style="font-family:Menlo; font-size:11px; font-stretch:normal; line-height:normal; margin-bottom:0px; margin-left:0px; margin-right:0px; margin-top:0px"> </div><div><blockquote type="cite"><div>27 февр. 2020 г., в 02:54, Vladislav Shpilevoy <<a href="//e.mail.ru/compose/?mailto=mailto%3av.shpilevoy@tarantool.org" rel="noopener noreferrer">v.shpilevoy@tarantool.org</a>> написал(а):</div> <div><div>Thanks for the patch!<br> </div></div></blockquote><div> </div><div>Hi! Thanks for the review!</div><div> </div><div>Please find my comments and the new diff below.</div> <blockquote type="cite"><div><div>See 4 comments below.<br> <blockquote type="cite">   replication: do not relay rows coming from a remote instance back to it<br><br>   We have a mechanism for restoring rows originating from an instance that<br>   suffered a sudden power loss: remote masters resend the isntance's rows<br>   received before a certain point in time, defined by remote master vclock<br>   at the moment of subscribe.<br>   However, this is useful only on initial replication configuraiton, when<br>   an instance has just recovered, so that it can receive what it has<br>   relayed but haven't synced to disk.<br>   In other cases, when an instance is operating normally and master-master<br>   replication is configured, the mechanism described above may lead to<br>   instance re-applying instance's own rows, coming from a master it has just<br>   subscribed to.<br>   To fix the problem do not relay rows coming from a remote instance, if<br>   the instance has already recovered.<br><br>   Closes #4739<br><br>diff --git a/src/box/<a href="http://applier.cc" rel="noopener noreferrer" target="_blank">applier.cc</a> b/src/box/<a href="http://applier.cc" rel="noopener noreferrer" target="_blank">applier.cc</a><br>index 911353425..73ffc0d68 100644<br>--- a/src/box/<a href="http://applier.cc" rel="noopener noreferrer" target="_blank">applier.cc</a><br>+++ b/src/box/<a href="http://applier.cc" rel="noopener noreferrer" target="_blank">applier.cc</a><br>@@ -866,8 +866,13 @@ applier_subscribe(struct applier *applier)<br>struct vclock vclock;<br>vclock_create(&vclock);<br>vclock_copy(&vclock, &replicaset.vclock);<br>+<span class="Apple-tab-span_mailru_css_attribute_postfix" style="white-space:pre"> </span>/*<br>+<span class="Apple-tab-span_mailru_css_attribute_postfix" style="white-space:pre"> </span> * Stop accepting local rows coming from a remote<br>+<span class="Apple-tab-span_mailru_css_attribute_postfix" style="white-space:pre"> </span> * instance as soon as local WAL starts accepting writes.<br>+<span class="Apple-tab-span_mailru_css_attribute_postfix" style="white-space:pre"> </span> */<br>+<span class="Apple-tab-span_mailru_css_attribute_postfix" style="white-space:pre"> </span>unsigned int id_filter = box_is_orphan() ? 0 : 1 << instance_id;</blockquote><br>1. I was always wondering, what if the instance got orphaned after it<br>started accepting writes? WAL is fully functional, it syncs whatever is<br>needed, and then a resubscribe happens. Can this break anything?<br> <blockquote type="cite">xrow_encode_subscribe_xc(&row, &REPLICASET_UUID, &INSTANCE_UUID,<br>-<span class="Apple-tab-span_mailru_css_attribute_postfix" style="white-space:pre"> </span><span class="Apple-tab-span_mailru_css_attribute_postfix" style="white-space:pre"> </span><span class="Apple-tab-span_mailru_css_attribute_postfix" style="white-space:pre"> </span><span class="Apple-tab-span_mailru_css_attribute_postfix" style="white-space:pre"> </span> &vclock, replication_anon, 0);<br>+<span class="Apple-tab-span_mailru_css_attribute_postfix" style="white-space:pre"> </span><span class="Apple-tab-span_mailru_css_attribute_postfix" style="white-space:pre"> </span><span class="Apple-tab-span_mailru_css_attribute_postfix" style="white-space:pre"> </span><span class="Apple-tab-span_mailru_css_attribute_postfix" style="white-space:pre"> </span> &vclock, replication_anon, id_filter);<br>coio_write_xrow(coio, &row);<br><br>/* Read SUBSCRIBE response */<br>diff --git a/src/box/wal.c b/src/box/wal.c<br>index 27bff662a..35ba7b072 100644<br>--- a/src/box/wal.c<br>+++ b/src/box/wal.c<br>@@ -278,8 +278,13 @@ tx_schedule_commit(struct cmsg *msg)<br>/* Closes the input valve. */<br>stailq_concat(&writer->rollback, &batch->rollback);<br>}<br>+<br>+<span class="Apple-tab-span_mailru_css_attribute_postfix" style="white-space:pre"> </span>ERROR_INJECT(ERRINJ_REPLICASET_VCLOCK_UPDATE, { goto skip_update; });<br>/* Update the tx vclock to the latest written by wal. */<br>vclock_copy(&replicaset.vclock, &batch->vclock);<br>+#ifndef NDEBUG<br>+skip_update:<br>+#endif</blockquote><br>2. Consider this hack which I just invented. In that way you won't<br>depend on ERRINJ and NDEBUG interconnection.<br><br>====================<br>@@ -282,9 +282,7 @@ tx_schedule_commit(struct cmsg *msg)<br>ERROR_INJECT(ERRINJ_REPLICASET_VCLOCK_UPDATE, { goto skip_update; });<br>/* Update the tx vclock to the latest written by wal. */<br>vclock_copy(&replicaset.vclock, &batch->vclock);<br>-#ifndef NDEBUG<br>-skip_update:<br>-#endif<br>+<span class="Apple-tab-span_mailru_css_attribute_postfix" style="white-space:pre"> </span>ERROR_INJECT(ERRINJ_REPLICASET_VCLOCK_UPDATE, {skip_update:;});<br>tx_schedule_queue(&batch->commit);<br>mempool_free(&writer->msg_pool, container_of(msg, struct wal_msg, base));<br>}<br>====================</div></div></blockquote><div> </div><div>Good one, applied.</div> <blockquote type="cite"><div><div><br>Talking of the injection itself - don't know really. Perhaps<br>it would be better to add a delay to the wal_write_to_disk()<br>function, to its very end, after wal_notify_watchers(). In<br>that case relay will wake up, send whatever it wants, and TX<br>won't update the vclock until you let wal_write_to_disk()<br>finish. Seems more natural this way.</div></div></blockquote><div> </div><div>I tried to add a sleep first. It’s impossible to sleep in tx_schedule_commit(),</div><div>since it’s processed in tx_prio endpoint, where yielding is impossible.</div><div>I also tried to add a sleep at the end of wal_write_to_disk(), just like you</div><div>suggest. This didn’t work out either. I’ll give you more details in the evening,</div><div>when I give it another try. I’ll send a follow-up if I succeed with adding a sleep.</div></div></div></div></div></div></div></blockquote><div> </div><div>Ok. I tried to add either ERROR_INJECT_YIELD or ERROR_INJECT_SLEEP at the end</div><div>of wal_write_to_disk().<br>It looks like you cannot yield in wal_write_to_disk(). (is it possible to yield in WAL thread at all?)</div><div>Firing this injection with ERROR_INJECT_YIELD and then resetting it leads to wal thread stopping</div><div>processing messages. This leads to tarantool hanging infinitely on shutdown when tx waits for wal</div><div>thread to exit, but wal never gets the shutdown signal.</div><div> </div><div>Using ERROR_INJECT_SLEEP leads to wal watchers not being notified until the injection is reset. This</div><div>probably happens because of  wal_notify_watchers’ use of cpipe_flush_input(), which doesn’t flush the input until</div><div>the end of event loop iteration, if there are not enough messages (only one message in our case).</div><div>The event loop iteration never ends, because we sleep right after wal_notify_watchers() call.</div><div> </div><div>So, I see skipping vclock assignment in tx_schedule_commit() as the only possible alternative.</div><div>Hope my explanation was clear enough and, more imortantly, correct. If not, lest discuss.</div><blockquote style="border-left:1px solid #0857A6; margin:10px; padding:0 0 0 10px;"><div><div class="js-helper js-readmsg-msg"><div><div><div class="class_1582853589"><div><blockquote type="cite"><div><div> <blockquote type="cite">tx_schedule_queue(&batch->commit);<br>mempool_free(&writer->msg_pool, container_of(msg, struct wal_msg, base));<br>}<br>diff --git a/test/replication/gh-4739-vclock-assert.result b/test/replication/gh-4739-vclock-assert.result<br>new file mode 100644<br>index 000000000..7dc2f7118<br>--- /dev/null<br>+++ b/test/replication/gh-4739-vclock-assert.result<br>@@ -0,0 +1,82 @@<br>+-- test-run result file version 2<br>+env = require('test_run')<br>+ | ---<br>+ | ...<br>+test_run = env.new()<br>+ | ---<br>+ | ...<br>+<br>+SERVERS = {'rebootstrap1', 'rebootstrap2'}<br>+ | ---<br>+ | ...<br>+test_run:create_cluster(SERVERS, "replication")<br>+ | ---<br>+ | ...<br>+test_run:wait_fullmesh(SERVERS)<br>+ | ---<br>+ | ...<br>+<br>+test_run:cmd('switch rebootstrap1')<br>+ | ---<br>+ | - true<br>+ | ...<br>+fiber = require('fiber')<br>+ | ---<br>+ | ...<br>+-- Stop updating replicaset vclock to simulate a situation, when<br>+-- a row is already relayed to the remote master, but the local<br>+-- vclock update hasn't happened yet.<br>+box.error.injection.set('ERRINJ_REPLICASET_VCLOCK_UPDATE', true)<br>+ | ---<br>+ | - ok<br>+ | ...<br>+lsn = box.info.lsn<br>+ | ---<br>+ | ...<br>+box.space._schema:replace{'something'}<br>+ | ---<br>+ | - ['something']<br>+ | ...<br>+-- Vclock isn't updated.<br>+box.info.lsn == lsn<br>+ | ---<br>+ | - true<br>+ | ...<br>+<br>+-- Wait until the remote instance gets the row.<br>+while test_run:get_vclock('rebootstrap2')[<a href="http://box.info.id" rel="noopener noreferrer" target="_blank">box.info.id</a>] == lsn do\<br>+    fiber.sleep(0.01)\<br>+end</blockquote><br>3. There is a cool thing which I discovered relatively recently:<br>test_run:wait_cond(). It does fiber sleep and while cycle, and<br>has a finite timeout, so such a test won't hang for 10 minutes<br>in Travis in case of a problem.</div></div></blockquote><div> </div><div>Thanks!</div> <blockquote type="cite"><div><div> <blockquote type="cite">+ | ---<br>+ | ...<br>+<br>+-- Restart the remote instance. This will make the first instance<br>+-- resubscribe without entering orphan mode.<br>+test_run:cmd('restart server rebootstrap2')<br>+ | ---<br>+ | - true<br>+ | ...<br>+test_run:cmd('switch rebootstrap1')<br>+ | ---<br>+ | - true<br>+ | ...<br>+-- Wait until resubscribe is sent<br>+fiber.sleep(2 * box.cfg.replication_timeout)</blockquote><br>4. Don't we collect any statistics on replication requests, just<br>like we do in box.stat()? Perhaps <a href="http://box.stat.net" rel="noopener noreferrer" target="_blank">box.stat.net</a>() can help? To<br>wait properly. Maybe just do test_run:wait_cond() for status 'sync'?</div></div></blockquote><div> </div><div>wait_cond for ’sync’ is enough. Applied.</div> <blockquote type="cite"><div><div> <blockquote type="cite">+ | ---<br>+ | ...<br>+box.info.replication[2].upstream.status<br>+ | ---<br>+ | - sync<br>+ | ...<br>+<br>+box.error.injection.set('ERRINJ_REPLICASET_VCLOCK_UPDATE', false)<br>+ | ---<br>+ | - ok<br>+ | ...<br>+test_run:cmd('switch default')<br>+ | ---<br>+ | - true<br>+ | ...<br>+test_run:drop_cluster(SERVERS)<br>+ | ---<br>+ | …</blockquote></div></div></blockquote></div><span>diff --git a/src/box/<a href="http://applier.cc" rel="noopener noreferrer" target="_blank">applier.cc</a> b/src/box/<a href="http://applier.cc" rel="noopener noreferrer" target="_blank">applier.cc</a><br>index 73ffc0d68..78f3d8a73 100644<br>--- a/src/box/<a href="http://applier.cc" rel="noopener noreferrer" target="_blank">applier.cc</a><br>+++ b/src/box/<a href="http://applier.cc" rel="noopener noreferrer" target="_blank">applier.cc</a><br>@@ -870,7 +870,7 @@ applier_subscribe(struct applier *applier)<br> <span class="Apple-tab-span_mailru_css_attribute_postfix" style="white-space:pre"> </span> * Stop accepting local rows coming from a remote<br> <span class="Apple-tab-span_mailru_css_attribute_postfix" style="white-space:pre"> </span> * instance as soon as local WAL starts accepting writes.<br> <span class="Apple-tab-span_mailru_css_attribute_postfix" style="white-space:pre"> </span> */<br>-<span class="Apple-tab-span_mailru_css_attribute_postfix" style="white-space:pre"> </span>unsigned int id_filter = box_is_orphan() ? 0 : 1 << instance_id;<br>+<span class="Apple-tab-span_mailru_css_attribute_postfix" style="white-space:pre"> </span>uint32_t id_filter = box_is_orphan() ? 0 : 1 << instance_id;<br> <span class="Apple-tab-span_mailru_css_attribute_postfix" style="white-space:pre"> </span>xrow_encode_subscribe_xc(&row, &REPLICASET_UUID, &INSTANCE_UUID,<br> <span class="Apple-tab-span_mailru_css_attribute_postfix" style="white-space:pre"> </span> &vclock, replication_anon, id_filter);<br> <span class="Apple-tab-span_mailru_css_attribute_postfix" style="white-space:pre"> </span>coio_write_xrow(coio, &row);<br>diff --git a/src/box/wal.c b/src/box/wal.c<br>index 35ba7b072..bf127b259 100644<br>--- a/src/box/wal.c<br>+++ b/src/box/wal.c<br>@@ -282,9 +282,7 @@ tx_schedule_commit(struct cmsg *msg)<br> <span class="Apple-tab-span_mailru_css_attribute_postfix" style="white-space:pre"> </span>ERROR_INJECT(ERRINJ_REPLICASET_VCLOCK_UPDATE, { goto skip_update; });<br> <span class="Apple-tab-span_mailru_css_attribute_postfix" style="white-space:pre"> </span>/* Update the tx vclock to the latest written by wal. */<br> <span class="Apple-tab-span_mailru_css_attribute_postfix" style="white-space:pre"> </span>vclock_copy(&replicaset.vclock, &batch->vclock);<br>-#ifndef NDEBUG<br>-skip_update:<br>-#endif<br>+<span class="Apple-tab-span_mailru_css_attribute_postfix" style="white-space:pre"> </span>ERROR_INJECT(ERRINJ_REPLICASET_VCLOCK_UPDATE, {skip_update:;});<br> <span class="Apple-tab-span_mailru_css_attribute_postfix" style="white-space:pre"> </span>tx_schedule_queue(&batch->commit);<br> <span class="Apple-tab-span_mailru_css_attribute_postfix" style="white-space:pre"> </span>mempool_free(&writer->msg_pool, container_of(msg, struct wal_msg, base));<br> }<br>diff --git a/test/replication/gh-4739-vclock-assert.result b/test/replication/gh-4739-vclock-assert.result<br>index 7dc2f7118..a612826a0 100644<br>--- a/test/replication/gh-4739-vclock-assert.result<br>+++ b/test/replication/gh-4739-vclock-assert.result<br>@@ -44,10 +44,11 @@ box.info.lsn == lsn<br>  | ...<br> <br> -- Wait until the remote instance gets the row.<br>-while test_run:get_vclock('rebootstrap2')[<a href="http://box.info.id" rel="noopener noreferrer" target="_blank">box.info.id</a>] == lsn do\<br>-    fiber.sleep(0.01)\<br>-end<br>+test_run:wait_cond(function()\<br>+    return test_run:get_vclock('rebootstrap2')[<a href="http://box.info.id" rel="noopener noreferrer" target="_blank">box.info.id</a>] > lsn\<br>+end, 10)<br>  | ---<br>+ | - true<br>  | ...<br> <br> -- Restart the remote instance. This will make the first instance<br>@@ -61,14 +62,12 @@ test_run:cmd('switch rebootstrap1')<br>  | - true<br>  | ...<br> -- Wait until resubscribe is sent<br>-fiber.sleep(2 * box.cfg.replication_timeout)<br>- | ---<br>- | ...<br>-box.info.replication[2].upstream.status<br>+test_run:wait_cond(function()\<br>+    return box.info.replication[2].upstream.status == 'sync'\<br>+end, 10)<br>  | ---<br>- | - sync<br>+ | - true<br>  | ...<br>-<br> box.error.injection.set('ERRINJ_REPLICASET_VCLOCK_UPDATE', false)<br>  | ---<br>  | - ok<br>diff --git a/test/replication/gh-4739-vclock-assert.test.lua b/test/replication/gh-4739-vclock-assert.test.lua<br>index 26dc781e2..b6a7caf3b 100644<br>--- a/test/replication/gh-4739-vclock-assert.test.lua<br>+++ b/test/replication/gh-4739-vclock-assert.test.lua<br>@@ -17,18 +17,18 @@ box.space._schema:replace{'something'}<br> box.info.lsn == lsn<br> <br> -- Wait until the remote instance gets the row.<br>-while test_run:get_vclock('rebootstrap2')[<a href="http://box.info.id" rel="noopener noreferrer" target="_blank">box.info.id</a>] == lsn do\<br>-    fiber.sleep(0.01)\<br>-end<br>+test_run:wait_cond(function()\<br>+    return test_run:get_vclock('rebootstrap2')[<a href="http://box.info.id" rel="noopener noreferrer" target="_blank">box.info.id</a>] > lsn\<br>+end, 10)<br> <br> -- Restart the remote instance. This will make the first instance<br> -- resubscribe without entering orphan mode.<br> test_run:cmd('restart server rebootstrap2')<br> test_run:cmd('switch rebootstrap1')<br> -- Wait until resubscribe is sent<br>-fiber.sleep(2 * box.cfg.replication_timeout)<br>-box.info.replication[2].upstream.status<br>-<br>+test_run:wait_cond(function()\<br>+    return box.info.replication[2].upstream.status == 'sync'\<br>+end, 10)<br> box.error.injection.set('ERRINJ_REPLICASET_VCLOCK_UPDATE', false)<br> test_run:cmd('switch default')<br> test_run:drop_cluster(SERVERS)</span><br> <div><span>--<br>Serge Petrenko<br><a href="//e.mail.ru/compose/?mailto=mailto%3asergepetrenko@tarantool.org" rel="noopener noreferrer">sergepetrenko@tarantool.org</a></span></div><div> </div></div></div></div></div></div></blockquote><div> </div></BODY></HTML>