<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div style="margin: 0px; font-stretch: normal; font-size: 11px; line-height: normal; font-family: Menlo;" class=""><br class=""></div><div><blockquote type="cite" class=""><div class="">27 февр. 2020 г., в 02:54, Vladislav Shpilevoy <<a href="mailto:v.shpilevoy@tarantool.org" class="">v.shpilevoy@tarantool.org</a>> написал(а):</div><br class="Apple-interchange-newline"><div class=""><div class="">Thanks for the patch!<br class=""><br class=""></div></div></blockquote><div><br class=""></div><div>Hi! Thanks for the review!</div><div><br class=""></div><div>Please find my comments and the new diff below.</div><br class=""><blockquote type="cite" class=""><div class=""><div class="">See 4 comments below.<br class=""><br class=""><blockquote type="cite" class=""> replication: do not relay rows coming from a remote instance back to it<br class=""><br class=""> We have a mechanism for restoring rows originating from an instance that<br class=""> suffered a sudden power loss: remote masters resend the isntance's rows<br class=""> received before a certain point in time, defined by remote master vclock<br class=""> at the moment of subscribe.<br class=""> However, this is useful only on initial replication configuraiton, when<br class=""> an instance has just recovered, so that it can receive what it has<br class=""> relayed but haven't synced to disk.<br class=""> In other cases, when an instance is operating normally and master-master<br class=""> replication is configured, the mechanism described above may lead to<br class=""> instance re-applying instance's own rows, coming from a master it has just<br class=""> subscribed to.<br class=""> To fix the problem do not relay rows coming from a remote instance, if<br class=""> the instance has already recovered.<br class=""><br class=""> Closes #4739<br class=""><br class="">diff --git a/src/box/<a href="http://applier.cc" class="">applier.cc</a> b/src/box/<a href="http://applier.cc" class="">applier.cc</a><br class="">index 911353425..73ffc0d68 100644<br class="">--- a/src/box/<a href="http://applier.cc" class="">applier.cc</a><br class="">+++ b/src/box/<a href="http://applier.cc" class="">applier.cc</a><br class="">@@ -866,8 +866,13 @@ applier_subscribe(struct applier *applier)<br class=""> <span class="Apple-tab-span" style="white-space:pre"> </span>struct vclock vclock;<br class=""> <span class="Apple-tab-span" style="white-space:pre"> </span>vclock_create(&vclock);<br class=""> <span class="Apple-tab-span" style="white-space:pre"> </span>vclock_copy(&vclock, &replicaset.vclock);<br class="">+<span class="Apple-tab-span" style="white-space:pre"> </span>/*<br class="">+<span class="Apple-tab-span" style="white-space:pre"> </span> * Stop accepting local rows coming from a remote<br class="">+<span class="Apple-tab-span" style="white-space:pre"> </span> * instance as soon as local WAL starts accepting writes.<br class="">+<span class="Apple-tab-span" style="white-space:pre"> </span> */<br class="">+<span class="Apple-tab-span" style="white-space:pre"> </span>unsigned int id_filter = box_is_orphan() ? 0 : 1 << instance_id;<br class=""></blockquote><br class="">1. I was always wondering, what if the instance got orphaned after it<br class="">started accepting writes? WAL is fully functional, it syncs whatever is<br class="">needed, and then a resubscribe happens. Can this break anything?<br class=""><br class=""><blockquote type="cite" class=""> <span class="Apple-tab-span" style="white-space:pre"> </span>xrow_encode_subscribe_xc(&row, &REPLICASET_UUID, &INSTANCE_UUID,<br class="">-<span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span> &vclock, replication_anon, 0);<br class="">+<span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span> &vclock, replication_anon, id_filter);<br class=""> <span class="Apple-tab-span" style="white-space:pre"> </span>coio_write_xrow(coio, &row);<br class=""><br class=""> <span class="Apple-tab-span" style="white-space:pre"> </span>/* Read SUBSCRIBE response */<br class="">diff --git a/src/box/wal.c b/src/box/wal.c<br class="">index 27bff662a..35ba7b072 100644<br class="">--- a/src/box/wal.c<br class="">+++ b/src/box/wal.c<br class="">@@ -278,8 +278,13 @@ tx_schedule_commit(struct cmsg *msg)<br class=""> <span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>/* Closes the input valve. */<br class=""> <span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>stailq_concat(&writer->rollback, &batch->rollback);<br class=""> <span class="Apple-tab-span" style="white-space:pre"> </span>}<br class="">+<br class="">+<span class="Apple-tab-span" style="white-space:pre"> </span>ERROR_INJECT(ERRINJ_REPLICASET_VCLOCK_UPDATE, { goto skip_update; });<br class=""> <span class="Apple-tab-span" style="white-space:pre"> </span>/* Update the tx vclock to the latest written by wal. */<br class=""> <span class="Apple-tab-span" style="white-space:pre"> </span>vclock_copy(&replicaset.vclock, &batch->vclock);<br class="">+#ifndef NDEBUG<br class="">+skip_update:<br class="">+#endif<br class=""></blockquote><br class="">2. Consider this hack which I just invented. In that way you won't<br class="">depend on ERRINJ and NDEBUG interconnection.<br class=""><br class="">====================<br class="">@@ -282,9 +282,7 @@ tx_schedule_commit(struct cmsg *msg)<br class=""> <span class="Apple-tab-span" style="white-space:pre"> </span>ERROR_INJECT(ERRINJ_REPLICASET_VCLOCK_UPDATE, { goto skip_update; });<br class=""> <span class="Apple-tab-span" style="white-space:pre"> </span>/* Update the tx vclock to the latest written by wal. */<br class=""> <span class="Apple-tab-span" style="white-space:pre"> </span>vclock_copy(&replicaset.vclock, &batch->vclock);<br class="">-#ifndef NDEBUG<br class="">-skip_update:<br class="">-#endif<br class="">+<span class="Apple-tab-span" style="white-space:pre"> </span>ERROR_INJECT(ERRINJ_REPLICASET_VCLOCK_UPDATE, {skip_update:;});<br class=""> <span class="Apple-tab-span" style="white-space:pre"> </span>tx_schedule_queue(&batch->commit);<br class=""> <span class="Apple-tab-span" style="white-space:pre"> </span>mempool_free(&writer->msg_pool, container_of(msg, struct wal_msg, base));<br class=""> }<br class="">====================<br class=""></div></div></blockquote><div><br class=""></div><div>Good one, applied.</div><br class=""><blockquote type="cite" class=""><div class=""><div class=""><br class="">Talking of the injection itself - don't know really. Perhaps<br class="">it would be better to add a delay to the wal_write_to_disk()<br class="">function, to its very end, after wal_notify_watchers(). In<br class="">that case relay will wake up, send whatever it wants, and TX<br class="">won't update the vclock until you let wal_write_to_disk()<br class="">finish. Seems more natural this way.<br class=""></div></div></blockquote><div><br class=""></div><div>I tried to add a sleep first. It’s impossible to sleep in tx_schedule_commit(),</div><div>since it’s processed in tx_prio endpoint, where yielding is impossible.</div><div>I also tried to add a sleep at the end of wal_write_to_disk(), just like you</div><div>suggest. This didn’t work out either. I’ll give you more details in the evening,</div><div>when I give it another try. I’ll send a follow-up if I succeed with adding a sleep.</div><br class=""><blockquote type="cite" class=""><div class=""><div class=""><br class=""><blockquote type="cite" class=""> <span class="Apple-tab-span" style="white-space:pre"> </span>tx_schedule_queue(&batch->commit);<br class=""> <span class="Apple-tab-span" style="white-space:pre"> </span>mempool_free(&writer->msg_pool, container_of(msg, struct wal_msg, base));<br class=""> }<br class="">diff --git a/test/replication/gh-4739-vclock-assert.result b/test/replication/gh-4739-vclock-assert.result<br class="">new file mode 100644<br class="">index 000000000..7dc2f7118<br class="">--- /dev/null<br class="">+++ b/test/replication/gh-4739-vclock-assert.result<br class="">@@ -0,0 +1,82 @@<br class="">+-- test-run result file version 2<br class="">+env = require('test_run')<br class="">+ | ---<br class="">+ | ...<br class="">+test_run = env.new()<br class="">+ | ---<br class="">+ | ...<br class="">+<br class="">+SERVERS = {'rebootstrap1', 'rebootstrap2'}<br class="">+ | ---<br class="">+ | ...<br class="">+test_run:create_cluster(SERVERS, "replication")<br class="">+ | ---<br class="">+ | ...<br class="">+test_run:wait_fullmesh(SERVERS)<br class="">+ | ---<br class="">+ | ...<br class="">+<br class="">+test_run:cmd('switch rebootstrap1')<br class="">+ | ---<br class="">+ | - true<br class="">+ | ...<br class="">+fiber = require('fiber')<br class="">+ | ---<br class="">+ | ...<br class="">+-- Stop updating replicaset vclock to simulate a situation, when<br class="">+-- a row is already relayed to the remote master, but the local<br class="">+-- vclock update hasn't happened yet.<br class="">+box.error.injection.set('ERRINJ_REPLICASET_VCLOCK_UPDATE', true)<br class="">+ | ---<br class="">+ | - ok<br class="">+ | ...<br class="">+lsn = box.info.lsn<br class="">+ | ---<br class="">+ | ...<br class="">+box.space._schema:replace{'something'}<br class="">+ | ---<br class="">+ | - ['something']<br class="">+ | ...<br class="">+-- Vclock isn't updated.<br class="">+box.info.lsn == lsn<br class="">+ | ---<br class="">+ | - true<br class="">+ | ...<br class="">+<br class="">+-- Wait until the remote instance gets the row.<br class="">+while test_run:get_vclock('rebootstrap2')[<a href="http://box.info.id" class="">box.info.id</a>] == lsn do\<br class="">+ fiber.sleep(0.01)\<br class="">+end<br class=""></blockquote><br class="">3. There is a cool thing which I discovered relatively recently:<br class="">test_run:wait_cond(). It does fiber sleep and while cycle, and<br class="">has a finite timeout, so such a test won't hang for 10 minutes<br class="">in Travis in case of a problem.<br class=""></div></div></blockquote><div><br class=""></div><div>Thanks!</div><br class=""><blockquote type="cite" class=""><div class=""><div class=""><br class=""><blockquote type="cite" class="">+ | ---<br class="">+ | ...<br class="">+<br class="">+-- Restart the remote instance. This will make the first instance<br class="">+-- resubscribe without entering orphan mode.<br class="">+test_run:cmd('restart server rebootstrap2')<br class="">+ | ---<br class="">+ | - true<br class="">+ | ...<br class="">+test_run:cmd('switch rebootstrap1')<br class="">+ | ---<br class="">+ | - true<br class="">+ | ...<br class="">+-- Wait until resubscribe is sent<br class="">+fiber.sleep(2 * box.cfg.replication_timeout)<br class=""></blockquote><br class="">4. Don't we collect any statistics on replication requests, just<br class="">like we do in box.stat()? Perhaps <a href="http://box.stat.net" class="">box.stat.net</a>() can help? To<br class="">wait properly. Maybe just do test_run:wait_cond() for status 'sync'?<br class=""></div></div></blockquote><div><br class=""></div><div>wait_cond for ’sync’ is enough. Applied.</div><br class=""><blockquote type="cite" class=""><div class=""><div class=""><br class=""><blockquote type="cite" class="">+ | ---<br class="">+ | ...<br class="">+box.info.replication[2].upstream.status<br class="">+ | ---<br class="">+ | - sync<br class="">+ | ...<br class="">+<br class="">+box.error.injection.set('ERRINJ_REPLICASET_VCLOCK_UPDATE', false)<br class="">+ | ---<br class="">+ | - ok<br class="">+ | ...<br class="">+test_run:cmd('switch default')<br class="">+ | ---<br class="">+ | - true<br class="">+ | ...<br class="">+test_run:drop_cluster(SERVERS)<br class="">+ | ---<br class="">+ | …</blockquote></div></div></blockquote><br class=""></div><span class="">diff --git a/src/box/<a href="http://applier.cc" class="">applier.cc</a> b/src/box/<a href="http://applier.cc" class="">applier.cc</a><br class="">index 73ffc0d68..78f3d8a73 100644<br class="">--- a/src/box/<a href="http://applier.cc" class="">applier.cc</a><br class="">+++ b/src/box/<a href="http://applier.cc" class="">applier.cc</a><br class="">@@ -870,7 +870,7 @@ applier_subscribe(struct applier *applier)<br class=""> <span class="Apple-tab-span" style="white-space:pre"> </span> * Stop accepting local rows coming from a remote<br class=""> <span class="Apple-tab-span" style="white-space:pre"> </span> * instance as soon as local WAL starts accepting writes.<br class=""> <span class="Apple-tab-span" style="white-space:pre"> </span> */<br class="">-<span class="Apple-tab-span" style="white-space:pre"> </span>unsigned int id_filter = box_is_orphan() ? 0 : 1 << instance_id;<br class="">+<span class="Apple-tab-span" style="white-space:pre"> </span>uint32_t id_filter = box_is_orphan() ? 0 : 1 << instance_id;<br class=""> <span class="Apple-tab-span" style="white-space:pre"> </span>xrow_encode_subscribe_xc(&row, &REPLICASET_UUID, &INSTANCE_UUID,<br class=""> <span class="Apple-tab-span" style="white-space:pre"> </span> &vclock, replication_anon, id_filter);<br class=""> <span class="Apple-tab-span" style="white-space:pre"> </span>coio_write_xrow(coio, &row);<br class="">diff --git a/src/box/wal.c b/src/box/wal.c<br class="">index 35ba7b072..bf127b259 100644<br class="">--- a/src/box/wal.c<br class="">+++ b/src/box/wal.c<br class="">@@ -282,9 +282,7 @@ tx_schedule_commit(struct cmsg *msg)<br class=""> <span class="Apple-tab-span" style="white-space:pre"> </span>ERROR_INJECT(ERRINJ_REPLICASET_VCLOCK_UPDATE, { goto skip_update; });<br class=""> <span class="Apple-tab-span" style="white-space:pre"> </span>/* Update the tx vclock to the latest written by wal. */<br class=""> <span class="Apple-tab-span" style="white-space:pre"> </span>vclock_copy(&replicaset.vclock, &batch->vclock);<br class="">-#ifndef NDEBUG<br class="">-skip_update:<br class="">-#endif<br class="">+<span class="Apple-tab-span" style="white-space:pre"> </span>ERROR_INJECT(ERRINJ_REPLICASET_VCLOCK_UPDATE, {skip_update:;});<br class=""> <span class="Apple-tab-span" style="white-space:pre"> </span>tx_schedule_queue(&batch->commit);<br class=""> <span class="Apple-tab-span" style="white-space:pre"> </span>mempool_free(&writer->msg_pool, container_of(msg, struct wal_msg, base));<br class=""> }<br class="">diff --git a/test/replication/gh-4739-vclock-assert.result b/test/replication/gh-4739-vclock-assert.result<br class="">index 7dc2f7118..a612826a0 100644<br class="">--- a/test/replication/gh-4739-vclock-assert.result<br class="">+++ b/test/replication/gh-4739-vclock-assert.result<br class="">@@ -44,10 +44,11 @@ box.info.lsn == lsn<br class=""> | ...<br class=""> <br class=""> -- Wait until the remote instance gets the row.<br class="">-while test_run:get_vclock('rebootstrap2')[<a href="http://box.info.id" class="">box.info.id</a>] == lsn do\<br class="">- fiber.sleep(0.01)\<br class="">-end<br class="">+test_run:wait_cond(function()\<br class="">+ return test_run:get_vclock('rebootstrap2')[<a href="http://box.info.id" class="">box.info.id</a>] > lsn\<br class="">+end, 10)<br class=""> | ---<br class="">+ | - true<br class=""> | ...<br class=""> <br class=""> -- Restart the remote instance. This will make the first instance<br class="">@@ -61,14 +62,12 @@ test_run:cmd('switch rebootstrap1')<br class=""> | - true<br class=""> | ...<br class=""> -- Wait until resubscribe is sent<br class="">-fiber.sleep(2 * box.cfg.replication_timeout)<br class="">- | ---<br class="">- | ...<br class="">-box.info.replication[2].upstream.status<br class="">+test_run:wait_cond(function()\<br class="">+ return box.info.replication[2].upstream.status == 'sync'\<br class="">+end, 10)<br class=""> | ---<br class="">- | - sync<br class="">+ | - true<br class=""> | ...<br class="">-<br class=""> box.error.injection.set('ERRINJ_REPLICASET_VCLOCK_UPDATE', false)<br class=""> | ---<br class=""> | - ok<br class="">diff --git a/test/replication/gh-4739-vclock-assert.test.lua b/test/replication/gh-4739-vclock-assert.test.lua<br class="">index 26dc781e2..b6a7caf3b 100644<br class="">--- a/test/replication/gh-4739-vclock-assert.test.lua<br class="">+++ b/test/replication/gh-4739-vclock-assert.test.lua<br class="">@@ -17,18 +17,18 @@ box.space._schema:replace{'something'}<br class=""> box.info.lsn == lsn<br class=""> <br class=""> -- Wait until the remote instance gets the row.<br class="">-while test_run:get_vclock('rebootstrap2')[<a href="http://box.info.id" class="">box.info.id</a>] == lsn do\<br class="">- fiber.sleep(0.01)\<br class="">-end<br class="">+test_run:wait_cond(function()\<br class="">+ return test_run:get_vclock('rebootstrap2')[<a href="http://box.info.id" class="">box.info.id</a>] > lsn\<br class="">+end, 10)<br class=""> <br class=""> -- Restart the remote instance. This will make the first instance<br class=""> -- resubscribe without entering orphan mode.<br class=""> test_run:cmd('restart server rebootstrap2')<br class=""> test_run:cmd('switch rebootstrap1')<br class=""> -- Wait until resubscribe is sent<br class="">-fiber.sleep(2 * box.cfg.replication_timeout)<br class="">-box.info.replication[2].upstream.status<br class="">-<br class="">+test_run:wait_cond(function()\<br class="">+ return box.info.replication[2].upstream.status == 'sync'\<br class="">+end, 10)<br class=""> box.error.injection.set('ERRINJ_REPLICASET_VCLOCK_UPDATE', false)<br class=""> test_run:cmd('switch default')<br class=""> test_run:drop_cluster(SERVERS)<br class=""><br class=""><div class="">--<br class="">Serge Petrenko<br class=""><a href="mailto:sergepetrenko@tarantool.org" class="">sergepetrenko@tarantool.org</a><br class=""></div></span><div><br class=""></div><br class=""></body></html>