<HTML><BODY><div> </div><blockquote style="border-left:1px solid #0857A6; margin:10px; padding:0 0 0 10px;">Среда, 26 февраля 2020, 14:58 +03:00 от Konstantin Osipov <kostja.osipov@gmail.com>:<br> <div id=""><div class="js-helper js-readmsg-msg"><style type="text/css"></style><div><div id="style_15827183020625111453_BODY">* Serge Petrenko <<a href="/compose?To=sergepetrenko@tarantool.org">sergepetrenko@tarantool.org</a>> [20/02/26 14:22]:<div class="mail-quote-collapse">> I don’t think I can. The test that comes with an issue is a stress test,<br>> relying on running it with multiple workers simultaneously.<br>> It reproduces the problem when ran with 4 workers on one of my PCs,<br>> and with 20 workers on the other.<br>> I think we don’t have the appropriate testing infrastructure to run the same<br>> test with multiple workers at the same time, and I couldn’t come up with a<br>> single test which would reproduce the same problem.</div><br>Is there a place in which you can inject a sleep to make the<br>problem much easier to reproduce?<br><br>What about injecting a sleep in wal code on replica, the place<br>which increments local replicaset vclock ?</div></div></div></div></blockquote><div> </div><div>Thanks for the suggestion! Haven’t thought about it for some reason.</div><div>I made a test. The diff’s below.</div><blockquote style="border-left:1px solid #0857A6; margin:10px; padding:0 0 0 10px;"><div><div class="js-helper js-readmsg-msg"><div><div><br>Then you will be much more likely to receive a record from the<br>peer before you incremented the record vclock locally, and the bug<br>will be reproducible with a single master.<br><br>--<br>Konstantin Osipov, Moscow, Russia</div></div></div></div></blockquote><div> </div><div><p>diff --git a/src/box/wal.c b/src/box/wal.c</p><p>index 27bff662a..35ba7b072 100644</p><p>--- a/src/box/wal.c</p><p>+++ b/src/box/wal.c</p><p>@@ -278,8 +278,13 @@ tx_schedule_commit(struct cmsg *msg)</p><p> /* Closes the input valve. */</p><p> stailq_concat(&writer->rollback, &batch->rollback);</p><p> }</p><p>+</p><p>+ ERROR_INJECT(ERRINJ_REPLICASET_VCLOCK_UPDATE, { goto skip_update; });</p><p> /* Update the tx vclock to the latest written by wal. */</p><p> vclock_copy(&replicaset.vclock, &batch->vclock);</p><p>+#ifndef NDEBUG</p><p>+skip_update:</p><p>+#endif</p><p> tx_schedule_queue(&batch->commit);</p><p> mempool_free(&writer->msg_pool, container_of(msg, struct wal_msg, base));</p><p>}</p><p>diff --git a/src/lib/core/errinj.h b/src/lib/core/errinj.h</p><p>index ed0cba903..58fe158fd 100644</p><p>--- a/src/lib/core/errinj.h</p><p>+++ b/src/lib/core/errinj.h</p><p>@@ -136,7 +136,8 @@ struct errinj {</p><p> _(ERRINJ_SWIM_FD_ONLY, ERRINJ_BOOL, {.bparam = false}) \</p><p> _(ERRINJ_DYN_MODULE_COUNT, ERRINJ_INT, {.iparam = 0}) \</p><p> _(ERRINJ_FIBER_MADVISE, ERRINJ_BOOL, {.bparam = false}) \</p><p>- _(ERRINJ_FIBER_MPROTECT, ERRINJ_INT, {.iparam = -1})</p><p>+ _(ERRINJ_FIBER_MPROTECT, ERRINJ_INT, {.iparam = -1}) \</p><p>+ _(ERRINJ_REPLICASET_VCLOCK_UPDATE, ERRINJ_BOOL, {.bparam = false}) \</p><p> </p><p>ENUM0(errinj_id, ERRINJ_LIST);</p><p>extern struct errinj errinjs[];</p><p>diff --git a/test/box/errinj.result b/test/box/errinj.result</p><p>index daa27ed24..eb0905238 100644</p><p>--- a/test/box/errinj.result</p><p>+++ b/test/box/errinj.result</p><p>@@ -64,6 +64,7 @@ evals</p><p> - ERRINJ_RELAY_REPORT_INTERVAL: 0</p><p> - ERRINJ_RELAY_SEND_DELAY: false</p><p> - ERRINJ_RELAY_TIMEOUT: 0</p><p>+ - ERRINJ_REPLICASET_VCLOCK_UPDATE: false</p><p> - ERRINJ_REPLICA_JOIN_DELAY: false</p><p> - ERRINJ_SIO_READ_MAX: -1</p><p> - ERRINJ_SNAP_COMMIT_DELAY: false</p><p>diff --git a/test/replication/gh-4739-vclock-assert.result b/test/replication/gh-4739-vclock-assert.result</p><p>new file mode 100644</p><p>index 000000000..7dc2f7118</p><p>--- /dev/null</p><p>+++ b/test/replication/gh-4739-vclock-assert.result</p><p>@@ -0,0 +1,82 @@</p><p>+-- test-run result file version 2</p><p>+env = require('test_run')</p><p>+ | ---</p><p>+ | ...</p><p>+test_run = env.new()</p><p>+ | ---</p><p>+ | ...</p><p>+</p><p>+SERVERS = {'rebootstrap1', 'rebootstrap2'}</p><p>+ | ---</p><p>+ | ...</p><p>+test_run:create_cluster(SERVERS, "replication")</p><p>+ | ---</p><p>+ | ...</p><p>+test_run:wait_fullmesh(SERVERS)</p><p>+ | ---</p><p>+ | ...</p><p>+</p><p>+test_run:cmd('switch rebootstrap1')</p><p>+ | ---</p><p>+ | - true</p><p>+ | ...</p><p>+fiber = require('fiber')</p><p>+ | ---</p><p>+ | ...</p><p>+-- Stop updating replicaset vclock to simulate a situation, when</p><p>+-- a row is already relayed to the remote master, but the local</p><p>+-- vclock update hasn't happened yet.</p><p>+box.error.injection.set('ERRINJ_REPLICASET_VCLOCK_UPDATE', true)</p><p>+ | ---</p><p>+ | - ok</p><p>+ | ...</p><p>+lsn = box.info.lsn</p><p>+ | ---</p><p>+ | ...</p><p>+box.space._schema:replace{'something'}</p><p>+ | ---</p><p>+ | - ['something']</p><p>+ | ...</p><p>+-- Vclock isn't updated.</p><p>+box.info.lsn == lsn</p><p>+ | ---</p><p>+ | - true</p><p>+ | ...</p><p>+</p><p>+-- Wait until the remote instance gets the row.</p><p>+while test_run:get_vclock('rebootstrap2')[box.info.id] == lsn do\</p><p>+ fiber.sleep(0.01)\</p><p>+end</p><p>+ | ---</p><p>+ | ...</p><p>+</p><p>+-- Restart the remote instance. This will make the first instance</p><p>+-- resubscribe without entering orphan mode.</p><p>+test_run:cmd('restart server rebootstrap2')</p><p>+ | ---</p><p>+ | - true</p><p>+ | ...</p><p>+test_run:cmd('switch rebootstrap1')</p><p>+ | ---</p><p>+ | - true</p><p>+ | ...</p><p>+-- Wait until resubscribe is sent</p><p>+fiber.sleep(2 * box.cfg.replication_timeout)</p><p>+ | ---</p><p>+ | ...</p><p>+box.info.replication[2].upstream.status</p><p>+ | ---</p><p>+ | - sync</p><p>+ | ...</p><p>+</p><p>+box.error.injection.set('ERRINJ_REPLICASET_VCLOCK_UPDATE', false)</p><p>+ | ---</p><p>+ | - ok</p><p>+ | ...</p><p>+test_run:cmd('switch default')</p><p>+ | ---</p><p>+ | - true</p><p>+ | ...</p><p>+test_run:drop_cluster(SERVERS)</p><p>+ | ---</p><p>+ | ...</p><p>diff --git a/test/replication/gh-4739-vclock-assert.test.lua b/test/replication/gh-4739-vclock-assert.test.lua</p><p>new file mode 100644</p><p>index 000000000..26dc781e2</p><p>--- /dev/null</p><p>+++ b/test/replication/gh-4739-vclock-assert.test.lua</p><p>@@ -0,0 +1,34 @@</p><p>+env = require('test_run')</p><p>+test_run = env.new()</p><p>+</p><p>+SERVERS = {'rebootstrap1', 'rebootstrap2'}</p><p>+test_run:create_cluster(SERVERS, "replication")</p><p>+test_run:wait_fullmesh(SERVERS)</p><p>+</p><p>+test_run:cmd('switch rebootstrap1')</p><p>+fiber = require('fiber')</p><p>+-- Stop updating replicaset vclock to simulate a situation, when</p><p>+-- a row is already relayed to the remote master, but the local</p><p>+-- vclock update hasn't happened yet.</p><p>+box.error.injection.set('ERRINJ_REPLICASET_VCLOCK_UPDATE', true)</p><p>+lsn = box.info.lsn</p><p>+box.space._schema:replace{'something'}</p><p>+-- Vclock isn't updated.</p><p>+box.info.lsn == lsn</p><p>+</p><p>+-- Wait until the remote instance gets the row.</p><p>+while test_run:get_vclock('rebootstrap2')[box.info.id] == lsn do\</p><p>+ fiber.sleep(0.01)\</p><p>+end</p><p>+</p><p>+-- Restart the remote instance. This will make the first instance</p><p>+-- resubscribe without entering orphan mode.</p><p>+test_run:cmd('restart server rebootstrap2')</p><p>+test_run:cmd('switch rebootstrap1')</p><p>+-- Wait until resubscribe is sent</p><p>+fiber.sleep(2 * box.cfg.replication_timeout)</p><p>+box.info.replication[2].upstream.status</p><p>+</p><p>+box.error.injection.set('ERRINJ_REPLICASET_VCLOCK_UPDATE', false)</p><p>+test_run:cmd('switch default')</p><p>+test_run:drop_cluster(SERVERS)</p><p>diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg</p><p>index 429c64df3..90fd53ca6 100644</p><p>--- a/test/replication/suite.cfg</p><p>+++ b/test/replication/suite.cfg</p><p>@@ -15,6 +15,7 @@</p><p> "gh-4402-info-errno.test.lua": {},</p><p> "gh-4605-empty-password.test.lua": {},</p><p> "gh-4606-admin-creds.test.lua": {},</p><p>+ "gh-4739-vclock-assert.test.lua": {},</p><p> "*": {</p><p> "memtx": {"engine": "memtx"},</p><p> "vinyl": {"engine": "vinyl"}</p><p>diff --git a/test/replication/suite.ini b/test/replication/suite.ini</p><p>index ed1de3140..b4e09744a 100644</p><p>--- a/test/replication/suite.ini</p><p>+++ b/test/replication/suite.ini</p><p>@@ -3,7 +3,7 @@ core = tarantool</p><p>script = master.lua</p><p>description = tarantool/box, replication</p><p>disabled = consistent.test.lua</p><p>-release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua</p><p>+release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua</p><p>config = suite.cfg</p><p>lua_libs = lua/fast_replica.lua lua/rlimit.lua</p><p>use_unix_sockets = True</p></div><div> </div><div>--<br>Serge Petrenko</div></BODY></HTML>