27 февр. 2020 г., в 02:54, Vladislav Shpilevoy <v.shpilevoy@tarantool.org> написал(а):Thanks for the patch!
See 4 comments below.replication: do not relay rows coming from a remote instance back to it
We have a mechanism for restoring rows originating from an instance that
suffered a sudden power loss: remote masters resend the isntance's rows
received before a certain point in time, defined by remote master vclock
at the moment of subscribe.
However, this is useful only on initial replication configuraiton, when
an instance has just recovered, so that it can receive what it has
relayed but haven't synced to disk.
In other cases, when an instance is operating normally and master-master
replication is configured, the mechanism described above may lead to
instance re-applying instance's own rows, coming from a master it has just
subscribed to.
To fix the problem do not relay rows coming from a remote instance, if
the instance has already recovered.
Closes #4739
diff --git a/src/box/applier.cc b/src/box/applier.cc
index 911353425..73ffc0d68 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -866,8 +866,13 @@ applier_subscribe(struct applier *applier)
struct vclock vclock;
vclock_create(&vclock);
vclock_copy(&vclock, &replicaset.vclock);
+ /*
+ * Stop accepting local rows coming from a remote
+ * instance as soon as local WAL starts accepting writes.
+ */
+ unsigned int id_filter = box_is_orphan() ? 0 : 1 << instance_id;
1. I was always wondering, what if the instance got orphaned after it
started accepting writes? WAL is fully functional, it syncs whatever is
needed, and then a resubscribe happens. Can this break anything?xrow_encode_subscribe_xc(&row, &REPLICASET_UUID, &INSTANCE_UUID,
- &vclock, replication_anon, 0);
+ &vclock, replication_anon, id_filter);
coio_write_xrow(coio, &row);
/* Read SUBSCRIBE response */
diff --git a/src/box/wal.c b/src/box/wal.c
index 27bff662a..35ba7b072 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -278,8 +278,13 @@ tx_schedule_commit(struct cmsg *msg)
/* Closes the input valve. */
stailq_concat(&writer->rollback, &batch->rollback);
}
+
+ ERROR_INJECT(ERRINJ_REPLICASET_VCLOCK_UPDATE, { goto skip_update; });
/* Update the tx vclock to the latest written by wal. */
vclock_copy(&replicaset.vclock, &batch->vclock);
+#ifndef NDEBUG
+skip_update:
+#endif
2. Consider this hack which I just invented. In that way you won't
depend on ERRINJ and NDEBUG interconnection.
====================
@@ -282,9 +282,7 @@ tx_schedule_commit(struct cmsg *msg)
ERROR_INJECT(ERRINJ_REPLICASET_VCLOCK_UPDATE, { goto skip_update; });
/* Update the tx vclock to the latest written by wal. */
vclock_copy(&replicaset.vclock, &batch->vclock);
-#ifndef NDEBUG
-skip_update:
-#endif
+ ERROR_INJECT(ERRINJ_REPLICASET_VCLOCK_UPDATE, {skip_update:;});
tx_schedule_queue(&batch->commit);
mempool_free(&writer->msg_pool, container_of(msg, struct wal_msg, base));
}
====================
Talking of the injection itself - don't know really. Perhaps
it would be better to add a delay to the wal_write_to_disk()
function, to its very end, after wal_notify_watchers(). In
that case relay will wake up, send whatever it wants, and TX
won't update the vclock until you let wal_write_to_disk()
finish. Seems more natural this way.
tx_schedule_queue(&batch->commit);
mempool_free(&writer->msg_pool, container_of(msg, struct wal_msg, base));
}
diff --git a/test/replication/gh-4739-vclock-assert.result b/test/replication/gh-4739-vclock-assert.result
new file mode 100644
index 000000000..7dc2f7118
--- /dev/null
+++ b/test/replication/gh-4739-vclock-assert.result
@@ -0,0 +1,82 @@
+-- test-run result file version 2
+env = require('test_run')
+ | ---
+ | ...
+test_run = env.new()
+ | ---
+ | ...
+
+SERVERS = {'rebootstrap1', 'rebootstrap2'}
+ | ---
+ | ...
+test_run:create_cluster(SERVERS, "replication")
+ | ---
+ | ...
+test_run:wait_fullmesh(SERVERS)
+ | ---
+ | ...
+
+test_run:cmd('switch rebootstrap1')
+ | ---
+ | - true
+ | ...
+fiber = require('fiber')
+ | ---
+ | ...
+-- Stop updating replicaset vclock to simulate a situation, when
+-- a row is already relayed to the remote master, but the local
+-- vclock update hasn't happened yet.
+box.error.injection.set('ERRINJ_REPLICASET_VCLOCK_UPDATE', true)
+ | ---
+ | - ok
+ | ...
+lsn = box.info.lsn
+ | ---
+ | ...
+box.space._schema:replace{'something'}
+ | ---
+ | - ['something']
+ | ...
+-- Vclock isn't updated.
+box.info.lsn == lsn
+ | ---
+ | - true
+ | ...
+
+-- Wait until the remote instance gets the row.
+while test_run:get_vclock('rebootstrap2')[box.info.id] == lsn do\
+ fiber.sleep(0.01)\
+end
3. There is a cool thing which I discovered relatively recently:
test_run:wait_cond(). It does fiber sleep and while cycle, and
has a finite timeout, so such a test won't hang for 10 minutes
in Travis in case of a problem.
+ | ---
+ | ...
+
+-- Restart the remote instance. This will make the first instance
+-- resubscribe without entering orphan mode.
+test_run:cmd('restart server rebootstrap2')
+ | ---
+ | - true
+ | ...
+test_run:cmd('switch rebootstrap1')
+ | ---
+ | - true
+ | ...
+-- Wait until resubscribe is sent
+fiber.sleep(2 * box.cfg.replication_timeout)
4. Don't we collect any statistics on replication requests, just
like we do in box.stat()? Perhaps box.stat.net() can help? To
wait properly. Maybe just do test_run:wait_cond() for status 'sync'?
+ | ---
+ | ...
+box.info.replication[2].upstream.status
+ | ---
+ | - sync
+ | ...
+
+box.error.injection.set('ERRINJ_REPLICASET_VCLOCK_UPDATE', false)
+ | ---
+ | - ok
+ | ...
+test_run:cmd('switch default')
+ | ---
+ | - true
+ | ...
+test_run:drop_cluster(SERVERS)
+ | ---
+ | …