Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH 1/1] applier: process synchro rows after WAL write
@ 2021-04-07 22:47 Vladislav Shpilevoy via Tarantool-patches
  2021-04-08  8:39 ` Serge Petrenko via Tarantool-patches
  0 siblings, 1 reply; 10+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-07 22:47 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

Applier used to process synchronous rows CONFIRM and ROLLBACK
right after receipt before they are written to WAL.

That led to a bug that the confirmed data became visible, might be
accessed by user requests, then the node restarted before CONFIRM
finished its WAL write, and the data was not visible again. That
is just like if it would be rolled back, which is not acceptable.

Another case - CONFIRM WAL write could simply fail due to any
reason (no disk space, OOM), but the transactions would remain
confirmed anyway.

Also that produced some hacks in the limbo's code to support the
confirmation and rollback of transactions not yet written to WAL.

The patch makes the synchro rows processed only after they are
written to WAL. Although the 'rollback' case above might still
happen if the xlogs were in the kernel caches, and the machine was
powered off before they were flushed to disk. But that is not
related to qsync specifically.

To handle the synchro rows after WAL write the patch makes them go
to WAL in a blocking way (journal_write() instead of
journal_write_try_async()). Otherwise it could happen that a
CONFIRM/ROLLBACK is being written to WAL and would clear the limbo
afterwards, but a new transaction arrives with a different owner,
and it conflicts with the current limbo owner.

Closes #5213
---
Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5213-applier-qsync-write-before
Issue: https://github.com/tarantool/tarantool/issues/5213

 .../unreleased/applier-qsync-write-order.md   |   6 +
 src/box/applier.cc                            | 107 +++---
 src/box/txn_limbo.c                           |  48 +--
 .../gh-5213-qsync-applier-order.result        | 329 ++++++++++++++++++
 .../gh-5213-qsync-applier-order.test.lua      | 129 +++++++
 test/replication/gh-5213-replica.lua          |  10 +
 test/replication/suite.cfg                    |   1 +
 test/replication/suite.ini                    |   2 +-
 8 files changed, 529 insertions(+), 103 deletions(-)
 create mode 100644 changelogs/unreleased/applier-qsync-write-order.md
 create mode 100644 test/replication/gh-5213-qsync-applier-order.result
 create mode 100644 test/replication/gh-5213-qsync-applier-order.test.lua
 create mode 100644 test/replication/gh-5213-replica.lua

diff --git a/changelogs/unreleased/applier-qsync-write-order.md b/changelogs/unreleased/applier-qsync-write-order.md
new file mode 100644
index 000000000..10b9ade80
--- /dev/null
+++ b/changelogs/unreleased/applier-qsync-write-order.md
@@ -0,0 +1,6 @@
+## bugfix/replication
+
+* Fix the bug when a synchronous transaction could be confirmed and visible on
+  a replica, but then not confirmed / invisible again after restart. Could
+  happen more likely on memtx spaces with `memtx_use_mvcc_engine` enabled
+  (gh-5213).
diff --git a/src/box/applier.cc b/src/box/applier.cc
index 971b2e64c..f7c82548c 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -762,26 +762,21 @@ applier_txn_wal_write_cb(struct trigger *trigger, void *event)
 }
 
 struct synchro_entry {
-	/** Encoded form of a synchro record. */
-	struct synchro_body_bin	body_bin;
-
-	/** xrow to write, used by the journal engine. */
-	struct xrow_header row;
-
+	/** Request to process when WAL write is done. */
+	struct synchro_request *req;
+	/** Fiber created the entry. To wakeup when WAL write is done. */
+	struct fiber *owner;
 	/**
-	 * The journal entry itself. Note since
-	 * it has unsized array it must be the
-	 * last entry in the structure.
+	 * The base journal entry. It has unsized array and then must be the
+	 * last entry in the structure. But can workaround it via a union
+	 * adding the needed tail as char[].
 	 */
-	struct journal_entry journal_entry;
+	union {
+		struct journal_entry base;
+		char base_buf[sizeof(base) + sizeof(base.rows[0])];
+	};
 };
 
-static void
-synchro_entry_delete(struct synchro_entry *entry)
-{
-	free(entry);
-}
-
 /**
  * Async write journal completion.
  */
@@ -791,50 +786,15 @@ apply_synchro_row_cb(struct journal_entry *entry)
 	assert(entry->complete_data != NULL);
 	struct synchro_entry *synchro_entry =
 		(struct synchro_entry *)entry->complete_data;
-	if (entry->res < 0)
+	if (entry->res < 0) {
 		applier_rollback_by_wal_io();
-	else
+	} else {
+		txn_limbo_process(&txn_limbo, synchro_entry->req);
 		trigger_run(&replicaset.applier.on_wal_write, NULL);
-
-	synchro_entry_delete(synchro_entry);
-}
-
-/**
- * Allocate a new synchro_entry to be passed to
- * the journal engine in async write way.
- */
-static struct synchro_entry *
-synchro_entry_new(struct xrow_header *applier_row,
-		  struct synchro_request *req)
-{
-	struct synchro_entry *entry;
-	size_t size = sizeof(*entry) + sizeof(struct xrow_header *);
-
-	/*
-	 * For simplicity we use malloc here but
-	 * probably should provide some cache similar
-	 * to txn cache.
-	 */
-	entry = (struct synchro_entry *)malloc(size);
-	if (entry == NULL) {
-		diag_set(OutOfMemory, size, "malloc", "synchro_entry");
-		return NULL;
 	}
-
-	struct journal_entry *journal_entry = &entry->journal_entry;
-	struct synchro_body_bin *body_bin = &entry->body_bin;
-	struct xrow_header *row = &entry->row;
-
-	journal_entry->rows[0] = row;
-
-	xrow_encode_synchro(row, body_bin, req);
-
-	row->lsn = applier_row->lsn;
-	row->replica_id = applier_row->replica_id;
-
-	journal_entry_create(journal_entry, 1, xrow_approx_len(row),
-			     apply_synchro_row_cb, entry);
-	return entry;
+	/* The fiber is the same on final join. */
+	if (synchro_entry->owner != fiber())
+		fiber_wakeup(synchro_entry->owner);
 }
 
 /** Process a synchro request. */
@@ -847,14 +807,31 @@ apply_synchro_row(struct xrow_header *row)
 	if (xrow_decode_synchro(row, &req) != 0)
 		goto err;
 
-	txn_limbo_process(&txn_limbo, &req);
-
-	struct synchro_entry *entry;
-	entry = synchro_entry_new(row, &req);
-	if (entry == NULL)
-		goto err;
-
-	if (journal_write_try_async(&entry->journal_entry) != 0) {
+	struct synchro_entry entry;
+	entry.base.rows[0] = row;
+	journal_entry_create(&entry.base, 1, xrow_approx_len(row),
+			     apply_synchro_row_cb, &entry);
+	entry.req = &req;
+	entry.owner = fiber();
+	/*
+	 * The WAL write is blocking. Otherwise it might happen that a CONFIRM
+	 * or ROLLBACK is sent to WAL, and it would empty the limbo, but before
+	 * it is written, more transactions arrive with a different owner. They
+	 * won't be able to enter the limbo due to owner ID mismatch. Hence the
+	 * synchro rows must block receipt of new transactions.
+	 *
+	 * Don't forget to return -1 both if the journal write failed right
+	 * away, and if it failed inside of WAL thread (res < 0). Otherwise the
+	 * caller would propagate committed vclock to this row thinking it was
+	 * a success.
+	 *
+	 * XXX: in theory it could be done vice-versa. The write could be made
+	 * non-blocking, and instead the potentially conflicting transactions
+	 * could try to wait for all the current synchro WAL writes to end
+	 * before trying to commit. But that requires extra steps from the
+	 * transactions side, including the async ones.
+	 */
+	if (journal_write(&entry.base) != 0 || entry.base.res < 0) {
 		diag_set(ClientError, ER_WAL_IO);
 		goto err;
 	}
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index cf0ad9350..c63211191 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -400,20 +400,11 @@ txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn)
 		txn_limbo_remove(limbo, e);
 		txn_clear_flags(e->txn, TXN_WAIT_SYNC | TXN_WAIT_ACK);
 		/*
-		 * If already written to WAL by now, finish tx processing.
-		 * Otherwise just clear the sync flags. Tx procesing will finish
-		 * automatically once the tx is written to WAL.
-		 *
-		 * XXX: Normally at this point all transactions covered by this
-		 * CONFIRM should be in WAL already, but there is a bug, that
-		 * replica always processes received synchro requests *before*
-		 * writing them to WAL. So it can happen, that a CONFIRM is
-		 * 'read', but the transaction is not written yet. Should be
-		 * fixed when the replica will behave properly, and then this
-		 * branch won't exist.
+		 * Should be written to WAL by now. Confirm is always written
+		 * after the affected transactions.
 		 */
-		if (e->txn->signature >= 0)
-			txn_complete_success(e->txn);
+		assert(e->txn->signature >= 0);
+		txn_complete_success(e->txn);
 	}
 	/* Update is_ro once the limbo is clear. */
 	if (txn_limbo_is_empty(limbo))
@@ -455,30 +446,13 @@ txn_limbo_read_rollback(struct txn_limbo *limbo, int64_t lsn)
 	rlist_foreach_entry_safe_reverse(e, &limbo->queue, in_queue, tmp) {
 		txn_limbo_abort(limbo, e);
 		txn_clear_flags(e->txn, TXN_WAIT_SYNC | TXN_WAIT_ACK);
-		if (e->txn->signature >= 0) {
-			/* Rollback the transaction. */
-			e->txn->signature = TXN_SIGNATURE_SYNC_ROLLBACK;
-			txn_complete_fail(e->txn);
-		} else {
-			/*
-			 * Rollback the transaction, but don't free it yet. It
-			 * will be freed after its WAL write is completed.
-			 *
-			 * XXX: Normally at this point all transactions covered
-			 * by this ROLLBACK should be in WAL already, but there
-			 * is a bug, that replica always processes received
-			 * synchro requests *before* writing them to WAL. So it
-			 * can happen, that a ROLLBACK is 'read', but the
-			 * transaction is not written yet. Should be fixed when
-			 * the replica will behave properly, and then this
-			 * branch won't exist.
-			 */
-			e->txn->signature = TXN_SIGNATURE_SYNC_ROLLBACK;
-			struct fiber *fiber = e->txn->fiber;
-			e->txn->fiber = fiber();
-			txn_complete_fail(e->txn);
-			e->txn->fiber = fiber;
-		}
+		/*
+		 * Should be written to WAL by now. Rollback is always written
+		 * after the affected transactions.
+		 */
+		assert(e->txn->signature >= 0);
+		e->txn->signature = TXN_SIGNATURE_SYNC_ROLLBACK;
+		txn_complete_fail(e->txn);
 		if (e == last_rollback)
 			break;
 	}
diff --git a/test/replication/gh-5213-qsync-applier-order.result b/test/replication/gh-5213-qsync-applier-order.result
new file mode 100644
index 000000000..cab8c588f
--- /dev/null
+++ b/test/replication/gh-5213-qsync-applier-order.result
@@ -0,0 +1,329 @@
+-- test-run result file version 2
+--
+-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL.
+-- As a result it could happen that the transactions became visible on CONFIRM,
+-- then somehow weren't written to WAL, and on restart the data might not be
+-- visible again. Which means rollback of confirmed data and is not acceptable
+-- (on the contrary with commit after rollback).
+--
+test_run = require('test_run').new()
+ | ---
+ | ...
+fiber = require('fiber')
+ | ---
+ | ...
+old_synchro_quorum = box.cfg.replication_synchro_quorum
+ | ---
+ | ...
+old_synchro_timeout = box.cfg.replication_synchro_timeout
+ | ---
+ | ...
+
+box.schema.user.grant('guest', 'super')
+ | ---
+ | ...
+
+s = box.schema.space.create('test', {is_sync = true})
+ | ---
+ | ...
+_ = s:create_index('pk')
+ | ---
+ | ...
+
+test_run:cmd('create server replica with rpl_master=default,\
+              script="replication/gh-5213-replica.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica')
+ | ---
+ | - true
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+assert(box.info.id == 2)
+ | ---
+ | - true
+ | ...
+lsn = box.info.vclock[1]
+ | ---
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+fiber = require('fiber')
+ | ---
+ | ...
+box.cfg{                                                                        \
+    replication_synchro_quorum = 3,                                             \
+    replication_synchro_timeout = 1000,                                         \
+}
+ | ---
+ | ...
+f = fiber.new(function() s:replace{1} end)
+ | ---
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+-- Wait when the transaction is written to WAL.
+test_run:wait_cond(function() return box.info.vclock[1] == lsn + 1 end)
+ | ---
+ | - true
+ | ...
+s = box.space.test
+ | ---
+ | ...
+-- But not visible yet. Because MVCC is on, and no CONFIRM yet.
+assert(s:get({1}) == nil)
+ | ---
+ | - true
+ | ...
+-- Block the incoming CONFIRM to be able to ensure the data is not visible until
+-- WAL write ends.
+box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0)
+ | ---
+ | - ok
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+box.cfg{replication_synchro_quorum = 2}
+ | ---
+ | ...
+test_run:wait_cond(function() return f:status() == 'dead' end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+-- Wait when CONFIRM is in the WAL thread.
+test_run:wait_cond(function() return box.error.injection.get('ERRINJ_WAL_DELAY') end)
+ | ---
+ | - true
+ | ...
+assert(s:get({1}) == nil)
+ | ---
+ | - true
+ | ...
+box.error.injection.set("ERRINJ_WAL_DELAY", false)
+ | ---
+ | - ok
+ | ...
+-- After CONFIRM is in WAL, the transaction is committed and its data is
+-- visible.
+test_run:wait_cond(function() return s:get({1}) ~= nil end)
+ | ---
+ | - true
+ | ...
+
+--
+-- Ensure CONFIRM WAL write fail also works fine when couldn't even start a WAL
+-- write. The data remains not confirmed until WAL write succeeds.
+--
+lsn = box.info.vclock[1]
+ | ---
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+box.cfg{replication_synchro_quorum = 3}
+ | ---
+ | ...
+f = fiber.new(function() s:replace{2} end)
+ | ---
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+test_run:wait_cond(function() return box.info.vclock[1] == lsn + 1 end)
+ | ---
+ | - true
+ | ...
+assert(s:get({2}) == nil)
+ | ---
+ | - true
+ | ...
+-- Make journal write fail immediately.
+box.error.injection.set('ERRINJ_WAL_IO', true)
+ | ---
+ | - ok
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+box.cfg{replication_synchro_quorum = 2}
+ | ---
+ | ...
+test_run:wait_cond(function() return f:status() == 'dead' end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+test_run:wait_upstream(1, {status = 'stopped'})
+ | ---
+ | - true
+ | ...
+box.error.injection.set('ERRINJ_WAL_IO', false)
+ | ---
+ | - ok
+ | ...
+assert(s:get({2}) == nil)
+ | ---
+ | - true
+ | ...
+-- Re-subscribe.
+replication = box.cfg.replication
+ | ---
+ | ...
+box.cfg{replication = {}}
+ | ---
+ | ...
+box.cfg{replication = replication}
+ | ---
+ | ...
+test_run:wait_cond(function() return box.info.vclock[1] == lsn + 2 end)
+ | ---
+ | - true
+ | ...
+test_run:wait_upstream(1, {status = 'follow'})
+ | ---
+ | - true
+ | ...
+assert(s:get({2}) ~= nil)
+ | ---
+ | - true
+ | ...
+
+--
+-- Ensure CONFIRM WAL write fail works fine when an error happens inside WAL
+-- thread. The data remains not confirmed until WAL write succeeds.
+--
+lsn = box.info.vclock[1]
+ | ---
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+box.cfg{replication_synchro_quorum = 3}
+ | ---
+ | ...
+f = fiber.new(function() s:replace{3} end)
+ | ---
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+test_run:wait_cond(function() return box.info.vclock[1] == lsn + 1 end)
+ | ---
+ | - true
+ | ...
+assert(s:get({3}) == nil)
+ | ---
+ | - true
+ | ...
+-- Journal write start is going to succeed, but it will fail later on return
+-- from the WAL thread.
+box.error.injection.set('ERRINJ_WAL_ROTATE', true)
+ | ---
+ | - ok
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+box.cfg{replication_synchro_quorum = 2}
+ | ---
+ | ...
+test_run:wait_cond(function() return f:status() == 'dead' end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+test_run:wait_upstream(1, {status = 'stopped'})
+ | ---
+ | - true
+ | ...
+box.error.injection.set('ERRINJ_WAL_ROTATE', false)
+ | ---
+ | - ok
+ | ...
+assert(s:get({3}) == nil)
+ | ---
+ | - true
+ | ...
+box.cfg{replication = {}}
+ | ---
+ | ...
+box.cfg{replication = replication}
+ | ---
+ | ...
+test_run:wait_cond(function() return box.info.vclock[1] == lsn + 2 end)
+ | ---
+ | - true
+ | ...
+test_run:wait_upstream(1, {status = 'follow'})
+ | ---
+ | - true
+ | ...
+assert(s:get({3}) ~= nil)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica')
+ | ---
+ | - true
+ | ...
+
+s:drop()
+ | ---
+ | ...
+box.schema.user.revoke('guest', 'super')
+ | ---
+ | ...
+box.cfg{                                                                        \
+    replication_synchro_quorum = old_synchro_quorum,                            \
+    replication_synchro_timeout = old_synchro_timeout,                          \
+}
+ | ---
+ | ...
diff --git a/test/replication/gh-5213-qsync-applier-order.test.lua b/test/replication/gh-5213-qsync-applier-order.test.lua
new file mode 100644
index 000000000..3a32626f2
--- /dev/null
+++ b/test/replication/gh-5213-qsync-applier-order.test.lua
@@ -0,0 +1,129 @@
+--
+-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL.
+-- As a result it could happen that the transactions became visible on CONFIRM,
+-- then somehow weren't written to WAL, and on restart the data might not be
+-- visible again. Which means rollback of confirmed data and is not acceptable
+-- (on the contrary with commit after rollback).
+--
+test_run = require('test_run').new()
+fiber = require('fiber')
+old_synchro_quorum = box.cfg.replication_synchro_quorum
+old_synchro_timeout = box.cfg.replication_synchro_timeout
+
+box.schema.user.grant('guest', 'super')
+
+s = box.schema.space.create('test', {is_sync = true})
+_ = s:create_index('pk')
+
+test_run:cmd('create server replica with rpl_master=default,\
+              script="replication/gh-5213-replica.lua"')
+test_run:cmd('start server replica')
+
+test_run:switch('replica')
+assert(box.info.id == 2)
+lsn = box.info.vclock[1]
+
+test_run:switch('default')
+fiber = require('fiber')
+box.cfg{                                                                        \
+    replication_synchro_quorum = 3,                                             \
+    replication_synchro_timeout = 1000,                                         \
+}
+f = fiber.new(function() s:replace{1} end)
+
+test_run:switch('replica')
+-- Wait when the transaction is written to WAL.
+test_run:wait_cond(function() return box.info.vclock[1] == lsn + 1 end)
+s = box.space.test
+-- But not visible yet. Because MVCC is on, and no CONFIRM yet.
+assert(s:get({1}) == nil)
+-- Block the incoming CONFIRM to be able to ensure the data is not visible until
+-- WAL write ends.
+box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0)
+
+test_run:switch('default')
+box.cfg{replication_synchro_quorum = 2}
+test_run:wait_cond(function() return f:status() == 'dead' end)
+
+test_run:switch('replica')
+-- Wait when CONFIRM is in the WAL thread.
+test_run:wait_cond(function() return box.error.injection.get('ERRINJ_WAL_DELAY') end)
+assert(s:get({1}) == nil)
+box.error.injection.set("ERRINJ_WAL_DELAY", false)
+-- After CONFIRM is in WAL, the transaction is committed and its data is
+-- visible.
+test_run:wait_cond(function() return s:get({1}) ~= nil end)
+
+--
+-- Ensure CONFIRM WAL write fail also works fine when couldn't even start a WAL
+-- write. The data remains not confirmed until WAL write succeeds.
+--
+lsn = box.info.vclock[1]
+
+test_run:switch('default')
+box.cfg{replication_synchro_quorum = 3}
+f = fiber.new(function() s:replace{2} end)
+
+test_run:switch('replica')
+test_run:wait_cond(function() return box.info.vclock[1] == lsn + 1 end)
+assert(s:get({2}) == nil)
+-- Make journal write fail immediately.
+box.error.injection.set('ERRINJ_WAL_IO', true)
+
+test_run:switch('default')
+box.cfg{replication_synchro_quorum = 2}
+test_run:wait_cond(function() return f:status() == 'dead' end)
+
+test_run:switch('replica')
+test_run:wait_upstream(1, {status = 'stopped'})
+box.error.injection.set('ERRINJ_WAL_IO', false)
+assert(s:get({2}) == nil)
+-- Re-subscribe.
+replication = box.cfg.replication
+box.cfg{replication = {}}
+box.cfg{replication = replication}
+test_run:wait_cond(function() return box.info.vclock[1] == lsn + 2 end)
+test_run:wait_upstream(1, {status = 'follow'})
+assert(s:get({2}) ~= nil)
+
+--
+-- Ensure CONFIRM WAL write fail works fine when an error happens inside WAL
+-- thread. The data remains not confirmed until WAL write succeeds.
+--
+lsn = box.info.vclock[1]
+
+test_run:switch('default')
+box.cfg{replication_synchro_quorum = 3}
+f = fiber.new(function() s:replace{3} end)
+
+test_run:switch('replica')
+test_run:wait_cond(function() return box.info.vclock[1] == lsn + 1 end)
+assert(s:get({3}) == nil)
+-- Journal write start is going to succeed, but it will fail later on return
+-- from the WAL thread.
+box.error.injection.set('ERRINJ_WAL_ROTATE', true)
+
+test_run:switch('default')
+box.cfg{replication_synchro_quorum = 2}
+test_run:wait_cond(function() return f:status() == 'dead' end)
+
+test_run:switch('replica')
+test_run:wait_upstream(1, {status = 'stopped'})
+box.error.injection.set('ERRINJ_WAL_ROTATE', false)
+assert(s:get({3}) == nil)
+box.cfg{replication = {}}
+box.cfg{replication = replication}
+test_run:wait_cond(function() return box.info.vclock[1] == lsn + 2 end)
+test_run:wait_upstream(1, {status = 'follow'})
+assert(s:get({3}) ~= nil)
+
+test_run:switch('default')
+test_run:cmd('stop server replica')
+test_run:cmd('delete server replica')
+
+s:drop()
+box.schema.user.revoke('guest', 'super')
+box.cfg{                                                                        \
+    replication_synchro_quorum = old_synchro_quorum,                            \
+    replication_synchro_timeout = old_synchro_timeout,                          \
+}
diff --git a/test/replication/gh-5213-replica.lua b/test/replication/gh-5213-replica.lua
new file mode 100644
index 000000000..3ba90d7ab
--- /dev/null
+++ b/test/replication/gh-5213-replica.lua
@@ -0,0 +1,10 @@
+#!/usr/bin/env tarantool
+
+box.cfg({
+    listen = os.getenv("LISTEN"),
+    replication = os.getenv("MASTER"),
+    read_only = true,
+    memtx_use_mvcc_engine = true,
+})
+
+require('console').listen(os.getenv('ADMIN'))
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index aff5fda26..ab25fa331 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -15,6 +15,7 @@
     "gh-3760-misc-return-on-quorum-0.test.lua": {},
     "gh-4399-misc-no-failure-on-error-reading-wal.test.lua": {},
     "gh-4424-misc-orphan-on-reconfiguration-error.test.lua": {},
+    "gh-5213-qsync-applier-order.test.lua": {},
     "gh-5426-election-on-off.test.lua": {},
     "gh-5433-election-restart-recovery.test.lua": {},
     "gh-5506-election-on-off.test.lua": {},
diff --git a/test/replication/suite.ini b/test/replication/suite.ini
index a9e44e8cf..a7d5f6129 100644
--- a/test/replication/suite.ini
+++ b/test/replication/suite.ini
@@ -3,7 +3,7 @@ core = tarantool
 script =  master.lua
 description = tarantool/box, replication
 disabled = consistent.test.lua
-release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua
+release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua
 config = suite.cfg
 lua_libs = lua/fast_replica.lua lua/rlimit.lua
 use_unix_sockets = True
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] applier: process synchro rows after WAL write
  2021-04-07 22:47 [Tarantool-patches] [PATCH 1/1] applier: process synchro rows after WAL write Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-08  8:39 ` Serge Petrenko via Tarantool-patches
  2021-04-08 10:19   ` Cyrill Gorcunov via Tarantool-patches
  2021-04-08 22:56   ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 2 replies; 10+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-08  8:39 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, gorcunov



08.04.2021 01:47, Vladislav Shpilevoy пишет:
> Applier used to process synchronous rows CONFIRM and ROLLBACK
> right after receipt before they are written to WAL.
>
> That led to a bug that the confirmed data became visible, might be
> accessed by user requests, then the node restarted before CONFIRM
> finished its WAL write, and the data was not visible again. That
> is just like if it would be rolled back, which is not acceptable.
>
> Another case - CONFIRM WAL write could simply fail due to any
> reason (no disk space, OOM), but the transactions would remain
> confirmed anyway.
>
> Also that produced some hacks in the limbo's code to support the
> confirmation and rollback of transactions not yet written to WAL.
>
> The patch makes the synchro rows processed only after they are
> written to WAL. Although the 'rollback' case above might still
> happen if the xlogs were in the kernel caches, and the machine was
> powered off before they were flushed to disk. But that is not
> related to qsync specifically.
>
> To handle the synchro rows after WAL write the patch makes them go
> to WAL in a blocking way (journal_write() instead of
> journal_write_try_async()). Otherwise it could happen that a
> CONFIRM/ROLLBACK is being written to WAL and would clear the limbo
> afterwards, but a new transaction arrives with a different owner,
> and it conflicts with the current limbo owner.


Thanks for the patch!

I'm a bit worried about two different synchro rows coming from two
appliers. Is everything ok in this case?
Or even normal rows coming from other appliers. Say some other replica
has already applied this synchro row and even has written some rows on
top of it. Its applier won't block on replica_id latch, and may fail to 
apply
some txs following this synchro row, because it's not yet written to WAL
and thus not applied (limbo is still not empty or belongs to other 
instance).

Looks like this won't be a problem once synchro rows start pinning the
limbo to some specific replica. Because in this case only the replica that
has issued confirm will be able to generate new rows. And these rows will
be ordered by replica_id latch.

But still, maybe this is worth fixing?
Am I missing something?


Please find two more comments below.

>
> Closes #5213
> ---
> Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5213-applier-qsync-write-before
> Issue: https://github.com/tarantool/tarantool/issues/5213
>
>   .../unreleased/applier-qsync-write-order.md   |   6 +
>   src/box/applier.cc                            | 107 +++---
>   src/box/txn_limbo.c                           |  48 +--
>   .../gh-5213-qsync-applier-order.result        | 329 ++++++++++++++++++
>   .../gh-5213-qsync-applier-order.test.lua      | 129 +++++++
>   test/replication/gh-5213-replica.lua          |  10 +
>   test/replication/suite.cfg                    |   1 +
>   test/replication/suite.ini                    |   2 +-
>   8 files changed, 529 insertions(+), 103 deletions(-)
>   create mode 100644 changelogs/unreleased/applier-qsync-write-order.md
>   create mode 100644 test/replication/gh-5213-qsync-applier-order.result
>   create mode 100644 test/replication/gh-5213-qsync-applier-order.test.lua
>   create mode 100644 test/replication/gh-5213-replica.lua
>
> diff --git a/changelogs/unreleased/applier-qsync-write-order.md b/changelogs/unreleased/applier-qsync-write-order.md
> new file mode 100644
> index 000000000..10b9ade80
> --- /dev/null
> +++ b/changelogs/unreleased/applier-qsync-write-order.md
> @@ -0,0 +1,6 @@
> +## bugfix/replication
> +
> +* Fix the bug when a synchronous transaction could be confirmed and visible on
> +  a replica, but then not confirmed / invisible again after restart. Could
> +  happen more likely on memtx spaces with `memtx_use_mvcc_engine` enabled
> +  (gh-5213).
> diff --git a/src/box/applier.cc b/src/box/applier.cc
> index 971b2e64c..f7c82548c 100644
> --- a/src/box/applier.cc
> +++ b/src/box/applier.cc
> @@ -762,26 +762,21 @@ applier_txn_wal_write_cb(struct trigger *trigger, void *event)
>   }
>   
>   struct synchro_entry {
> -	/** Encoded form of a synchro record. */
> -	struct synchro_body_bin	body_bin;
> -
> -	/** xrow to write, used by the journal engine. */
> -	struct xrow_header row;
> -
> +	/** Request to process when WAL write is done. */
> +	struct synchro_request *req;
> +	/** Fiber created the entry. To wakeup when WAL write is done. */
> +	struct fiber *owner;
>   	/**
> -	 * The journal entry itself. Note since
> -	 * it has unsized array it must be the
> -	 * last entry in the structure.
> +	 * The base journal entry. It has unsized array and then must be the
> +	 * last entry in the structure. But can workaround it via a union
> +	 * adding the needed tail as char[].
>   	 */
> -	struct journal_entry journal_entry;
> +	union {
> +		struct journal_entry base;
> +		char base_buf[sizeof(base) + sizeof(base.rows[0])];
> +	};
>   };

I don't understand this union stuff.
The journal_entry is the last entry in synchro_entry anyway.
Is this a hack for allowing to allocate synchro_entry on the stack?

>   
> -static void
> -synchro_entry_delete(struct synchro_entry *entry)
> -{
> -	free(entry);
> -}
> -
>   /**
>    * Async write journal completion.
>    */
> @@ -791,50 +786,15 @@ apply_synchro_row_cb(struct journal_entry *entry)
>   	assert(entry->complete_data != NULL);
>   	struct synchro_entry *synchro_entry =
>   		(struct synchro_entry *)entry->complete_data;
> -	if (entry->res < 0)
> +	if (entry->res < 0) {
>   		applier_rollback_by_wal_io();
> -	else
> +	} else {
> +		txn_limbo_process(&txn_limbo, synchro_entry->req);
>   		trigger_run(&replicaset.applier.on_wal_write, NULL);
> -
> -	synchro_entry_delete(synchro_entry);
> -}
> -
> -/**
> - * Allocate a new synchro_entry to be passed to
> - * the journal engine in async write way.
> - */
> -static struct synchro_entry *
> -synchro_entry_new(struct xrow_header *applier_row,
> -		  struct synchro_request *req)
> -{
> -	struct synchro_entry *entry;
> -	size_t size = sizeof(*entry) + sizeof(struct xrow_header *);
> -
> -	/*
> -	 * For simplicity we use malloc here but
> -	 * probably should provide some cache similar
> -	 * to txn cache.
> -	 */
> -	entry = (struct synchro_entry *)malloc(size);
> -	if (entry == NULL) {
> -		diag_set(OutOfMemory, size, "malloc", "synchro_entry");
> -		return NULL;
>   	}
> -
> -	struct journal_entry *journal_entry = &entry->journal_entry;
> -	struct synchro_body_bin *body_bin = &entry->body_bin;
> -	struct xrow_header *row = &entry->row;
> -
> -	journal_entry->rows[0] = row;
> -
> -	xrow_encode_synchro(row, body_bin, req);
> -
> -	row->lsn = applier_row->lsn;
> -	row->replica_id = applier_row->replica_id;
> -
> -	journal_entry_create(journal_entry, 1, xrow_approx_len(row),
> -			     apply_synchro_row_cb, entry);
> -	return entry;
> +	/* The fiber is the same on final join. */
> +	if (synchro_entry->owner != fiber())
> +		fiber_wakeup(synchro_entry->owner);
>   }
>   
>   /** Process a synchro request. */
> @@ -847,14 +807,31 @@ apply_synchro_row(struct xrow_header *row)
>   	if (xrow_decode_synchro(row, &req) != 0)
>   		goto err;
>   
> -	txn_limbo_process(&txn_limbo, &req);
> -
> -	struct synchro_entry *entry;
> -	entry = synchro_entry_new(row, &req);
> -	if (entry == NULL)
> -		goto err;
> -
> -	if (journal_write_try_async(&entry->journal_entry) != 0) {
> +	struct synchro_entry entry;
> +	entry.base.rows[0] = row;
> +	journal_entry_create(&entry.base, 1, xrow_approx_len(row),
> +			     apply_synchro_row_cb, &entry);
> +	entry.req = &req;
> +	entry.owner = fiber();
> +	/*
> +	 * The WAL write is blocking. Otherwise it might happen that a CONFIRM
> +	 * or ROLLBACK is sent to WAL, and it would empty the limbo, but before
> +	 * it is written, more transactions arrive with a different owner. They
> +	 * won't be able to enter the limbo due to owner ID mismatch. Hence the
> +	 * synchro rows must block receipt of new transactions.
> +	 *
> +	 * Don't forget to return -1 both if the journal write failed right
> +	 * away, and if it failed inside of WAL thread (res < 0). Otherwise the
> +	 * caller would propagate committed vclock to this row thinking it was
> +	 * a success.
> +	 *
> +	 * XXX: in theory it could be done vice-versa. The write could be made
> +	 * non-blocking, and instead the potentially conflicting transactions
> +	 * could try to wait for all the current synchro WAL writes to end
> +	 * before trying to commit. But that requires extra steps from the
> +	 * transactions side, including the async ones.
> +	 */
> +	if (journal_write(&entry.base) != 0 || entry.base.res < 0) {
>   		diag_set(ClientError, ER_WAL_IO);
>   		goto err;
>   	}
...
>
> diff --git a/test/replication/gh-5213-qsync-applier-order.test.lua b/test/replication/gh-5213-qsync-applier-order.test.lua
> new file mode 100644
> index 000000000..3a32626f2
> --- /dev/null
> +++ b/test/replication/gh-5213-qsync-applier-order.test.lua
> @@ -0,0 +1,129 @@
> +--
> +-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL.
> +-- As a result it could happen that the transactions became visible on CONFIRM,
> +-- then somehow weren't written to WAL, and on restart the data might not be
> +-- visible again. Which means rollback of confirmed data and is not acceptable
> +-- (on the contrary with commit after rollback).
> +--
> +test_run = require('test_run').new()
> +fiber = require('fiber')
> +old_synchro_quorum = box.cfg.replication_synchro_quorum
> +old_synchro_timeout = box.cfg.replication_synchro_timeout
> +
> +box.schema.user.grant('guest', 'super')
> +
> +s = box.schema.space.create('test', {is_sync = true})
> +_ = s:create_index('pk')
> +
> +test_run:cmd('create server replica with rpl_master=default,\
> +              script="replication/gh-5213-replica.lua"')
> +test_run:cmd('start server replica')
> +
> +test_run:switch('replica')
> +assert(box.info.id == 2)
> +lsn = box.info.vclock[1]
> +
> +test_run:switch('default')
> +fiber = require('fiber')
> +box.cfg{                                                                        \
> +    replication_synchro_quorum = 3,                                             \
> +    replication_synchro_timeout = 1000,                                         \
> +}
> +f = fiber.new(function() s:replace{1} end)
> +
> +test_run:switch('replica')
> +-- Wait when the transaction is written to WAL.
> +test_run:wait_cond(function() return box.info.vclock[1] == lsn + 1 end)

This shouldn't go wrong, but I find
test_run:wait_lsn('replica, 'default') more durable.
And you wouldn't need to save lsn above then.

Same for other usages of lsn = ... and wait_cond(box.info.vclock[1] == 
lsn + ...)

Up to you though.

> +s = box.space.test
> +-- But not visible yet. Because MVCC is on, and no CONFIRM yet.
> +assert(s:get({1}) == nil)
> +-- Block the incoming CONFIRM to be able to ensure the data is not visible until
> +-- WAL write ends.
> +box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0)
> +
> +test_run:switch('default')
> +box.cfg{replication_synchro_quorum = 2}
> +test_run:wait_cond(function() return f:status() == 'dead' end)
> +
> +test_run:switch('replica')
> +-- Wait when CONFIRM is in the WAL thread.
> +test_run:wait_cond(function() return box.error.injection.get('ERRINJ_WAL_DELAY') end)
> +assert(s:get({1}) == nil)
> +box.error.injection.set("ERRINJ_WAL_DELAY", false)
> +-- After CONFIRM is in WAL, the transaction is committed and its data is
> +-- visible.
> +test_run:wait_cond(function() return s:get({1}) ~= nil end)
> +
> +--
> +-- Ensure CONFIRM WAL write fail also works fine when couldn't even start a WAL
> +-- write. The data remains not confirmed until WAL write succeeds.
> +--
> +lsn = box.info.vclock[1]
> +
> +test_run:switch('default')
> +box.cfg{replication_synchro_quorum = 3}
> +f = fiber.new(function() s:replace{2} end)
> +
> +test_run:switch('replica')
> +test_run:wait_cond(function() return box.info.vclock[1] == lsn + 1 end)
> +assert(s:get({2}) == nil)
> +-- Make journal write fail immediately.
> +box.error.injection.set('ERRINJ_WAL_IO', true)
> +
> +test_run:switch('default')
> +box.cfg{replication_synchro_quorum = 2}
> +test_run:wait_cond(function() return f:status() == 'dead' end)
> +
> +test_run:switch('replica')
> +test_run:wait_upstream(1, {status = 'stopped'})
> +box.error.injection.set('ERRINJ_WAL_IO', false)
> +assert(s:get({2}) == nil)
> +-- Re-subscribe.
> +replication = box.cfg.replication
> +box.cfg{replication = {}}
> +box.cfg{replication = replication}
> +test_run:wait_cond(function() return box.info.vclock[1] == lsn + 2 end)
> +test_run:wait_upstream(1, {status = 'follow'})
> +assert(s:get({2}) ~= nil)
> +
> +--
> +-- Ensure CONFIRM WAL write fail works fine when an error happens inside WAL
> +-- thread. The data remains not confirmed until WAL write succeeds.
> +--
> +lsn = box.info.vclock[1]
> +
> +test_run:switch('default')
> +box.cfg{replication_synchro_quorum = 3}
> +f = fiber.new(function() s:replace{3} end)
> +
> +test_run:switch('replica')
> +test_run:wait_cond(function() return box.info.vclock[1] == lsn + 1 end)
> +assert(s:get({3}) == nil)
> +-- Journal write start is going to succeed, but it will fail later on return
> +-- from the WAL thread.
> +box.error.injection.set('ERRINJ_WAL_ROTATE', true)
> +
> +test_run:switch('default')
> +box.cfg{replication_synchro_quorum = 2}
> +test_run:wait_cond(function() return f:status() == 'dead' end)
> +
> +test_run:switch('replica')
> +test_run:wait_upstream(1, {status = 'stopped'})
> +box.error.injection.set('ERRINJ_WAL_ROTATE', false)
> +assert(s:get({3}) == nil)
> +box.cfg{replication = {}}
> +box.cfg{replication = replication}
> +test_run:wait_cond(function() return box.info.vclock[1] == lsn + 2 end)
> +test_run:wait_upstream(1, {status = 'follow'})
> +assert(s:get({3}) ~= nil)
> +
> +test_run:switch('default')
> +test_run:cmd('stop server replica')
> +test_run:cmd('delete server replica')
> +
> +s:drop()
> +box.schema.user.revoke('guest', 'super')
> +box.cfg{                                                                        \
> +    replication_synchro_quorum = old_synchro_quorum,                            \
> +    replication_synchro_timeout = old_synchro_timeout,                          \
> +}
> diff --git a/test/replication/gh-5213-replica.lua b/test/replication/gh-5213-replica.lua
> new file mode 100644
> index 000000000..3ba90d7ab
> --- /dev/null
> +++ b/test/replication/gh-5213-replica.lua
> @@ -0,0 +1,10 @@
> +#!/usr/bin/env tarantool
> +
> +box.cfg({
> +    listen = os.getenv("LISTEN"),
> +    replication = os.getenv("MASTER"),
> +    read_only = true,
> +    memtx_use_mvcc_engine = true,
> +})
> +
> +require('console').listen(os.getenv('ADMIN'))
> diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
> index aff5fda26..ab25fa331 100644
> --- a/test/replication/suite.cfg
> +++ b/test/replication/suite.cfg
> @@ -15,6 +15,7 @@
>       "gh-3760-misc-return-on-quorum-0.test.lua": {},
>       "gh-4399-misc-no-failure-on-error-reading-wal.test.lua": {},
>       "gh-4424-misc-orphan-on-reconfiguration-error.test.lua": {},
> +    "gh-5213-qsync-applier-order.test.lua": {},
>       "gh-5426-election-on-off.test.lua": {},
>       "gh-5433-election-restart-recovery.test.lua": {},
>       "gh-5506-election-on-off.test.lua": {},
> diff --git a/test/replication/suite.ini b/test/replication/suite.ini
> index a9e44e8cf..a7d5f6129 100644
> --- a/test/replication/suite.ini
> +++ b/test/replication/suite.ini
> @@ -3,7 +3,7 @@ core = tarantool
>   script =  master.lua
>   description = tarantool/box, replication
>   disabled = consistent.test.lua
> -release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua
> +release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua
>   config = suite.cfg
>   lua_libs = lua/fast_replica.lua lua/rlimit.lua
>   use_unix_sockets = True

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] applier: process synchro rows after WAL write
  2021-04-08  8:39 ` Serge Petrenko via Tarantool-patches
@ 2021-04-08 10:19   ` Cyrill Gorcunov via Tarantool-patches
  2021-04-08 10:32     ` Serge Petrenko via Tarantool-patches
  2021-04-08 22:56   ` Vladislav Shpilevoy via Tarantool-patches
  1 sibling, 1 reply; 10+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-04-08 10:19 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: Vladislav Shpilevoy, tarantool-patches

On Thu, Apr 08, 2021 at 11:39:03AM +0300, Serge Petrenko wrote:
> 
> Thanks for the patch!
> 
> I'm a bit worried about two different synchro rows coming from two
> appliers. Is everything ok in this case?

Serge, you mean the scenario when some instances in replicaset
have the patch applied and some are not?

> Or even normal rows coming from other appliers. Say some other replica
> has already applied this synchro row and even has written some rows on
> top of it. Its applier won't block on replica_id latch, and may fail to
> apply
> some txs following this synchro row, because it's not yet written to WAL
> and thus not applied (limbo is still not empty or belongs to other
> instance).
> 
> Looks like this won't be a problem once synchro rows start pinning the
> limbo to some specific replica. Because in this case only the replica that
> has issued confirm will be able to generate new rows. And these rows will
> be ordered by replica_id latch.
> 
> But still, maybe this is worth fixing?
> Am I missing something?

> > -	struct journal_entry journal_entry;
> > +	union {
> > +		struct journal_entry base;
> > +		char base_buf[sizeof(base) + sizeof(base.rows[0])];
> > +	};
> >   };
> 
> I don't understand this union stuff.
> The journal_entry is the last entry in synchro_entry anyway.
> Is this a hack for allowing to allocate synchro_entry on the stack?

Yeah, the journal_entry last member is zero size array so someone
has to preallocate memory for rows and using union allows to squash
everything statically on stack.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] applier: process synchro rows after WAL write
  2021-04-08 10:19   ` Cyrill Gorcunov via Tarantool-patches
@ 2021-04-08 10:32     ` Serge Petrenko via Tarantool-patches
  2021-04-08 10:46       ` Cyrill Gorcunov via Tarantool-patches
  2021-04-08 22:57       ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 2 replies; 10+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-08 10:32 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: Vladislav Shpilevoy, tarantool-patches



08.04.2021 13:19, Cyrill Gorcunov пишет:
> On Thu, Apr 08, 2021 at 11:39:03AM +0300, Serge Petrenko wrote:
>> Thanks for the patch!
>>
>> I'm a bit worried about two different synchro rows coming from two
>> appliers. Is everything ok in this case?
> Serge, you mean the scenario when some instances in replicaset
> have the patch applied and some are not?

No. Let's suppose everyone has this patch applied.
Now look at one particular instance. It may happen that while one of
its appliers is writing this synchro row (either CONFIRM or ROLLBACK,
doesn't matter), some other applier may still apply requests coming
from other replicaset members.

I was wondering what would happen if someone else sent this instance
another synchro row. Looks like nothing bad but I just wanted to
double-check.

And looks like there's a bug, which I'm speaking of below. It's about
someone sending us normal rows (either synchronous transactions or
asynchronous, but not CONFIRM/ROLLBACK entries) while we're waiting for
syncro row's write to end.

Say, limbo was owned by instance 1, and instance 2 has written CONFIRM
for everything there was. While we wait for 2's CONFIRM to be written to
WAL, we may receive some rows from instance 3, who has already applied 2's
CONFIRM. Since we haven't written the CONFIRM yet, we haven't applied it,
and the limbo on our instance still isn't empty. All the rows coming from
3 will get rejected and replication between 3 and us will be broken.

>
>> Or even normal rows coming from other appliers. Say some other replica
>> has already applied this synchro row and even has written some rows on
>> top of it. Its applier won't block on replica_id latch, and may fail to
>> apply
>> some txs following this synchro row, because it's not yet written to WAL
>> and thus not applied (limbo is still not empty or belongs to other
>> instance).
>>
>> Looks like this won't be a problem once synchro rows start pinning the
>> limbo to some specific replica. Because in this case only the replica that
>> has issued confirm will be able to generate new rows. And these rows will
>> be ordered by replica_id latch.
>>
>> But still, maybe this is worth fixing?
>> Am I missing something?
>>> -	struct journal_entry journal_entry;
>>> +	union {
>>> +		struct journal_entry base;
>>> +		char base_buf[sizeof(base) + sizeof(base.rows[0])];
>>> +	};
>>>    };
>> I don't understand this union stuff.
>> The journal_entry is the last entry in synchro_entry anyway.
>> Is this a hack for allowing to allocate synchro_entry on the stack?
> Yeah, the journal_entry last member is zero size array so someone
> has to preallocate memory for rows and using union allows to squash
> everything statically on stack.

Ok, I see now, thanks for the explanation!


-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] applier: process synchro rows after WAL write
  2021-04-08 10:32     ` Serge Petrenko via Tarantool-patches
@ 2021-04-08 10:46       ` Cyrill Gorcunov via Tarantool-patches
  2021-04-08 22:57       ` Vladislav Shpilevoy via Tarantool-patches
  1 sibling, 0 replies; 10+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-04-08 10:46 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: Vladislav Shpilevoy, tarantool-patches

On Thu, Apr 08, 2021 at 01:32:06PM +0300, Serge Petrenko wrote:
> > Serge, you mean the scenario when some instances in replicaset
> > have the patch applied and some are not?
> 
> No. Let's suppose everyone has this patch applied.
> Now look at one particular instance. It may happen that while one of
> its appliers is writing this synchro row (either CONFIRM or ROLLBACK,
> doesn't matter), some other applier may still apply requests coming
> from other replicaset members.

Ah, indeed. I must confess I forgot that there are a number of applier
fibers and blocking write does block only the fiber which has initiated
the write procedure, not other fibers.

> I was wondering what would happen if someone else sent this instance
> another synchro row. Looks like nothing bad but I just wanted to
> double-check.
> 
> And looks like there's a bug, which I'm speaking of below. It's about
> someone sending us normal rows (either synchronous transactions or
> asynchronous, but not CONFIRM/ROLLBACK entries) while we're waiting for
> syncro row's write to end.
> 
> Say, limbo was owned by instance 1, and instance 2 has written CONFIRM
> for everything there was. While we wait for 2's CONFIRM to be written to
> WAL, we may receive some rows from instance 3, who has already applied 2's
> CONFIRM. Since we haven't written the CONFIRM yet, we haven't applied it,
> and the limbo on our instance still isn't empty. All the rows coming from
> 3 will get rejected and replication between 3 and us will be broken.

Hmm, looks so, need to think about.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] applier: process synchro rows after WAL write
  2021-04-08  8:39 ` Serge Petrenko via Tarantool-patches
  2021-04-08 10:19   ` Cyrill Gorcunov via Tarantool-patches
@ 2021-04-08 22:56   ` Vladislav Shpilevoy via Tarantool-patches
  2021-04-09  8:26     ` Serge Petrenko via Tarantool-patches
  1 sibling, 1 reply; 10+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-08 22:56 UTC (permalink / raw)
  To: Serge Petrenko, tarantool-patches, gorcunov

Hi! Thanks for the review!

> Thanks for the patch!
> 
> I'm a bit worried about two different synchro rows coming from two
> appliers. Is everything ok in this case?
> Or even normal rows coming from other appliers. Say some other replica
> has already applied this synchro row and even has written some rows on
> top of it. Its applier won't block on replica_id latch, and may fail to apply
> some txs following this synchro row, because it's not yet written to WAL
> and thus not applied (limbo is still not empty or belongs to other instance).

Should not happen, if I understood you correctly. See my response in the
other email + a test for test-run + a console test.

>> diff --git a/src/box/applier.cc b/src/box/applier.cc
>> index 971b2e64c..f7c82548c 100644
>> --- a/src/box/applier.cc
>> +++ b/src/box/applier.cc
>> @@ -762,26 +762,21 @@ applier_txn_wal_write_cb(struct trigger *trigger, void *event)
>>   }
>>     struct synchro_entry {
>> -    /** Encoded form of a synchro record. */
>> -    struct synchro_body_bin    body_bin;
>> -
>> -    /** xrow to write, used by the journal engine. */
>> -    struct xrow_header row;
>> -
>> +    /** Request to process when WAL write is done. */
>> +    struct synchro_request *req;
>> +    /** Fiber created the entry. To wakeup when WAL write is done. */
>> +    struct fiber *owner;
>>       /**
>> -     * The journal entry itself. Note since
>> -     * it has unsized array it must be the
>> -     * last entry in the structure.
>> +     * The base journal entry. It has unsized array and then must be the
>> +     * last entry in the structure. But can workaround it via a union
>> +     * adding the needed tail as char[].
>>        */
>> -    struct journal_entry journal_entry;
>> +    union {
>> +        struct journal_entry base;
>> +        char base_buf[sizeof(base) + sizeof(base.rows[0])];
>> +    };
>>   };
> 
> I don't understand this union stuff.
> The journal_entry is the last entry in synchro_entry anyway.
> Is this a hack for allowing to allocate synchro_entry on the stack?

Yes. Otherwise I would need a more ugly hack:

	struct synchro_entry *e;
	alignas(alignof(*e)) char e_buf[sizeof(*e) + sizeof(e->base.rows[0])];
	e = (struct synchro_entry *)e_buf;

Then I managed to create the union hack, which I find
less ugly, and it seems to be allow to declare it on the stack.
The interesting thing though is that I still can't make that
union not a last member.

I also see now there is a problem in one of the CI jobs:
https://github.com/tarantool/tarantool/runs/2291961909

It complains about out of bound array access. Which has nothing to
do with the union, just a C++ issue I suppose. I fixed it in the new
version, see the other email.

>> diff --git a/test/replication/gh-5213-qsync-applier-order.test.lua b/test/replication/gh-5213-qsync-applier-order.test.lua
>> new file mode 100644
>> index 000000000..3a32626f2
>> --- /dev/null
>> +++ b/test/replication/gh-5213-qsync-applier-order.test.lua
>> @@ -0,0 +1,129 @@
>> +--
>> +-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL.
>> +-- As a result it could happen that the transactions became visible on CONFIRM,
>> +-- then somehow weren't written to WAL, and on restart the data might not be
>> +-- visible again. Which means rollback of confirmed data and is not acceptable
>> +-- (on the contrary with commit after rollback).
>> +--
>> +test_run = require('test_run').new()
>> +fiber = require('fiber')
>> +old_synchro_quorum = box.cfg.replication_synchro_quorum
>> +old_synchro_timeout = box.cfg.replication_synchro_timeout
>> +
>> +box.schema.user.grant('guest', 'super')
>> +
>> +s = box.schema.space.create('test', {is_sync = true})
>> +_ = s:create_index('pk')
>> +
>> +test_run:cmd('create server replica with rpl_master=default,\
>> +              script="replication/gh-5213-replica.lua"')
>> +test_run:cmd('start server replica')
>> +
>> +test_run:switch('replica')
>> +assert(box.info.id == 2)
>> +lsn = box.info.vclock[1]
>> +
>> +test_run:switch('default')
>> +fiber = require('fiber')
>> +box.cfg{                                                                        \
>> +    replication_synchro_quorum = 3,                                             \
>> +    replication_synchro_timeout = 1000,                                         \
>> +}
>> +f = fiber.new(function() s:replace{1} end)
>> +
>> +test_run:switch('replica')
>> +-- Wait when the transaction is written to WAL.
>> +test_run:wait_cond(function() return box.info.vclock[1] == lsn + 1 end)
> 
> This shouldn't go wrong, but I find
> test_run:wait_lsn('replica, 'default') more durable.
> And you wouldn't need to save lsn above then.
> 
> Same for other usages of lsn = ... and wait_cond(box.info.vclock[1] == lsn + ...)
> 
> Up to you though.

I totally forgot about wait_lsn(), thanks for noticing.
I applied it. See the full diff in the second email.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] applier: process synchro rows after WAL write
  2021-04-08 10:32     ` Serge Petrenko via Tarantool-patches
  2021-04-08 10:46       ` Cyrill Gorcunov via Tarantool-patches
@ 2021-04-08 22:57       ` Vladislav Shpilevoy via Tarantool-patches
  2021-04-09  8:25         ` Serge Petrenko via Tarantool-patches
  1 sibling, 1 reply; 10+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-08 22:57 UTC (permalink / raw)
  To: Serge Petrenko, Cyrill Gorcunov; +Cc: tarantool-patches

On 08.04.2021 12:32, Serge Petrenko via Tarantool-patches wrote> 08.04.2021 13:19, Cyrill Gorcunov пишет:
>> On Thu, Apr 08, 2021 at 11:39:03AM +0300, Serge Petrenko wrote:
>>> Thanks for the patch!
>>>
>>> I'm a bit worried about two different synchro rows coming from two
>>> appliers. Is everything ok in this case?
>> Serge, you mean the scenario when some instances in replicaset
>> have the patch applied and some are not?
> 
> No. Let's suppose everyone has this patch applied.
> Now look at one particular instance. It may happen that while one of
> its appliers is writing this synchro row (either CONFIRM or ROLLBACK,
> doesn't matter), some other applier may still apply requests coming
> from other replicaset members.
> 
> I was wondering what would happen if someone else sent this instance
> another synchro row. Looks like nothing bad but I just wanted to
> double-check.
> 
> And looks like there's a bug, which I'm speaking of below. It's about
> someone sending us normal rows (either synchronous transactions or
> asynchronous, but not CONFIRM/ROLLBACK entries) while we're waiting for
> syncro row's write to end.
> 
> Say, limbo was owned by instance 1, and instance 2 has written CONFIRM
> for everything there was. While we wait for 2's CONFIRM to be written to
> WAL, we may receive some rows from instance 3, who has already applied 2's
> CONFIRM. Since we haven't written the CONFIRM yet, we haven't applied it,
> and the limbo on our instance still isn't empty. All the rows coming from
> 3 will get rejected and replication between 3 and us will be broken.

## Regarding the described issue

This is a super good question, I am glad you noticed. I firstly thought
it is a bug, but then I realized it is not possible.

If the instance 3 already wrote that CONFIRM from instance_id 1, it would
send it us too before its own rows, and it would block on the latch of the
struct replica with instance_id 1 until the CONFIRM is finally written. Then
the second attempt to write it would be skipped and the rows of instance_id 3
would be applied fine.

I made a test which replayed the described situation and all worked fine. I
added it to the patch, and also I provide it below for easy reproduction in
3 consoles.

## A problem not related to the patch

However there might happen a similar issue, not related to the fixed bug.
Assume the limbo belongs to instance 1. Then instance 1 is gone, and
instance 2 tries to write CONFIRM on behalf of instance 1 using
box.ctl.clear_synchro_queue().

While the CONFIRM is in progress, it goes to WAL, from WAL it goes to
replication, from there it goes to instance 3. This one applies it, the limbo
becomes empty on the instance 3. It writes a new row, which is written to
its WAL, and is sent back to instance 2. The instance 2 still didn't finish
the original CONFIRM WAL write, TX thread didn't see it yet. So its limbo still
has rows from instance 1, and the latch is not locked. Now the replication
between instances 2 and 3 is broken.

I didn't try to reproduce this craziness though. However I have an idea how
we could fix it "for good". Introduce a fiber read-write lock. All the
transactions take read-lock on the limbo. All CONFIRM/ROLLBACK/PROMOTE take
a write-lock on the limbo. Then all the limbo owner transitions would be
fixed.

I also thought more about how to evolve my idea about waiting for WAL write
only if we get a conflicting transaction. We need to know transaction's
origin ID before calling txn_commit/txn_commit_try_async. To make it fast we
need to calculate it while create txn_stmt objects so it is already known
right before commit. And here it is tricky, because locally-created
transactions might rollback individual statements using savepoints. That
makes the on-fly calculation of the origin_id not feasible. So far I couldn't
find what to do with that.

## The console test

It reproduces your scenario and all works fine, because the latch saves
our asses. You need 3 instances in different consoles and need to do the
steps in the order as they are marked in the comments.

--
-- Instance 1
--
-- Step 1
box.cfg{
    listen = 3313,
    replication_synchro_quorum = 4,
    replication_synchro_timeout = 1000,
}
box.schema.user.grant('guest', 'super')
s = box.schema.create_space('test', {is_sync = true})
_ = s:create_index('pk')

-- Step 4
fiber = require('fiber')
f = fiber.new(function() s:replace{1} end)

-- Step 8
box.cfg{replication_synchro_quorum = 3}
while f:status() ~= 'dead' do fiber.sleep(0.1) end



--
-- Instance 2
--
-- Step 2
box.cfg{
    listen = 3314,
    replication = {'localhost:3313', 'localhost:3314', 'localhost:3315'},
    read_only = true,
}

-- Step 5
-- Wait vclock.

-- Step 7
box.error.injection.set("ERRINJ_WAL_DELAY", true)

-- Step 10
box.error.injection.set("ERRINJ_WAL_DELAY", false)



--
-- Instance 3
--
-- Step 3
box.cfg{
    listen = 3315,
    replication = {'localhost:3313', 'localhost:3314', 'localhost:3315'},
    read_only = true,
}

-- Step 6
-- Wait vclock.

-- Step 9
box.cfg{read_only = false}
s = box.space.test
box.cfg{
    replication_synchro_quorum = 2,
    replication_synchro_timeout = 1000,
}
s:replace{2}

## The patch

Here is a full diff of the patch. It includes the test above as a
new .test.lua file, and all review fixes.

====================

    applier: process synchro rows after WAL write
    
    Applier used to process synchronous rows CONFIRM and ROLLBACK
    right after receipt before they are written to WAL.
    
    That led to a bug that the confirmed data became visible, might be
    accessed by user requests, then the node restarted before CONFIRM
    finished its WAL write, and the data was not visible again. That
    is just like if it would be rolled back, which is not acceptable.
    
    Another case - CONFIRM WAL write could simply fail due to any
    reason (no disk space, OOM), but the transactions would remain
    confirmed anyway.
    
    Also that produced some hacks in the limbo's code to support the
    confirmation and rollback of transactions not yet written to WAL.
    
    The patch makes the synchro rows processed only after they are
    written to WAL. Although the 'rollback' case above might still
    happen if the xlogs were in the kernel caches, and the machine was
    powered off before they were flushed to disk. But that is not
    related to qsync specifically.
    
    To handle the synchro rows after WAL write the patch makes them go
    to WAL in a blocking way (journal_write() instead of
    journal_write_try_async()). Otherwise it could happen that a
    CONFIRM/ROLLBACK is being written to WAL and would clear the limbo
    afterwards, but a new transaction arrives with a different owner,
    and it conflicts with the current limbo owner.
    
    Closes #5213

diff --git a/changelogs/unreleased/applier-qsync-write-order.md b/changelogs/unreleased/applier-qsync-write-order.md
new file mode 100644
index 000000000..10b9ade80
--- /dev/null
+++ b/changelogs/unreleased/applier-qsync-write-order.md
@@ -0,0 +1,6 @@
+## bugfix/replication
+
+* Fix the bug when a synchronous transaction could be confirmed and visible on
+  a replica, but then not confirmed / invisible again after restart. Could
+  happen more likely on memtx spaces with `memtx_use_mvcc_engine` enabled
+  (gh-5213).
diff --git a/src/box/applier.cc b/src/box/applier.cc
index 971b2e64c..4898f9f7b 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -762,26 +762,21 @@ applier_txn_wal_write_cb(struct trigger *trigger, void *event)
 }
 
 struct synchro_entry {
-	/** Encoded form of a synchro record. */
-	struct synchro_body_bin	body_bin;
-
-	/** xrow to write, used by the journal engine. */
-	struct xrow_header row;
-
+	/** Request to process when WAL write is done. */
+	struct synchro_request *req;
+	/** Fiber created the entry. To wakeup when WAL write is done. */
+	struct fiber *owner;
 	/**
-	 * The journal entry itself. Note since
-	 * it has unsized array it must be the
-	 * last entry in the structure.
+	 * The base journal entry. It has unsized array and then must be the
+	 * last entry in the structure. But can workaround it via a union
+	 * adding the needed tail as char[].
 	 */
-	struct journal_entry journal_entry;
+	union {
+		struct journal_entry base;
+		char base_buf[sizeof(base) + sizeof(base.rows[0])];
+	};
 };
 
-static void
-synchro_entry_delete(struct synchro_entry *entry)
-{
-	free(entry);
-}
-
 /**
  * Async write journal completion.
  */
@@ -791,50 +786,15 @@ apply_synchro_row_cb(struct journal_entry *entry)
 	assert(entry->complete_data != NULL);
 	struct synchro_entry *synchro_entry =
 		(struct synchro_entry *)entry->complete_data;
-	if (entry->res < 0)
+	if (entry->res < 0) {
 		applier_rollback_by_wal_io();
-	else
+	} else {
+		txn_limbo_process(&txn_limbo, synchro_entry->req);
 		trigger_run(&replicaset.applier.on_wal_write, NULL);
-
-	synchro_entry_delete(synchro_entry);
-}
-
-/**
- * Allocate a new synchro_entry to be passed to
- * the journal engine in async write way.
- */
-static struct synchro_entry *
-synchro_entry_new(struct xrow_header *applier_row,
-		  struct synchro_request *req)
-{
-	struct synchro_entry *entry;
-	size_t size = sizeof(*entry) + sizeof(struct xrow_header *);
-
-	/*
-	 * For simplicity we use malloc here but
-	 * probably should provide some cache similar
-	 * to txn cache.
-	 */
-	entry = (struct synchro_entry *)malloc(size);
-	if (entry == NULL) {
-		diag_set(OutOfMemory, size, "malloc", "synchro_entry");
-		return NULL;
 	}
-
-	struct journal_entry *journal_entry = &entry->journal_entry;
-	struct synchro_body_bin *body_bin = &entry->body_bin;
-	struct xrow_header *row = &entry->row;
-
-	journal_entry->rows[0] = row;
-
-	xrow_encode_synchro(row, body_bin, req);
-
-	row->lsn = applier_row->lsn;
-	row->replica_id = applier_row->replica_id;
-
-	journal_entry_create(journal_entry, 1, xrow_approx_len(row),
-			     apply_synchro_row_cb, entry);
-	return entry;
+	/* The fiber is the same on final join. */
+	if (synchro_entry->owner != fiber())
+		fiber_wakeup(synchro_entry->owner);
 }
 
 /** Process a synchro request. */
@@ -847,14 +807,37 @@ apply_synchro_row(struct xrow_header *row)
 	if (xrow_decode_synchro(row, &req) != 0)
 		goto err;
 
-	txn_limbo_process(&txn_limbo, &req);
-
-	struct synchro_entry *entry;
-	entry = synchro_entry_new(row, &req);
-	if (entry == NULL)
-		goto err;
-
-	if (journal_write_try_async(&entry->journal_entry) != 0) {
+	struct synchro_entry entry;
+	/*
+	 * Rows array is cast from *[] to **, because otherwise g++ complains
+	 * about out of array bounds access.
+	 */
+	struct xrow_header **rows;
+	rows = entry.base.rows;
+	rows[0] = row;
====================
This is how I fixed the compilation issue. I couldn't reproduce it
locally in my GCC installed via brew, but it was failing in CI in
one job. A simple cast didn't help. Only creation of a new variable
did help. Also it is not assigned right away - the declaration is
separate. Otherwise 'goto's before that line do not work.
====================
+	journal_entry_create(&entry.base, 1, xrow_approx_len(row),
+			     apply_synchro_row_cb, &entry);
+	entry.req = &req;
+	entry.owner = fiber();
+	/*
+	 * The WAL write is blocking. Otherwise it might happen that a CONFIRM
+	 * or ROLLBACK is sent to WAL, and it would empty the limbo, but before
+	 * it is written, more transactions arrive with a different owner. They
+	 * won't be able to enter the limbo due to owner ID mismatch. Hence the
+	 * synchro rows must block receipt of new transactions.
+	 *
+	 * Don't forget to return -1 both if the journal write failed right
+	 * away, and if it failed inside of WAL thread (res < 0). Otherwise the
+	 * caller would propagate committed vclock to this row thinking it was
+	 * a success.
+	 *
+	 * XXX: in theory it could be done vice-versa. The write could be made
+	 * non-blocking, and instead the potentially conflicting transactions
+	 * could try to wait for all the current synchro WAL writes to end
+	 * before trying to commit. But that requires extra steps from the
+	 * transactions side, including the async ones.
+	 */
+	if (journal_write(&entry.base) != 0 || entry.base.res < 0) {
 		diag_set(ClientError, ER_WAL_IO);
 		goto err;
 	}
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index cf0ad9350..c63211191 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -400,20 +400,11 @@ txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn)
 		txn_limbo_remove(limbo, e);
 		txn_clear_flags(e->txn, TXN_WAIT_SYNC | TXN_WAIT_ACK);
 		/*
-		 * If already written to WAL by now, finish tx processing.
-		 * Otherwise just clear the sync flags. Tx procesing will finish
-		 * automatically once the tx is written to WAL.
-		 *
-		 * XXX: Normally at this point all transactions covered by this
-		 * CONFIRM should be in WAL already, but there is a bug, that
-		 * replica always processes received synchro requests *before*
-		 * writing them to WAL. So it can happen, that a CONFIRM is
-		 * 'read', but the transaction is not written yet. Should be
-		 * fixed when the replica will behave properly, and then this
-		 * branch won't exist.
+		 * Should be written to WAL by now. Confirm is always written
+		 * after the affected transactions.
 		 */
-		if (e->txn->signature >= 0)
-			txn_complete_success(e->txn);
+		assert(e->txn->signature >= 0);
+		txn_complete_success(e->txn);
 	}
 	/* Update is_ro once the limbo is clear. */
 	if (txn_limbo_is_empty(limbo))
@@ -455,30 +446,13 @@ txn_limbo_read_rollback(struct txn_limbo *limbo, int64_t lsn)
 	rlist_foreach_entry_safe_reverse(e, &limbo->queue, in_queue, tmp) {
 		txn_limbo_abort(limbo, e);
 		txn_clear_flags(e->txn, TXN_WAIT_SYNC | TXN_WAIT_ACK);
-		if (e->txn->signature >= 0) {
-			/* Rollback the transaction. */
-			e->txn->signature = TXN_SIGNATURE_SYNC_ROLLBACK;
-			txn_complete_fail(e->txn);
-		} else {
-			/*
-			 * Rollback the transaction, but don't free it yet. It
-			 * will be freed after its WAL write is completed.
-			 *
-			 * XXX: Normally at this point all transactions covered
-			 * by this ROLLBACK should be in WAL already, but there
-			 * is a bug, that replica always processes received
-			 * synchro requests *before* writing them to WAL. So it
-			 * can happen, that a ROLLBACK is 'read', but the
-			 * transaction is not written yet. Should be fixed when
-			 * the replica will behave properly, and then this
-			 * branch won't exist.
-			 */
-			e->txn->signature = TXN_SIGNATURE_SYNC_ROLLBACK;
-			struct fiber *fiber = e->txn->fiber;
-			e->txn->fiber = fiber();
-			txn_complete_fail(e->txn);
-			e->txn->fiber = fiber;
-		}
+		/*
+		 * Should be written to WAL by now. Rollback is always written
+		 * after the affected transactions.
+		 */
+		assert(e->txn->signature >= 0);
+		e->txn->signature = TXN_SIGNATURE_SYNC_ROLLBACK;
+		txn_complete_fail(e->txn);
 		if (e == last_rollback)
 			break;
 	}
diff --git a/test/replication/gh-5213-qsync-applier-order-3.result b/test/replication/gh-5213-qsync-applier-order-3.result
new file mode 100644
index 000000000..bcb18b5c0
--- /dev/null
+++ b/test/replication/gh-5213-qsync-applier-order-3.result
@@ -0,0 +1,263 @@
+-- test-run result file version 2
+--
+-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL.
+-- As a result it could happen that the transactions became visible on CONFIRM,
+-- then somehow weren't written to WAL, and on restart the data might not be
+-- visible again. Which means rollback of confirmed data and is not acceptable
+-- (on the contrary with commit after rollback).
+--
+-- To fix that there was a patch making synchro rows processing after WAL write.
+-- As a result, the following situation could happen. Instance 1 owns the limbo,
+-- instances 2 and 3 know about that. The limbo is not empty. Now instance 1
+-- writes CONFIRM and sends it to the instances 2 and 3. Both start a WAL write
+-- for the CONFIRM. Now instance 3 finishes it, creates a new synchro
+-- transaction, owns the local limbo, and sends the transaction to the instance
+-- 2. Here the CONFIRM WAL write is not done yet. It should not happen, that
+-- the instance 3's new transaction is rejected. Because firstly instance 3 will
+-- send the same instance 1's CONFIRM to the instance 2 due to it being earlier
+-- in WAL of instance 3. Then on instance 2 it will block on a latch with
+-- replica_id 1 until the original CONFIRM received from the instance 1 is
+-- finished. Afterwards the instance 3's transaction is applied just fine - the
+-- limbo on instance 2 is empty now.
+--
+-- It is not related to read-views, but could break the replication.
+--
+test_run = require('test_run').new()
+ | ---
+ | ...
+fiber = require('fiber')
+ | ---
+ | ...
+old_synchro_quorum = box.cfg.replication_synchro_quorum
+ | ---
+ | ...
+old_synchro_timeout = box.cfg.replication_synchro_timeout
+ | ---
+ | ...
+
+box.schema.user.grant('guest', 'super')
+ | ---
+ | ...
+
+s = box.schema.space.create('test', {is_sync = true})
+ | ---
+ | ...
+_ = s:create_index('pk')
+ | ---
+ | ...
+
+test_run:cmd('create server replica1 with rpl_master=default,\
+              script="replication/replica1.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica1')
+ | ---
+ | - true
+ | ...
+
+test_run:cmd('create server replica2 with rpl_master=default,\
+              script="replication/replica2.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica2')
+ | ---
+ | - true
+ | ...
+
+-- Build mutual replication between replica1 and replica2.
+test_run:switch('replica1')
+ | ---
+ | - true
+ | ...
+replication = box.cfg.replication
+ | ---
+ | ...
+table.insert(replication, test_run:eval('replica2', 'return box.cfg.listen')[1])
+ | ---
+ | ...
+box.cfg{replication = {}}
+ | ---
+ | ...
+box.cfg{replication = replication}
+ | ---
+ | ...
+
+test_run:switch('replica2')
+ | ---
+ | - true
+ | ...
+replication = box.cfg.replication
+ | ---
+ | ...
+table.insert(replication, test_run:eval('replica1', 'return box.cfg.listen')[1])
+ | ---
+ | ...
+box.cfg{replication = {}}
+ | ---
+ | ...
+box.cfg{replication = replication}
+ | ---
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+fiber = require('fiber')
+ | ---
+ | ...
+box.cfg{                                                                        \
+    replication_synchro_quorum = 4,                                             \
+    replication_synchro_timeout = 1000,                                         \
+}
+ | ---
+ | ...
+-- Send a transaction to all 3 nodes. The limbo is owned by the default node
+-- everywhere.
+f = fiber.new(function() s:replace{1} end)
+ | ---
+ | ...
+test_run:wait_lsn('replica1', 'default')
+ | ---
+ | ...
+test_run:wait_lsn('replica2', 'default')
+ | ---
+ | ...
+
+-- Make so the replica1 will apply CONFIRM from the default instance for a long
+-- time.
+test_run:switch('replica1')
+ | ---
+ | - true
+ | ...
+box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0)
+ | ---
+ | - ok
+ | ...
+
+-- Emit the CONFIRM.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+box.cfg{replication_synchro_quorum = 3}
+ | ---
+ | ...
+test_run:wait_cond(function() return f:status() == 'dead' end)
+ | ---
+ | - true
+ | ...
+
+-- It hangs on the replica1.
+test_run:switch('replica1')
+ | ---
+ | - true
+ | ...
+test_run:wait_cond(function()                                                   \
+    return box.error.injection.get('ERRINJ_WAL_DELAY')                          \
+end)
+ | ---
+ | - true
+ | ...
+
+-- But is applied on the replica2. The limbo is empty here now.
+test_run:switch('replica2')
+ | ---
+ | - true
+ | ...
+test_run:wait_lsn('replica2', 'default')
+ | ---
+ | ...
+box.cfg{                                                                        \
+    replication_synchro_quorum = 2,                                             \
+    replication_synchro_timeout = 1000,                                         \
+}
+ | ---
+ | ...
+-- Replica2 takes the limbo ownership and sends the transaction to the replica1.
+-- Along with the CONFIRM from the default node, which is still not applied
+-- on the replica1.
+fiber = require('fiber')
+ | ---
+ | ...
+f = fiber.new(function() box.space.test:replace{2} end)
+ | ---
+ | ...
+
+test_run:switch('replica1')
+ | ---
+ | - true
+ | ...
+fiber = require('fiber')
+ | ---
+ | ...
+-- WAL write of the CONFIRM from the default node still is not done. Give it
+-- some time to get the new rows from the replica2 and block on the latch.
+-- Can't catch it anyhow via conds, so the only way is to sleep a bit.
+fiber.sleep(0.1)
+ | ---
+ | ...
+-- Let the WAL writes finish. Firstly CONFIRM is finished, the limbo is emptied
+-- and the replica_id 1 latch is unlocked. Now the replica2 transaction is
+-- applied and persisted.
+box.error.injection.set('ERRINJ_WAL_DELAY', false)
+ | ---
+ | - ok
+ | ...
+test_run:wait_lsn('replica1', 'replica2')
+ | ---
+ | ...
+box.space.test:get({2})
+ | ---
+ | - [2]
+ | ...
+
+-- Ensure the replication works fine, nothing is broken.
+test_run:wait_upstream(test_run:get_server_id('replica2'), {status = 'follow'})
+ | ---
+ | - true
+ | ...
+
+test_run:switch('replica2')
+ | ---
+ | - true
+ | ...
+test_run:wait_upstream(test_run:get_server_id('replica1'), {status = 'follow'})
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica1')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica1')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica2')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica2')
+ | ---
+ | - true
+ | ...
+s:drop()
+ | ---
+ | ...
+box.schema.user.revoke('guest', 'super')
+ | ---
+ | ...
+box.cfg{                                                                        \
+    replication_synchro_quorum = old_synchro_quorum,                            \
+    replication_synchro_timeout = old_synchro_timeout,                          \
+}
+ | ---
+ | ...
diff --git a/test/replication/gh-5213-qsync-applier-order-3.test.lua b/test/replication/gh-5213-qsync-applier-order-3.test.lua
new file mode 100644
index 000000000..37b569da7
--- /dev/null
+++ b/test/replication/gh-5213-qsync-applier-order-3.test.lua
@@ -0,0 +1,125 @@
+--
+-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL.
+-- As a result it could happen that the transactions became visible on CONFIRM,
+-- then somehow weren't written to WAL, and on restart the data might not be
+-- visible again. Which means rollback of confirmed data and is not acceptable
+-- (on the contrary with commit after rollback).
+--
+-- To fix that there was a patch making synchro rows processing after WAL write.
+-- As a result, the following situation could happen. Instance 1 owns the limbo,
+-- instances 2 and 3 know about that. The limbo is not empty. Now instance 1
+-- writes CONFIRM and sends it to the instances 2 and 3. Both start a WAL write
+-- for the CONFIRM. Now instance 3 finishes it, creates a new synchro
+-- transaction, owns the local limbo, and sends the transaction to the instance
+-- 2. Here the CONFIRM WAL write is not done yet. It should not happen, that
+-- the instance 3's new transaction is rejected. Because firstly instance 3 will
+-- send the same instance 1's CONFIRM to the instance 2 due to it being earlier
+-- in WAL of instance 3. Then on instance 2 it will block on a latch with
+-- replica_id 1 until the original CONFIRM received from the instance 1 is
+-- finished. Afterwards the instance 3's transaction is applied just fine - the
+-- limbo on instance 2 is empty now.
+--
+-- It is not related to read-views, but could break the replication.
+--
+test_run = require('test_run').new()
+fiber = require('fiber')
+old_synchro_quorum = box.cfg.replication_synchro_quorum
+old_synchro_timeout = box.cfg.replication_synchro_timeout
+
+box.schema.user.grant('guest', 'super')
+
+s = box.schema.space.create('test', {is_sync = true})
+_ = s:create_index('pk')
+
+test_run:cmd('create server replica1 with rpl_master=default,\
+              script="replication/replica1.lua"')
+test_run:cmd('start server replica1')
+
+test_run:cmd('create server replica2 with rpl_master=default,\
+              script="replication/replica2.lua"')
+test_run:cmd('start server replica2')
+
+-- Build mutual replication between replica1 and replica2.
+test_run:switch('replica1')
+replication = box.cfg.replication
+table.insert(replication, test_run:eval('replica2', 'return box.cfg.listen')[1])
+box.cfg{replication = {}}
+box.cfg{replication = replication}
+
+test_run:switch('replica2')
+replication = box.cfg.replication
+table.insert(replication, test_run:eval('replica1', 'return box.cfg.listen')[1])
+box.cfg{replication = {}}
+box.cfg{replication = replication}
+
+test_run:switch('default')
+fiber = require('fiber')
+box.cfg{                                                                        \
+    replication_synchro_quorum = 4,                                             \
+    replication_synchro_timeout = 1000,                                         \
+}
+-- Send a transaction to all 3 nodes. The limbo is owned by the default node
+-- everywhere.
+f = fiber.new(function() s:replace{1} end)
+test_run:wait_lsn('replica1', 'default')
+test_run:wait_lsn('replica2', 'default')
+
+-- Make so the replica1 will apply CONFIRM from the default instance for a long
+-- time.
+test_run:switch('replica1')
+box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0)
+
+-- Emit the CONFIRM.
+test_run:switch('default')
+box.cfg{replication_synchro_quorum = 3}
+test_run:wait_cond(function() return f:status() == 'dead' end)
+
+-- It hangs on the replica1.
+test_run:switch('replica1')
+test_run:wait_cond(function()                                                   \
+    return box.error.injection.get('ERRINJ_WAL_DELAY')                          \
+end)
+
+-- But is applied on the replica2. The limbo is empty here now.
+test_run:switch('replica2')
+test_run:wait_lsn('replica2', 'default')
+box.cfg{                                                                        \
+    replication_synchro_quorum = 2,                                             \
+    replication_synchro_timeout = 1000,                                         \
+}
+-- Replica2 takes the limbo ownership and sends the transaction to the replica1.
+-- Along with the CONFIRM from the default node, which is still not applied
+-- on the replica1.
+fiber = require('fiber')
+f = fiber.new(function() box.space.test:replace{2} end)
+
+test_run:switch('replica1')
+fiber = require('fiber')
+-- WAL write of the CONFIRM from the default node still is not done. Give it
+-- some time to get the new rows from the replica2 and block on the latch.
+-- Can't catch it anyhow via conds, so the only way is to sleep a bit.
+fiber.sleep(0.1)
+-- Let the WAL writes finish. Firstly CONFIRM is finished, the limbo is emptied
+-- and the replica_id 1 latch is unlocked. Now the replica2 transaction is
+-- applied and persisted.
+box.error.injection.set('ERRINJ_WAL_DELAY', false)
+test_run:wait_lsn('replica1', 'replica2')
+box.space.test:get({2})
+
+-- Ensure the replication works fine, nothing is broken.
+test_run:wait_upstream(test_run:get_server_id('replica2'), {status = 'follow'})
+
+test_run:switch('replica2')
+test_run:wait_upstream(test_run:get_server_id('replica1'), {status = 'follow'})
+
+test_run:switch('default')
+test_run:cmd('stop server replica1')
+test_run:cmd('delete server replica1')
+test_run:cmd('stop server replica2')
+test_run:cmd('delete server replica2')
+s:drop()
+box.schema.user.revoke('guest', 'super')
+box.cfg{                                                                        \
+    replication_synchro_quorum = old_synchro_quorum,                            \
+    replication_synchro_timeout = old_synchro_timeout,                          \
+}
diff --git a/test/replication/gh-5213-qsync-applier-order.result b/test/replication/gh-5213-qsync-applier-order.result
new file mode 100644
index 000000000..a8c24c289
--- /dev/null
+++ b/test/replication/gh-5213-qsync-applier-order.result
@@ -0,0 +1,302 @@
+-- test-run result file version 2
+--
+-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL.
+-- As a result it could happen that the transactions became visible on CONFIRM,
+-- then somehow weren't written to WAL, and on restart the data might not be
+-- visible again. Which means rollback of confirmed data and is not acceptable
+-- (on the contrary with commit after rollback).
+--
+test_run = require('test_run').new()
+ | ---
+ | ...
+fiber = require('fiber')
+ | ---
+ | ...
+old_synchro_quorum = box.cfg.replication_synchro_quorum
+ | ---
+ | ...
+old_synchro_timeout = box.cfg.replication_synchro_timeout
+ | ---
+ | ...
+
+box.schema.user.grant('guest', 'super')
+ | ---
+ | ...
+
+s = box.schema.space.create('test', {is_sync = true})
+ | ---
+ | ...
+_ = s:create_index('pk')
+ | ---
+ | ...
+
+test_run:cmd('create server replica with rpl_master=default,\
+              script="replication/gh-5213-replica.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica')
+ | ---
+ | - true
+ | ...
+
+fiber = require('fiber')
+ | ---
+ | ...
+box.cfg{                                                                        \
+    replication_synchro_quorum = 3,                                             \
+    replication_synchro_timeout = 1000,                                         \
+}
+ | ---
+ | ...
+f = fiber.new(function() s:replace{1} end)
+ | ---
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+-- Wait when the transaction is written to WAL.
+test_run:wait_lsn('replica', 'default')
+ | ---
+ | ...
+s = box.space.test
+ | ---
+ | ...
+-- But not visible yet. Because MVCC is on, and no CONFIRM yet.
+assert(s:get({1}) == nil)
+ | ---
+ | - true
+ | ...
+-- Block the incoming CONFIRM to be able to ensure the data is not visible until
+-- WAL write ends.
+box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0)
+ | ---
+ | - ok
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+box.cfg{replication_synchro_quorum = 2}
+ | ---
+ | ...
+test_run:wait_cond(function() return f:status() == 'dead' end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+-- Wait when CONFIRM is in the WAL thread.
+test_run:wait_cond(function()                                                   \
+    return box.error.injection.get('ERRINJ_WAL_DELAY')                          \
+end)
+ | ---
+ | - true
+ | ...
+assert(s:get({1}) == nil)
+ | ---
+ | - true
+ | ...
+box.error.injection.set('ERRINJ_WAL_DELAY', false)
+ | ---
+ | - ok
+ | ...
+-- After CONFIRM is in WAL, the transaction is committed and its data is
+-- visible.
+test_run:wait_cond(function() return s:get({1}) ~= nil end)
+ | ---
+ | - true
+ | ...
+
+--
+-- Ensure CONFIRM WAL write fail also works fine when couldn't even start a WAL
+-- write. The data remains not confirmed until WAL write succeeds.
+--
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+box.cfg{replication_synchro_quorum = 3}
+ | ---
+ | ...
+f = fiber.new(function() s:replace{2} end)
+ | ---
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+test_run:wait_lsn('replica', 'default')
+ | ---
+ | ...
+assert(s:get({2}) == nil)
+ | ---
+ | - true
+ | ...
+-- Make journal write fail immediately.
+box.error.injection.set('ERRINJ_WAL_IO', true)
+ | ---
+ | - ok
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+box.cfg{replication_synchro_quorum = 2}
+ | ---
+ | ...
+test_run:wait_cond(function() return f:status() == 'dead' end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+test_run:wait_upstream(1, {status = 'stopped'})
+ | ---
+ | - true
+ | ...
+box.error.injection.set('ERRINJ_WAL_IO', false)
+ | ---
+ | - ok
+ | ...
+assert(s:get({2}) == nil)
+ | ---
+ | - true
+ | ...
+-- Re-subscribe.
+replication = box.cfg.replication
+ | ---
+ | ...
+box.cfg{replication = {}}
+ | ---
+ | ...
+box.cfg{replication = replication}
+ | ---
+ | ...
+test_run:wait_lsn('replica', 'default')
+ | ---
+ | ...
+test_run:wait_upstream(1, {status = 'follow'})
+ | ---
+ | - true
+ | ...
+assert(s:get({2}) ~= nil)
+ | ---
+ | - true
+ | ...
+
+--
+-- Ensure CONFIRM WAL write fail works fine when an error happens inside WAL
+-- thread. The data remains not confirmed until WAL write succeeds.
+--
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+box.cfg{replication_synchro_quorum = 3}
+ | ---
+ | ...
+f = fiber.new(function() s:replace{3} end)
+ | ---
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+test_run:wait_lsn('replica', 'default')
+ | ---
+ | ...
+assert(s:get({3}) == nil)
+ | ---
+ | - true
+ | ...
+-- Journal write start is going to succeed, but it will fail later on return
+-- from the WAL thread.
+box.error.injection.set('ERRINJ_WAL_ROTATE', true)
+ | ---
+ | - ok
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+box.cfg{replication_synchro_quorum = 2}
+ | ---
+ | ...
+test_run:wait_cond(function() return f:status() == 'dead' end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+test_run:wait_upstream(1, {status = 'stopped'})
+ | ---
+ | - true
+ | ...
+box.error.injection.set('ERRINJ_WAL_ROTATE', false)
+ | ---
+ | - ok
+ | ...
+assert(s:get({3}) == nil)
+ | ---
+ | - true
+ | ...
+box.cfg{replication = {}}
+ | ---
+ | ...
+box.cfg{replication = replication}
+ | ---
+ | ...
+test_run:wait_lsn('replica', 'default')
+ | ---
+ | ...
+test_run:wait_upstream(1, {status = 'follow'})
+ | ---
+ | - true
+ | ...
+assert(s:get({3}) ~= nil)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica')
+ | ---
+ | - true
+ | ...
+
+s:drop()
+ | ---
+ | ...
+box.schema.user.revoke('guest', 'super')
+ | ---
+ | ...
+box.cfg{                                                                        \
+    replication_synchro_quorum = old_synchro_quorum,                            \
+    replication_synchro_timeout = old_synchro_timeout,                          \
+}
+ | ---
+ | ...
diff --git a/test/replication/gh-5213-qsync-applier-order.test.lua b/test/replication/gh-5213-qsync-applier-order.test.lua
new file mode 100644
index 000000000..f1eccfa84
--- /dev/null
+++ b/test/replication/gh-5213-qsync-applier-order.test.lua
@@ -0,0 +1,122 @@
+--
+-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL.
+-- As a result it could happen that the transactions became visible on CONFIRM,
+-- then somehow weren't written to WAL, and on restart the data might not be
+-- visible again. Which means rollback of confirmed data and is not acceptable
+-- (on the contrary with commit after rollback).
+--
+test_run = require('test_run').new()
+fiber = require('fiber')
+old_synchro_quorum = box.cfg.replication_synchro_quorum
+old_synchro_timeout = box.cfg.replication_synchro_timeout
+
+box.schema.user.grant('guest', 'super')
+
+s = box.schema.space.create('test', {is_sync = true})
+_ = s:create_index('pk')
+
+test_run:cmd('create server replica with rpl_master=default,\
+              script="replication/gh-5213-replica.lua"')
+test_run:cmd('start server replica')
+
+fiber = require('fiber')
+box.cfg{                                                                        \
+    replication_synchro_quorum = 3,                                             \
+    replication_synchro_timeout = 1000,                                         \
+}
+f = fiber.new(function() s:replace{1} end)
+
+test_run:switch('replica')
+-- Wait when the transaction is written to WAL.
+test_run:wait_lsn('replica', 'default')
+s = box.space.test
+-- But not visible yet. Because MVCC is on, and no CONFIRM yet.
+assert(s:get({1}) == nil)
+-- Block the incoming CONFIRM to be able to ensure the data is not visible until
+-- WAL write ends.
+box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0)
+
+test_run:switch('default')
+box.cfg{replication_synchro_quorum = 2}
+test_run:wait_cond(function() return f:status() == 'dead' end)
+
+test_run:switch('replica')
+-- Wait when CONFIRM is in the WAL thread.
+test_run:wait_cond(function()                                                   \
+    return box.error.injection.get('ERRINJ_WAL_DELAY')                          \
+end)
+assert(s:get({1}) == nil)
+box.error.injection.set('ERRINJ_WAL_DELAY', false)
+-- After CONFIRM is in WAL, the transaction is committed and its data is
+-- visible.
+test_run:wait_cond(function() return s:get({1}) ~= nil end)
+
+--
+-- Ensure CONFIRM WAL write fail also works fine when couldn't even start a WAL
+-- write. The data remains not confirmed until WAL write succeeds.
+--
+test_run:switch('default')
+box.cfg{replication_synchro_quorum = 3}
+f = fiber.new(function() s:replace{2} end)
+
+test_run:switch('replica')
+test_run:wait_lsn('replica', 'default')
+assert(s:get({2}) == nil)
+-- Make journal write fail immediately.
+box.error.injection.set('ERRINJ_WAL_IO', true)
+
+test_run:switch('default')
+box.cfg{replication_synchro_quorum = 2}
+test_run:wait_cond(function() return f:status() == 'dead' end)
+
+test_run:switch('replica')
+test_run:wait_upstream(1, {status = 'stopped'})
+box.error.injection.set('ERRINJ_WAL_IO', false)
+assert(s:get({2}) == nil)
+-- Re-subscribe.
+replication = box.cfg.replication
+box.cfg{replication = {}}
+box.cfg{replication = replication}
+test_run:wait_lsn('replica', 'default')
+test_run:wait_upstream(1, {status = 'follow'})
+assert(s:get({2}) ~= nil)
+
+--
+-- Ensure CONFIRM WAL write fail works fine when an error happens inside WAL
+-- thread. The data remains not confirmed until WAL write succeeds.
+--
+test_run:switch('default')
+box.cfg{replication_synchro_quorum = 3}
+f = fiber.new(function() s:replace{3} end)
+
+test_run:switch('replica')
+test_run:wait_lsn('replica', 'default')
+assert(s:get({3}) == nil)
+-- Journal write start is going to succeed, but it will fail later on return
+-- from the WAL thread.
+box.error.injection.set('ERRINJ_WAL_ROTATE', true)
+
+test_run:switch('default')
+box.cfg{replication_synchro_quorum = 2}
+test_run:wait_cond(function() return f:status() == 'dead' end)
+
+test_run:switch('replica')
+test_run:wait_upstream(1, {status = 'stopped'})
+box.error.injection.set('ERRINJ_WAL_ROTATE', false)
+assert(s:get({3}) == nil)
+box.cfg{replication = {}}
+box.cfg{replication = replication}
+test_run:wait_lsn('replica', 'default')
+test_run:wait_upstream(1, {status = 'follow'})
+assert(s:get({3}) ~= nil)
+
+test_run:switch('default')
+test_run:cmd('stop server replica')
+test_run:cmd('delete server replica')
+
+s:drop()
+box.schema.user.revoke('guest', 'super')
+box.cfg{                                                                        \
+    replication_synchro_quorum = old_synchro_quorum,                            \
+    replication_synchro_timeout = old_synchro_timeout,                          \
+}
diff --git a/test/replication/gh-5213-replica.lua b/test/replication/gh-5213-replica.lua
new file mode 100644
index 000000000..3ba90d7ab
--- /dev/null
+++ b/test/replication/gh-5213-replica.lua
@@ -0,0 +1,10 @@
+#!/usr/bin/env tarantool
+
+box.cfg({
+    listen = os.getenv("LISTEN"),
+    replication = os.getenv("MASTER"),
+    read_only = true,
+    memtx_use_mvcc_engine = true,
+})
+
+require('console').listen(os.getenv('ADMIN'))
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index aff5fda26..4a9ca0a46 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -15,6 +15,8 @@
     "gh-3760-misc-return-on-quorum-0.test.lua": {},
     "gh-4399-misc-no-failure-on-error-reading-wal.test.lua": {},
     "gh-4424-misc-orphan-on-reconfiguration-error.test.lua": {},
+    "gh-5213-qsync-applier-order.test.lua": {},
+    "gh-5213-qsync-applier-order-3.test.lua": {},
     "gh-5426-election-on-off.test.lua": {},
     "gh-5433-election-restart-recovery.test.lua": {},
     "gh-5506-election-on-off.test.lua": {},
diff --git a/test/replication/suite.ini b/test/replication/suite.ini
index a9e44e8cf..1d9c0a4ae 100644
--- a/test/replication/suite.ini
+++ b/test/replication/suite.ini
@@ -3,7 +3,7 @@ core = tarantool
 script =  master.lua
 description = tarantool/box, replication
 disabled = consistent.test.lua
-release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua
+release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua
 config = suite.cfg
 lua_libs = lua/fast_replica.lua lua/rlimit.lua
 use_unix_sockets = True

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] applier: process synchro rows after WAL write
  2021-04-08 22:57       ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-09  8:25         ` Serge Petrenko via Tarantool-patches
  2021-04-09 21:32           ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 1 reply; 10+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-09  8:25 UTC (permalink / raw)
  To: Vladislav Shpilevoy, Cyrill Gorcunov; +Cc: tarantool-patches



09.04.2021 01:57, Vladislav Shpilevoy пишет:
> On 08.04.2021 12:32, Serge Petrenko via Tarantool-patches wrote> 08.04.2021 13:19, Cyrill Gorcunov пишет:
>>> On Thu, Apr 08, 2021 at 11:39:03AM +0300, Serge Petrenko wrote:
>>>> Thanks for the patch!
>>>>
>>>> I'm a bit worried about two different synchro rows coming from two
>>>> appliers. Is everything ok in this case?
>>> Serge, you mean the scenario when some instances in replicaset
>>> have the patch applied and some are not?
>> No. Let's suppose everyone has this patch applied.
>> Now look at one particular instance. It may happen that while one of
>> its appliers is writing this synchro row (either CONFIRM or ROLLBACK,
>> doesn't matter), some other applier may still apply requests coming
>> from other replicaset members.
>>
>> I was wondering what would happen if someone else sent this instance
>> another synchro row. Looks like nothing bad but I just wanted to
>> double-check.
>>
>> And looks like there's a bug, which I'm speaking of below. It's about
>> someone sending us normal rows (either synchronous transactions or
>> asynchronous, but not CONFIRM/ROLLBACK entries) while we're waiting for
>> syncro row's write to end.
>>
>> Say, limbo was owned by instance 1, and instance 2 has written CONFIRM
>> for everything there was. While we wait for 2's CONFIRM to be written to
>> WAL, we may receive some rows from instance 3, who has already applied 2's
>> CONFIRM. Since we haven't written the CONFIRM yet, we haven't applied it,
>> and the limbo on our instance still isn't empty. All the rows coming from
>> 3 will get rejected and replication between 3 and us will be broken.
> ## Regarding the described issue
>
> This is a super good question, I am glad you noticed. I firstly thought
> it is a bug, but then I realized it is not possible.
>
> If the instance 3 already wrote that CONFIRM from instance_id 1, it would
> send it us too before its own rows, and it would block on the latch of the
> struct replica with instance_id 1 until the CONFIRM is finally written. Then
> the second attempt to write it would be skipped and the rows of instance_id 3
> would be applied fine.
>
> I made a test which replayed the described situation and all worked fine. I
> added it to the patch, and also I provide it below for easy reproduction in
> 3 consoles.

Yes, indeed. Thanks for the explanation!

>
> ## A problem not related to the patch
>
> However there might happen a similar issue, not related to the fixed bug.
> Assume the limbo belongs to instance 1. Then instance 1 is gone, and
> instance 2 tries to write CONFIRM on behalf of instance 1 using
> box.ctl.clear_synchro_queue().
>
> While the CONFIRM is in progress, it goes to WAL, from WAL it goes to
> replication, from there it goes to instance 3. This one applies it, the limbo
> becomes empty on the instance 3. It writes a new row, which is written to
> its WAL, and is sent back to instance 2. The instance 2 still didn't finish
> the original CONFIRM WAL write, TX thread didn't see it yet. So its limbo still
> has rows from instance 1, and the latch is not locked. Now the replication
> between instances 2 and 3 is broken.

Yes, that's possible. I've also thought of a lock on CONFIRM, etc.

>
> I didn't try to reproduce this craziness though. However I have an idea how
> we could fix it "for good". Introduce a fiber read-write lock. All the
> transactions take read-lock on the limbo. All CONFIRM/ROLLBACK/PROMOTE take
> a write-lock on the limbo. Then all the limbo owner transitions would be
> fixed.

Sounds good. Isn't it enough to only have a write lock though?

>
> I also thought more about how to evolve my idea about waiting for WAL write
> only if we get a conflicting transaction. We need to know transaction's
> origin ID before calling txn_commit/txn_commit_try_async. To make it fast we
> need to calculate it while create txn_stmt objects so it is already known
> right before commit. And here it is tricky, because locally-created
> transactions might rollback individual statements using savepoints. That
> makes the on-fly calculation of the origin_id not feasible. So far I couldn't
> find what to do with that.
>
> ## The console test
>
> It reproduces your scenario and all works fine, because the latch saves
> our asses. You need 3 instances in different consoles and need to do the
> steps in the order as they are marked in the comments.

The test's fine. Thanks for checking this!

>
> --
> -- Instance 1
> --
> -- Step 1
> box.cfg{
>      listen = 3313,
>      replication_synchro_quorum = 4,
>      replication_synchro_timeout = 1000,
> }
> box.schema.user.grant('guest', 'super')
> s = box.schema.create_space('test', {is_sync = true})
> _ = s:create_index('pk')
>
> -- Step 4
> fiber = require('fiber')
> f = fiber.new(function() s:replace{1} end)
>
> -- Step 8
> box.cfg{replication_synchro_quorum = 3}
> while f:status() ~= 'dead' do fiber.sleep(0.1) end
>
>
>
> --
> -- Instance 2
> --
> -- Step 2
> box.cfg{
>      listen = 3314,
>      replication = {'localhost:3313', 'localhost:3314', 'localhost:3315'},
>      read_only = true,
> }
>
> -- Step 5
> -- Wait vclock.
>
> -- Step 7
> box.error.injection.set("ERRINJ_WAL_DELAY", true)
>
> -- Step 10
> box.error.injection.set("ERRINJ_WAL_DELAY", false)
>
>
>
> --
> -- Instance 3
> --
> -- Step 3
> box.cfg{
>      listen = 3315,
>      replication = {'localhost:3313', 'localhost:3314', 'localhost:3315'},
>      read_only = true,
> }
>
> -- Step 6
> -- Wait vclock.
>
> -- Step 9
> box.cfg{read_only = false}
> s = box.space.test
> box.cfg{
>      replication_synchro_quorum = 2,
>      replication_synchro_timeout = 1000,
> }
> s:replace{2}
>
> ## The patch
>
> Here is a full diff of the patch. It includes the test above as a
> new .test.lua file, and all review fixes.
>
> ====================
>
>      applier: process synchro rows after WAL write
>      
>      Applier used to process synchronous rows CONFIRM and ROLLBACK
>      right after receipt before they are written to WAL.
>      
>      That led to a bug that the confirmed data became visible, might be
>      accessed by user requests, then the node restarted before CONFIRM
>      finished its WAL write, and the data was not visible again. That
>      is just like if it would be rolled back, which is not acceptable.
>      
>      Another case - CONFIRM WAL write could simply fail due to any
>      reason (no disk space, OOM), but the transactions would remain
>      confirmed anyway.
>      
>      Also that produced some hacks in the limbo's code to support the
>      confirmation and rollback of transactions not yet written to WAL.
>      
>      The patch makes the synchro rows processed only after they are
>      written to WAL. Although the 'rollback' case above might still
>      happen if the xlogs were in the kernel caches, and the machine was
>      powered off before they were flushed to disk. But that is not
>      related to qsync specifically.
>      
>      To handle the synchro rows after WAL write the patch makes them go
>      to WAL in a blocking way (journal_write() instead of
>      journal_write_try_async()). Otherwise it could happen that a
>      CONFIRM/ROLLBACK is being written to WAL and would clear the limbo
>      afterwards, but a new transaction arrives with a different owner,
>      and it conflicts with the current limbo owner.
>      
>      Closes #5213
>
> diff --git a/changelogs/unreleased/applier-qsync-write-order.md b/changelogs/unreleased/applier-qsync-write-order.md
> new file mode 100644
> index 000000000..10b9ade80
> --- /dev/null
> +++ b/changelogs/unreleased/applier-qsync-write-order.md
> @@ -0,0 +1,6 @@
> +## bugfix/replication
> +
> +* Fix the bug when a synchronous transaction could be confirmed and visible on
> +  a replica, but then not confirmed / invisible again after restart. Could
> +  happen more likely on memtx spaces with `memtx_use_mvcc_engine` enabled
> +  (gh-5213).
> diff --git a/src/box/applier.cc b/src/box/applier.cc
> index 971b2e64c..4898f9f7b 100644
> --- a/src/box/applier.cc
> +++ b/src/box/applier.cc
> @@ -762,26 +762,21 @@ applier_txn_wal_write_cb(struct trigger *trigger, void *event)
>   }
>   
>   struct synchro_entry {
> -	/** Encoded form of a synchro record. */
> -	struct synchro_body_bin	body_bin;
> -
> -	/** xrow to write, used by the journal engine. */
> -	struct xrow_header row;
> -
> +	/** Request to process when WAL write is done. */
> +	struct synchro_request *req;
> +	/** Fiber created the entry. To wakeup when WAL write is done. */
> +	struct fiber *owner;
>   	/**
> -	 * The journal entry itself. Note since
> -	 * it has unsized array it must be the
> -	 * last entry in the structure.
> +	 * The base journal entry. It has unsized array and then must be the
> +	 * last entry in the structure. But can workaround it via a union
> +	 * adding the needed tail as char[].
>   	 */
> -	struct journal_entry journal_entry;
> +	union {
> +		struct journal_entry base;
> +		char base_buf[sizeof(base) + sizeof(base.rows[0])];
> +	};
>   };
>   
> -static void
> -synchro_entry_delete(struct synchro_entry *entry)
> -{
> -	free(entry);
> -}
> -
>   /**
>    * Async write journal completion.
>    */
> @@ -791,50 +786,15 @@ apply_synchro_row_cb(struct journal_entry *entry)
>   	assert(entry->complete_data != NULL);
>   	struct synchro_entry *synchro_entry =
>   		(struct synchro_entry *)entry->complete_data;
> -	if (entry->res < 0)
> +	if (entry->res < 0) {
>   		applier_rollback_by_wal_io();
> -	else
> +	} else {
> +		txn_limbo_process(&txn_limbo, synchro_entry->req);
>   		trigger_run(&replicaset.applier.on_wal_write, NULL);
> -
> -	synchro_entry_delete(synchro_entry);
> -}
> -
> -/**
> - * Allocate a new synchro_entry to be passed to
> - * the journal engine in async write way.
> - */
> -static struct synchro_entry *
> -synchro_entry_new(struct xrow_header *applier_row,
> -		  struct synchro_request *req)
> -{
> -	struct synchro_entry *entry;
> -	size_t size = sizeof(*entry) + sizeof(struct xrow_header *);
> -
> -	/*
> -	 * For simplicity we use malloc here but
> -	 * probably should provide some cache similar
> -	 * to txn cache.
> -	 */
> -	entry = (struct synchro_entry *)malloc(size);
> -	if (entry == NULL) {
> -		diag_set(OutOfMemory, size, "malloc", "synchro_entry");
> -		return NULL;
>   	}
> -
> -	struct journal_entry *journal_entry = &entry->journal_entry;
> -	struct synchro_body_bin *body_bin = &entry->body_bin;
> -	struct xrow_header *row = &entry->row;
> -
> -	journal_entry->rows[0] = row;
> -
> -	xrow_encode_synchro(row, body_bin, req);
> -
> -	row->lsn = applier_row->lsn;
> -	row->replica_id = applier_row->replica_id;
> -
> -	journal_entry_create(journal_entry, 1, xrow_approx_len(row),
> -			     apply_synchro_row_cb, entry);
> -	return entry;
> +	/* The fiber is the same on final join. */
> +	if (synchro_entry->owner != fiber())
> +		fiber_wakeup(synchro_entry->owner);
>   }
>   
>   /** Process a synchro request. */
> @@ -847,14 +807,37 @@ apply_synchro_row(struct xrow_header *row)
>   	if (xrow_decode_synchro(row, &req) != 0)
>   		goto err;
>   
> -	txn_limbo_process(&txn_limbo, &req);
> -
> -	struct synchro_entry *entry;
> -	entry = synchro_entry_new(row, &req);
> -	if (entry == NULL)
> -		goto err;
> -
> -	if (journal_write_try_async(&entry->journal_entry) != 0) {
> +	struct synchro_entry entry;
> +	/*
> +	 * Rows array is cast from *[] to **, because otherwise g++ complains
> +	 * about out of array bounds access.
> +	 */
> +	struct xrow_header **rows;
> +	rows = entry.base.rows;
> +	rows[0] = row;
> ====================
> This is how I fixed the compilation issue. I couldn't reproduce it
> locally in my GCC installed via brew, but it was failing in CI in
> one job. A simple cast didn't help. Only creation of a new variable
> did help. Also it is not assigned right away - the declaration is
> separate. Otherwise 'goto's before that line do not work.
> ====================
> +	journal_entry_create(&entry.base, 1, xrow_approx_len(row),
> +			     apply_synchro_row_cb, &entry);
> +	entry.req = &req;
> +	entry.owner = fiber();
> +	/*
> +	 * The WAL write is blocking. Otherwise it might happen that a CONFIRM
> +	 * or ROLLBACK is sent to WAL, and it would empty the limbo, but before
> +	 * it is written, more transactions arrive with a different owner. They
> +	 * won't be able to enter the limbo due to owner ID mismatch. Hence the
> +	 * synchro rows must block receipt of new transactions.
> +	 *
> +	 * Don't forget to return -1 both if the journal write failed right
> +	 * away, and if it failed inside of WAL thread (res < 0). Otherwise the
> +	 * caller would propagate committed vclock to this row thinking it was
> +	 * a success.
> +	 *
> +	 * XXX: in theory it could be done vice-versa. The write could be made
> +	 * non-blocking, and instead the potentially conflicting transactions
> +	 * could try to wait for all the current synchro WAL writes to end
> +	 * before trying to commit. But that requires extra steps from the
> +	 * transactions side, including the async ones.
> +	 */
> +	if (journal_write(&entry.base) != 0 || entry.base.res < 0) {
>   		diag_set(ClientError, ER_WAL_IO);
>   		goto err;
>   	}
> diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
> index cf0ad9350..c63211191 100644
> --- a/src/box/txn_limbo.c
> +++ b/src/box/txn_limbo.c
> @@ -400,20 +400,11 @@ txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn)
>   		txn_limbo_remove(limbo, e);
>   		txn_clear_flags(e->txn, TXN_WAIT_SYNC | TXN_WAIT_ACK);
>   		/*
> -		 * If already written to WAL by now, finish tx processing.
> -		 * Otherwise just clear the sync flags. Tx procesing will finish
> -		 * automatically once the tx is written to WAL.
> -		 *
> -		 * XXX: Normally at this point all transactions covered by this
> -		 * CONFIRM should be in WAL already, but there is a bug, that
> -		 * replica always processes received synchro requests *before*
> -		 * writing them to WAL. So it can happen, that a CONFIRM is
> -		 * 'read', but the transaction is not written yet. Should be
> -		 * fixed when the replica will behave properly, and then this
> -		 * branch won't exist.
> +		 * Should be written to WAL by now. Confirm is always written
> +		 * after the affected transactions.
>   		 */
> -		if (e->txn->signature >= 0)
> -			txn_complete_success(e->txn);
> +		assert(e->txn->signature >= 0);
> +		txn_complete_success(e->txn);
>   	}
>   	/* Update is_ro once the limbo is clear. */
>   	if (txn_limbo_is_empty(limbo))
> @@ -455,30 +446,13 @@ txn_limbo_read_rollback(struct txn_limbo *limbo, int64_t lsn)
>   	rlist_foreach_entry_safe_reverse(e, &limbo->queue, in_queue, tmp) {
>   		txn_limbo_abort(limbo, e);
>   		txn_clear_flags(e->txn, TXN_WAIT_SYNC | TXN_WAIT_ACK);
> -		if (e->txn->signature >= 0) {
> -			/* Rollback the transaction. */
> -			e->txn->signature = TXN_SIGNATURE_SYNC_ROLLBACK;
> -			txn_complete_fail(e->txn);
> -		} else {
> -			/*
> -			 * Rollback the transaction, but don't free it yet. It
> -			 * will be freed after its WAL write is completed.
> -			 *
> -			 * XXX: Normally at this point all transactions covered
> -			 * by this ROLLBACK should be in WAL already, but there
> -			 * is a bug, that replica always processes received
> -			 * synchro requests *before* writing them to WAL. So it
> -			 * can happen, that a ROLLBACK is 'read', but the
> -			 * transaction is not written yet. Should be fixed when
> -			 * the replica will behave properly, and then this
> -			 * branch won't exist.
> -			 */
> -			e->txn->signature = TXN_SIGNATURE_SYNC_ROLLBACK;
> -			struct fiber *fiber = e->txn->fiber;
> -			e->txn->fiber = fiber();
> -			txn_complete_fail(e->txn);
> -			e->txn->fiber = fiber;
> -		}
> +		/*
> +		 * Should be written to WAL by now. Rollback is always written
> +		 * after the affected transactions.
> +		 */
> +		assert(e->txn->signature >= 0);
> +		e->txn->signature = TXN_SIGNATURE_SYNC_ROLLBACK;
> +		txn_complete_fail(e->txn);
>   		if (e == last_rollback)
>   			break;
>   	}
> diff --git a/test/replication/gh-5213-qsync-applier-order-3.result b/test/replication/gh-5213-qsync-applier-order-3.result
> new file mode 100644
> index 000000000..bcb18b5c0
> --- /dev/null
> +++ b/test/replication/gh-5213-qsync-applier-order-3.result
> @@ -0,0 +1,263 @@
> +-- test-run result file version 2
> +--
> +-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL.
> +-- As a result it could happen that the transactions became visible on CONFIRM,
> +-- then somehow weren't written to WAL, and on restart the data might not be
> +-- visible again. Which means rollback of confirmed data and is not acceptable
> +-- (on the contrary with commit after rollback).
> +--
> +-- To fix that there was a patch making synchro rows processing after WAL write.
> +-- As a result, the following situation could happen. Instance 1 owns the limbo,
> +-- instances 2 and 3 know about that. The limbo is not empty. Now instance 1
> +-- writes CONFIRM and sends it to the instances 2 and 3. Both start a WAL write
> +-- for the CONFIRM. Now instance 3 finishes it, creates a new synchro
> +-- transaction, owns the local limbo, and sends the transaction to the instance
> +-- 2. Here the CONFIRM WAL write is not done yet. It should not happen, that
> +-- the instance 3's new transaction is rejected. Because firstly instance 3 will
> +-- send the same instance 1's CONFIRM to the instance 2 due to it being earlier
> +-- in WAL of instance 3. Then on instance 2 it will block on a latch with
> +-- replica_id 1 until the original CONFIRM received from the instance 1 is
> +-- finished. Afterwards the instance 3's transaction is applied just fine - the
> +-- limbo on instance 2 is empty now.
> +--
> +-- It is not related to read-views, but could break the replication.
> +--
> +test_run = require('test_run').new()
> + | ---
> + | ...
> +fiber = require('fiber')
> + | ---
> + | ...
> +old_synchro_quorum = box.cfg.replication_synchro_quorum
> + | ---
> + | ...
> +old_synchro_timeout = box.cfg.replication_synchro_timeout
> + | ---
> + | ...
> +
> +box.schema.user.grant('guest', 'super')
> + | ---
> + | ...
> +
> +s = box.schema.space.create('test', {is_sync = true})
> + | ---
> + | ...
> +_ = s:create_index('pk')
> + | ---
> + | ...
> +
> +test_run:cmd('create server replica1 with rpl_master=default,\
> +              script="replication/replica1.lua"')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('start server replica1')
> + | ---
> + | - true
> + | ...
> +
> +test_run:cmd('create server replica2 with rpl_master=default,\
> +              script="replication/replica2.lua"')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('start server replica2')
> + | ---
> + | - true
> + | ...
> +
> +-- Build mutual replication between replica1 and replica2.
> +test_run:switch('replica1')
> + | ---
> + | - true
> + | ...
> +replication = box.cfg.replication
> + | ---
> + | ...
> +table.insert(replication, test_run:eval('replica2', 'return box.cfg.listen')[1])
> + | ---
> + | ...
> +box.cfg{replication = {}}
> + | ---
> + | ...
> +box.cfg{replication = replication}
> + | ---
> + | ...
> +
> +test_run:switch('replica2')
> + | ---
> + | - true
> + | ...
> +replication = box.cfg.replication
> + | ---
> + | ...
> +table.insert(replication, test_run:eval('replica1', 'return box.cfg.listen')[1])
> + | ---
> + | ...
> +box.cfg{replication = {}}
> + | ---
> + | ...
> +box.cfg{replication = replication}
> + | ---
> + | ...
> +
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +fiber = require('fiber')
> + | ---
> + | ...
> +box.cfg{                                                                        \
> +    replication_synchro_quorum = 4,                                             \
> +    replication_synchro_timeout = 1000,                                         \
> +}
> + | ---
> + | ...
> +-- Send a transaction to all 3 nodes. The limbo is owned by the default node
> +-- everywhere.
> +f = fiber.new(function() s:replace{1} end)
> + | ---
> + | ...
> +test_run:wait_lsn('replica1', 'default')
> + | ---
> + | ...
> +test_run:wait_lsn('replica2', 'default')
> + | ---
> + | ...
> +
> +-- Make so the replica1 will apply CONFIRM from the default instance for a long
> +-- time.
> +test_run:switch('replica1')
> + | ---
> + | - true
> + | ...
> +box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0)
> + | ---
> + | - ok
> + | ...
> +
> +-- Emit the CONFIRM.
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +box.cfg{replication_synchro_quorum = 3}
> + | ---
> + | ...
> +test_run:wait_cond(function() return f:status() == 'dead' end)
> + | ---
> + | - true
> + | ...
> +
> +-- It hangs on the replica1.
> +test_run:switch('replica1')
> + | ---
> + | - true
> + | ...
> +test_run:wait_cond(function()                                                   \
> +    return box.error.injection.get('ERRINJ_WAL_DELAY')                          \
> +end)
> + | ---
> + | - true
> + | ...
> +
> +-- But is applied on the replica2. The limbo is empty here now.
> +test_run:switch('replica2')
> + | ---
> + | - true
> + | ...
> +test_run:wait_lsn('replica2', 'default')
> + | ---
> + | ...
> +box.cfg{                                                                        \
> +    replication_synchro_quorum = 2,                                             \
> +    replication_synchro_timeout = 1000,                                         \
> +}
> + | ---
> + | ...
> +-- Replica2 takes the limbo ownership and sends the transaction to the replica1.
> +-- Along with the CONFIRM from the default node, which is still not applied
> +-- on the replica1.
> +fiber = require('fiber')
> + | ---
> + | ...
> +f = fiber.new(function() box.space.test:replace{2} end)
> + | ---
> + | ...
> +
> +test_run:switch('replica1')
> + | ---
> + | - true
> + | ...
> +fiber = require('fiber')
> + | ---
> + | ...
> +-- WAL write of the CONFIRM from the default node still is not done. Give it
> +-- some time to get the new rows from the replica2 and block on the latch.
> +-- Can't catch it anyhow via conds, so the only way is to sleep a bit.
> +fiber.sleep(0.1)
> + | ---
> + | ...
> +-- Let the WAL writes finish. Firstly CONFIRM is finished, the limbo is emptied
> +-- and the replica_id 1 latch is unlocked. Now the replica2 transaction is
> +-- applied and persisted.
> +box.error.injection.set('ERRINJ_WAL_DELAY', false)
> + | ---
> + | - ok
> + | ...
> +test_run:wait_lsn('replica1', 'replica2')
> + | ---
> + | ...
> +box.space.test:get({2})
> + | ---
> + | - [2]
> + | ...
> +
> +-- Ensure the replication works fine, nothing is broken.
> +test_run:wait_upstream(test_run:get_server_id('replica2'), {status = 'follow'})
> + | ---
> + | - true
> + | ...
> +
> +test_run:switch('replica2')
> + | ---
> + | - true
> + | ...
> +test_run:wait_upstream(test_run:get_server_id('replica1'), {status = 'follow'})
> + | ---
> + | - true
> + | ...
> +
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('stop server replica1')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('delete server replica1')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('stop server replica2')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('delete server replica2')
> + | ---
> + | - true
> + | ...
> +s:drop()
> + | ---
> + | ...
> +box.schema.user.revoke('guest', 'super')
> + | ---
> + | ...
> +box.cfg{                                                                        \
> +    replication_synchro_quorum = old_synchro_quorum,                            \
> +    replication_synchro_timeout = old_synchro_timeout,                          \
> +}
> + | ---
> + | ...
> diff --git a/test/replication/gh-5213-qsync-applier-order-3.test.lua b/test/replication/gh-5213-qsync-applier-order-3.test.lua
> new file mode 100644
> index 000000000..37b569da7
> --- /dev/null
> +++ b/test/replication/gh-5213-qsync-applier-order-3.test.lua
> @@ -0,0 +1,125 @@
> +--
> +-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL.
> +-- As a result it could happen that the transactions became visible on CONFIRM,
> +-- then somehow weren't written to WAL, and on restart the data might not be
> +-- visible again. Which means rollback of confirmed data and is not acceptable
> +-- (on the contrary with commit after rollback).
> +--
> +-- To fix that there was a patch making synchro rows processing after WAL write.
> +-- As a result, the following situation could happen. Instance 1 owns the limbo,
> +-- instances 2 and 3 know about that. The limbo is not empty. Now instance 1
> +-- writes CONFIRM and sends it to the instances 2 and 3. Both start a WAL write
> +-- for the CONFIRM. Now instance 3 finishes it, creates a new synchro
> +-- transaction, owns the local limbo, and sends the transaction to the instance
> +-- 2. Here the CONFIRM WAL write is not done yet. It should not happen, that
> +-- the instance 3's new transaction is rejected. Because firstly instance 3 will
> +-- send the same instance 1's CONFIRM to the instance 2 due to it being earlier
> +-- in WAL of instance 3. Then on instance 2 it will block on a latch with
> +-- replica_id 1 until the original CONFIRM received from the instance 1 is
> +-- finished. Afterwards the instance 3's transaction is applied just fine - the
> +-- limbo on instance 2 is empty now.
> +--
> +-- It is not related to read-views, but could break the replication.
> +--
> +test_run = require('test_run').new()
> +fiber = require('fiber')
> +old_synchro_quorum = box.cfg.replication_synchro_quorum
> +old_synchro_timeout = box.cfg.replication_synchro_timeout
> +
> +box.schema.user.grant('guest', 'super')
> +
> +s = box.schema.space.create('test', {is_sync = true})
> +_ = s:create_index('pk')
> +
> +test_run:cmd('create server replica1 with rpl_master=default,\
> +              script="replication/replica1.lua"')
> +test_run:cmd('start server replica1')
> +
> +test_run:cmd('create server replica2 with rpl_master=default,\
> +              script="replication/replica2.lua"')
> +test_run:cmd('start server replica2')
> +
> +-- Build mutual replication between replica1 and replica2.
> +test_run:switch('replica1')
> +replication = box.cfg.replication
> +table.insert(replication, test_run:eval('replica2', 'return box.cfg.listen')[1])
> +box.cfg{replication = {}}
> +box.cfg{replication = replication}
> +
> +test_run:switch('replica2')
> +replication = box.cfg.replication
> +table.insert(replication, test_run:eval('replica1', 'return box.cfg.listen')[1])
> +box.cfg{replication = {}}
> +box.cfg{replication = replication}
> +
> +test_run:switch('default')
> +fiber = require('fiber')
> +box.cfg{                                                                        \
> +    replication_synchro_quorum = 4,                                             \
> +    replication_synchro_timeout = 1000,                                         \
> +}
> +-- Send a transaction to all 3 nodes. The limbo is owned by the default node
> +-- everywhere.
> +f = fiber.new(function() s:replace{1} end)
> +test_run:wait_lsn('replica1', 'default')
> +test_run:wait_lsn('replica2', 'default')
> +
> +-- Make so the replica1 will apply CONFIRM from the default instance for a long
> +-- time.
> +test_run:switch('replica1')
> +box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0)
> +
> +-- Emit the CONFIRM.
> +test_run:switch('default')
> +box.cfg{replication_synchro_quorum = 3}
> +test_run:wait_cond(function() return f:status() == 'dead' end)
> +
> +-- It hangs on the replica1.
> +test_run:switch('replica1')
> +test_run:wait_cond(function()                                                   \
> +    return box.error.injection.get('ERRINJ_WAL_DELAY')                          \
> +end)
> +
> +-- But is applied on the replica2. The limbo is empty here now.
> +test_run:switch('replica2')
> +test_run:wait_lsn('replica2', 'default')
> +box.cfg{                                                                        \
> +    replication_synchro_quorum = 2,                                             \
> +    replication_synchro_timeout = 1000,                                         \
> +}
> +-- Replica2 takes the limbo ownership and sends the transaction to the replica1.
> +-- Along with the CONFIRM from the default node, which is still not applied
> +-- on the replica1.
> +fiber = require('fiber')
> +f = fiber.new(function() box.space.test:replace{2} end)
> +
> +test_run:switch('replica1')
> +fiber = require('fiber')
> +-- WAL write of the CONFIRM from the default node still is not done. Give it
> +-- some time to get the new rows from the replica2 and block on the latch.
> +-- Can't catch it anyhow via conds, so the only way is to sleep a bit.
> +fiber.sleep(0.1)
> +-- Let the WAL writes finish. Firstly CONFIRM is finished, the limbo is emptied
> +-- and the replica_id 1 latch is unlocked. Now the replica2 transaction is
> +-- applied and persisted.
> +box.error.injection.set('ERRINJ_WAL_DELAY', false)
> +test_run:wait_lsn('replica1', 'replica2')
> +box.space.test:get({2})
> +
> +-- Ensure the replication works fine, nothing is broken.
> +test_run:wait_upstream(test_run:get_server_id('replica2'), {status = 'follow'})
> +
> +test_run:switch('replica2')
> +test_run:wait_upstream(test_run:get_server_id('replica1'), {status = 'follow'})
> +
> +test_run:switch('default')
> +test_run:cmd('stop server replica1')
> +test_run:cmd('delete server replica1')
> +test_run:cmd('stop server replica2')
> +test_run:cmd('delete server replica2')
> +s:drop()
> +box.schema.user.revoke('guest', 'super')
> +box.cfg{                                                                        \
> +    replication_synchro_quorum = old_synchro_quorum,                            \
> +    replication_synchro_timeout = old_synchro_timeout,                          \
> +}
> diff --git a/test/replication/gh-5213-qsync-applier-order.result b/test/replication/gh-5213-qsync-applier-order.result
> new file mode 100644
> index 000000000..a8c24c289
> --- /dev/null
> +++ b/test/replication/gh-5213-qsync-applier-order.result
> @@ -0,0 +1,302 @@
> +-- test-run result file version 2
> +--
> +-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL.
> +-- As a result it could happen that the transactions became visible on CONFIRM,
> +-- then somehow weren't written to WAL, and on restart the data might not be
> +-- visible again. Which means rollback of confirmed data and is not acceptable
> +-- (on the contrary with commit after rollback).
> +--
> +test_run = require('test_run').new()
> + | ---
> + | ...
> +fiber = require('fiber')
> + | ---
> + | ...
> +old_synchro_quorum = box.cfg.replication_synchro_quorum
> + | ---
> + | ...
> +old_synchro_timeout = box.cfg.replication_synchro_timeout
> + | ---
> + | ...
> +
> +box.schema.user.grant('guest', 'super')
> + | ---
> + | ...
> +
> +s = box.schema.space.create('test', {is_sync = true})
> + | ---
> + | ...
> +_ = s:create_index('pk')
> + | ---
> + | ...
> +
> +test_run:cmd('create server replica with rpl_master=default,\
> +              script="replication/gh-5213-replica.lua"')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('start server replica')
> + | ---
> + | - true
> + | ...
> +
> +fiber = require('fiber')
> + | ---
> + | ...
> +box.cfg{                                                                        \
> +    replication_synchro_quorum = 3,                                             \
> +    replication_synchro_timeout = 1000,                                         \
> +}
> + | ---
> + | ...
> +f = fiber.new(function() s:replace{1} end)
> + | ---
> + | ...
> +
> +test_run:switch('replica')
> + | ---
> + | - true
> + | ...
> +-- Wait when the transaction is written to WAL.
> +test_run:wait_lsn('replica', 'default')
> + | ---
> + | ...
> +s = box.space.test
> + | ---
> + | ...
> +-- But not visible yet. Because MVCC is on, and no CONFIRM yet.
> +assert(s:get({1}) == nil)
> + | ---
> + | - true
> + | ...
> +-- Block the incoming CONFIRM to be able to ensure the data is not visible until
> +-- WAL write ends.
> +box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0)
> + | ---
> + | - ok
> + | ...
> +
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +box.cfg{replication_synchro_quorum = 2}
> + | ---
> + | ...
> +test_run:wait_cond(function() return f:status() == 'dead' end)
> + | ---
> + | - true
> + | ...
> +
> +test_run:switch('replica')
> + | ---
> + | - true
> + | ...
> +-- Wait when CONFIRM is in the WAL thread.
> +test_run:wait_cond(function()                                                   \
> +    return box.error.injection.get('ERRINJ_WAL_DELAY')                          \
> +end)
> + | ---
> + | - true
> + | ...
> +assert(s:get({1}) == nil)
> + | ---
> + | - true
> + | ...
> +box.error.injection.set('ERRINJ_WAL_DELAY', false)
> + | ---
> + | - ok
> + | ...
> +-- After CONFIRM is in WAL, the transaction is committed and its data is
> +-- visible.
> +test_run:wait_cond(function() return s:get({1}) ~= nil end)
> + | ---
> + | - true
> + | ...
> +
> +--
> +-- Ensure CONFIRM WAL write fail also works fine when couldn't even start a WAL
> +-- write. The data remains not confirmed until WAL write succeeds.
> +--
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +box.cfg{replication_synchro_quorum = 3}
> + | ---
> + | ...
> +f = fiber.new(function() s:replace{2} end)
> + | ---
> + | ...
> +
> +test_run:switch('replica')
> + | ---
> + | - true
> + | ...
> +test_run:wait_lsn('replica', 'default')
> + | ---
> + | ...
> +assert(s:get({2}) == nil)
> + | ---
> + | - true
> + | ...
> +-- Make journal write fail immediately.
> +box.error.injection.set('ERRINJ_WAL_IO', true)
> + | ---
> + | - ok
> + | ...
> +
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +box.cfg{replication_synchro_quorum = 2}
> + | ---
> + | ...
> +test_run:wait_cond(function() return f:status() == 'dead' end)
> + | ---
> + | - true
> + | ...
> +
> +test_run:switch('replica')
> + | ---
> + | - true
> + | ...
> +test_run:wait_upstream(1, {status = 'stopped'})
> + | ---
> + | - true
> + | ...
> +box.error.injection.set('ERRINJ_WAL_IO', false)
> + | ---
> + | - ok
> + | ...
> +assert(s:get({2}) == nil)
> + | ---
> + | - true
> + | ...
> +-- Re-subscribe.
> +replication = box.cfg.replication
> + | ---
> + | ...
> +box.cfg{replication = {}}
> + | ---
> + | ...
> +box.cfg{replication = replication}
> + | ---
> + | ...
> +test_run:wait_lsn('replica', 'default')
> + | ---
> + | ...
> +test_run:wait_upstream(1, {status = 'follow'})
> + | ---
> + | - true
> + | ...
> +assert(s:get({2}) ~= nil)
> + | ---
> + | - true
> + | ...
> +
> +--
> +-- Ensure CONFIRM WAL write fail works fine when an error happens inside WAL
> +-- thread. The data remains not confirmed until WAL write succeeds.
> +--
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +box.cfg{replication_synchro_quorum = 3}
> + | ---
> + | ...
> +f = fiber.new(function() s:replace{3} end)
> + | ---
> + | ...
> +
> +test_run:switch('replica')
> + | ---
> + | - true
> + | ...
> +test_run:wait_lsn('replica', 'default')
> + | ---
> + | ...
> +assert(s:get({3}) == nil)
> + | ---
> + | - true
> + | ...
> +-- Journal write start is going to succeed, but it will fail later on return
> +-- from the WAL thread.
> +box.error.injection.set('ERRINJ_WAL_ROTATE', true)
> + | ---
> + | - ok
> + | ...
> +
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +box.cfg{replication_synchro_quorum = 2}
> + | ---
> + | ...
> +test_run:wait_cond(function() return f:status() == 'dead' end)
> + | ---
> + | - true
> + | ...
> +
> +test_run:switch('replica')
> + | ---
> + | - true
> + | ...
> +test_run:wait_upstream(1, {status = 'stopped'})
> + | ---
> + | - true
> + | ...
> +box.error.injection.set('ERRINJ_WAL_ROTATE', false)
> + | ---
> + | - ok
> + | ...
> +assert(s:get({3}) == nil)
> + | ---
> + | - true
> + | ...
> +box.cfg{replication = {}}
> + | ---
> + | ...
> +box.cfg{replication = replication}
> + | ---
> + | ...
> +test_run:wait_lsn('replica', 'default')
> + | ---
> + | ...
> +test_run:wait_upstream(1, {status = 'follow'})
> + | ---
> + | - true
> + | ...
> +assert(s:get({3}) ~= nil)
> + | ---
> + | - true
> + | ...
> +
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('stop server replica')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('delete server replica')
> + | ---
> + | - true
> + | ...
> +
> +s:drop()
> + | ---
> + | ...
> +box.schema.user.revoke('guest', 'super')
> + | ---
> + | ...
> +box.cfg{                                                                        \
> +    replication_synchro_quorum = old_synchro_quorum,                            \
> +    replication_synchro_timeout = old_synchro_timeout,                          \
> +}
> + | ---
> + | ...
> diff --git a/test/replication/gh-5213-qsync-applier-order.test.lua b/test/replication/gh-5213-qsync-applier-order.test.lua
> new file mode 100644
> index 000000000..f1eccfa84
> --- /dev/null
> +++ b/test/replication/gh-5213-qsync-applier-order.test.lua
> @@ -0,0 +1,122 @@
> +--
> +-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL.
> +-- As a result it could happen that the transactions became visible on CONFIRM,
> +-- then somehow weren't written to WAL, and on restart the data might not be
> +-- visible again. Which means rollback of confirmed data and is not acceptable
> +-- (on the contrary with commit after rollback).
> +--
> +test_run = require('test_run').new()
> +fiber = require('fiber')
> +old_synchro_quorum = box.cfg.replication_synchro_quorum
> +old_synchro_timeout = box.cfg.replication_synchro_timeout
> +
> +box.schema.user.grant('guest', 'super')
> +
> +s = box.schema.space.create('test', {is_sync = true})
> +_ = s:create_index('pk')
> +
> +test_run:cmd('create server replica with rpl_master=default,\
> +              script="replication/gh-5213-replica.lua"')
> +test_run:cmd('start server replica')
> +
> +fiber = require('fiber')
> +box.cfg{                                                                        \
> +    replication_synchro_quorum = 3,                                             \
> +    replication_synchro_timeout = 1000,                                         \
> +}
> +f = fiber.new(function() s:replace{1} end)
> +
> +test_run:switch('replica')
> +-- Wait when the transaction is written to WAL.
> +test_run:wait_lsn('replica', 'default')
> +s = box.space.test
> +-- But not visible yet. Because MVCC is on, and no CONFIRM yet.
> +assert(s:get({1}) == nil)
> +-- Block the incoming CONFIRM to be able to ensure the data is not visible until
> +-- WAL write ends.
> +box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0)
> +
> +test_run:switch('default')
> +box.cfg{replication_synchro_quorum = 2}
> +test_run:wait_cond(function() return f:status() == 'dead' end)
> +
> +test_run:switch('replica')
> +-- Wait when CONFIRM is in the WAL thread.
> +test_run:wait_cond(function()                                                   \
> +    return box.error.injection.get('ERRINJ_WAL_DELAY')                          \
> +end)
> +assert(s:get({1}) == nil)
> +box.error.injection.set('ERRINJ_WAL_DELAY', false)
> +-- After CONFIRM is in WAL, the transaction is committed and its data is
> +-- visible.
> +test_run:wait_cond(function() return s:get({1}) ~= nil end)
> +
> +--
> +-- Ensure CONFIRM WAL write fail also works fine when couldn't even start a WAL
> +-- write. The data remains not confirmed until WAL write succeeds.
> +--
> +test_run:switch('default')
> +box.cfg{replication_synchro_quorum = 3}
> +f = fiber.new(function() s:replace{2} end)
> +
> +test_run:switch('replica')
> +test_run:wait_lsn('replica', 'default')
> +assert(s:get({2}) == nil)
> +-- Make journal write fail immediately.
> +box.error.injection.set('ERRINJ_WAL_IO', true)
> +
> +test_run:switch('default')
> +box.cfg{replication_synchro_quorum = 2}
> +test_run:wait_cond(function() return f:status() == 'dead' end)
> +
> +test_run:switch('replica')
> +test_run:wait_upstream(1, {status = 'stopped'})
> +box.error.injection.set('ERRINJ_WAL_IO', false)
> +assert(s:get({2}) == nil)
> +-- Re-subscribe.
> +replication = box.cfg.replication
> +box.cfg{replication = {}}
> +box.cfg{replication = replication}
> +test_run:wait_lsn('replica', 'default')
> +test_run:wait_upstream(1, {status = 'follow'})
> +assert(s:get({2}) ~= nil)
> +
> +--
> +-- Ensure CONFIRM WAL write fail works fine when an error happens inside WAL
> +-- thread. The data remains not confirmed until WAL write succeeds.
> +--
> +test_run:switch('default')
> +box.cfg{replication_synchro_quorum = 3}
> +f = fiber.new(function() s:replace{3} end)
> +
> +test_run:switch('replica')
> +test_run:wait_lsn('replica', 'default')
> +assert(s:get({3}) == nil)
> +-- Journal write start is going to succeed, but it will fail later on return
> +-- from the WAL thread.
> +box.error.injection.set('ERRINJ_WAL_ROTATE', true)
> +
> +test_run:switch('default')
> +box.cfg{replication_synchro_quorum = 2}
> +test_run:wait_cond(function() return f:status() == 'dead' end)
> +
> +test_run:switch('replica')
> +test_run:wait_upstream(1, {status = 'stopped'})
> +box.error.injection.set('ERRINJ_WAL_ROTATE', false)
> +assert(s:get({3}) == nil)
> +box.cfg{replication = {}}
> +box.cfg{replication = replication}
> +test_run:wait_lsn('replica', 'default')
> +test_run:wait_upstream(1, {status = 'follow'})
> +assert(s:get({3}) ~= nil)
> +
> +test_run:switch('default')
> +test_run:cmd('stop server replica')
> +test_run:cmd('delete server replica')
> +
> +s:drop()
> +box.schema.user.revoke('guest', 'super')
> +box.cfg{                                                                        \
> +    replication_synchro_quorum = old_synchro_quorum,                            \
> +    replication_synchro_timeout = old_synchro_timeout,                          \
> +}
> diff --git a/test/replication/gh-5213-replica.lua b/test/replication/gh-5213-replica.lua
> new file mode 100644
> index 000000000..3ba90d7ab
> --- /dev/null
> +++ b/test/replication/gh-5213-replica.lua
> @@ -0,0 +1,10 @@
> +#!/usr/bin/env tarantool
> +
> +box.cfg({
> +    listen = os.getenv("LISTEN"),
> +    replication = os.getenv("MASTER"),
> +    read_only = true,
> +    memtx_use_mvcc_engine = true,
> +})
> +
> +require('console').listen(os.getenv('ADMIN'))
> diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
> index aff5fda26..4a9ca0a46 100644
> --- a/test/replication/suite.cfg
> +++ b/test/replication/suite.cfg
> @@ -15,6 +15,8 @@
>       "gh-3760-misc-return-on-quorum-0.test.lua": {},
>       "gh-4399-misc-no-failure-on-error-reading-wal.test.lua": {},
>       "gh-4424-misc-orphan-on-reconfiguration-error.test.lua": {},
> +    "gh-5213-qsync-applier-order.test.lua": {},
> +    "gh-5213-qsync-applier-order-3.test.lua": {},
>       "gh-5426-election-on-off.test.lua": {},
>       "gh-5433-election-restart-recovery.test.lua": {},
>       "gh-5506-election-on-off.test.lua": {},
> diff --git a/test/replication/suite.ini b/test/replication/suite.ini
> index a9e44e8cf..1d9c0a4ae 100644
> --- a/test/replication/suite.ini
> +++ b/test/replication/suite.ini
> @@ -3,7 +3,7 @@ core = tarantool
>   script =  master.lua
>   description = tarantool/box, replication
>   disabled = consistent.test.lua
> -release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua
> +release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua
>   config = suite.cfg
>   lua_libs = lua/fast_replica.lua lua/rlimit.lua
>   use_unix_sockets = True

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] applier: process synchro rows after WAL write
  2021-04-08 22:56   ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-09  8:26     ` Serge Petrenko via Tarantool-patches
  0 siblings, 0 replies; 10+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-09  8:26 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, gorcunov



09.04.2021 01:56, Vladislav Shpilevoy пишет:
> Hi! Thanks for the review!
>
>> Thanks for the patch!
>>
>> I'm a bit worried about two different synchro rows coming from two
>> appliers. Is everything ok in this case?
>> Or even normal rows coming from other appliers. Say some other replica
>> has already applied this synchro row and even has written some rows on
>> top of it. Its applier won't block on replica_id latch, and may fail to apply
>> some txs following this synchro row, because it's not yet written to WAL
>> and thus not applied (limbo is still not empty or belongs to other instance).
> Should not happen, if I understood you correctly. See my response in the
> other email + a test for test-run + a console test.

Yes, it's fine, thanks for the explanation & the test!

>
>>> diff --git a/src/box/applier.cc b/src/box/applier.cc
>>> index 971b2e64c..f7c82548c 100644
>>> --- a/src/box/applier.cc
>>> +++ b/src/box/applier.cc
>>> @@ -762,26 +762,21 @@ applier_txn_wal_write_cb(struct trigger *trigger, void *event)
>>>    }
>>>      struct synchro_entry {
>>> -    /** Encoded form of a synchro record. */
>>> -    struct synchro_body_bin    body_bin;
>>> -
>>> -    /** xrow to write, used by the journal engine. */
>>> -    struct xrow_header row;
>>> -
>>> +    /** Request to process when WAL write is done. */
>>> +    struct synchro_request *req;
>>> +    /** Fiber created the entry. To wakeup when WAL write is done. */
>>> +    struct fiber *owner;
>>>        /**
>>> -     * The journal entry itself. Note since
>>> -     * it has unsized array it must be the
>>> -     * last entry in the structure.
>>> +     * The base journal entry. It has unsized array and then must be the
>>> +     * last entry in the structure. But can workaround it via a union
>>> +     * adding the needed tail as char[].
>>>         */
>>> -    struct journal_entry journal_entry;
>>> +    union {
>>> +        struct journal_entry base;
>>> +        char base_buf[sizeof(base) + sizeof(base.rows[0])];
>>> +    };
>>>    };
>> I don't understand this union stuff.
>> The journal_entry is the last entry in synchro_entry anyway.
>> Is this a hack for allowing to allocate synchro_entry on the stack?
> Yes. Otherwise I would need a more ugly hack:
>
> 	struct synchro_entry *e;
> 	alignas(alignof(*e)) char e_buf[sizeof(*e) + sizeof(e->base.rows[0])];
> 	e = (struct synchro_entry *)e_buf;
>
> Then I managed to create the union hack, which I find
> less ugly, and it seems to be allow to declare it on the stack.
> The interesting thing though is that I still can't make that
> union not a last member.
>
> I also see now there is a problem in one of the CI jobs:
> https://github.com/tarantool/tarantool/runs/2291961909
>
> It complains about out of bound array access. Which has nothing to
> do with the union, just a C++ issue I suppose. I fixed it in the new
> version, see the other email.

Ok, I see now. Thanks!

>
>>> diff --git a/test/replication/gh-5213-qsync-applier-order.test.lua b/test/replication/gh-5213-qsync-applier-order.test.lua
>>> new file mode 100644
>>> index 000000000..3a32626f2
>>> --- /dev/null
>>> +++ b/test/replication/gh-5213-qsync-applier-order.test.lua
>>> @@ -0,0 +1,129 @@
>>> +--
>>> +-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL.
>>> +-- As a result it could happen that the transactions became visible on CONFIRM,
>>> +-- then somehow weren't written to WAL, and on restart the data might not be
>>> +-- visible again. Which means rollback of confirmed data and is not acceptable
>>> +-- (on the contrary with commit after rollback).
>>> +--
>>> +test_run = require('test_run').new()
>>> +fiber = require('fiber')
>>> +old_synchro_quorum = box.cfg.replication_synchro_quorum
>>> +old_synchro_timeout = box.cfg.replication_synchro_timeout
>>> +
>>> +box.schema.user.grant('guest', 'super')
>>> +
>>> +s = box.schema.space.create('test', {is_sync = true})
>>> +_ = s:create_index('pk')
>>> +
>>> +test_run:cmd('create server replica with rpl_master=default,\
>>> +              script="replication/gh-5213-replica.lua"')
>>> +test_run:cmd('start server replica')
>>> +
>>> +test_run:switch('replica')
>>> +assert(box.info.id == 2)
>>> +lsn = box.info.vclock[1]
>>> +
>>> +test_run:switch('default')
>>> +fiber = require('fiber')
>>> +box.cfg{                                                                        \
>>> +    replication_synchro_quorum = 3,                                             \
>>> +    replication_synchro_timeout = 1000,                                         \
>>> +}
>>> +f = fiber.new(function() s:replace{1} end)
>>> +
>>> +test_run:switch('replica')
>>> +-- Wait when the transaction is written to WAL.
>>> +test_run:wait_cond(function() return box.info.vclock[1] == lsn + 1 end)
>> This shouldn't go wrong, but I find
>> test_run:wait_lsn('replica, 'default') more durable.
>> And you wouldn't need to save lsn above then.
>>
>> Same for other usages of lsn = ... and wait_cond(box.info.vclock[1] == lsn + ...)
>>
>> Up to you though.
> I totally forgot about wait_lsn(), thanks for noticing.
> I applied it. See the full diff in the second email.

Thanks for the fixes, LGTM.

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] applier: process synchro rows after WAL write
  2021-04-09  8:25         ` Serge Petrenko via Tarantool-patches
@ 2021-04-09 21:32           ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 0 replies; 10+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-09 21:32 UTC (permalink / raw)
  To: Serge Petrenko, Cyrill Gorcunov; +Cc: tarantool-patches

>> I didn't try to reproduce this craziness though. However I have an idea how
>> we could fix it "for good". Introduce a fiber read-write lock. All the
>> transactions take read-lock on the limbo. All CONFIRM/ROLLBACK/PROMOTE take
>> a write-lock on the limbo. Then all the limbo owner transitions would be
>> fixed.
> 
> Sounds good. Isn't it enough to only have a write lock though?

Perhaps. I didn't give it much thought yet. Maybe after this patch is pushed,
and after you introduce PROMOTE.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-04-09 21:32 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-07 22:47 [Tarantool-patches] [PATCH 1/1] applier: process synchro rows after WAL write Vladislav Shpilevoy via Tarantool-patches
2021-04-08  8:39 ` Serge Petrenko via Tarantool-patches
2021-04-08 10:19   ` Cyrill Gorcunov via Tarantool-patches
2021-04-08 10:32     ` Serge Petrenko via Tarantool-patches
2021-04-08 10:46       ` Cyrill Gorcunov via Tarantool-patches
2021-04-08 22:57       ` Vladislav Shpilevoy via Tarantool-patches
2021-04-09  8:25         ` Serge Petrenko via Tarantool-patches
2021-04-09 21:32           ` Vladislav Shpilevoy via Tarantool-patches
2021-04-08 22:56   ` Vladislav Shpilevoy via Tarantool-patches
2021-04-09  8:26     ` Serge Petrenko via Tarantool-patches

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox