[Tarantool-patches] [PATCH 1/1] applier: process synchro rows after WAL write

Vladislav Shpilevoy v.shpilevoy at tarantool.org
Fri Apr 9 01:57:00 MSK 2021


On 08.04.2021 12:32, Serge Petrenko via Tarantool-patches wrote> 08.04.2021 13:19, Cyrill Gorcunov пишет:
>> On Thu, Apr 08, 2021 at 11:39:03AM +0300, Serge Petrenko wrote:
>>> Thanks for the patch!
>>>
>>> I'm a bit worried about two different synchro rows coming from two
>>> appliers. Is everything ok in this case?
>> Serge, you mean the scenario when some instances in replicaset
>> have the patch applied and some are not?
> 
> No. Let's suppose everyone has this patch applied.
> Now look at one particular instance. It may happen that while one of
> its appliers is writing this synchro row (either CONFIRM or ROLLBACK,
> doesn't matter), some other applier may still apply requests coming
> from other replicaset members.
> 
> I was wondering what would happen if someone else sent this instance
> another synchro row. Looks like nothing bad but I just wanted to
> double-check.
> 
> And looks like there's a bug, which I'm speaking of below. It's about
> someone sending us normal rows (either synchronous transactions or
> asynchronous, but not CONFIRM/ROLLBACK entries) while we're waiting for
> syncro row's write to end.
> 
> Say, limbo was owned by instance 1, and instance 2 has written CONFIRM
> for everything there was. While we wait for 2's CONFIRM to be written to
> WAL, we may receive some rows from instance 3, who has already applied 2's
> CONFIRM. Since we haven't written the CONFIRM yet, we haven't applied it,
> and the limbo on our instance still isn't empty. All the rows coming from
> 3 will get rejected and replication between 3 and us will be broken.

## Regarding the described issue

This is a super good question, I am glad you noticed. I firstly thought
it is a bug, but then I realized it is not possible.

If the instance 3 already wrote that CONFIRM from instance_id 1, it would
send it us too before its own rows, and it would block on the latch of the
struct replica with instance_id 1 until the CONFIRM is finally written. Then
the second attempt to write it would be skipped and the rows of instance_id 3
would be applied fine.

I made a test which replayed the described situation and all worked fine. I
added it to the patch, and also I provide it below for easy reproduction in
3 consoles.

## A problem not related to the patch

However there might happen a similar issue, not related to the fixed bug.
Assume the limbo belongs to instance 1. Then instance 1 is gone, and
instance 2 tries to write CONFIRM on behalf of instance 1 using
box.ctl.clear_synchro_queue().

While the CONFIRM is in progress, it goes to WAL, from WAL it goes to
replication, from there it goes to instance 3. This one applies it, the limbo
becomes empty on the instance 3. It writes a new row, which is written to
its WAL, and is sent back to instance 2. The instance 2 still didn't finish
the original CONFIRM WAL write, TX thread didn't see it yet. So its limbo still
has rows from instance 1, and the latch is not locked. Now the replication
between instances 2 and 3 is broken.

I didn't try to reproduce this craziness though. However I have an idea how
we could fix it "for good". Introduce a fiber read-write lock. All the
transactions take read-lock on the limbo. All CONFIRM/ROLLBACK/PROMOTE take
a write-lock on the limbo. Then all the limbo owner transitions would be
fixed.

I also thought more about how to evolve my idea about waiting for WAL write
only if we get a conflicting transaction. We need to know transaction's
origin ID before calling txn_commit/txn_commit_try_async. To make it fast we
need to calculate it while create txn_stmt objects so it is already known
right before commit. And here it is tricky, because locally-created
transactions might rollback individual statements using savepoints. That
makes the on-fly calculation of the origin_id not feasible. So far I couldn't
find what to do with that.

## The console test

It reproduces your scenario and all works fine, because the latch saves
our asses. You need 3 instances in different consoles and need to do the
steps in the order as they are marked in the comments.

--
-- Instance 1
--
-- Step 1
box.cfg{
    listen = 3313,
    replication_synchro_quorum = 4,
    replication_synchro_timeout = 1000,
}
box.schema.user.grant('guest', 'super')
s = box.schema.create_space('test', {is_sync = true})
_ = s:create_index('pk')

-- Step 4
fiber = require('fiber')
f = fiber.new(function() s:replace{1} end)

-- Step 8
box.cfg{replication_synchro_quorum = 3}
while f:status() ~= 'dead' do fiber.sleep(0.1) end



--
-- Instance 2
--
-- Step 2
box.cfg{
    listen = 3314,
    replication = {'localhost:3313', 'localhost:3314', 'localhost:3315'},
    read_only = true,
}

-- Step 5
-- Wait vclock.

-- Step 7
box.error.injection.set("ERRINJ_WAL_DELAY", true)

-- Step 10
box.error.injection.set("ERRINJ_WAL_DELAY", false)



--
-- Instance 3
--
-- Step 3
box.cfg{
    listen = 3315,
    replication = {'localhost:3313', 'localhost:3314', 'localhost:3315'},
    read_only = true,
}

-- Step 6
-- Wait vclock.

-- Step 9
box.cfg{read_only = false}
s = box.space.test
box.cfg{
    replication_synchro_quorum = 2,
    replication_synchro_timeout = 1000,
}
s:replace{2}

## The patch

Here is a full diff of the patch. It includes the test above as a
new .test.lua file, and all review fixes.

====================

    applier: process synchro rows after WAL write
    
    Applier used to process synchronous rows CONFIRM and ROLLBACK
    right after receipt before they are written to WAL.
    
    That led to a bug that the confirmed data became visible, might be
    accessed by user requests, then the node restarted before CONFIRM
    finished its WAL write, and the data was not visible again. That
    is just like if it would be rolled back, which is not acceptable.
    
    Another case - CONFIRM WAL write could simply fail due to any
    reason (no disk space, OOM), but the transactions would remain
    confirmed anyway.
    
    Also that produced some hacks in the limbo's code to support the
    confirmation and rollback of transactions not yet written to WAL.
    
    The patch makes the synchro rows processed only after they are
    written to WAL. Although the 'rollback' case above might still
    happen if the xlogs were in the kernel caches, and the machine was
    powered off before they were flushed to disk. But that is not
    related to qsync specifically.
    
    To handle the synchro rows after WAL write the patch makes them go
    to WAL in a blocking way (journal_write() instead of
    journal_write_try_async()). Otherwise it could happen that a
    CONFIRM/ROLLBACK is being written to WAL and would clear the limbo
    afterwards, but a new transaction arrives with a different owner,
    and it conflicts with the current limbo owner.
    
    Closes #5213

diff --git a/changelogs/unreleased/applier-qsync-write-order.md b/changelogs/unreleased/applier-qsync-write-order.md
new file mode 100644
index 000000000..10b9ade80
--- /dev/null
+++ b/changelogs/unreleased/applier-qsync-write-order.md
@@ -0,0 +1,6 @@
+## bugfix/replication
+
+* Fix the bug when a synchronous transaction could be confirmed and visible on
+  a replica, but then not confirmed / invisible again after restart. Could
+  happen more likely on memtx spaces with `memtx_use_mvcc_engine` enabled
+  (gh-5213).
diff --git a/src/box/applier.cc b/src/box/applier.cc
index 971b2e64c..4898f9f7b 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -762,26 +762,21 @@ applier_txn_wal_write_cb(struct trigger *trigger, void *event)
 }
 
 struct synchro_entry {
-	/** Encoded form of a synchro record. */
-	struct synchro_body_bin	body_bin;
-
-	/** xrow to write, used by the journal engine. */
-	struct xrow_header row;
-
+	/** Request to process when WAL write is done. */
+	struct synchro_request *req;
+	/** Fiber created the entry. To wakeup when WAL write is done. */
+	struct fiber *owner;
 	/**
-	 * The journal entry itself. Note since
-	 * it has unsized array it must be the
-	 * last entry in the structure.
+	 * The base journal entry. It has unsized array and then must be the
+	 * last entry in the structure. But can workaround it via a union
+	 * adding the needed tail as char[].
 	 */
-	struct journal_entry journal_entry;
+	union {
+		struct journal_entry base;
+		char base_buf[sizeof(base) + sizeof(base.rows[0])];
+	};
 };
 
-static void
-synchro_entry_delete(struct synchro_entry *entry)
-{
-	free(entry);
-}
-
 /**
  * Async write journal completion.
  */
@@ -791,50 +786,15 @@ apply_synchro_row_cb(struct journal_entry *entry)
 	assert(entry->complete_data != NULL);
 	struct synchro_entry *synchro_entry =
 		(struct synchro_entry *)entry->complete_data;
-	if (entry->res < 0)
+	if (entry->res < 0) {
 		applier_rollback_by_wal_io();
-	else
+	} else {
+		txn_limbo_process(&txn_limbo, synchro_entry->req);
 		trigger_run(&replicaset.applier.on_wal_write, NULL);
-
-	synchro_entry_delete(synchro_entry);
-}
-
-/**
- * Allocate a new synchro_entry to be passed to
- * the journal engine in async write way.
- */
-static struct synchro_entry *
-synchro_entry_new(struct xrow_header *applier_row,
-		  struct synchro_request *req)
-{
-	struct synchro_entry *entry;
-	size_t size = sizeof(*entry) + sizeof(struct xrow_header *);
-
-	/*
-	 * For simplicity we use malloc here but
-	 * probably should provide some cache similar
-	 * to txn cache.
-	 */
-	entry = (struct synchro_entry *)malloc(size);
-	if (entry == NULL) {
-		diag_set(OutOfMemory, size, "malloc", "synchro_entry");
-		return NULL;
 	}
-
-	struct journal_entry *journal_entry = &entry->journal_entry;
-	struct synchro_body_bin *body_bin = &entry->body_bin;
-	struct xrow_header *row = &entry->row;
-
-	journal_entry->rows[0] = row;
-
-	xrow_encode_synchro(row, body_bin, req);
-
-	row->lsn = applier_row->lsn;
-	row->replica_id = applier_row->replica_id;
-
-	journal_entry_create(journal_entry, 1, xrow_approx_len(row),
-			     apply_synchro_row_cb, entry);
-	return entry;
+	/* The fiber is the same on final join. */
+	if (synchro_entry->owner != fiber())
+		fiber_wakeup(synchro_entry->owner);
 }
 
 /** Process a synchro request. */
@@ -847,14 +807,37 @@ apply_synchro_row(struct xrow_header *row)
 	if (xrow_decode_synchro(row, &req) != 0)
 		goto err;
 
-	txn_limbo_process(&txn_limbo, &req);
-
-	struct synchro_entry *entry;
-	entry = synchro_entry_new(row, &req);
-	if (entry == NULL)
-		goto err;
-
-	if (journal_write_try_async(&entry->journal_entry) != 0) {
+	struct synchro_entry entry;
+	/*
+	 * Rows array is cast from *[] to **, because otherwise g++ complains
+	 * about out of array bounds access.
+	 */
+	struct xrow_header **rows;
+	rows = entry.base.rows;
+	rows[0] = row;
====================
This is how I fixed the compilation issue. I couldn't reproduce it
locally in my GCC installed via brew, but it was failing in CI in
one job. A simple cast didn't help. Only creation of a new variable
did help. Also it is not assigned right away - the declaration is
separate. Otherwise 'goto's before that line do not work.
====================
+	journal_entry_create(&entry.base, 1, xrow_approx_len(row),
+			     apply_synchro_row_cb, &entry);
+	entry.req = &req;
+	entry.owner = fiber();
+	/*
+	 * The WAL write is blocking. Otherwise it might happen that a CONFIRM
+	 * or ROLLBACK is sent to WAL, and it would empty the limbo, but before
+	 * it is written, more transactions arrive with a different owner. They
+	 * won't be able to enter the limbo due to owner ID mismatch. Hence the
+	 * synchro rows must block receipt of new transactions.
+	 *
+	 * Don't forget to return -1 both if the journal write failed right
+	 * away, and if it failed inside of WAL thread (res < 0). Otherwise the
+	 * caller would propagate committed vclock to this row thinking it was
+	 * a success.
+	 *
+	 * XXX: in theory it could be done vice-versa. The write could be made
+	 * non-blocking, and instead the potentially conflicting transactions
+	 * could try to wait for all the current synchro WAL writes to end
+	 * before trying to commit. But that requires extra steps from the
+	 * transactions side, including the async ones.
+	 */
+	if (journal_write(&entry.base) != 0 || entry.base.res < 0) {
 		diag_set(ClientError, ER_WAL_IO);
 		goto err;
 	}
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index cf0ad9350..c63211191 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -400,20 +400,11 @@ txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn)
 		txn_limbo_remove(limbo, e);
 		txn_clear_flags(e->txn, TXN_WAIT_SYNC | TXN_WAIT_ACK);
 		/*
-		 * If already written to WAL by now, finish tx processing.
-		 * Otherwise just clear the sync flags. Tx procesing will finish
-		 * automatically once the tx is written to WAL.
-		 *
-		 * XXX: Normally at this point all transactions covered by this
-		 * CONFIRM should be in WAL already, but there is a bug, that
-		 * replica always processes received synchro requests *before*
-		 * writing them to WAL. So it can happen, that a CONFIRM is
-		 * 'read', but the transaction is not written yet. Should be
-		 * fixed when the replica will behave properly, and then this
-		 * branch won't exist.
+		 * Should be written to WAL by now. Confirm is always written
+		 * after the affected transactions.
 		 */
-		if (e->txn->signature >= 0)
-			txn_complete_success(e->txn);
+		assert(e->txn->signature >= 0);
+		txn_complete_success(e->txn);
 	}
 	/* Update is_ro once the limbo is clear. */
 	if (txn_limbo_is_empty(limbo))
@@ -455,30 +446,13 @@ txn_limbo_read_rollback(struct txn_limbo *limbo, int64_t lsn)
 	rlist_foreach_entry_safe_reverse(e, &limbo->queue, in_queue, tmp) {
 		txn_limbo_abort(limbo, e);
 		txn_clear_flags(e->txn, TXN_WAIT_SYNC | TXN_WAIT_ACK);
-		if (e->txn->signature >= 0) {
-			/* Rollback the transaction. */
-			e->txn->signature = TXN_SIGNATURE_SYNC_ROLLBACK;
-			txn_complete_fail(e->txn);
-		} else {
-			/*
-			 * Rollback the transaction, but don't free it yet. It
-			 * will be freed after its WAL write is completed.
-			 *
-			 * XXX: Normally at this point all transactions covered
-			 * by this ROLLBACK should be in WAL already, but there
-			 * is a bug, that replica always processes received
-			 * synchro requests *before* writing them to WAL. So it
-			 * can happen, that a ROLLBACK is 'read', but the
-			 * transaction is not written yet. Should be fixed when
-			 * the replica will behave properly, and then this
-			 * branch won't exist.
-			 */
-			e->txn->signature = TXN_SIGNATURE_SYNC_ROLLBACK;
-			struct fiber *fiber = e->txn->fiber;
-			e->txn->fiber = fiber();
-			txn_complete_fail(e->txn);
-			e->txn->fiber = fiber;
-		}
+		/*
+		 * Should be written to WAL by now. Rollback is always written
+		 * after the affected transactions.
+		 */
+		assert(e->txn->signature >= 0);
+		e->txn->signature = TXN_SIGNATURE_SYNC_ROLLBACK;
+		txn_complete_fail(e->txn);
 		if (e == last_rollback)
 			break;
 	}
diff --git a/test/replication/gh-5213-qsync-applier-order-3.result b/test/replication/gh-5213-qsync-applier-order-3.result
new file mode 100644
index 000000000..bcb18b5c0
--- /dev/null
+++ b/test/replication/gh-5213-qsync-applier-order-3.result
@@ -0,0 +1,263 @@
+-- test-run result file version 2
+--
+-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL.
+-- As a result it could happen that the transactions became visible on CONFIRM,
+-- then somehow weren't written to WAL, and on restart the data might not be
+-- visible again. Which means rollback of confirmed data and is not acceptable
+-- (on the contrary with commit after rollback).
+--
+-- To fix that there was a patch making synchro rows processing after WAL write.
+-- As a result, the following situation could happen. Instance 1 owns the limbo,
+-- instances 2 and 3 know about that. The limbo is not empty. Now instance 1
+-- writes CONFIRM and sends it to the instances 2 and 3. Both start a WAL write
+-- for the CONFIRM. Now instance 3 finishes it, creates a new synchro
+-- transaction, owns the local limbo, and sends the transaction to the instance
+-- 2. Here the CONFIRM WAL write is not done yet. It should not happen, that
+-- the instance 3's new transaction is rejected. Because firstly instance 3 will
+-- send the same instance 1's CONFIRM to the instance 2 due to it being earlier
+-- in WAL of instance 3. Then on instance 2 it will block on a latch with
+-- replica_id 1 until the original CONFIRM received from the instance 1 is
+-- finished. Afterwards the instance 3's transaction is applied just fine - the
+-- limbo on instance 2 is empty now.
+--
+-- It is not related to read-views, but could break the replication.
+--
+test_run = require('test_run').new()
+ | ---
+ | ...
+fiber = require('fiber')
+ | ---
+ | ...
+old_synchro_quorum = box.cfg.replication_synchro_quorum
+ | ---
+ | ...
+old_synchro_timeout = box.cfg.replication_synchro_timeout
+ | ---
+ | ...
+
+box.schema.user.grant('guest', 'super')
+ | ---
+ | ...
+
+s = box.schema.space.create('test', {is_sync = true})
+ | ---
+ | ...
+_ = s:create_index('pk')
+ | ---
+ | ...
+
+test_run:cmd('create server replica1 with rpl_master=default,\
+              script="replication/replica1.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica1')
+ | ---
+ | - true
+ | ...
+
+test_run:cmd('create server replica2 with rpl_master=default,\
+              script="replication/replica2.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica2')
+ | ---
+ | - true
+ | ...
+
+-- Build mutual replication between replica1 and replica2.
+test_run:switch('replica1')
+ | ---
+ | - true
+ | ...
+replication = box.cfg.replication
+ | ---
+ | ...
+table.insert(replication, test_run:eval('replica2', 'return box.cfg.listen')[1])
+ | ---
+ | ...
+box.cfg{replication = {}}
+ | ---
+ | ...
+box.cfg{replication = replication}
+ | ---
+ | ...
+
+test_run:switch('replica2')
+ | ---
+ | - true
+ | ...
+replication = box.cfg.replication
+ | ---
+ | ...
+table.insert(replication, test_run:eval('replica1', 'return box.cfg.listen')[1])
+ | ---
+ | ...
+box.cfg{replication = {}}
+ | ---
+ | ...
+box.cfg{replication = replication}
+ | ---
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+fiber = require('fiber')
+ | ---
+ | ...
+box.cfg{                                                                        \
+    replication_synchro_quorum = 4,                                             \
+    replication_synchro_timeout = 1000,                                         \
+}
+ | ---
+ | ...
+-- Send a transaction to all 3 nodes. The limbo is owned by the default node
+-- everywhere.
+f = fiber.new(function() s:replace{1} end)
+ | ---
+ | ...
+test_run:wait_lsn('replica1', 'default')
+ | ---
+ | ...
+test_run:wait_lsn('replica2', 'default')
+ | ---
+ | ...
+
+-- Make so the replica1 will apply CONFIRM from the default instance for a long
+-- time.
+test_run:switch('replica1')
+ | ---
+ | - true
+ | ...
+box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0)
+ | ---
+ | - ok
+ | ...
+
+-- Emit the CONFIRM.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+box.cfg{replication_synchro_quorum = 3}
+ | ---
+ | ...
+test_run:wait_cond(function() return f:status() == 'dead' end)
+ | ---
+ | - true
+ | ...
+
+-- It hangs on the replica1.
+test_run:switch('replica1')
+ | ---
+ | - true
+ | ...
+test_run:wait_cond(function()                                                   \
+    return box.error.injection.get('ERRINJ_WAL_DELAY')                          \
+end)
+ | ---
+ | - true
+ | ...
+
+-- But is applied on the replica2. The limbo is empty here now.
+test_run:switch('replica2')
+ | ---
+ | - true
+ | ...
+test_run:wait_lsn('replica2', 'default')
+ | ---
+ | ...
+box.cfg{                                                                        \
+    replication_synchro_quorum = 2,                                             \
+    replication_synchro_timeout = 1000,                                         \
+}
+ | ---
+ | ...
+-- Replica2 takes the limbo ownership and sends the transaction to the replica1.
+-- Along with the CONFIRM from the default node, which is still not applied
+-- on the replica1.
+fiber = require('fiber')
+ | ---
+ | ...
+f = fiber.new(function() box.space.test:replace{2} end)
+ | ---
+ | ...
+
+test_run:switch('replica1')
+ | ---
+ | - true
+ | ...
+fiber = require('fiber')
+ | ---
+ | ...
+-- WAL write of the CONFIRM from the default node still is not done. Give it
+-- some time to get the new rows from the replica2 and block on the latch.
+-- Can't catch it anyhow via conds, so the only way is to sleep a bit.
+fiber.sleep(0.1)
+ | ---
+ | ...
+-- Let the WAL writes finish. Firstly CONFIRM is finished, the limbo is emptied
+-- and the replica_id 1 latch is unlocked. Now the replica2 transaction is
+-- applied and persisted.
+box.error.injection.set('ERRINJ_WAL_DELAY', false)
+ | ---
+ | - ok
+ | ...
+test_run:wait_lsn('replica1', 'replica2')
+ | ---
+ | ...
+box.space.test:get({2})
+ | ---
+ | - [2]
+ | ...
+
+-- Ensure the replication works fine, nothing is broken.
+test_run:wait_upstream(test_run:get_server_id('replica2'), {status = 'follow'})
+ | ---
+ | - true
+ | ...
+
+test_run:switch('replica2')
+ | ---
+ | - true
+ | ...
+test_run:wait_upstream(test_run:get_server_id('replica1'), {status = 'follow'})
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica1')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica1')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica2')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica2')
+ | ---
+ | - true
+ | ...
+s:drop()
+ | ---
+ | ...
+box.schema.user.revoke('guest', 'super')
+ | ---
+ | ...
+box.cfg{                                                                        \
+    replication_synchro_quorum = old_synchro_quorum,                            \
+    replication_synchro_timeout = old_synchro_timeout,                          \
+}
+ | ---
+ | ...
diff --git a/test/replication/gh-5213-qsync-applier-order-3.test.lua b/test/replication/gh-5213-qsync-applier-order-3.test.lua
new file mode 100644
index 000000000..37b569da7
--- /dev/null
+++ b/test/replication/gh-5213-qsync-applier-order-3.test.lua
@@ -0,0 +1,125 @@
+--
+-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL.
+-- As a result it could happen that the transactions became visible on CONFIRM,
+-- then somehow weren't written to WAL, and on restart the data might not be
+-- visible again. Which means rollback of confirmed data and is not acceptable
+-- (on the contrary with commit after rollback).
+--
+-- To fix that there was a patch making synchro rows processing after WAL write.
+-- As a result, the following situation could happen. Instance 1 owns the limbo,
+-- instances 2 and 3 know about that. The limbo is not empty. Now instance 1
+-- writes CONFIRM and sends it to the instances 2 and 3. Both start a WAL write
+-- for the CONFIRM. Now instance 3 finishes it, creates a new synchro
+-- transaction, owns the local limbo, and sends the transaction to the instance
+-- 2. Here the CONFIRM WAL write is not done yet. It should not happen, that
+-- the instance 3's new transaction is rejected. Because firstly instance 3 will
+-- send the same instance 1's CONFIRM to the instance 2 due to it being earlier
+-- in WAL of instance 3. Then on instance 2 it will block on a latch with
+-- replica_id 1 until the original CONFIRM received from the instance 1 is
+-- finished. Afterwards the instance 3's transaction is applied just fine - the
+-- limbo on instance 2 is empty now.
+--
+-- It is not related to read-views, but could break the replication.
+--
+test_run = require('test_run').new()
+fiber = require('fiber')
+old_synchro_quorum = box.cfg.replication_synchro_quorum
+old_synchro_timeout = box.cfg.replication_synchro_timeout
+
+box.schema.user.grant('guest', 'super')
+
+s = box.schema.space.create('test', {is_sync = true})
+_ = s:create_index('pk')
+
+test_run:cmd('create server replica1 with rpl_master=default,\
+              script="replication/replica1.lua"')
+test_run:cmd('start server replica1')
+
+test_run:cmd('create server replica2 with rpl_master=default,\
+              script="replication/replica2.lua"')
+test_run:cmd('start server replica2')
+
+-- Build mutual replication between replica1 and replica2.
+test_run:switch('replica1')
+replication = box.cfg.replication
+table.insert(replication, test_run:eval('replica2', 'return box.cfg.listen')[1])
+box.cfg{replication = {}}
+box.cfg{replication = replication}
+
+test_run:switch('replica2')
+replication = box.cfg.replication
+table.insert(replication, test_run:eval('replica1', 'return box.cfg.listen')[1])
+box.cfg{replication = {}}
+box.cfg{replication = replication}
+
+test_run:switch('default')
+fiber = require('fiber')
+box.cfg{                                                                        \
+    replication_synchro_quorum = 4,                                             \
+    replication_synchro_timeout = 1000,                                         \
+}
+-- Send a transaction to all 3 nodes. The limbo is owned by the default node
+-- everywhere.
+f = fiber.new(function() s:replace{1} end)
+test_run:wait_lsn('replica1', 'default')
+test_run:wait_lsn('replica2', 'default')
+
+-- Make so the replica1 will apply CONFIRM from the default instance for a long
+-- time.
+test_run:switch('replica1')
+box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0)
+
+-- Emit the CONFIRM.
+test_run:switch('default')
+box.cfg{replication_synchro_quorum = 3}
+test_run:wait_cond(function() return f:status() == 'dead' end)
+
+-- It hangs on the replica1.
+test_run:switch('replica1')
+test_run:wait_cond(function()                                                   \
+    return box.error.injection.get('ERRINJ_WAL_DELAY')                          \
+end)
+
+-- But is applied on the replica2. The limbo is empty here now.
+test_run:switch('replica2')
+test_run:wait_lsn('replica2', 'default')
+box.cfg{                                                                        \
+    replication_synchro_quorum = 2,                                             \
+    replication_synchro_timeout = 1000,                                         \
+}
+-- Replica2 takes the limbo ownership and sends the transaction to the replica1.
+-- Along with the CONFIRM from the default node, which is still not applied
+-- on the replica1.
+fiber = require('fiber')
+f = fiber.new(function() box.space.test:replace{2} end)
+
+test_run:switch('replica1')
+fiber = require('fiber')
+-- WAL write of the CONFIRM from the default node still is not done. Give it
+-- some time to get the new rows from the replica2 and block on the latch.
+-- Can't catch it anyhow via conds, so the only way is to sleep a bit.
+fiber.sleep(0.1)
+-- Let the WAL writes finish. Firstly CONFIRM is finished, the limbo is emptied
+-- and the replica_id 1 latch is unlocked. Now the replica2 transaction is
+-- applied and persisted.
+box.error.injection.set('ERRINJ_WAL_DELAY', false)
+test_run:wait_lsn('replica1', 'replica2')
+box.space.test:get({2})
+
+-- Ensure the replication works fine, nothing is broken.
+test_run:wait_upstream(test_run:get_server_id('replica2'), {status = 'follow'})
+
+test_run:switch('replica2')
+test_run:wait_upstream(test_run:get_server_id('replica1'), {status = 'follow'})
+
+test_run:switch('default')
+test_run:cmd('stop server replica1')
+test_run:cmd('delete server replica1')
+test_run:cmd('stop server replica2')
+test_run:cmd('delete server replica2')
+s:drop()
+box.schema.user.revoke('guest', 'super')
+box.cfg{                                                                        \
+    replication_synchro_quorum = old_synchro_quorum,                            \
+    replication_synchro_timeout = old_synchro_timeout,                          \
+}
diff --git a/test/replication/gh-5213-qsync-applier-order.result b/test/replication/gh-5213-qsync-applier-order.result
new file mode 100644
index 000000000..a8c24c289
--- /dev/null
+++ b/test/replication/gh-5213-qsync-applier-order.result
@@ -0,0 +1,302 @@
+-- test-run result file version 2
+--
+-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL.
+-- As a result it could happen that the transactions became visible on CONFIRM,
+-- then somehow weren't written to WAL, and on restart the data might not be
+-- visible again. Which means rollback of confirmed data and is not acceptable
+-- (on the contrary with commit after rollback).
+--
+test_run = require('test_run').new()
+ | ---
+ | ...
+fiber = require('fiber')
+ | ---
+ | ...
+old_synchro_quorum = box.cfg.replication_synchro_quorum
+ | ---
+ | ...
+old_synchro_timeout = box.cfg.replication_synchro_timeout
+ | ---
+ | ...
+
+box.schema.user.grant('guest', 'super')
+ | ---
+ | ...
+
+s = box.schema.space.create('test', {is_sync = true})
+ | ---
+ | ...
+_ = s:create_index('pk')
+ | ---
+ | ...
+
+test_run:cmd('create server replica with rpl_master=default,\
+              script="replication/gh-5213-replica.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica')
+ | ---
+ | - true
+ | ...
+
+fiber = require('fiber')
+ | ---
+ | ...
+box.cfg{                                                                        \
+    replication_synchro_quorum = 3,                                             \
+    replication_synchro_timeout = 1000,                                         \
+}
+ | ---
+ | ...
+f = fiber.new(function() s:replace{1} end)
+ | ---
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+-- Wait when the transaction is written to WAL.
+test_run:wait_lsn('replica', 'default')
+ | ---
+ | ...
+s = box.space.test
+ | ---
+ | ...
+-- But not visible yet. Because MVCC is on, and no CONFIRM yet.
+assert(s:get({1}) == nil)
+ | ---
+ | - true
+ | ...
+-- Block the incoming CONFIRM to be able to ensure the data is not visible until
+-- WAL write ends.
+box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0)
+ | ---
+ | - ok
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+box.cfg{replication_synchro_quorum = 2}
+ | ---
+ | ...
+test_run:wait_cond(function() return f:status() == 'dead' end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+-- Wait when CONFIRM is in the WAL thread.
+test_run:wait_cond(function()                                                   \
+    return box.error.injection.get('ERRINJ_WAL_DELAY')                          \
+end)
+ | ---
+ | - true
+ | ...
+assert(s:get({1}) == nil)
+ | ---
+ | - true
+ | ...
+box.error.injection.set('ERRINJ_WAL_DELAY', false)
+ | ---
+ | - ok
+ | ...
+-- After CONFIRM is in WAL, the transaction is committed and its data is
+-- visible.
+test_run:wait_cond(function() return s:get({1}) ~= nil end)
+ | ---
+ | - true
+ | ...
+
+--
+-- Ensure CONFIRM WAL write fail also works fine when couldn't even start a WAL
+-- write. The data remains not confirmed until WAL write succeeds.
+--
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+box.cfg{replication_synchro_quorum = 3}
+ | ---
+ | ...
+f = fiber.new(function() s:replace{2} end)
+ | ---
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+test_run:wait_lsn('replica', 'default')
+ | ---
+ | ...
+assert(s:get({2}) == nil)
+ | ---
+ | - true
+ | ...
+-- Make journal write fail immediately.
+box.error.injection.set('ERRINJ_WAL_IO', true)
+ | ---
+ | - ok
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+box.cfg{replication_synchro_quorum = 2}
+ | ---
+ | ...
+test_run:wait_cond(function() return f:status() == 'dead' end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+test_run:wait_upstream(1, {status = 'stopped'})
+ | ---
+ | - true
+ | ...
+box.error.injection.set('ERRINJ_WAL_IO', false)
+ | ---
+ | - ok
+ | ...
+assert(s:get({2}) == nil)
+ | ---
+ | - true
+ | ...
+-- Re-subscribe.
+replication = box.cfg.replication
+ | ---
+ | ...
+box.cfg{replication = {}}
+ | ---
+ | ...
+box.cfg{replication = replication}
+ | ---
+ | ...
+test_run:wait_lsn('replica', 'default')
+ | ---
+ | ...
+test_run:wait_upstream(1, {status = 'follow'})
+ | ---
+ | - true
+ | ...
+assert(s:get({2}) ~= nil)
+ | ---
+ | - true
+ | ...
+
+--
+-- Ensure CONFIRM WAL write fail works fine when an error happens inside WAL
+-- thread. The data remains not confirmed until WAL write succeeds.
+--
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+box.cfg{replication_synchro_quorum = 3}
+ | ---
+ | ...
+f = fiber.new(function() s:replace{3} end)
+ | ---
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+test_run:wait_lsn('replica', 'default')
+ | ---
+ | ...
+assert(s:get({3}) == nil)
+ | ---
+ | - true
+ | ...
+-- Journal write start is going to succeed, but it will fail later on return
+-- from the WAL thread.
+box.error.injection.set('ERRINJ_WAL_ROTATE', true)
+ | ---
+ | - ok
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+box.cfg{replication_synchro_quorum = 2}
+ | ---
+ | ...
+test_run:wait_cond(function() return f:status() == 'dead' end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+test_run:wait_upstream(1, {status = 'stopped'})
+ | ---
+ | - true
+ | ...
+box.error.injection.set('ERRINJ_WAL_ROTATE', false)
+ | ---
+ | - ok
+ | ...
+assert(s:get({3}) == nil)
+ | ---
+ | - true
+ | ...
+box.cfg{replication = {}}
+ | ---
+ | ...
+box.cfg{replication = replication}
+ | ---
+ | ...
+test_run:wait_lsn('replica', 'default')
+ | ---
+ | ...
+test_run:wait_upstream(1, {status = 'follow'})
+ | ---
+ | - true
+ | ...
+assert(s:get({3}) ~= nil)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica')
+ | ---
+ | - true
+ | ...
+
+s:drop()
+ | ---
+ | ...
+box.schema.user.revoke('guest', 'super')
+ | ---
+ | ...
+box.cfg{                                                                        \
+    replication_synchro_quorum = old_synchro_quorum,                            \
+    replication_synchro_timeout = old_synchro_timeout,                          \
+}
+ | ---
+ | ...
diff --git a/test/replication/gh-5213-qsync-applier-order.test.lua b/test/replication/gh-5213-qsync-applier-order.test.lua
new file mode 100644
index 000000000..f1eccfa84
--- /dev/null
+++ b/test/replication/gh-5213-qsync-applier-order.test.lua
@@ -0,0 +1,122 @@
+--
+-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL.
+-- As a result it could happen that the transactions became visible on CONFIRM,
+-- then somehow weren't written to WAL, and on restart the data might not be
+-- visible again. Which means rollback of confirmed data and is not acceptable
+-- (on the contrary with commit after rollback).
+--
+test_run = require('test_run').new()
+fiber = require('fiber')
+old_synchro_quorum = box.cfg.replication_synchro_quorum
+old_synchro_timeout = box.cfg.replication_synchro_timeout
+
+box.schema.user.grant('guest', 'super')
+
+s = box.schema.space.create('test', {is_sync = true})
+_ = s:create_index('pk')
+
+test_run:cmd('create server replica with rpl_master=default,\
+              script="replication/gh-5213-replica.lua"')
+test_run:cmd('start server replica')
+
+fiber = require('fiber')
+box.cfg{                                                                        \
+    replication_synchro_quorum = 3,                                             \
+    replication_synchro_timeout = 1000,                                         \
+}
+f = fiber.new(function() s:replace{1} end)
+
+test_run:switch('replica')
+-- Wait when the transaction is written to WAL.
+test_run:wait_lsn('replica', 'default')
+s = box.space.test
+-- But not visible yet. Because MVCC is on, and no CONFIRM yet.
+assert(s:get({1}) == nil)
+-- Block the incoming CONFIRM to be able to ensure the data is not visible until
+-- WAL write ends.
+box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0)
+
+test_run:switch('default')
+box.cfg{replication_synchro_quorum = 2}
+test_run:wait_cond(function() return f:status() == 'dead' end)
+
+test_run:switch('replica')
+-- Wait when CONFIRM is in the WAL thread.
+test_run:wait_cond(function()                                                   \
+    return box.error.injection.get('ERRINJ_WAL_DELAY')                          \
+end)
+assert(s:get({1}) == nil)
+box.error.injection.set('ERRINJ_WAL_DELAY', false)
+-- After CONFIRM is in WAL, the transaction is committed and its data is
+-- visible.
+test_run:wait_cond(function() return s:get({1}) ~= nil end)
+
+--
+-- Ensure CONFIRM WAL write fail also works fine when couldn't even start a WAL
+-- write. The data remains not confirmed until WAL write succeeds.
+--
+test_run:switch('default')
+box.cfg{replication_synchro_quorum = 3}
+f = fiber.new(function() s:replace{2} end)
+
+test_run:switch('replica')
+test_run:wait_lsn('replica', 'default')
+assert(s:get({2}) == nil)
+-- Make journal write fail immediately.
+box.error.injection.set('ERRINJ_WAL_IO', true)
+
+test_run:switch('default')
+box.cfg{replication_synchro_quorum = 2}
+test_run:wait_cond(function() return f:status() == 'dead' end)
+
+test_run:switch('replica')
+test_run:wait_upstream(1, {status = 'stopped'})
+box.error.injection.set('ERRINJ_WAL_IO', false)
+assert(s:get({2}) == nil)
+-- Re-subscribe.
+replication = box.cfg.replication
+box.cfg{replication = {}}
+box.cfg{replication = replication}
+test_run:wait_lsn('replica', 'default')
+test_run:wait_upstream(1, {status = 'follow'})
+assert(s:get({2}) ~= nil)
+
+--
+-- Ensure CONFIRM WAL write fail works fine when an error happens inside WAL
+-- thread. The data remains not confirmed until WAL write succeeds.
+--
+test_run:switch('default')
+box.cfg{replication_synchro_quorum = 3}
+f = fiber.new(function() s:replace{3} end)
+
+test_run:switch('replica')
+test_run:wait_lsn('replica', 'default')
+assert(s:get({3}) == nil)
+-- Journal write start is going to succeed, but it will fail later on return
+-- from the WAL thread.
+box.error.injection.set('ERRINJ_WAL_ROTATE', true)
+
+test_run:switch('default')
+box.cfg{replication_synchro_quorum = 2}
+test_run:wait_cond(function() return f:status() == 'dead' end)
+
+test_run:switch('replica')
+test_run:wait_upstream(1, {status = 'stopped'})
+box.error.injection.set('ERRINJ_WAL_ROTATE', false)
+assert(s:get({3}) == nil)
+box.cfg{replication = {}}
+box.cfg{replication = replication}
+test_run:wait_lsn('replica', 'default')
+test_run:wait_upstream(1, {status = 'follow'})
+assert(s:get({3}) ~= nil)
+
+test_run:switch('default')
+test_run:cmd('stop server replica')
+test_run:cmd('delete server replica')
+
+s:drop()
+box.schema.user.revoke('guest', 'super')
+box.cfg{                                                                        \
+    replication_synchro_quorum = old_synchro_quorum,                            \
+    replication_synchro_timeout = old_synchro_timeout,                          \
+}
diff --git a/test/replication/gh-5213-replica.lua b/test/replication/gh-5213-replica.lua
new file mode 100644
index 000000000..3ba90d7ab
--- /dev/null
+++ b/test/replication/gh-5213-replica.lua
@@ -0,0 +1,10 @@
+#!/usr/bin/env tarantool
+
+box.cfg({
+    listen = os.getenv("LISTEN"),
+    replication = os.getenv("MASTER"),
+    read_only = true,
+    memtx_use_mvcc_engine = true,
+})
+
+require('console').listen(os.getenv('ADMIN'))
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index aff5fda26..4a9ca0a46 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -15,6 +15,8 @@
     "gh-3760-misc-return-on-quorum-0.test.lua": {},
     "gh-4399-misc-no-failure-on-error-reading-wal.test.lua": {},
     "gh-4424-misc-orphan-on-reconfiguration-error.test.lua": {},
+    "gh-5213-qsync-applier-order.test.lua": {},
+    "gh-5213-qsync-applier-order-3.test.lua": {},
     "gh-5426-election-on-off.test.lua": {},
     "gh-5433-election-restart-recovery.test.lua": {},
     "gh-5506-election-on-off.test.lua": {},
diff --git a/test/replication/suite.ini b/test/replication/suite.ini
index a9e44e8cf..1d9c0a4ae 100644
--- a/test/replication/suite.ini
+++ b/test/replication/suite.ini
@@ -3,7 +3,7 @@ core = tarantool
 script =  master.lua
 description = tarantool/box, replication
 disabled = consistent.test.lua
-release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua
+release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua
 config = suite.cfg
 lua_libs = lua/fast_replica.lua lua/rlimit.lua
 use_unix_sockets = True


More information about the Tarantool-patches mailing list