[Tarantool-patches] [PATCH 1/1] applier: process synchro rows after WAL write

Fri Apr 9 11:25:18 MSK 2021

09.04.2021 01:57, Vladislav Shpilevoy пишет:
> On 08.04.2021 12:32, Serge Petrenko via Tarantool-patches wrote> 08.04.2021 13:19, Cyrill Gorcunov пишет:
>>> On Thu, Apr 08, 2021 at 11:39:03AM +0300, Serge Petrenko wrote:
>>>> Thanks for the patch!
>>>>
>>>> I'm a bit worried about two different synchro rows coming from two
>>>> appliers. Is everything ok in this case?
>>> Serge, you mean the scenario when some instances in replicaset
>>> have the patch applied and some are not?
>> No. Let's suppose everyone has this patch applied.
>> Now look at one particular instance. It may happen that while one of
>> its appliers is writing this synchro row (either CONFIRM or ROLLBACK,
>> doesn't matter), some other applier may still apply requests coming
>> from other replicaset members.
>>
>> I was wondering what would happen if someone else sent this instance
>> another synchro row. Looks like nothing bad but I just wanted to
>> double-check.
>>
>> And looks like there's a bug, which I'm speaking of below. It's about
>> someone sending us normal rows (either synchronous transactions or
>> asynchronous, but not CONFIRM/ROLLBACK entries) while we're waiting for
>> syncro row's write to end.
>>
>> Say, limbo was owned by instance 1, and instance 2 has written CONFIRM
>> for everything there was. While we wait for 2's CONFIRM to be written to
>> WAL, we may receive some rows from instance 3, who has already applied 2's
>> CONFIRM. Since we haven't written the CONFIRM yet, we haven't applied it,
>> and the limbo on our instance still isn't empty. All the rows coming from
>> 3 will get rejected and replication between 3 and us will be broken.
> ## Regarding the described issue
>
> This is a super good question, I am glad you noticed. I firstly thought
> it is a bug, but then I realized it is not possible.
>
> If the instance 3 already wrote that CONFIRM from instance_id 1, it would
> send it us too before its own rows, and it would block on the latch of the
> struct replica with instance_id 1 until the CONFIRM is finally written. Then
> the second attempt to write it would be skipped and the rows of instance_id 3
> would be applied fine.
>
> I made a test which replayed the described situation and all worked fine. I
> added it to the patch, and also I provide it below for easy reproduction in
> 3 consoles.

Yes, indeed. Thanks for the explanation!

>
> ## A problem not related to the patch
>
> However there might happen a similar issue, not related to the fixed bug.
> Assume the limbo belongs to instance 1. Then instance 1 is gone, and
> instance 2 tries to write CONFIRM on behalf of instance 1 using
> box.ctl.clear_synchro_queue().
>
> While the CONFIRM is in progress, it goes to WAL, from WAL it goes to
> replication, from there it goes to instance 3. This one applies it, the limbo
> becomes empty on the instance 3. It writes a new row, which is written to
> its WAL, and is sent back to instance 2. The instance 2 still didn't finish
> the original CONFIRM WAL write, TX thread didn't see it yet. So its limbo still
> has rows from instance 1, and the latch is not locked. Now the replication
> between instances 2 and 3 is broken.

Yes, that's possible. I've also thought of a lock on CONFIRM, etc.

>
> I didn't try to reproduce this craziness though. However I have an idea how
> we could fix it "for good". Introduce a fiber read-write lock. All the
> transactions take read-lock on the limbo. All CONFIRM/ROLLBACK/PROMOTE take
> a write-lock on the limbo. Then all the limbo owner transitions would be
> fixed.

Sounds good. Isn't it enough to only have a write lock though?

>
> I also thought more about how to evolve my idea about waiting for WAL write
> only if we get a conflicting transaction. We need to know transaction's
> origin ID before calling txn_commit/txn_commit_try_async. To make it fast we
> need to calculate it while create txn_stmt objects so it is already known
> right before commit. And here it is tricky, because locally-created
> transactions might rollback individual statements using savepoints. That
> makes the on-fly calculation of the origin_id not feasible. So far I couldn't
> find what to do with that.
>
> ## The console test
>
> It reproduces your scenario and all works fine, because the latch saves
> our asses. You need 3 instances in different consoles and need to do the
> steps in the order as they are marked in the comments.

The test's fine. Thanks for checking this!

>
> --
> -- Instance 1
> --
> -- Step 1
> box.cfg{
>      listen = 3313,
>      replication_synchro_quorum = 4,
>      replication_synchro_timeout = 1000,
> }
> box.schema.user.grant('guest', 'super')
> s = box.schema.create_space('test', {is_sync = true})
> _ = s:create_index('pk')
>
> -- Step 4
> fiber = require('fiber')
> f = fiber.new(function() s:replace{1} end)
>
> -- Step 8
> box.cfg{replication_synchro_quorum = 3}
> while f:status() ~= 'dead' do fiber.sleep(0.1) end
>
>
>
> --
> -- Instance 2
> --
> -- Step 2
> box.cfg{
>      listen = 3314,
>      replication = {'localhost:3313', 'localhost:3314', 'localhost:3315'},
>      read_only = true,
> }
>
> -- Step 5
> -- Wait vclock.
>
> -- Step 7
> box.error.injection.set("ERRINJ_WAL_DELAY", true)
>
> -- Step 10
> box.error.injection.set("ERRINJ_WAL_DELAY", false)
>
>
>
> --
> -- Instance 3
> --
> -- Step 3
> box.cfg{
>      listen = 3315,
>      replication = {'localhost:3313', 'localhost:3314', 'localhost:3315'},
>      read_only = true,
> }
>
> -- Step 6
> -- Wait vclock.
>
> -- Step 9
> box.cfg{read_only = false}
> s = box.space.test
> box.cfg{
>      replication_synchro_quorum = 2,
>      replication_synchro_timeout = 1000,
> }
> s:replace{2}
>
> ## The patch
>
> Here is a full diff of the patch. It includes the test above as a
> new .test.lua file, and all review fixes.
>
> ====================
>
>      applier: process synchro rows after WAL write
>      
>      Applier used to process synchronous rows CONFIRM and ROLLBACK
>      right after receipt before they are written to WAL.
>      
>      That led to a bug that the confirmed data became visible, might be
>      accessed by user requests, then the node restarted before CONFIRM
>      finished its WAL write, and the data was not visible again. That
>      is just like if it would be rolled back, which is not acceptable.
>      
>      Another case - CONFIRM WAL write could simply fail due to any
>      reason (no disk space, OOM), but the transactions would remain
>      confirmed anyway.
>      
>      Also that produced some hacks in the limbo's code to support the
>      confirmation and rollback of transactions not yet written to WAL.
>      
>      The patch makes the synchro rows processed only after they are
>      written to WAL. Although the 'rollback' case above might still
>      happen if the xlogs were in the kernel caches, and the machine was
>      powered off before they were flushed to disk. But that is not
>      related to qsync specifically.
>      
>      To handle the synchro rows after WAL write the patch makes them go
>      to WAL in a blocking way (journal_write() instead of
>      journal_write_try_async()). Otherwise it could happen that a
>      CONFIRM/ROLLBACK is being written to WAL and would clear the limbo
>      afterwards, but a new transaction arrives with a different owner,
>      and it conflicts with the current limbo owner.
>      
>      Closes #5213
>
> diff --git a/changelogs/unreleased/applier-qsync-write-order.md b/changelogs/unreleased/applier-qsync-write-order.md
> new file mode 100644
> index 000000000..10b9ade80
> --- /dev/null
> +++ b/changelogs/unreleased/applier-qsync-write-order.md
> @@ -0,0 +1,6 @@
> +## bugfix/replication
> +
> +* Fix the bug when a synchronous transaction could be confirmed and visible on
> +  a replica, but then not confirmed / invisible again after restart. Could
> +  happen more likely on memtx spaces with `memtx_use_mvcc_engine` enabled
> +  (gh-5213).
> diff --git a/src/box/applier.cc b/src/box/applier.cc
> index 971b2e64c..4898f9f7b 100644
> --- a/src/box/applier.cc
> +++ b/src/box/applier.cc
> @@ -762,26 +762,21 @@ applier_txn_wal_write_cb(struct trigger *trigger, void *event)
>   }
>   
>   struct synchro_entry {
> -	/** Encoded form of a synchro record. */
> -	struct synchro_body_bin	body_bin;
> -
> -	/** xrow to write, used by the journal engine. */
> -	struct xrow_header row;
> -
> +	/** Request to process when WAL write is done. */
> +	struct synchro_request *req;
> +	/** Fiber created the entry. To wakeup when WAL write is done. */
> +	struct fiber *owner;
>   	/**
> -	 * The journal entry itself. Note since
> -	 * it has unsized array it must be the
> -	 * last entry in the structure.
> +	 * The base journal entry. It has unsized array and then must be the
> +	 * last entry in the structure. But can workaround it via a union
> +	 * adding the needed tail as char[].
>   	 */
> -	struct journal_entry journal_entry;
> +	union {
> +		struct journal_entry base;
> +		char base_buf[sizeof(base) + sizeof(base.rows[0])];
> +	};
>   };
>   
> -static void
> -synchro_entry_delete(struct synchro_entry *entry)
> -{
> -	free(entry);
> -}
> -
>   /**
>    * Async write journal completion.
>    */
> @@ -791,50 +786,15 @@ apply_synchro_row_cb(struct journal_entry *entry)
>   	assert(entry->complete_data != NULL);
>   	struct synchro_entry *synchro_entry =
>   		(struct synchro_entry *)entry->complete_data;
> -	if (entry->res < 0)
> +	if (entry->res < 0) {
>   		applier_rollback_by_wal_io();
> -	else
> +	} else {
> +		txn_limbo_process(&txn_limbo, synchro_entry->req);
>   		trigger_run(&replicaset.applier.on_wal_write, NULL);
> -
> -	synchro_entry_delete(synchro_entry);
> -}
> -
> -/**
> - * Allocate a new synchro_entry to be passed to
> - * the journal engine in async write way.
> - */
> -static struct synchro_entry *
> -synchro_entry_new(struct xrow_header *applier_row,
> -		  struct synchro_request *req)
> -{
> -	struct synchro_entry *entry;
> -	size_t size = sizeof(*entry) + sizeof(struct xrow_header *);
> -
> -	/*
> -	 * For simplicity we use malloc here but
> -	 * probably should provide some cache similar
> -	 * to txn cache.
> -	 */
> -	entry = (struct synchro_entry *)malloc(size);
> -	if (entry == NULL) {
> -		diag_set(OutOfMemory, size, "malloc", "synchro_entry");
> -		return NULL;
>   	}
> -
> -	struct journal_entry *journal_entry = &entry->journal_entry;
> -	struct synchro_body_bin *body_bin = &entry->body_bin;
> -	struct xrow_header *row = &entry->row;
> -
> -	journal_entry->rows[0] = row;
> -
> -	xrow_encode_synchro(row, body_bin, req);
> -
> -	row->lsn = applier_row->lsn;
> -	row->replica_id = applier_row->replica_id;
> -
> -	journal_entry_create(journal_entry, 1, xrow_approx_len(row),
> -			     apply_synchro_row_cb, entry);
> -	return entry;
> +	/* The fiber is the same on final join. */
> +	if (synchro_entry->owner != fiber())
> +		fiber_wakeup(synchro_entry->owner);
>   }
>   
>   /** Process a synchro request. */
> @@ -847,14 +807,37 @@ apply_synchro_row(struct xrow_header *row)
>   	if (xrow_decode_synchro(row, &req) != 0)
>   		goto err;
>   
> -	txn_limbo_process(&txn_limbo, &req);
> -
> -	struct synchro_entry *entry;
> -	entry = synchro_entry_new(row, &req);
> -	if (entry == NULL)
> -		goto err;
> -
> -	if (journal_write_try_async(&entry->journal_entry) != 0) {
> +	struct synchro_entry entry;
> +	/*
> +	 * Rows array is cast from *[] to **, because otherwise g++ complains
> +	 * about out of array bounds access.
> +	 */
> +	struct xrow_header **rows;
> +	rows = entry.base.rows;
> +	rows[0] = row;
> ====================
> This is how I fixed the compilation issue. I couldn't reproduce it
> locally in my GCC installed via brew, but it was failing in CI in
> one job. A simple cast didn't help. Only creation of a new variable
> did help. Also it is not assigned right away - the declaration is
> separate. Otherwise 'goto's before that line do not work.
> ====================
> +	journal_entry_create(&entry.base, 1, xrow_approx_len(row),
> +			     apply_synchro_row_cb, &entry);
> +	entry.req = &req;
> +	entry.owner = fiber();
> +	/*
> +	 * The WAL write is blocking. Otherwise it might happen that a CONFIRM
> +	 * or ROLLBACK is sent to WAL, and it would empty the limbo, but before
> +	 * it is written, more transactions arrive with a different owner. They
> +	 * won't be able to enter the limbo due to owner ID mismatch. Hence the
> +	 * synchro rows must block receipt of new transactions.
> +	 *
> +	 * Don't forget to return -1 both if the journal write failed right
> +	 * away, and if it failed inside of WAL thread (res < 0). Otherwise the
> +	 * caller would propagate committed vclock to this row thinking it was
> +	 * a success.
> +	 *
> +	 * XXX: in theory it could be done vice-versa. The write could be made
> +	 * non-blocking, and instead the potentially conflicting transactions
> +	 * could try to wait for all the current synchro WAL writes to end
> +	 * before trying to commit. But that requires extra steps from the
> +	 * transactions side, including the async ones.
> +	 */
> +	if (journal_write(&entry.base) != 0 || entry.base.res < 0) {
>   		diag_set(ClientError, ER_WAL_IO);
>   		goto err;
>   	}
> diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
> index cf0ad9350..c63211191 100644
> --- a/src/box/txn_limbo.c
> +++ b/src/box/txn_limbo.c
> @@ -400,20 +400,11 @@ txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn)
>   		txn_limbo_remove(limbo, e);
>   		txn_clear_flags(e->txn, TXN_WAIT_SYNC | TXN_WAIT_ACK);
>   		/*
> -		 * If already written to WAL by now, finish tx processing.
> -		 * Otherwise just clear the sync flags. Tx procesing will finish
> -		 * automatically once the tx is written to WAL.
> -		 *
> -		 * XXX: Normally at this point all transactions covered by this
> -		 * CONFIRM should be in WAL already, but there is a bug, that
> -		 * replica always processes received synchro requests *before*
> -		 * writing them to WAL. So it can happen, that a CONFIRM is
> -		 * 'read', but the transaction is not written yet. Should be
> -		 * fixed when the replica will behave properly, and then this
> -		 * branch won't exist.
> +		 * Should be written to WAL by now. Confirm is always written
> +		 * after the affected transactions.
>   		 */
> -		if (e->txn->signature >= 0)
> -			txn_complete_success(e->txn);
> +		assert(e->txn->signature >= 0);
> +		txn_complete_success(e->txn);
>   	}
>   	/* Update is_ro once the limbo is clear. */
>   	if (txn_limbo_is_empty(limbo))
> @@ -455,30 +446,13 @@ txn_limbo_read_rollback(struct txn_limbo *limbo, int64_t lsn)
>   	rlist_foreach_entry_safe_reverse(e, &limbo->queue, in_queue, tmp) {
>   		txn_limbo_abort(limbo, e);
>   		txn_clear_flags(e->txn, TXN_WAIT_SYNC | TXN_WAIT_ACK);
> -		if (e->txn->signature >= 0) {
> -			/* Rollback the transaction. */
> -			e->txn->signature = TXN_SIGNATURE_SYNC_ROLLBACK;
> -			txn_complete_fail(e->txn);
> -		} else {
> -			/*
> -			 * Rollback the transaction, but don't free it yet. It
> -			 * will be freed after its WAL write is completed.
> -			 *
> -			 * XXX: Normally at this point all transactions covered
> -			 * by this ROLLBACK should be in WAL already, but there
> -			 * is a bug, that replica always processes received
> -			 * synchro requests *before* writing them to WAL. So it
> -			 * can happen, that a ROLLBACK is 'read', but the
> -			 * transaction is not written yet. Should be fixed when
> -			 * the replica will behave properly, and then this
> -			 * branch won't exist.
> -			 */
> -			e->txn->signature = TXN_SIGNATURE_SYNC_ROLLBACK;
> -			struct fiber *fiber = e->txn->fiber;
> -			e->txn->fiber = fiber();
> -			txn_complete_fail(e->txn);
> -			e->txn->fiber = fiber;
> -		}
> +		/*
> +		 * Should be written to WAL by now. Rollback is always written
> +		 * after the affected transactions.
> +		 */
> +		assert(e->txn->signature >= 0);
> +		e->txn->signature = TXN_SIGNATURE_SYNC_ROLLBACK;
> +		txn_complete_fail(e->txn);
>   		if (e == last_rollback)
>   			break;
>   	}
> diff --git a/test/replication/gh-5213-qsync-applier-order-3.result b/test/replication/gh-5213-qsync-applier-order-3.result
> new file mode 100644
> index 000000000..bcb18b5c0
> --- /dev/null
> +++ b/test/replication/gh-5213-qsync-applier-order-3.result
> @@ -0,0 +1,263 @@
> +-- test-run result file version 2
> +--
> +-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL.
> +-- As a result it could happen that the transactions became visible on CONFIRM,
> +-- then somehow weren't written to WAL, and on restart the data might not be
> +-- visible again. Which means rollback of confirmed data and is not acceptable
> +-- (on the contrary with commit after rollback).
> +--
> +-- To fix that there was a patch making synchro rows processing after WAL write.
> +-- As a result, the following situation could happen. Instance 1 owns the limbo,
> +-- instances 2 and 3 know about that. The limbo is not empty. Now instance 1
> +-- writes CONFIRM and sends it to the instances 2 and 3. Both start a WAL write
> +-- for the CONFIRM. Now instance 3 finishes it, creates a new synchro
> +-- transaction, owns the local limbo, and sends the transaction to the instance
> +-- 2. Here the CONFIRM WAL write is not done yet. It should not happen, that
> +-- the instance 3's new transaction is rejected. Because firstly instance 3 will
> +-- send the same instance 1's CONFIRM to the instance 2 due to it being earlier
> +-- in WAL of instance 3. Then on instance 2 it will block on a latch with
> +-- replica_id 1 until the original CONFIRM received from the instance 1 is
> +-- finished. Afterwards the instance 3's transaction is applied just fine - the
> +-- limbo on instance 2 is empty now.
> +--
> +-- It is not related to read-views, but could break the replication.
> +--
> +test_run = require('test_run').new()
> + | ---
> + | ...
> +fiber = require('fiber')
> + | ---
> + | ...
> +old_synchro_quorum = box.cfg.replication_synchro_quorum
> + | ---
> + | ...
> +old_synchro_timeout = box.cfg.replication_synchro_timeout
> + | ---
> + | ...
> +
> +box.schema.user.grant('guest', 'super')
> + | ---
> + | ...
> +
> +s = box.schema.space.create('test', {is_sync = true})
> + | ---
> + | ...
> +_ = s:create_index('pk')
> + | ---
> + | ...
> +
> +test_run:cmd('create server replica1 with rpl_master=default,\
> +              script="replication/replica1.lua"')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('start server replica1')
> + | ---
> + | - true
> + | ...
> +
> +test_run:cmd('create server replica2 with rpl_master=default,\
> +              script="replication/replica2.lua"')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('start server replica2')
> + | ---
> + | - true
> + | ...
> +
> +-- Build mutual replication between replica1 and replica2.
> +test_run:switch('replica1')
> + | ---
> + | - true
> + | ...
> +replication = box.cfg.replication
> + | ---
> + | ...
> +table.insert(replication, test_run:eval('replica2', 'return box.cfg.listen')[1])
> + | ---
> + | ...
> +box.cfg{replication = {}}
> + | ---
> + | ...
> +box.cfg{replication = replication}
> + | ---
> + | ...
> +
> +test_run:switch('replica2')
> + | ---
> + | - true
> + | ...
> +replication = box.cfg.replication
> + | ---
> + | ...
> +table.insert(replication, test_run:eval('replica1', 'return box.cfg.listen')[1])
> + | ---
> + | ...
> +box.cfg{replication = {}}
> + | ---
> + | ...
> +box.cfg{replication = replication}
> + | ---
> + | ...
> +
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +fiber = require('fiber')
> + | ---
> + | ...
> +box.cfg{                                                                        \
> +    replication_synchro_quorum = 4,                                             \
> +    replication_synchro_timeout = 1000,                                         \
> +}
> + | ---
> + | ...
> +-- Send a transaction to all 3 nodes. The limbo is owned by the default node
> +-- everywhere.
> +f = fiber.new(function() s:replace{1} end)
> + | ---
> + | ...
> +test_run:wait_lsn('replica1', 'default')
> + | ---
> + | ...
> +test_run:wait_lsn('replica2', 'default')
> + | ---
> + | ...
> +
> +-- Make so the replica1 will apply CONFIRM from the default instance for a long
> +-- time.
> +test_run:switch('replica1')
> + | ---
> + | - true
> + | ...
> +box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0)
> + | ---
> + | - ok
> + | ...
> +
> +-- Emit the CONFIRM.
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +box.cfg{replication_synchro_quorum = 3}
> + | ---
> + | ...
> +test_run:wait_cond(function() return f:status() == 'dead' end)
> + | ---
> + | - true
> + | ...
> +
> +-- It hangs on the replica1.
> +test_run:switch('replica1')
> + | ---
> + | - true
> + | ...
> +test_run:wait_cond(function()                                                   \
> +    return box.error.injection.get('ERRINJ_WAL_DELAY')                          \
> +end)
> + | ---
> + | - true
> + | ...
> +
> +-- But is applied on the replica2. The limbo is empty here now.
> +test_run:switch('replica2')
> + | ---
> + | - true
> + | ...
> +test_run:wait_lsn('replica2', 'default')
> + | ---
> + | ...
> +box.cfg{                                                                        \
> +    replication_synchro_quorum = 2,                                             \
> +    replication_synchro_timeout = 1000,                                         \
> +}
> + | ---
> + | ...
> +-- Replica2 takes the limbo ownership and sends the transaction to the replica1.
> +-- Along with the CONFIRM from the default node, which is still not applied
> +-- on the replica1.
> +fiber = require('fiber')
> + | ---
> + | ...
> +f = fiber.new(function() box.space.test:replace{2} end)
> + | ---
> + | ...
> +
> +test_run:switch('replica1')
> + | ---
> + | - true
> + | ...
> +fiber = require('fiber')
> + | ---
> + | ...
> +-- WAL write of the CONFIRM from the default node still is not done. Give it
> +-- some time to get the new rows from the replica2 and block on the latch.
> +-- Can't catch it anyhow via conds, so the only way is to sleep a bit.
> +fiber.sleep(0.1)
> + | ---
> + | ...
> +-- Let the WAL writes finish. Firstly CONFIRM is finished, the limbo is emptied
> +-- and the replica_id 1 latch is unlocked. Now the replica2 transaction is
> +-- applied and persisted.
> +box.error.injection.set('ERRINJ_WAL_DELAY', false)
> + | ---
> + | - ok
> + | ...
> +test_run:wait_lsn('replica1', 'replica2')
> + | ---
> + | ...
> +box.space.test:get({2})
> + | ---
> + | - [2]
> + | ...
> +
> +-- Ensure the replication works fine, nothing is broken.
> +test_run:wait_upstream(test_run:get_server_id('replica2'), {status = 'follow'})
> + | ---
> + | - true
> + | ...
> +
> +test_run:switch('replica2')
> + | ---
> + | - true
> + | ...
> +test_run:wait_upstream(test_run:get_server_id('replica1'), {status = 'follow'})
> + | ---
> + | - true
> + | ...
> +
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('stop server replica1')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('delete server replica1')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('stop server replica2')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('delete server replica2')
> + | ---
> + | - true
> + | ...
> +s:drop()
> + | ---
> + | ...
> +box.schema.user.revoke('guest', 'super')
> + | ---
> + | ...
> +box.cfg{                                                                        \
> +    replication_synchro_quorum = old_synchro_quorum,                            \
> +    replication_synchro_timeout = old_synchro_timeout,                          \
> +}
> + | ---
> + | ...
> diff --git a/test/replication/gh-5213-qsync-applier-order-3.test.lua b/test/replication/gh-5213-qsync-applier-order-3.test.lua
> new file mode 100644
> index 000000000..37b569da7
> --- /dev/null
> +++ b/test/replication/gh-5213-qsync-applier-order-3.test.lua
> @@ -0,0 +1,125 @@
> +--
> +-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL.
> +-- As a result it could happen that the transactions became visible on CONFIRM,
> +-- then somehow weren't written to WAL, and on restart the data might not be
> +-- visible again. Which means rollback of confirmed data and is not acceptable
> +-- (on the contrary with commit after rollback).
> +--
> +-- To fix that there was a patch making synchro rows processing after WAL write.
> +-- As a result, the following situation could happen. Instance 1 owns the limbo,
> +-- instances 2 and 3 know about that. The limbo is not empty. Now instance 1
> +-- writes CONFIRM and sends it to the instances 2 and 3. Both start a WAL write
> +-- for the CONFIRM. Now instance 3 finishes it, creates a new synchro
> +-- transaction, owns the local limbo, and sends the transaction to the instance
> +-- 2. Here the CONFIRM WAL write is not done yet. It should not happen, that
> +-- the instance 3's new transaction is rejected. Because firstly instance 3 will
> +-- send the same instance 1's CONFIRM to the instance 2 due to it being earlier
> +-- in WAL of instance 3. Then on instance 2 it will block on a latch with
> +-- replica_id 1 until the original CONFIRM received from the instance 1 is
> +-- finished. Afterwards the instance 3's transaction is applied just fine - the
> +-- limbo on instance 2 is empty now.
> +--
> +-- It is not related to read-views, but could break the replication.
> +--
> +test_run = require('test_run').new()
> +fiber = require('fiber')
> +old_synchro_quorum = box.cfg.replication_synchro_quorum
> +old_synchro_timeout = box.cfg.replication_synchro_timeout
> +
> +box.schema.user.grant('guest', 'super')
> +
> +s = box.schema.space.create('test', {is_sync = true})
> +_ = s:create_index('pk')
> +
> +test_run:cmd('create server replica1 with rpl_master=default,\
> +              script="replication/replica1.lua"')
> +test_run:cmd('start server replica1')
> +
> +test_run:cmd('create server replica2 with rpl_master=default,\
> +              script="replication/replica2.lua"')
> +test_run:cmd('start server replica2')
> +
> +-- Build mutual replication between replica1 and replica2.
> +test_run:switch('replica1')
> +replication = box.cfg.replication
> +table.insert(replication, test_run:eval('replica2', 'return box.cfg.listen')[1])
> +box.cfg{replication = {}}
> +box.cfg{replication = replication}
> +
> +test_run:switch('replica2')
> +replication = box.cfg.replication
> +table.insert(replication, test_run:eval('replica1', 'return box.cfg.listen')[1])
> +box.cfg{replication = {}}
> +box.cfg{replication = replication}
> +
> +test_run:switch('default')
> +fiber = require('fiber')
> +box.cfg{                                                                        \
> +    replication_synchro_quorum = 4,                                             \
> +    replication_synchro_timeout = 1000,                                         \
> +}
> +-- Send a transaction to all 3 nodes. The limbo is owned by the default node
> +-- everywhere.
> +f = fiber.new(function() s:replace{1} end)
> +test_run:wait_lsn('replica1', 'default')
> +test_run:wait_lsn('replica2', 'default')
> +
> +-- Make so the replica1 will apply CONFIRM from the default instance for a long
> +-- time.
> +test_run:switch('replica1')
> +box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0)
> +
> +-- Emit the CONFIRM.
> +test_run:switch('default')
> +box.cfg{replication_synchro_quorum = 3}
> +test_run:wait_cond(function() return f:status() == 'dead' end)
> +
> +-- It hangs on the replica1.
> +test_run:switch('replica1')
> +test_run:wait_cond(function()                                                   \
> +    return box.error.injection.get('ERRINJ_WAL_DELAY')                          \
> +end)
> +
> +-- But is applied on the replica2. The limbo is empty here now.
> +test_run:switch('replica2')
> +test_run:wait_lsn('replica2', 'default')
> +box.cfg{                                                                        \
> +    replication_synchro_quorum = 2,                                             \
> +    replication_synchro_timeout = 1000,                                         \
> +}
> +-- Replica2 takes the limbo ownership and sends the transaction to the replica1.
> +-- Along with the CONFIRM from the default node, which is still not applied
> +-- on the replica1.
> +fiber = require('fiber')
> +f = fiber.new(function() box.space.test:replace{2} end)
> +
> +test_run:switch('replica1')
> +fiber = require('fiber')
> +-- WAL write of the CONFIRM from the default node still is not done. Give it
> +-- some time to get the new rows from the replica2 and block on the latch.
> +-- Can't catch it anyhow via conds, so the only way is to sleep a bit.
> +fiber.sleep(0.1)
> +-- Let the WAL writes finish. Firstly CONFIRM is finished, the limbo is emptied
> +-- and the replica_id 1 latch is unlocked. Now the replica2 transaction is
> +-- applied and persisted.
> +box.error.injection.set('ERRINJ_WAL_DELAY', false)
> +test_run:wait_lsn('replica1', 'replica2')
> +box.space.test:get({2})
> +
> +-- Ensure the replication works fine, nothing is broken.
> +test_run:wait_upstream(test_run:get_server_id('replica2'), {status = 'follow'})
> +
> +test_run:switch('replica2')
> +test_run:wait_upstream(test_run:get_server_id('replica1'), {status = 'follow'})
> +
> +test_run:switch('default')
> +test_run:cmd('stop server replica1')
> +test_run:cmd('delete server replica1')
> +test_run:cmd('stop server replica2')
> +test_run:cmd('delete server replica2')
> +s:drop()
> +box.schema.user.revoke('guest', 'super')
> +box.cfg{                                                                        \
> +    replication_synchro_quorum = old_synchro_quorum,                            \
> +    replication_synchro_timeout = old_synchro_timeout,                          \
> +}
> diff --git a/test/replication/gh-5213-qsync-applier-order.result b/test/replication/gh-5213-qsync-applier-order.result
> new file mode 100644
> index 000000000..a8c24c289
> --- /dev/null
> +++ b/test/replication/gh-5213-qsync-applier-order.result
> @@ -0,0 +1,302 @@
> +-- test-run result file version 2
> +--
> +-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL.
> +-- As a result it could happen that the transactions became visible on CONFIRM,
> +-- then somehow weren't written to WAL, and on restart the data might not be
> +-- visible again. Which means rollback of confirmed data and is not acceptable
> +-- (on the contrary with commit after rollback).
> +--
> +test_run = require('test_run').new()
> + | ---
> + | ...
> +fiber = require('fiber')
> + | ---
> + | ...
> +old_synchro_quorum = box.cfg.replication_synchro_quorum
> + | ---
> + | ...
> +old_synchro_timeout = box.cfg.replication_synchro_timeout
> + | ---
> + | ...
> +
> +box.schema.user.grant('guest', 'super')
> + | ---
> + | ...
> +
> +s = box.schema.space.create('test', {is_sync = true})
> + | ---
> + | ...
> +_ = s:create_index('pk')
> + | ---
> + | ...
> +
> +test_run:cmd('create server replica with rpl_master=default,\
> +              script="replication/gh-5213-replica.lua"')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('start server replica')
> + | ---
> + | - true
> + | ...
> +
> +fiber = require('fiber')
> + | ---
> + | ...
> +box.cfg{                                                                        \
> +    replication_synchro_quorum = 3,                                             \
> +    replication_synchro_timeout = 1000,                                         \
> +}
> + | ---
> + | ...
> +f = fiber.new(function() s:replace{1} end)
> + | ---
> + | ...
> +
> +test_run:switch('replica')
> + | ---
> + | - true
> + | ...
> +-- Wait when the transaction is written to WAL.
> +test_run:wait_lsn('replica', 'default')
> + | ---
> + | ...
> +s = box.space.test
> + | ---
> + | ...
> +-- But not visible yet. Because MVCC is on, and no CONFIRM yet.
> +assert(s:get({1}) == nil)
> + | ---
> + | - true
> + | ...
> +-- Block the incoming CONFIRM to be able to ensure the data is not visible until
> +-- WAL write ends.
> +box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0)
> + | ---
> + | - ok
> + | ...
> +
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +box.cfg{replication_synchro_quorum = 2}
> + | ---
> + | ...
> +test_run:wait_cond(function() return f:status() == 'dead' end)
> + | ---
> + | - true
> + | ...
> +
> +test_run:switch('replica')
> + | ---
> + | - true
> + | ...
> +-- Wait when CONFIRM is in the WAL thread.
> +test_run:wait_cond(function()                                                   \
> +    return box.error.injection.get('ERRINJ_WAL_DELAY')                          \
> +end)
> + | ---
> + | - true
> + | ...
> +assert(s:get({1}) == nil)
> + | ---
> + | - true
> + | ...
> +box.error.injection.set('ERRINJ_WAL_DELAY', false)
> + | ---
> + | - ok
> + | ...
> +-- After CONFIRM is in WAL, the transaction is committed and its data is
> +-- visible.
> +test_run:wait_cond(function() return s:get({1}) ~= nil end)
> + | ---
> + | - true
> + | ...
> +
> +--
> +-- Ensure CONFIRM WAL write fail also works fine when couldn't even start a WAL
> +-- write. The data remains not confirmed until WAL write succeeds.
> +--
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +box.cfg{replication_synchro_quorum = 3}
> + | ---
> + | ...
> +f = fiber.new(function() s:replace{2} end)
> + | ---
> + | ...
> +
> +test_run:switch('replica')
> + | ---
> + | - true
> + | ...
> +test_run:wait_lsn('replica', 'default')
> + | ---
> + | ...
> +assert(s:get({2}) == nil)
> + | ---
> + | - true
> + | ...
> +-- Make journal write fail immediately.
> +box.error.injection.set('ERRINJ_WAL_IO', true)
> + | ---
> + | - ok
> + | ...
> +
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +box.cfg{replication_synchro_quorum = 2}
> + | ---
> + | ...
> +test_run:wait_cond(function() return f:status() == 'dead' end)
> + | ---
> + | - true
> + | ...
> +
> +test_run:switch('replica')
> + | ---
> + | - true
> + | ...
> +test_run:wait_upstream(1, {status = 'stopped'})
> + | ---
> + | - true
> + | ...
> +box.error.injection.set('ERRINJ_WAL_IO', false)
> + | ---
> + | - ok
> + | ...
> +assert(s:get({2}) == nil)
> + | ---
> + | - true
> + | ...
> +-- Re-subscribe.
> +replication = box.cfg.replication
> + | ---
> + | ...
> +box.cfg{replication = {}}
> + | ---
> + | ...
> +box.cfg{replication = replication}
> + | ---
> + | ...
> +test_run:wait_lsn('replica', 'default')
> + | ---
> + | ...
> +test_run:wait_upstream(1, {status = 'follow'})
> + | ---
> + | - true
> + | ...
> +assert(s:get({2}) ~= nil)
> + | ---
> + | - true
> + | ...
> +
> +--
> +-- Ensure CONFIRM WAL write fail works fine when an error happens inside WAL
> +-- thread. The data remains not confirmed until WAL write succeeds.
> +--
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +box.cfg{replication_synchro_quorum = 3}
> + | ---
> + | ...
> +f = fiber.new(function() s:replace{3} end)
> + | ---
> + | ...
> +
> +test_run:switch('replica')
> + | ---
> + | - true
> + | ...
> +test_run:wait_lsn('replica', 'default')
> + | ---
> + | ...
> +assert(s:get({3}) == nil)
> + | ---
> + | - true
> + | ...
> +-- Journal write start is going to succeed, but it will fail later on return
> +-- from the WAL thread.
> +box.error.injection.set('ERRINJ_WAL_ROTATE', true)
> + | ---
> + | - ok
> + | ...
> +
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +box.cfg{replication_synchro_quorum = 2}
> + | ---
> + | ...
> +test_run:wait_cond(function() return f:status() == 'dead' end)
> + | ---
> + | - true
> + | ...
> +
> +test_run:switch('replica')
> + | ---
> + | - true
> + | ...
> +test_run:wait_upstream(1, {status = 'stopped'})
> + | ---
> + | - true
> + | ...
> +box.error.injection.set('ERRINJ_WAL_ROTATE', false)
> + | ---
> + | - ok
> + | ...
> +assert(s:get({3}) == nil)
> + | ---
> + | - true
> + | ...
> +box.cfg{replication = {}}
> + | ---
> + | ...
> +box.cfg{replication = replication}
> + | ---
> + | ...
> +test_run:wait_lsn('replica', 'default')
> + | ---
> + | ...
> +test_run:wait_upstream(1, {status = 'follow'})
> + | ---
> + | - true
> + | ...
> +assert(s:get({3}) ~= nil)
> + | ---
> + | - true
> + | ...
> +
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('stop server replica')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('delete server replica')
> + | ---
> + | - true
> + | ...
> +
> +s:drop()
> + | ---
> + | ...
> +box.schema.user.revoke('guest', 'super')
> + | ---
> + | ...
> +box.cfg{                                                                        \
> +    replication_synchro_quorum = old_synchro_quorum,                            \
> +    replication_synchro_timeout = old_synchro_timeout,                          \
> +}
> + | ---
> + | ...
> diff --git a/test/replication/gh-5213-qsync-applier-order.test.lua b/test/replication/gh-5213-qsync-applier-order.test.lua
> new file mode 100644
> index 000000000..f1eccfa84
> --- /dev/null
> +++ b/test/replication/gh-5213-qsync-applier-order.test.lua
> @@ -0,0 +1,122 @@
> +--
> +-- gh-5213: applier used to process CONFIRM/ROLLBACK before writing them to WAL.
> +-- As a result it could happen that the transactions became visible on CONFIRM,
> +-- then somehow weren't written to WAL, and on restart the data might not be
> +-- visible again. Which means rollback of confirmed data and is not acceptable
> +-- (on the contrary with commit after rollback).
> +--
> +test_run = require('test_run').new()
> +fiber = require('fiber')
> +old_synchro_quorum = box.cfg.replication_synchro_quorum
> +old_synchro_timeout = box.cfg.replication_synchro_timeout
> +
> +box.schema.user.grant('guest', 'super')
> +
> +s = box.schema.space.create('test', {is_sync = true})
> +_ = s:create_index('pk')
> +
> +test_run:cmd('create server replica with rpl_master=default,\
> +              script="replication/gh-5213-replica.lua"')
> +test_run:cmd('start server replica')
> +
> +fiber = require('fiber')
> +box.cfg{                                                                        \
> +    replication_synchro_quorum = 3,                                             \
> +    replication_synchro_timeout = 1000,                                         \
> +}
> +f = fiber.new(function() s:replace{1} end)
> +
> +test_run:switch('replica')
> +-- Wait when the transaction is written to WAL.
> +test_run:wait_lsn('replica', 'default')
> +s = box.space.test
> +-- But not visible yet. Because MVCC is on, and no CONFIRM yet.
> +assert(s:get({1}) == nil)
> +-- Block the incoming CONFIRM to be able to ensure the data is not visible until
> +-- WAL write ends.
> +box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0)
> +
> +test_run:switch('default')
> +box.cfg{replication_synchro_quorum = 2}
> +test_run:wait_cond(function() return f:status() == 'dead' end)
> +
> +test_run:switch('replica')
> +-- Wait when CONFIRM is in the WAL thread.
> +test_run:wait_cond(function()                                                   \
> +    return box.error.injection.get('ERRINJ_WAL_DELAY')                          \
> +end)
> +assert(s:get({1}) == nil)
> +box.error.injection.set('ERRINJ_WAL_DELAY', false)
> +-- After CONFIRM is in WAL, the transaction is committed and its data is
> +-- visible.
> +test_run:wait_cond(function() return s:get({1}) ~= nil end)
> +
> +--
> +-- Ensure CONFIRM WAL write fail also works fine when couldn't even start a WAL
> +-- write. The data remains not confirmed until WAL write succeeds.
> +--
> +test_run:switch('default')
> +box.cfg{replication_synchro_quorum = 3}
> +f = fiber.new(function() s:replace{2} end)
> +
> +test_run:switch('replica')
> +test_run:wait_lsn('replica', 'default')
> +assert(s:get({2}) == nil)
> +-- Make journal write fail immediately.
> +box.error.injection.set('ERRINJ_WAL_IO', true)
> +
> +test_run:switch('default')
> +box.cfg{replication_synchro_quorum = 2}
> +test_run:wait_cond(function() return f:status() == 'dead' end)
> +
> +test_run:switch('replica')
> +test_run:wait_upstream(1, {status = 'stopped'})
> +box.error.injection.set('ERRINJ_WAL_IO', false)
> +assert(s:get({2}) == nil)
> +-- Re-subscribe.
> +replication = box.cfg.replication
> +box.cfg{replication = {}}
> +box.cfg{replication = replication}
> +test_run:wait_lsn('replica', 'default')
> +test_run:wait_upstream(1, {status = 'follow'})
> +assert(s:get({2}) ~= nil)
> +
> +--
> +-- Ensure CONFIRM WAL write fail works fine when an error happens inside WAL
> +-- thread. The data remains not confirmed until WAL write succeeds.
> +--
> +test_run:switch('default')
> +box.cfg{replication_synchro_quorum = 3}
> +f = fiber.new(function() s:replace{3} end)
> +
> +test_run:switch('replica')
> +test_run:wait_lsn('replica', 'default')
> +assert(s:get({3}) == nil)
> +-- Journal write start is going to succeed, but it will fail later on return
> +-- from the WAL thread.
> +box.error.injection.set('ERRINJ_WAL_ROTATE', true)
> +
> +test_run:switch('default')
> +box.cfg{replication_synchro_quorum = 2}
> +test_run:wait_cond(function() return f:status() == 'dead' end)
> +
> +test_run:switch('replica')
> +test_run:wait_upstream(1, {status = 'stopped'})
> +box.error.injection.set('ERRINJ_WAL_ROTATE', false)
> +assert(s:get({3}) == nil)
> +box.cfg{replication = {}}
> +box.cfg{replication = replication}
> +test_run:wait_lsn('replica', 'default')
> +test_run:wait_upstream(1, {status = 'follow'})
> +assert(s:get({3}) ~= nil)
> +
> +test_run:switch('default')
> +test_run:cmd('stop server replica')
> +test_run:cmd('delete server replica')
> +
> +s:drop()
> +box.schema.user.revoke('guest', 'super')
> +box.cfg{                                                                        \
> +    replication_synchro_quorum = old_synchro_quorum,                            \
> +    replication_synchro_timeout = old_synchro_timeout,                          \
> +}
> diff --git a/test/replication/gh-5213-replica.lua b/test/replication/gh-5213-replica.lua
> new file mode 100644
> index 000000000..3ba90d7ab
> --- /dev/null
> +++ b/test/replication/gh-5213-replica.lua
> @@ -0,0 +1,10 @@
> +#!/usr/bin/env tarantool
> +
> +box.cfg({
> +    listen = os.getenv("LISTEN"),
> +    replication = os.getenv("MASTER"),
> +    read_only = true,
> +    memtx_use_mvcc_engine = true,
> +})
> +
> +require('console').listen(os.getenv('ADMIN'))
> diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
> index aff5fda26..4a9ca0a46 100644
> --- a/test/replication/suite.cfg
> +++ b/test/replication/suite.cfg
> @@ -15,6 +15,8 @@
>       "gh-3760-misc-return-on-quorum-0.test.lua": {},
>       "gh-4399-misc-no-failure-on-error-reading-wal.test.lua": {},
>       "gh-4424-misc-orphan-on-reconfiguration-error.test.lua": {},
> +    "gh-5213-qsync-applier-order.test.lua": {},
> +    "gh-5213-qsync-applier-order-3.test.lua": {},
>       "gh-5426-election-on-off.test.lua": {},
>       "gh-5433-election-restart-recovery.test.lua": {},
>       "gh-5506-election-on-off.test.lua": {},
> diff --git a/test/replication/suite.ini b/test/replication/suite.ini
> index a9e44e8cf..1d9c0a4ae 100644
> --- a/test/replication/suite.ini
> +++ b/test/replication/suite.ini
> @@ -3,7 +3,7 @@ core = tarantool
>   script =  master.lua
>   description = tarantool/box, replication
>   disabled = consistent.test.lua
> -release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua
> +release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua
>   config = suite.cfg
>   lua_libs = lua/fast_replica.lua lua/rlimit.lua
>   use_unix_sockets = True

-- 
Serge Petrenko