From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <v.shpilevoy@tarantool.org>
Received: from smtp5.mail.ru (smtp5.mail.ru [94.100.179.24])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by dev.tarantool.org (Postfix) with ESMTPS id 6D3DA445320
 for <tarantool-patches@dev.tarantool.org>;
 Sat, 25 Jul 2020 19:08:54 +0300 (MSK)
From: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Date: Sat, 25 Jul 2020 18:08:52 +0200
Message-Id: <5ba9881518d88918e9a7ab267e093ad9ceb50a5e.1595693237.git.v.shpilevoy@tarantool.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Subject: [Tarantool-patches] [PATCH 1/1] txn_limbo: introduce cascading
	rollback
List-Id: Tarantool development patches <tarantool-patches.dev.tarantool.org>
List-Unsubscribe: <https://lists.tarantool.org/mailman/options/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=unsubscribe>
List-Archive: <https://lists.tarantool.org/pipermail/tarantool-patches/>
List-Post: <mailto:tarantool-patches@dev.tarantool.org>
List-Help: <mailto:tarantool-patches-request@dev.tarantool.org?subject=help>
List-Subscribe: <https://lists.tarantool.org/mailman/listinfo/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=subscribe>
To: tarantool-patches@dev.tarantool.org, gorcunov@gmail.com

Cascading rollback is a state when existing transactions are being
rolled back right now, and newer transactions can't be committed
as well. To preserve the 'reversed rollback order' rule.

WAL writer can enter such state when something goes wrong with
writing to disk. Limbo didn't have that feature until now.

Consider an example why limbo should be able to turn on cascading
rollback. Without cascading rollback it can happen that a
transaction is seemingly rolled back, but after restart it is
committed and visible. The scenario:

    * Master writes a sync transaction to WAL with LSN1;

    * It starts waiting for ACKs;

    * No ACKs for timeout - it starts writing to WAL the command
      ROLLBACK(LSN1). To rollback everything with LSN >= LSN1
      but < LSN of the ROLLBACK record itself;

    * Another fiber starts a new transaction, while ROLLBACK is in
      progress;

    * Limbo is not empty, so the new transaction is added there.
      Then it also starts writing itself to WAL;

    * ROLLBACK finishes WAL write. It rolls back all the
      transactions in the limbo to conform with the 'reversed
      rollback order' rule. Including the latest transaction;

    * The latest transaction finished its WAL write with LSN2 and
      sees that it was rolled back by the limbo already.

All seems to be fine, but actually what happened is that
ROLLBACK(LSN1) is written to WAL *before* the latest transaction
with LSN2. Now when restart happens, ROLLBACK(LSN1) is replayed
first, and then the latest LSN2 transaction is replayed second -
it will be committed successfully, and will be visible.

On the summary: transaction canceled its rollback after instance
restart. Expected behaviour is that while ROLLBACK is in progress,
all newer transactions should not even try going to WAL. They
should be rolled back immediately.

The patch implements the cascading rollback for the limbo.

Closes #5140
---
Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5140-qsync-casc-rollback
Issue: https://github.com/tarantool/tarantool/issues/5140

@ChangeLog
* Fixed a bug when a rolled back synchronous transaction could become committed after restart (gh-5140).

 src/box/txn_limbo.c                           |  25 ++++-
 src/box/txn_limbo.h                           |  17 +++
 src/box/wal.c                                 |   6 +
 src/lib/core/errinj.h                         |   1 +
 .../gh-5140-qsync-casc-rollback.test.lua      | 105 ++++++++++++++++++
 5 files changed, 153 insertions(+), 1 deletion(-)
 create mode 100644 test/replication/gh-5140-qsync-casc-rollback.test.lua

diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index a74bfe244..72c8d572e 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -42,12 +42,31 @@ txn_limbo_create(struct txn_limbo *limbo)
 	fiber_cond_create(&limbo->wait_cond);
 	vclock_create(&limbo->vclock);
 	limbo->rollback_count = 0;
+	limbo->is_in_rollback = false;
 }
 
 struct txn_limbo_entry *
 txn_limbo_append(struct txn_limbo *limbo, uint32_t id, struct txn *txn)
 {
 	assert(txn_has_flag(txn, TXN_WAIT_SYNC));
+	/*
+	 * Transactions should be added to the limbo before WAL write. Limbo
+	 * needs that to be able rollback transactions, whose WAL write is in
+	 * progress.
+	 */
+	assert(txn->signature < 0);
+	if (limbo->is_in_rollback) {
+		/*
+		 * Cascading rollback. It is impossible to commit the
+		 * transaction, because if there is an existing rollback in
+		 * progress, it should rollback this one too for the sake of
+		 * 'reversed rollback order' rule. On the other hand the
+		 * rollback can't be postponed until after WAL write as well -
+		 * it should be done right now. See in the limbo comments why.
+		 */
+		diag_set(ClientError, ER_SYNC_ROLLBACK);
+		return NULL;
+	}
 	if (id == 0)
 		id = instance_id;
 	if (limbo->instance_id != id) {
@@ -186,6 +205,7 @@ txn_limbo_wait_complete(struct txn_limbo *limbo, struct txn_limbo_entry *entry)
 do_rollback:
 	assert(!txn_limbo_is_empty(limbo));
 	if (txn_limbo_first_entry(limbo) != entry) {
+		assert(limbo->is_in_rollback);
 		/*
 		 * If this is not a first entry in the limbo, it
 		 * is definitely not a first timed out entry. And
@@ -341,7 +361,10 @@ txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn)
 static int
 txn_limbo_write_rollback(struct txn_limbo *limbo, int64_t lsn)
 {
-	return txn_limbo_write_confirm_rollback(limbo, lsn, false);
+	limbo->is_in_rollback = true;
+	int rc = txn_limbo_write_confirm_rollback(limbo, lsn, false);
+	limbo->is_in_rollback = false;
+	return rc;
 }
 
 void
diff --git a/src/box/txn_limbo.h b/src/box/txn_limbo.h
index 8cfb490c3..d2a689e5c 100644
--- a/src/box/txn_limbo.h
+++ b/src/box/txn_limbo.h
@@ -131,6 +131,23 @@ struct txn_limbo {
 	 * in the end.
 	 */
 	int64_t rollback_count;
+	/**
+	 * Whether the limbo is in rollback mode. The meaning is exactly the
+	 * same as for the similar WAL flag. In theory this should be deleted
+	 * if the limbo will be ever moved to WAL thread. It would reuse the WAL
+	 * flag.
+	 * It is a sign to immediately rollback all new limbo entries, if there
+	 * is an existing rollback in progress. This technique is called
+	 * 'cascading rollback'. Cascading rollback does not allow to write to
+	 * WAL anything new so as not to violate the 'reversed rollback order'
+	 * rule.
+	 * Without cascading rollback it could happen, that the limbo would
+	 * start writing ROLLBACK to WAL, then a new transaction would be added
+	 * to limbo and sent to WAL too. In the result the new transaction would
+	 * be stored in WAL after ROLLBACK, and yet it should be rolled back too
+	 * by the 'reversed rollback order' rule - contradiction.
+	 */
+	bool is_in_rollback;
 };
 
 /**
diff --git a/src/box/wal.c b/src/box/wal.c
index 37a8bd483..220e68245 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -1030,6 +1030,12 @@ wal_write_to_disk(struct cmsg *msg)
 
 	ERROR_INJECT_SLEEP(ERRINJ_WAL_DELAY);
 
+	ERROR_INJECT_COUNTDOWN(ERRINJ_WAL_DELAY_COUNTDOWN, {
+		struct errinj *e = errinj(ERRINJ_WAL_DELAY, ERRINJ_BOOL);
+		e->bparam = true;
+		ERROR_INJECT_SLEEP(ERRINJ_WAL_DELAY);
+	});
+
 	if (writer->is_in_rollback) {
 		/* We're rolling back a failed write. */
 		stailq_concat(&wal_msg->rollback, &wal_msg->commit);
diff --git a/src/lib/core/errinj.h b/src/lib/core/errinj.h
index 76b453003..aace8736f 100644
--- a/src/lib/core/errinj.h
+++ b/src/lib/core/errinj.h
@@ -80,6 +80,7 @@ struct errinj {
 	_(ERRINJ_WAL_WRITE_DISK, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_WAL_WRITE_EOF, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_WAL_DELAY, ERRINJ_BOOL, {.bparam = false}) \
+	_(ERRINJ_WAL_DELAY_COUNTDOWN, ERRINJ_INT, {.iparam = -1}) \
 	_(ERRINJ_WAL_FALLOCATE, ERRINJ_INT, {.iparam = 0}) \
 	_(ERRINJ_INDEX_ALLOC, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_TUPLE_ALLOC, ERRINJ_BOOL, {.bparam = false}) \
diff --git a/test/replication/gh-5140-qsync-casc-rollback.test.lua b/test/replication/gh-5140-qsync-casc-rollback.test.lua
new file mode 100644
index 000000000..69fc9ad02
--- /dev/null
+++ b/test/replication/gh-5140-qsync-casc-rollback.test.lua
@@ -0,0 +1,105 @@
+test_run = require('test_run').new()
+engine = test_run:get_cfg('engine')
+fiber = require('fiber')
+--
+-- gh-5140: qsync cascading rollback. Without cascading rollback it can happen
+-- that a transaction is seemingly rolled back, but after restart it is
+-- committed and visible. This is how it was possible:
+--
+--     * Master writes a sync transaction to WAL with LSN1;
+--
+--     * It starts waiting for ACKs;
+--
+--     * No ACKs for timeout - it starts writing to WAL the command
+--       ROLLBACK(LSN1). To rollback everything with LSN >= LSN1 but < LSN of
+--       the ROLLBACK record itself;
+--
+--     * Another fiber starts a new transaction, while ROLLBACK is in progress;
+--
+--     * Limbo is not empty, so the new transaction is added there. Then it
+--       also starts writing itself to WAL;
+--
+--     * ROLLBACK finishes WAL write. It rolls back all the transactions in the
+--       limbo to conform with the 'reversed rollback order' rule. Including
+--       the latest transaction;
+--
+--     * The latest transaction finished its WAL write with LSN2 and sees that
+--       it was rolled back by the limbo already.
+--
+-- All seems to be fine, but actually what happened is that ROLLBACK(LSN1) is
+-- written to WAL *before* the latest transaction with LSN2. Now when restart
+-- happens, ROLLBACK(LSN1) is replayed first, and then the latest LSN2
+-- transaction is replayed second - it will be committed successfully, and will
+-- be visible.
+-- On the summary: transaction canceled its rollback after instance restart.
+-- Expected behaviour is that while ROLLBACK is in progress, all newer
+-- transactions should not even try going to WAL. They should be rolled back
+-- immediately.
+--
+box.schema.user.grant('guest', 'super')
+
+test_run:cmd('create server replica with rpl_master=default,			\
+             script="replication/replica.lua"')
+test_run:cmd('start server replica with wait=True, wait_load=True')
+
+box.cfg{replication_synchro_quorum = 2, replication_synchro_timeout = 1000}
+
+_ = box.schema.space.create('sync', {is_sync = true, engine = engine})
+_ = _:create_index('pk')
+_ = box.schema.space.create('async', {is_sync=false, engine = engine})
+_ = _:create_index('pk')
+-- Write something to flush the master state to replica.
+box.space.sync:replace{1}
+
+box.cfg{replication_synchro_quorum = 3, replication_synchro_timeout = 0.001}
+-- First WAL write will be fine. Second will be delayed. In this
+-- test first is the transaction itself. Second is the ROLLBACK
+-- record.
+box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 1)
+ok, err = nil
+f = fiber.create(function()                                                     \
+    ok, err = pcall(box.space.sync.replace, box.space.sync, {2})                \
+end)
+while not box.error.injection.get("ERRINJ_WAL_DELAY") do fiber.sleep(0.001) end
+-- ROLLBACK is in progress now. All newer transactions should be rolled back
+-- immediately until the ROLLBACK record is written, and all the older
+-- transactions are rolled back too. This is needed to preserve the 'reversed
+-- rollback order' rule.
+box.space.sync:replace{3}
+box.space.async:replace{3}
+box.error.injection.set("ERRINJ_WAL_DELAY", false)
+test_run:wait_cond(function() return f:status() == 'dead' end)
+ok, err
+
+box.cfg{replication_synchro_quorum = 2, replication_synchro_timeout = 1000}
+box.space.async:replace{4}
+box.space.sync:replace{4}
+box.space.async:select{}
+box.space.sync:select{}
+
+test_run:switch('replica')
+box.space.async:select{}
+box.space.sync:select{}
+
+test_run:switch('default')
+-- Key to reproduce the cascading rollback not done is to restart. On restart
+-- all the records are replayed one be one without yields for WAL writes, and
+-- nothing should change.
+test_run:cmd('restart server default')
+test_run:cmd('restart server replica')
+
+test_run:switch('replica')
+box.space.async:select{}
+box.space.sync:select{}
+
+test_run:switch('default')
+box.space.async:select{}
+box.space.sync:select{}
+
+box.space.sync:drop()
+box.space.async:drop()
+
+test_run:cmd('stop server replica')
+test_run:cmd('delete server replica')
+
+box.schema.user.revoke('guest', 'super')
-- 
2.21.1 (Apple Git-122.3)