Tarantool development patches archive
 help / color / mirror / Atom feed
From: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
To: tarantool-patches@dev.tarantool.org, gorcunov@gmail.com
Subject: [Tarantool-patches] [PATCH 1/3] txn_limbo: handle ROLLBACK during CONFIRM
Date: Fri, 31 Jul 2020 00:37:43 +0200	[thread overview]
Message-ID: <849ba7dc73a0f8372375a4cc6e0ab45548361a75.1596148501.git.v.shpilevoy@tarantool.org> (raw)
In-Reply-To: <cover.1596148501.git.v.shpilevoy@tarantool.org>

Limbo could try to ROLLBACK LSN whose CONFIRM is in progress. This
is how it could happen:

- A synchronous transaction is created, written to WAL;
- The fiber sleeps in the limbo waiting for CONFIRM or timeout;
- Replica receives the transaction, sends ACK;
- Master receives ACK, starts writing CONFIRM;
- The first fiber times out and tries to write ROLLBACK for the
  LSN, whose CONFIRM is in progress right now.

The patch adds more checks to the 'timed out' code path to see if
it isn't too late to write ROLLBACK. If CONFIRM is in progress,
the fiber will wait for its finish.

Part of #5185
---
 src/box/txn_limbo.c                    | 31 +++++++---
 test/replication/qsync_errinj.result   | 85 ++++++++++++++++++++++++++
 test/replication/qsync_errinj.test.lua | 40 ++++++++++++
 3 files changed, 147 insertions(+), 9 deletions(-)

diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index 27a7302c0..1602c5d16 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -200,25 +200,33 @@ txn_limbo_wait_complete(struct txn_limbo *limbo, struct txn_limbo_entry *entry)
 		if (txn_limbo_entry_is_complete(entry))
 			goto complete;
 		if (rc != 0)
-			goto do_rollback;
+			break;
 	}
 
-do_rollback:
 	assert(!txn_limbo_is_empty(limbo));
 	if (txn_limbo_first_entry(limbo) != entry) {
-		assert(limbo->is_in_rollback);
 		/*
 		 * If this is not a first entry in the limbo, it
 		 * is definitely not a first timed out entry. And
 		 * since it managed to time out too, it means
 		 * there is currently another fiber writing
-		 * rollback. Wait when it will finish and wake us
-		 * up.
+		 * rollback, or waiting for confirmation WAL
+		 * write. Wait when it will finish and wake us up.
 		 */
-		do {
-			fiber_yield();
-		} while (!txn_limbo_entry_is_complete(entry));
-		goto complete;
+		goto wait;
+	}
+
+	/* First in the queue is always a synchronous transaction. */
+	assert(entry->lsn > 0);
+	if (entry->lsn <= limbo->confirmed_lsn) {
+		/*
+		 * Yes, the wait timed out, but there is an on-going CONFIRM WAL
+		 * write in another fiber covering this LSN. Can't rollback it
+		 * already. All what can be done is waiting. The CONFIRM writer
+		 * will wakeup all the confirmed txns when WAL write will be
+		 * finished.
+		 */
+		goto wait;
 	}
 
 	txn_limbo_write_rollback(limbo, entry->lsn);
@@ -238,6 +246,11 @@ do_rollback:
 	diag_set(ClientError, ER_SYNC_QUORUM_TIMEOUT);
 	return -1;
 
+wait:
+	do {
+		fiber_yield();
+	} while (!txn_limbo_entry_is_complete(entry));
+
 complete:
 	assert(txn_limbo_entry_is_complete(entry));
 	/*
diff --git a/test/replication/qsync_errinj.result b/test/replication/qsync_errinj.result
index 1c6e357f7..4b8977810 100644
--- a/test/replication/qsync_errinj.result
+++ b/test/replication/qsync_errinj.result
@@ -283,6 +283,91 @@ box.space.sync:select{}
  | - - [2]
  | ...
 
+--
+-- See if synchro timeout during CONFIRM WAL write won't try to rollback the
+-- confirmed transaction.
+--
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+box.space.sync:truncate()
+ | ---
+ | ...
+-- Write something to flush the master's state to the replica.
+_ = box.space.sync:insert({1})
+ | ---
+ | ...
+_ = box.space.sync:delete({1})
+ | ---
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+-- Set a trap for CONFIRM write so as the current txn won't hang, but following
+-- CONFIRM will.
+box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 1)
+ | ---
+ | - ok
+ | ...
+ok, err = nil
+ | ---
+ | ...
+f = fiber.create(function()                                                     \
+    ok, err = pcall(box.space.sync.replace, box.space.sync, {1})                \
+end)
+ | ---
+ | ...
+
+-- Wait when got ACK from replica and started writing CONFIRM.
+test_run:wait_cond(function()                                                   \
+    return box.error.injection.get("ERRINJ_WAL_DELAY")                          \
+end)
+ | ---
+ | - true
+ | ...
+
+-- Let the pending synchro transaction go. It should timeout, notice the
+-- on-going confirmation, and go to sleep instead of doing rollback.
+box.cfg{replication_synchro_timeout = 0.001}
+ | ---
+ | ...
+fiber.yield()
+ | ---
+ | ...
+
+-- Let the confirmation finish.
+box.error.injection.set("ERRINJ_WAL_DELAY", false)
+ | ---
+ | - ok
+ | ...
+
+-- Now the transaction sees the CONFIRM is ok, and it is also finished.
+test_run:wait_cond(function() return f:status() == 'dead' end)
+ | ---
+ | - true
+ | ...
+ok, err
+ | ---
+ | - true
+ | - [1]
+ | ...
+box.space.sync:select{}
+ | ---
+ | - - [1]
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+box.space.sync:select{}
+ | ---
+ | - - [1]
+ | ...
+
 test_run:cmd('switch default')
  | ---
  | - true
diff --git a/test/replication/qsync_errinj.test.lua b/test/replication/qsync_errinj.test.lua
index da554c775..817b08a25 100644
--- a/test/replication/qsync_errinj.test.lua
+++ b/test/replication/qsync_errinj.test.lua
@@ -105,6 +105,46 @@ box.space.sync:replace{2}
 test_run:switch('replica')
 box.space.sync:select{}
 
+--
+-- See if synchro timeout during CONFIRM WAL write won't try to rollback the
+-- confirmed transaction.
+--
+test_run:switch('default')
+box.space.sync:truncate()
+-- Write something to flush the master's state to the replica.
+_ = box.space.sync:insert({1})
+_ = box.space.sync:delete({1})
+
+test_run:switch('default')
+-- Set a trap for CONFIRM write so as the current txn won't hang, but following
+-- CONFIRM will.
+box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 1)
+ok, err = nil
+f = fiber.create(function()                                                     \
+    ok, err = pcall(box.space.sync.replace, box.space.sync, {1})                \
+end)
+
+-- Wait when got ACK from replica and started writing CONFIRM.
+test_run:wait_cond(function()                                                   \
+    return box.error.injection.get("ERRINJ_WAL_DELAY")                          \
+end)
+
+-- Let the pending synchro transaction go. It should timeout, notice the
+-- on-going confirmation, and go to sleep instead of doing rollback.
+box.cfg{replication_synchro_timeout = 0.001}
+fiber.yield()
+
+-- Let the confirmation finish.
+box.error.injection.set("ERRINJ_WAL_DELAY", false)
+
+-- Now the transaction sees the CONFIRM is ok, and it is also finished.
+test_run:wait_cond(function() return f:status() == 'dead' end)
+ok, err
+box.space.sync:select{}
+
+test_run:switch('replica')
+box.space.sync:select{}
+
 test_run:cmd('switch default')
 
 box.cfg{                                                                        \
-- 
2.21.1 (Apple Git-122.3)

  reply	other threads:[~2020-07-30 22:37 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-30 22:37 [Tarantool-patches] [PATCH 0/3] Qsync tricky crashes Vladislav Shpilevoy
2020-07-30 22:37 ` Vladislav Shpilevoy [this message]
2020-07-30 22:37 ` [Tarantool-patches] [PATCH 2/3] txn_limbo: handle CONFIRM during ROLLBACK Vladislav Shpilevoy
2020-07-30 22:37 ` [Tarantool-patches] [PATCH 3/3] txn_limbo: handle duplicate ACKs Vladislav Shpilevoy
2020-07-31 13:59 ` [Tarantool-patches] [PATCH 0/3] Qsync tricky crashes Cyrill Gorcunov
2020-07-31 22:31   ` Vladislav Shpilevoy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=849ba7dc73a0f8372375a4cc6e0ab45548361a75.1596148501.git.v.shpilevoy@tarantool.org \
    --to=v.shpilevoy@tarantool.org \
    --cc=gorcunov@gmail.com \
    --cc=tarantool-patches@dev.tarantool.org \
    --subject='Re: [Tarantool-patches] [PATCH 1/3] txn_limbo: handle ROLLBACK during CONFIRM' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox