Tarantool development patches archive
 help / color / mirror / Atom feed
From: Serge Petrenko via Tarantool-patches <tarantool-patches@dev.tarantool.org>
To: v.shpilevoy@tarantool.org, gorcunov@gmail.com
Cc: tarantool-patches@dev.tarantool.org
Subject: [Tarantool-patches] [PATCH v2 2/2] box: fix an assertion failure in box.ctl.promote()
Date: Tue, 25 May 2021 13:39:29 +0300	[thread overview]
Message-ID: <8011f87bb9b5e1f53f5bee3124f3a8e9dbe1917c.1621935783.git.sergepetrenko@tarantool.org> (raw)
In-Reply-To: <cover.1621935783.git.sergepetrenko@tarantool.org>

box.ctl.promote() used to assume that the last synchronous entry is
already written to WAL by the time it's called. This is not the case
when promote is executed on the limbo owner. The last synchronous entry
might still be en route to WAL.

In order to fix the issue, wait until all the limbo entries are written
to disk via wal_sync(). After this happens, it's safe to proceed to
gathering quorum in promote.

Closes #6032
---
 src/box/box.cc                                | 27 ++++++--
 .../gh-6032-promote-wal-write.result          | 69 +++++++++++++++++++
 .../gh-6032-promote-wal-write.test.lua        | 28 ++++++++
 test/replication/suite.cfg                    |  1 +
 test/replication/suite.ini                    |  2 +-
 5 files changed, 120 insertions(+), 7 deletions(-)
 create mode 100644 test/replication/gh-6032-promote-wal-write.result
 create mode 100644 test/replication/gh-6032-promote-wal-write.test.lua

diff --git a/src/box/box.cc b/src/box/box.cc
index 894e3d0f4..3d9cd0e57 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1618,14 +1618,29 @@ box_promote(void)
 				 txn_limbo.owner_id);
 			return -1;
 		}
+		if (txn_limbo_is_empty(&txn_limbo)) {
+			wait_lsn = txn_limbo.confirmed_lsn;
+			goto promote;
+		}
 	}
 
-	/*
-	 * promote() is a no-op on the limbo owner, so all the rows
-	 * in the limbo must've come through the applier meaning they already
-	 * have an lsn assigned, even if their WAL write hasn't finished yet.
-	 */
-	wait_lsn = txn_limbo_last_synchro_entry(&txn_limbo)->lsn;
+	struct txn_limbo_entry *last_entry;
+	last_entry = txn_limbo_last_synchro_entry(&txn_limbo);
+	/* Wait for the last entries WAL write. */
+	if (last_entry->lsn < 0) {
+		if (wal_sync(NULL) < 0)
+			return -1;
+		if (txn_limbo_is_empty(&txn_limbo)) {
+			wait_lsn = txn_limbo.confirmed_lsn;
+			goto promote;
+		}
+		if (last_entry != txn_limbo_last_synchro_entry(&txn_limbo)) {
+			diag_set(ClientError, ER_QUORUM_WAIT, quorum,
+				 "new synchronous transactions appeared");
+			return -1;
+		}
+	}
+	wait_lsn = last_entry->lsn;
 	assert(wait_lsn > 0);
 
 	rc = box_wait_quorum(former_leader_id, wait_lsn, quorum,
diff --git a/test/replication/gh-6032-promote-wal-write.result b/test/replication/gh-6032-promote-wal-write.result
new file mode 100644
index 000000000..246c7974f
--- /dev/null
+++ b/test/replication/gh-6032-promote-wal-write.result
@@ -0,0 +1,69 @@
+-- test-run result file version 2
+test_run = require('test_run').new()
+ | ---
+ | ...
+fiber = require('fiber')
+ | ---
+ | ...
+
+replication_synchro_timeout = box.cfg.replication_synchro_timeout
+ | ---
+ | ...
+box.cfg{\
+    replication_synchro_timeout = 0.001,\
+}
+ | ---
+ | ...
+
+_ = box.schema.create_space('sync', {is_sync = true}):create_index('pk')
+ | ---
+ | ...
+
+box.error.injection.set('ERRINJ_WAL_DELAY', true)
+ | ---
+ | - ok
+ | ...
+_ = fiber.create(function() box.space.sync:replace{1} end)
+ | ---
+ | ...
+ok, err = nil, nil
+ | ---
+ | ...
+
+-- Test that the fiber actually waits for a WAL write to happen.
+f = fiber.create(function() ok, err = pcall(box.ctl.promote) end)
+ | ---
+ | ...
+fiber.sleep(0.1)
+ | ---
+ | ...
+f:status()
+ | ---
+ | - suspended
+ | ...
+box.error.injection.set('ERRINJ_WAL_DELAY', false)
+ | ---
+ | - ok
+ | ...
+test_run:wait_cond(function() return f:status() == 'dead' end)
+ | ---
+ | - true
+ | ...
+ok
+ | ---
+ | - true
+ | ...
+err
+ | ---
+ | - null
+ | ...
+
+-- Cleanup.
+box.cfg{\
+    replication_synchro_timeout = replication_synchro_timeout,\
+}
+ | ---
+ | ...
+box.space.sync:drop()
+ | ---
+ | ...
diff --git a/test/replication/gh-6032-promote-wal-write.test.lua b/test/replication/gh-6032-promote-wal-write.test.lua
new file mode 100644
index 000000000..8c1859083
--- /dev/null
+++ b/test/replication/gh-6032-promote-wal-write.test.lua
@@ -0,0 +1,28 @@
+test_run = require('test_run').new()
+fiber = require('fiber')
+
+replication_synchro_timeout = box.cfg.replication_synchro_timeout
+box.cfg{\
+    replication_synchro_timeout = 0.001,\
+}
+
+_ = box.schema.create_space('sync', {is_sync = true}):create_index('pk')
+
+box.error.injection.set('ERRINJ_WAL_DELAY', true)
+_ = fiber.create(function() box.space.sync:replace{1} end)
+ok, err = nil, nil
+
+-- Test that the fiber actually waits for a WAL write to happen.
+f = fiber.create(function() ok, err = pcall(box.ctl.promote) end)
+fiber.sleep(0.1)
+f:status()
+box.error.injection.set('ERRINJ_WAL_DELAY', false)
+test_run:wait_cond(function() return f:status() == 'dead' end)
+ok
+err
+
+-- Cleanup.
+box.cfg{\
+    replication_synchro_timeout = replication_synchro_timeout,\
+}
+box.space.sync:drop()
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index dc39e2f74..dfe4be9ae 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -45,6 +45,7 @@
     "gh-5435-qsync-clear-synchro-queue-commit-all.test.lua": {},
     "gh-5536-wal-limit.test.lua": {},
     "gh-5566-final-join-synchro.test.lua": {},
+    "gh-6032-promote-wal-write.test.lua": {},
     "*": {
         "memtx": {"engine": "memtx"},
         "vinyl": {"engine": "vinyl"}
diff --git a/test/replication/suite.ini b/test/replication/suite.ini
index 1d9c0a4ae..2625c5eea 100644
--- a/test/replication/suite.ini
+++ b/test/replication/suite.ini
@@ -3,7 +3,7 @@ core = tarantool
 script =  master.lua
 description = tarantool/box, replication
 disabled = consistent.test.lua
-release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua
+release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua gh-6032-promote-wal-write.test.lua
 config = suite.cfg
 lua_libs = lua/fast_replica.lua lua/rlimit.lua
 use_unix_sockets = True
-- 
2.30.1 (Apple Git-130)


  parent reply	other threads:[~2021-05-25 10:40 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-25 10:39 [Tarantool-patches] [PATCH v2 0/2] " Serge Petrenko via Tarantool-patches
2021-05-25 10:39 ` [Tarantool-patches] [PATCH v2 1/2] box: refactor in_promote using a guard Serge Petrenko via Tarantool-patches
2021-05-26  7:25   ` Cyrill Gorcunov via Tarantool-patches
2021-05-27 10:57     ` Serge Petrenko via Tarantool-patches
2021-05-27 11:02       ` Cyrill Gorcunov via Tarantool-patches
2021-05-25 10:39 ` Serge Petrenko via Tarantool-patches [this message]
2021-05-26  6:14   ` [Tarantool-patches] [PATCH v2 2/2] box: fix an assertion failure in box.ctl.promote() Cyrill Gorcunov via Tarantool-patches
2021-05-26  8:25     ` Serge Petrenko via Tarantool-patches
2021-05-26 18:46       ` Vladislav Shpilevoy via Tarantool-patches
2021-05-27 10:53         ` Serge Petrenko via Tarantool-patches
2021-05-27 11:03           ` Cyrill Gorcunov via Tarantool-patches
2021-05-27 19:30           ` Vladislav Shpilevoy via Tarantool-patches
2021-06-01 12:20 ` [Tarantool-patches] [PATCH v2 0/2] " Kirill Yukhin via Tarantool-patches

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8011f87bb9b5e1f53f5bee3124f3a8e9dbe1917c.1621935783.git.sergepetrenko@tarantool.org \
    --to=tarantool-patches@dev.tarantool.org \
    --cc=gorcunov@gmail.com \
    --cc=sergepetrenko@tarantool.org \
    --cc=v.shpilevoy@tarantool.org \
    --subject='Re: [Tarantool-patches] [PATCH v2 2/2] box: fix an assertion failure in box.ctl.promote()' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox