* [Tarantool-patches] [PATCH 1/1] qsync: handle async txns right during CONFIRM
@ 2021-05-27 21:28 Vladislav Shpilevoy via Tarantool-patches
2021-05-28 7:23 ` Cyrill Gorcunov via Tarantool-patches
` (2 more replies)
0 siblings, 3 replies; 7+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-05-27 21:28 UTC (permalink / raw)
To: tarantool-patches, sergepetrenko, gorcunov
It is possible that a new async transaction is added to the limbo
when there is an in-progress CONFIRM WAL write for all the pending
sync transactions.
Then when CONFIRM WAL write is done, it might see that the limbo
now in the first place contains an async transaction not yet
written to WAL. A suspicious situation - on one hand the async
transaction does not have any blocking sync txns before it and
can be considered complete, on the other hand its WAL write is not
done and it is not complete.
Before this patch it resulted into a crash - limbo didn't consider
the situation possible at all.
Now when CONFIRM covers a not yet written async transactions, they
are removed from the limbo and are turned to plain transactions.
When their WAL write is done, they see they no more have
TXN_WAIT_SYNC flag and don't even need to interact with the limbo.
It is important to remove them from the limbo right when the
CONFIRM is done. Because otherwise their limbo entry may be not
removed at all when it is done on a replica. On a replica the
limbo entries are removed only by CONFIRM/ROLLBACK/PROMOTE. If
there would be an async transaction in the first position in the
limbo queue, it wouldn't be deleted until next sync transaction
appears.
This replica case is not possible now though. Because all synchro
entries on the applier are written in a blocking way. Nonetheless
if it ever becomes non-blocking, the code should handle it ok.
Closes #6057
---
Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-6057-confirm-async-no-wal
Issue: https://github.com/tarantool/tarantool/issues/6057
.../gh-6057-qsync-confirm-async-no-wal.md | 5 +
src/box/txn.c | 14 +-
src/box/txn_limbo.c | 21 +++
.../gh-6057-qsync-confirm-async-no-wal.result | 163 ++++++++++++++++++
...h-6057-qsync-confirm-async-no-wal.test.lua | 88 ++++++++++
test/replication/suite.cfg | 1 +
test/replication/suite.ini | 2 +-
7 files changed, 289 insertions(+), 5 deletions(-)
create mode 100644 changelogs/unreleased/gh-6057-qsync-confirm-async-no-wal.md
create mode 100644 test/replication/gh-6057-qsync-confirm-async-no-wal.result
create mode 100644 test/replication/gh-6057-qsync-confirm-async-no-wal.test.lua
diff --git a/changelogs/unreleased/gh-6057-qsync-confirm-async-no-wal.md b/changelogs/unreleased/gh-6057-qsync-confirm-async-no-wal.md
new file mode 100644
index 000000000..1005389d8
--- /dev/null
+++ b/changelogs/unreleased/gh-6057-qsync-confirm-async-no-wal.md
@@ -0,0 +1,5 @@
+## bugfix/replication
+
+* Fixed a possible crash when a synchronous transaction was followed by an
+ asynchronous transaction right when its confirmation was being written
+ (gh-6057).
diff --git a/src/box/txn.c b/src/box/txn.c
index 1d42c9113..3d4d5c397 100644
--- a/src/box/txn.c
+++ b/src/box/txn.c
@@ -880,8 +880,14 @@ txn_commit(struct txn *txn)
if (req == NULL)
goto rollback;
- bool is_sync = txn_has_flag(txn, TXN_WAIT_SYNC);
- if (is_sync) {
+ /*
+ * Do not cash the flag value in a variable. The flag might be deleted
+ * during WAL write. This can happen for async transactions created
+ * during CONFIRM write, whose all blocking sync transactions get
+ * confirmed. They they turn the async transaction into just a plain
+ * txn not waiting for anything.
+ */
+ if (txn_has_flag(txn, TXN_WAIT_SYNC)) {
/*
* Remote rows, if any, come before local rows, so
* check for originating instance id here.
@@ -900,13 +906,13 @@ txn_commit(struct txn *txn)
fiber_set_txn(fiber(), NULL);
if (journal_write(req) != 0 || req->res < 0) {
- if (is_sync)
+ if (txn_has_flag(txn, TXN_WAIT_SYNC))
txn_limbo_abort(&txn_limbo, limbo_entry);
diag_set(ClientError, ER_WAL_IO);
diag_log();
goto rollback;
}
- if (is_sync) {
+ if (txn_has_flag(txn, TXN_WAIT_SYNC)) {
if (txn_has_flag(txn, TXN_WAIT_ACK)) {
int64_t lsn = req->rows[req->n_rows - 1]->lsn;
/*
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index f287369a2..05f0bf30a 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -389,6 +389,27 @@ txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn)
*/
if (e->lsn == -1)
break;
+ } else if (e->txn->signature < 0) {
+ /*
+ * A transaction might be covered by the CONFIRM even if
+ * it is not written to WAL yet when it is an async
+ * transaction. It could be created just when the
+ * CONFIRM was being written to WAL.
+ */
+ assert(e->txn->status == TXN_PREPARED);
+ /*
+ * Let it complete normally as a plain transaction.
+ */
+ txn_clear_flags(e->txn, TXN_WAIT_SYNC | TXN_WAIT_ACK);
+ txn_limbo_remove(limbo, e);
+ /*
+ * The limbo entry now should not be used by the owner
+ * transaction since it just became a plain one. Nullify
+ * the txn to get a crash on any usage attempt instead
+ * of potential undefined behaviour.
+ */
+ e->txn = NULL;
+ continue;
}
e->is_commit = true;
txn_limbo_remove(limbo, e);
diff --git a/test/replication/gh-6057-qsync-confirm-async-no-wal.result b/test/replication/gh-6057-qsync-confirm-async-no-wal.result
new file mode 100644
index 000000000..23c77729b
--- /dev/null
+++ b/test/replication/gh-6057-qsync-confirm-async-no-wal.result
@@ -0,0 +1,163 @@
+-- test-run result file version 2
+test_run = require('test_run').new()
+ | ---
+ | ...
+fiber = require('fiber')
+ | ---
+ | ...
+
+--
+-- gh-6057: while CONFIRM is being written, there might appear async
+-- transactions not having an LSN/signature yet. When CONFIRM is done, it covers
+-- these transactions too since they don't need to wait for an ACK, but it
+-- should not try to complete them. Because their WAL write is not done and
+-- it might even fail later. It should simply turn them into plain transactions
+-- not depending on any synchronous ones.
+--
+
+old_synchro_quorum = box.cfg.replication_synchro_quorum
+ | ---
+ | ...
+old_synchro_timeout = box.cfg.replication_synchro_timeout
+ | ---
+ | ...
+box.cfg{ \
+ replication_synchro_quorum = 1, \
+ replication_synchro_timeout = 1000000, \
+}
+ | ---
+ | ...
+s = box.schema.create_space('test', {is_sync = true})
+ | ---
+ | ...
+_ = s:create_index('pk')
+ | ---
+ | ...
+s2 = box.schema.create_space('test2')
+ | ---
+ | ...
+_ = s2:create_index('pk')
+ | ---
+ | ...
+
+errinj = box.error.injection
+ | ---
+ | ...
+
+function create_hanging_async_after_confirm(sync_key, async_key1, async_key2) \
+-- Let the transaction itself to WAL, but CONFIRM will be blocked. \
+ errinj.set('ERRINJ_WAL_DELAY_COUNTDOWN', 1) \
+ local lsn = box.info.lsn \
+ local f = fiber.create(function() s:replace{sync_key} end) \
+ test_run:wait_cond(function() return box.info.lsn == lsn + 1 end) \
+-- Wait for the CONFIRM block. WAL thread is in busy loop now. \
+ test_run:wait_cond(function() return errinj.get('ERRINJ_WAL_DELAY') end) \
+ \
+-- Create 2 async transactions to ensure multiple of them are handled fine. \
+-- But return only fiber of the second one. It is enough because if it is \
+-- finished, the first one is too. \
+ fiber.new(function() s2:replace{async_key1} end) \
+ local f2 = fiber.new(function() s2:replace{async_key2} end) \
+ fiber.yield() \
+-- When WAL thread would finish CONFIRM, it should be blocked on the async \
+-- transaction so as it wouldn't be completed when CONFIRM is processed. \
+ errinj.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0) \
+-- Let CONFIRM go. \
+ errinj.set('ERRINJ_WAL_DELAY', false) \
+-- And WAL thread should be blocked on the async txn. \
+ test_run:wait_cond(function() return errinj.get('ERRINJ_WAL_DELAY') end) \
+-- Wait CONFIRM finish. \
+ test_run:wait_cond(function() return f:status() == 'dead' end) \
+ return f2 \
+end
+ | ---
+ | ...
+
+async_f = create_hanging_async_after_confirm(1, 2, 3)
+ | ---
+ | ...
+-- Let the async transaction finish its WAL write.
+errinj.set('ERRINJ_WAL_DELAY', false)
+ | ---
+ | - ok
+ | ...
+-- It should see that even though it is in the limbo, it does not have any
+-- synchronous transactions to wait for and can complete right away.
+test_run:wait_cond(function() return async_f:status() == 'dead' end)
+ | ---
+ | - true
+ | ...
+
+assert(s:get({1}) ~= nil)
+ | ---
+ | - true
+ | ...
+assert(s2:get({2}) ~= nil)
+ | ---
+ | - true
+ | ...
+assert(s2:get({3}) ~= nil)
+ | ---
+ | - true
+ | ...
+
+--
+-- Try all the same, but the async transaction fails its WAL write.
+--
+async_f = create_hanging_async_after_confirm(4, 5, 6)
+ | ---
+ | ...
+-- The async transaction will fail to go to WAL when WAL thread is unblocked.
+errinj.set('ERRINJ_WAL_ROTATE', true)
+ | ---
+ | - ok
+ | ...
+errinj.set('ERRINJ_WAL_DELAY', false)
+ | ---
+ | - ok
+ | ...
+test_run:wait_cond(function() return async_f:status() == 'dead' end)
+ | ---
+ | - true
+ | ...
+errinj.set('ERRINJ_WAL_ROTATE', false)
+ | ---
+ | - ok
+ | ...
+
+assert(s:get({4}) ~= nil)
+ | ---
+ | - true
+ | ...
+assert(s2:get({5}) == nil)
+ | ---
+ | - true
+ | ...
+assert(s2:get({6}) == nil)
+ | ---
+ | - true
+ | ...
+
+-- Ensure nothing is broken, works fine.
+s:replace{7}
+ | ---
+ | - [7]
+ | ...
+s2:replace{8}
+ | ---
+ | - [8]
+ | ...
+
+s:drop()
+ | ---
+ | ...
+s2:drop()
+ | ---
+ | ...
+
+box.cfg{ \
+ replication_synchro_quorum = old_synchro_quorum, \
+ replication_synchro_timeout = old_synchro_timeout, \
+}
+ | ---
+ | ...
diff --git a/test/replication/gh-6057-qsync-confirm-async-no-wal.test.lua b/test/replication/gh-6057-qsync-confirm-async-no-wal.test.lua
new file mode 100644
index 000000000..a11ddc042
--- /dev/null
+++ b/test/replication/gh-6057-qsync-confirm-async-no-wal.test.lua
@@ -0,0 +1,88 @@
+test_run = require('test_run').new()
+fiber = require('fiber')
+
+--
+-- gh-6057: while CONFIRM is being written, there might appear async
+-- transactions not having an LSN/signature yet. When CONFIRM is done, it covers
+-- these transactions too since they don't need to wait for an ACK, but it
+-- should not try to complete them. Because their WAL write is not done and
+-- it might even fail later. It should simply turn them into plain transactions
+-- not depending on any synchronous ones.
+--
+
+old_synchro_quorum = box.cfg.replication_synchro_quorum
+old_synchro_timeout = box.cfg.replication_synchro_timeout
+box.cfg{ \
+ replication_synchro_quorum = 1, \
+ replication_synchro_timeout = 1000000, \
+}
+s = box.schema.create_space('test', {is_sync = true})
+_ = s:create_index('pk')
+s2 = box.schema.create_space('test2')
+_ = s2:create_index('pk')
+
+errinj = box.error.injection
+
+function create_hanging_async_after_confirm(sync_key, async_key1, async_key2) \
+-- Let the transaction itself to WAL, but CONFIRM will be blocked. \
+ errinj.set('ERRINJ_WAL_DELAY_COUNTDOWN', 1) \
+ local lsn = box.info.lsn \
+ local f = fiber.create(function() s:replace{sync_key} end) \
+ test_run:wait_cond(function() return box.info.lsn == lsn + 1 end) \
+-- Wait for the CONFIRM block. WAL thread is in busy loop now. \
+ test_run:wait_cond(function() return errinj.get('ERRINJ_WAL_DELAY') end) \
+ \
+-- Create 2 async transactions to ensure multiple of them are handled fine. \
+-- But return only fiber of the second one. It is enough because if it is \
+-- finished, the first one is too. \
+ fiber.new(function() s2:replace{async_key1} end) \
+ local f2 = fiber.new(function() s2:replace{async_key2} end) \
+ fiber.yield() \
+-- When WAL thread would finish CONFIRM, it should be blocked on the async \
+-- transaction so as it wouldn't be completed when CONFIRM is processed. \
+ errinj.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0) \
+-- Let CONFIRM go. \
+ errinj.set('ERRINJ_WAL_DELAY', false) \
+-- And WAL thread should be blocked on the async txn. \
+ test_run:wait_cond(function() return errinj.get('ERRINJ_WAL_DELAY') end) \
+-- Wait CONFIRM finish. \
+ test_run:wait_cond(function() return f:status() == 'dead' end) \
+ return f2 \
+end
+
+async_f = create_hanging_async_after_confirm(1, 2, 3)
+-- Let the async transaction finish its WAL write.
+errinj.set('ERRINJ_WAL_DELAY', false)
+-- It should see that even though it is in the limbo, it does not have any
+-- synchronous transactions to wait for and can complete right away.
+test_run:wait_cond(function() return async_f:status() == 'dead' end)
+
+assert(s:get({1}) ~= nil)
+assert(s2:get({2}) ~= nil)
+assert(s2:get({3}) ~= nil)
+
+--
+-- Try all the same, but the async transaction fails its WAL write.
+--
+async_f = create_hanging_async_after_confirm(4, 5, 6)
+-- The async transaction will fail to go to WAL when WAL thread is unblocked.
+errinj.set('ERRINJ_WAL_ROTATE', true)
+errinj.set('ERRINJ_WAL_DELAY', false)
+test_run:wait_cond(function() return async_f:status() == 'dead' end)
+errinj.set('ERRINJ_WAL_ROTATE', false)
+
+assert(s:get({4}) ~= nil)
+assert(s2:get({5}) == nil)
+assert(s2:get({6}) == nil)
+
+-- Ensure nothing is broken, works fine.
+s:replace{7}
+s2:replace{8}
+
+s:drop()
+s2:drop()
+
+box.cfg{ \
+ replication_synchro_quorum = old_synchro_quorum, \
+ replication_synchro_timeout = old_synchro_timeout, \
+}
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index dc39e2f74..20337b310 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -45,6 +45,7 @@
"gh-5435-qsync-clear-synchro-queue-commit-all.test.lua": {},
"gh-5536-wal-limit.test.lua": {},
"gh-5566-final-join-synchro.test.lua": {},
+ "gh-6057-qsync-confirm-async-no-wal.test.lua": {},
"*": {
"memtx": {"engine": "memtx"},
"vinyl": {"engine": "vinyl"}
diff --git a/test/replication/suite.ini b/test/replication/suite.ini
index 1d9c0a4ae..7d811d333 100644
--- a/test/replication/suite.ini
+++ b/test/replication/suite.ini
@@ -3,7 +3,7 @@ core = tarantool
script = master.lua
description = tarantool/box, replication
disabled = consistent.test.lua
-release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua
+release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua gh-6057-qsync-confirm-async-no-wal.test.lua
config = suite.cfg
lua_libs = lua/fast_replica.lua lua/rlimit.lua
use_unix_sockets = True
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Tarantool-patches] [PATCH 1/1] qsync: handle async txns right during CONFIRM
2021-05-27 21:28 [Tarantool-patches] [PATCH 1/1] qsync: handle async txns right during CONFIRM Vladislav Shpilevoy via Tarantool-patches
@ 2021-05-28 7:23 ` Cyrill Gorcunov via Tarantool-patches
2021-05-28 19:13 ` Vladislav Shpilevoy via Tarantool-patches
2021-05-28 9:01 ` Serge Petrenko via Tarantool-patches
2021-06-01 20:59 ` Vladislav Shpilevoy via Tarantool-patches
2 siblings, 1 reply; 7+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-05-28 7:23 UTC (permalink / raw)
To: Vladislav Shpilevoy; +Cc: tarantool-patches
On Thu, May 27, 2021 at 11:28:02PM +0200, Vladislav Shpilevoy wrote:
> diff --git a/src/box/txn.c b/src/box/txn.c
> index 1d42c9113..3d4d5c397 100644
> --- a/src/box/txn.c
> +++ b/src/box/txn.c
> @@ -880,8 +880,14 @@ txn_commit(struct txn *txn)
> if (req == NULL)
> goto rollback;
>
> - bool is_sync = txn_has_flag(txn, TXN_WAIT_SYNC);
> - if (is_sync) {
> + /*
> + * Do not cash the flag value in a variable. The flag might be deleted
I suspect you meant "cache"? Obviously flag won't be paying any money :-)
> + * during WAL write. This can happen for async transactions created
> + * during CONFIRM write, whose all blocking sync transactions get
> + * confirmed. They they turn the async transaction into just a plain
"Then they" I guess? No need for diff, just force push an update please.
The rest of patch, looks ok to me, thanks!
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Tarantool-patches] [PATCH 1/1] qsync: handle async txns right during CONFIRM
2021-05-28 7:23 ` Cyrill Gorcunov via Tarantool-patches
@ 2021-05-28 19:13 ` Vladislav Shpilevoy via Tarantool-patches
0 siblings, 0 replies; 7+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-05-28 19:13 UTC (permalink / raw)
To: Cyrill Gorcunov; +Cc: tarantool-patches
Hi! Thanks for the review!
On 28.05.2021 09:23, Cyrill Gorcunov via Tarantool-patches wrote:
> On Thu, May 27, 2021 at 11:28:02PM +0200, Vladislav Shpilevoy wrote:
>> diff --git a/src/box/txn.c b/src/box/txn.c
>> index 1d42c9113..3d4d5c397 100644
>> --- a/src/box/txn.c
>> +++ b/src/box/txn.c
>> @@ -880,8 +880,14 @@ txn_commit(struct txn *txn)
>> if (req == NULL)
>> goto rollback;
>>
>> - bool is_sync = txn_has_flag(txn, TXN_WAIT_SYNC);
>> - if (is_sync) {
>> + /*
>> + * Do not cash the flag value in a variable. The flag might be deleted
>
> I suspect you meant "cache"? Obviously flag won't be paying any money :-)
>
>> + * during WAL write. This can happen for async transactions created
>> + * during CONFIRM write, whose all blocking sync transactions get
>> + * confirmed. They they turn the async transaction into just a plain
>
> "Then they" I guess? No need for diff, just force push an update please.
Yes, sorry, I was in a big hurry when sent this patch. Fixed on the
branch.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Tarantool-patches] [PATCH 1/1] qsync: handle async txns right during CONFIRM
2021-05-27 21:28 [Tarantool-patches] [PATCH 1/1] qsync: handle async txns right during CONFIRM Vladislav Shpilevoy via Tarantool-patches
2021-05-28 7:23 ` Cyrill Gorcunov via Tarantool-patches
@ 2021-05-28 9:01 ` Serge Petrenko via Tarantool-patches
2021-05-28 19:13 ` Vladislav Shpilevoy via Tarantool-patches
2021-06-01 20:59 ` Vladislav Shpilevoy via Tarantool-patches
2 siblings, 1 reply; 7+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-05-28 9:01 UTC (permalink / raw)
To: Vladislav Shpilevoy, tarantool-patches, gorcunov
28.05.2021 00:28, Vladislav Shpilevoy пишет:
> It is possible that a new async transaction is added to the limbo
> when there is an in-progress CONFIRM WAL write for all the pending
> sync transactions.
>
> Then when CONFIRM WAL write is done, it might see that the limbo
> now in the first place contains an async transaction not yet
> written to WAL. A suspicious situation - on one hand the async
> transaction does not have any blocking sync txns before it and
> can be considered complete, on the other hand its WAL write is not
> done and it is not complete.
>
> Before this patch it resulted into a crash - limbo didn't consider
> the situation possible at all.
>
> Now when CONFIRM covers a not yet written async transactions, they
> are removed from the limbo and are turned to plain transactions.
>
> When their WAL write is done, they see they no more have
> TXN_WAIT_SYNC flag and don't even need to interact with the limbo.
>
> It is important to remove them from the limbo right when the
> CONFIRM is done. Because otherwise their limbo entry may be not
> removed at all when it is done on a replica. On a replica the
> limbo entries are removed only by CONFIRM/ROLLBACK/PROMOTE. If
> there would be an async transaction in the first position in the
> limbo queue, it wouldn't be deleted until next sync transaction
> appears.
>
> This replica case is not possible now though. Because all synchro
> entries on the applier are written in a blocking way. Nonetheless
> if it ever becomes non-blocking, the code should handle it ok.
>
> Closes #6057
Hi! Thanks for working on this and for the fast fix!
Please find one comment below.
> ---
> Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-6057-confirm-async-no-wal
> Issue: https://github.com/tarantool/tarantool/issues/6057
>
> .../gh-6057-qsync-confirm-async-no-wal.md | 5 +
> src/box/txn.c | 14 +-
> src/box/txn_limbo.c | 21 +++
> .../gh-6057-qsync-confirm-async-no-wal.result | 163 ++++++++++++++++++
> ...h-6057-qsync-confirm-async-no-wal.test.lua | 88 ++++++++++
> test/replication/suite.cfg | 1 +
> test/replication/suite.ini | 2 +-
> 7 files changed, 289 insertions(+), 5 deletions(-)
> create mode 100644 changelogs/unreleased/gh-6057-qsync-confirm-async-no-wal.md
> create mode 100644 test/replication/gh-6057-qsync-confirm-async-no-wal.result
> create mode 100644 test/replication/gh-6057-qsync-confirm-async-no-wal.test.lua
>
> diff --git a/changelogs/unreleased/gh-6057-qsync-confirm-async-no-wal.md b/changelogs/unreleased/gh-6057-qsync-confirm-async-no-wal.md
> new file mode 100644
> index 000000000..1005389d8
> --- /dev/null
> +++ b/changelogs/unreleased/gh-6057-qsync-confirm-async-no-wal.md
> @@ -0,0 +1,5 @@
> +## bugfix/replication
> +
> +* Fixed a possible crash when a synchronous transaction was followed by an
> + asynchronous transaction right when its confirmation was being written
> + (gh-6057).
> diff --git a/src/box/txn.c b/src/box/txn.c
> index 1d42c9113..3d4d5c397 100644
> --- a/src/box/txn.c
> +++ b/src/box/txn.c
> @@ -880,8 +880,14 @@ txn_commit(struct txn *txn)
> if (req == NULL)
> goto rollback;
>
> - bool is_sync = txn_has_flag(txn, TXN_WAIT_SYNC);
> - if (is_sync) {
> + /*
> + * Do not cash the flag value in a variable. The flag might be deleted
> + * during WAL write. This can happen for async transactions created
> + * during CONFIRM write, whose all blocking sync transactions get
> + * confirmed. They they turn the async transaction into just a plain
> + * txn not waiting for anything.
> + */
> + if (txn_has_flag(txn, TXN_WAIT_SYNC)) {
> /*
> * Remote rows, if any, come before local rows, so
> * check for originating instance id here.
> @@ -900,13 +906,13 @@ txn_commit(struct txn *txn)
>
> fiber_set_txn(fiber(), NULL);
> if (journal_write(req) != 0 || req->res < 0) {
> - if (is_sync)
> + if (txn_has_flag(txn, TXN_WAIT_SYNC))
> txn_limbo_abort(&txn_limbo, limbo_entry);
> diag_set(ClientError, ER_WAL_IO);
> diag_log();
> goto rollback;
> }
> - if (is_sync) {
> + if (txn_has_flag(txn, TXN_WAIT_SYNC)) {
> if (txn_has_flag(txn, TXN_WAIT_ACK)) {
> int64_t lsn = req->rows[req->n_rows - 1]->lsn;
> /*
> diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
> index f287369a2..05f0bf30a 100644
> --- a/src/box/txn_limbo.c
> +++ b/src/box/txn_limbo.c
> @@ -389,6 +389,27 @@ txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn)
> */
> if (e->lsn == -1)
> break;
> + } else if (e->txn->signature < 0) {
> + /*
> + * A transaction might be covered by the CONFIRM even if
> + * it is not written to WAL yet when it is an async
> + * transaction. It could be created just when the
> + * CONFIRM was being written to WAL.
> + */
> + assert(e->txn->status == TXN_PREPARED);
> + /*
> + * Let it complete normally as a plain transaction.
> + */
> + txn_clear_flags(e->txn, TXN_WAIT_SYNC | TXN_WAIT_ACK);
AFAICS it's enough to clear WAIT_SYNC here.
Asynchronous transactions never have WAIT_ACK set, do they?
> + txn_limbo_remove(limbo, e);
> + /*
> + * The limbo entry now should not be used by the owner
> + * transaction since it just became a plain one. Nullify
> + * the txn to get a crash on any usage attempt instead
> + * of potential undefined behaviour.
> + */
> + e->txn = NULL;
> + continue;
> }
> e->is_commit = true;
> txn_limbo_remove(limbo, e);
--
Serge Petrenko
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Tarantool-patches] [PATCH 1/1] qsync: handle async txns right during CONFIRM
2021-05-28 9:01 ` Serge Petrenko via Tarantool-patches
@ 2021-05-28 19:13 ` Vladislav Shpilevoy via Tarantool-patches
2021-06-01 7:37 ` Serge Petrenko via Tarantool-patches
0 siblings, 1 reply; 7+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-05-28 19:13 UTC (permalink / raw)
To: Serge Petrenko, tarantool-patches, gorcunov
Hi! Thanks for the review!
>> diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
>> index f287369a2..05f0bf30a 100644
>> --- a/src/box/txn_limbo.c
>> +++ b/src/box/txn_limbo.c
>> @@ -389,6 +389,27 @@ txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn)
>> */
>> if (e->lsn == -1)
>> break;
>> + } else if (e->txn->signature < 0) {
>> + /*
>> + * A transaction might be covered by the CONFIRM even if
>> + * it is not written to WAL yet when it is an async
>> + * transaction. It could be created just when the
>> + * CONFIRM was being written to WAL.
>> + */
>> + assert(e->txn->status == TXN_PREPARED);
>> + /*
>> + * Let it complete normally as a plain transaction.
>> + */
>> + txn_clear_flags(e->txn, TXN_WAIT_SYNC | TXN_WAIT_ACK);
>
> AFAICS it's enough to clear WAIT_SYNC here.
> Asynchronous transactions never have WAIT_ACK set, do they?
Yes, sorry, I was in a big hurry when sent this patch. Fixed:
====================
@@ -400,7 +400,7 @@ txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn)
/*
* Let it complete normally as a plain transaction.
*/
- txn_clear_flags(e->txn, TXN_WAIT_SYNC | TXN_WAIT_ACK);
+ txn_clear_flags(e->txn, TXN_WAIT_SYNC);
txn_limbo_remove(limbo, e);
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Tarantool-patches] [PATCH 1/1] qsync: handle async txns right during CONFIRM
2021-05-28 19:13 ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-01 7:37 ` Serge Petrenko via Tarantool-patches
0 siblings, 0 replies; 7+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-06-01 7:37 UTC (permalink / raw)
To: Vladislav Shpilevoy, tarantool-patches, gorcunov
28.05.2021 22:13, Vladislav Shpilevoy пишет:
> Hi! Thanks for the review!
>
>>> diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
>>> index f287369a2..05f0bf30a 100644
>>> --- a/src/box/txn_limbo.c
>>> +++ b/src/box/txn_limbo.c
>>> @@ -389,6 +389,27 @@ txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn)
>>> */
>>> if (e->lsn == -1)
>>> break;
>>> + } else if (e->txn->signature < 0) {
>>> + /*
>>> + * A transaction might be covered by the CONFIRM even if
>>> + * it is not written to WAL yet when it is an async
>>> + * transaction. It could be created just when the
>>> + * CONFIRM was being written to WAL.
>>> + */
>>> + assert(e->txn->status == TXN_PREPARED);
>>> + /*
>>> + * Let it complete normally as a plain transaction.
>>> + */
>>> + txn_clear_flags(e->txn, TXN_WAIT_SYNC | TXN_WAIT_ACK);
>> AFAICS it's enough to clear WAIT_SYNC here.
>> Asynchronous transactions never have WAIT_ACK set, do they?
> Yes, sorry, I was in a big hurry when sent this patch. Fixed:
>
> ====================
> @@ -400,7 +400,7 @@ txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn)
> /*
> * Let it complete normally as a plain transaction.
> */
> - txn_clear_flags(e->txn, TXN_WAIT_SYNC | TXN_WAIT_ACK);
> + txn_clear_flags(e->txn, TXN_WAIT_SYNC);
> txn_limbo_remove(limbo, e);
Hi! Thanks for the fixes! LGTM.
--
Serge Petrenko
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Tarantool-patches] [PATCH 1/1] qsync: handle async txns right during CONFIRM
2021-05-27 21:28 [Tarantool-patches] [PATCH 1/1] qsync: handle async txns right during CONFIRM Vladislav Shpilevoy via Tarantool-patches
2021-05-28 7:23 ` Cyrill Gorcunov via Tarantool-patches
2021-05-28 9:01 ` Serge Petrenko via Tarantool-patches
@ 2021-06-01 20:59 ` Vladislav Shpilevoy via Tarantool-patches
2 siblings, 0 replies; 7+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-01 20:59 UTC (permalink / raw)
To: tarantool-patches, sergepetrenko, gorcunov
Pushed to master, 2.8, 2.7.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2021-06-01 20:59 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-27 21:28 [Tarantool-patches] [PATCH 1/1] qsync: handle async txns right during CONFIRM Vladislav Shpilevoy via Tarantool-patches
2021-05-28 7:23 ` Cyrill Gorcunov via Tarantool-patches
2021-05-28 19:13 ` Vladislav Shpilevoy via Tarantool-patches
2021-05-28 9:01 ` Serge Petrenko via Tarantool-patches
2021-05-28 19:13 ` Vladislav Shpilevoy via Tarantool-patches
2021-06-01 7:37 ` Serge Petrenko via Tarantool-patches
2021-06-01 20:59 ` Vladislav Shpilevoy via Tarantool-patches
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox