* [Tarantool-patches] [PATCH 1/1] limbo: don't wake self fiber on CONFIRM write
@ 2020-09-10 21:53 Vladislav Shpilevoy
2020-09-12 4:35 ` Alexander V. Tikhonov
2020-09-12 12:13 ` Vladislav Shpilevoy
0 siblings, 2 replies; 3+ messages in thread
From: Vladislav Shpilevoy @ 2020-09-10 21:53 UTC (permalink / raw)
To: tarantool-patches, sergepetrenko, avtikhon
During recovery WAL writes end immediately, without yields.
Therefore WAL write completion callback is executed in the
currently active fiber.
Txn limbo on CONFIRM WAL write wakes up the waiting fiber, which
appears to be the same as the active fiber during recovery.
That breaks the fiber scheduler, because apparently it is not safe
to wake the currently active fiber unless it is going to call
fiber_yield() immediately after. See a comment in fiber_wakeup()
implementation about that way of usage.
The patch simply stops waking the waiting fiber, if it is the
currently active one.
Closes #5288
Closes #5232
---
Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5288-limbo-wakeup
Issue: https://github.com/tarantool/tarantool/issues/5288
Issue: https://github.com/tarantool/tarantool/issues/5232
src/box/txn_limbo.c | 3 ++-
.../replication/gh-5288-qsync-recovery.result | 27 +++++++++++++++++++
.../gh-5288-qsync-recovery.test.lua | 11 ++++++++
3 files changed, 40 insertions(+), 1 deletion(-)
create mode 100644 test/replication/gh-5288-qsync-recovery.result
create mode 100644 test/replication/gh-5288-qsync-recovery.test.lua
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index 2ff3bdd05..3655338d8 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -281,7 +281,8 @@ static void
txn_limbo_write_cb(struct journal_entry *entry)
{
assert(entry->complete_data != NULL);
- fiber_wakeup(entry->complete_data);
+ if (fiber() != entry->complete_data)
+ fiber_wakeup(entry->complete_data);
}
static void
diff --git a/test/replication/gh-5288-qsync-recovery.result b/test/replication/gh-5288-qsync-recovery.result
new file mode 100644
index 000000000..dc0babef6
--- /dev/null
+++ b/test/replication/gh-5288-qsync-recovery.result
@@ -0,0 +1,27 @@
+-- test-run result file version 2
+test_run = require('test_run').new()
+ | ---
+ | ...
+--
+-- gh-5288: transaction limbo could crash during recovery, because in WAL write
+-- completion callback it woken up the currently active fiber.
+--
+s = box.schema.space.create('sync', {is_sync = true})
+ | ---
+ | ...
+_ = s:create_index('pk')
+ | ---
+ | ...
+s:insert{1}
+ | ---
+ | - [1]
+ | ...
+box.snapshot()
+ | ---
+ | - ok
+ | ...
+test_run:cmd('restart server default')
+ |
+box.space.sync:drop()
+ | ---
+ | ...
diff --git a/test/replication/gh-5288-qsync-recovery.test.lua b/test/replication/gh-5288-qsync-recovery.test.lua
new file mode 100644
index 000000000..00bff7b87
--- /dev/null
+++ b/test/replication/gh-5288-qsync-recovery.test.lua
@@ -0,0 +1,11 @@
+test_run = require('test_run').new()
+--
+-- gh-5288: transaction limbo could crash during recovery, because in WAL write
+-- completion callback it woken up the currently active fiber.
+--
+s = box.schema.space.create('sync', {is_sync = true})
+_ = s:create_index('pk')
+s:insert{1}
+box.snapshot()
+test_run:cmd('restart server default')
+box.space.sync:drop()
--
2.21.1 (Apple Git-122.3)
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [Tarantool-patches] [PATCH 1/1] limbo: don't wake self fiber on CONFIRM write
2020-09-10 21:53 [Tarantool-patches] [PATCH 1/1] limbo: don't wake self fiber on CONFIRM write Vladislav Shpilevoy
@ 2020-09-12 4:35 ` Alexander V. Tikhonov
2020-09-12 12:13 ` Vladislav Shpilevoy
1 sibling, 0 replies; 3+ messages in thread
From: Alexander V. Tikhonov @ 2020-09-12 4:35 UTC (permalink / raw)
To: Vladislav Shpilevoy; +Cc: tarantool-patches
Hi Vlad, thanks a lot for your patch. I've checked and it really helped
to fix the issue, patch LGTM.
On Thu, Sep 10, 2020 at 11:53:57PM +0200, Vladislav Shpilevoy wrote:
> During recovery WAL writes end immediately, without yields.
> Therefore WAL write completion callback is executed in the
> currently active fiber.
>
> Txn limbo on CONFIRM WAL write wakes up the waiting fiber, which
> appears to be the same as the active fiber during recovery.
>
> That breaks the fiber scheduler, because apparently it is not safe
> to wake the currently active fiber unless it is going to call
> fiber_yield() immediately after. See a comment in fiber_wakeup()
> implementation about that way of usage.
>
> The patch simply stops waking the waiting fiber, if it is the
> currently active one.
>
> Closes #5288
> Closes #5232
> ---
> Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5288-limbo-wakeup
> Issue: https://github.com/tarantool/tarantool/issues/5288
> Issue: https://github.com/tarantool/tarantool/issues/5232
>
> src/box/txn_limbo.c | 3 ++-
> .../replication/gh-5288-qsync-recovery.result | 27 +++++++++++++++++++
> .../gh-5288-qsync-recovery.test.lua | 11 ++++++++
> 3 files changed, 40 insertions(+), 1 deletion(-)
> create mode 100644 test/replication/gh-5288-qsync-recovery.result
> create mode 100644 test/replication/gh-5288-qsync-recovery.test.lua
>
> diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
> index 2ff3bdd05..3655338d8 100644
> --- a/src/box/txn_limbo.c
> +++ b/src/box/txn_limbo.c
> @@ -281,7 +281,8 @@ static void
> txn_limbo_write_cb(struct journal_entry *entry)
> {
> assert(entry->complete_data != NULL);
> - fiber_wakeup(entry->complete_data);
> + if (fiber() != entry->complete_data)
> + fiber_wakeup(entry->complete_data);
> }
>
> static void
> diff --git a/test/replication/gh-5288-qsync-recovery.result b/test/replication/gh-5288-qsync-recovery.result
> new file mode 100644
> index 000000000..dc0babef6
> --- /dev/null
> +++ b/test/replication/gh-5288-qsync-recovery.result
> @@ -0,0 +1,27 @@
> +-- test-run result file version 2
> +test_run = require('test_run').new()
> + | ---
> + | ...
> +--
> +-- gh-5288: transaction limbo could crash during recovery, because in WAL write
> +-- completion callback it woken up the currently active fiber.
> +--
> +s = box.schema.space.create('sync', {is_sync = true})
> + | ---
> + | ...
> +_ = s:create_index('pk')
> + | ---
> + | ...
> +s:insert{1}
> + | ---
> + | - [1]
> + | ...
> +box.snapshot()
> + | ---
> + | - ok
> + | ...
> +test_run:cmd('restart server default')
> + |
> +box.space.sync:drop()
> + | ---
> + | ...
> diff --git a/test/replication/gh-5288-qsync-recovery.test.lua b/test/replication/gh-5288-qsync-recovery.test.lua
> new file mode 100644
> index 000000000..00bff7b87
> --- /dev/null
> +++ b/test/replication/gh-5288-qsync-recovery.test.lua
> @@ -0,0 +1,11 @@
> +test_run = require('test_run').new()
> +--
> +-- gh-5288: transaction limbo could crash during recovery, because in WAL write
> +-- completion callback it woken up the currently active fiber.
> +--
> +s = box.schema.space.create('sync', {is_sync = true})
> +_ = s:create_index('pk')
> +s:insert{1}
> +box.snapshot()
> +test_run:cmd('restart server default')
> +box.space.sync:drop()
> --
> 2.21.1 (Apple Git-122.3)
>
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [Tarantool-patches] [PATCH 1/1] limbo: don't wake self fiber on CONFIRM write
2020-09-10 21:53 [Tarantool-patches] [PATCH 1/1] limbo: don't wake self fiber on CONFIRM write Vladislav Shpilevoy
2020-09-12 4:35 ` Alexander V. Tikhonov
@ 2020-09-12 12:13 ` Vladislav Shpilevoy
1 sibling, 0 replies; 3+ messages in thread
From: Vladislav Shpilevoy @ 2020-09-12 12:13 UTC (permalink / raw)
To: tarantool-patches, sergepetrenko, avtikhon
Pushed to master and 2.5 as trivial.
@ChangeLog
* During recovery of synchronous changes from snapshot the instance could crash (gh-5288).
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2020-09-12 12:13 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-10 21:53 [Tarantool-patches] [PATCH 1/1] limbo: don't wake self fiber on CONFIRM write Vladislav Shpilevoy
2020-09-12 4:35 ` Alexander V. Tikhonov
2020-09-12 12:13 ` Vladislav Shpilevoy
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox