Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH 1/1] limbo: don't wake self fiber on CONFIRM write
@ 2020-09-10 21:53 Vladislav Shpilevoy
  2020-09-12  4:35 ` Alexander V. Tikhonov
  2020-09-12 12:13 ` Vladislav Shpilevoy
  0 siblings, 2 replies; 3+ messages in thread
From: Vladislav Shpilevoy @ 2020-09-10 21:53 UTC (permalink / raw)
  To: tarantool-patches, sergepetrenko, avtikhon

During recovery WAL writes end immediately, without yields.
Therefore WAL write completion callback is executed in the
currently active fiber.

Txn limbo on CONFIRM WAL write wakes up the waiting fiber, which
appears to be the same as the active fiber during recovery.

That breaks the fiber scheduler, because apparently it is not safe
to wake the currently active fiber unless it is going to call
fiber_yield() immediately after. See a comment in fiber_wakeup()
implementation about that way of usage.

The patch simply stops waking the waiting fiber, if it is the
currently active one.

Closes #5288
Closes #5232
---
Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5288-limbo-wakeup
Issue: https://github.com/tarantool/tarantool/issues/5288
Issue: https://github.com/tarantool/tarantool/issues/5232

 src/box/txn_limbo.c                           |  3 ++-
 .../replication/gh-5288-qsync-recovery.result | 27 +++++++++++++++++++
 .../gh-5288-qsync-recovery.test.lua           | 11 ++++++++
 3 files changed, 40 insertions(+), 1 deletion(-)
 create mode 100644 test/replication/gh-5288-qsync-recovery.result
 create mode 100644 test/replication/gh-5288-qsync-recovery.test.lua

diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index 2ff3bdd05..3655338d8 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -281,7 +281,8 @@ static void
 txn_limbo_write_cb(struct journal_entry *entry)
 {
 	assert(entry->complete_data != NULL);
-	fiber_wakeup(entry->complete_data);
+	if (fiber() != entry->complete_data)
+		fiber_wakeup(entry->complete_data);
 }
 
 static void
diff --git a/test/replication/gh-5288-qsync-recovery.result b/test/replication/gh-5288-qsync-recovery.result
new file mode 100644
index 000000000..dc0babef6
--- /dev/null
+++ b/test/replication/gh-5288-qsync-recovery.result
@@ -0,0 +1,27 @@
+-- test-run result file version 2
+test_run = require('test_run').new()
+ | ---
+ | ...
+--
+-- gh-5288: transaction limbo could crash during recovery, because in WAL write
+-- completion callback it woken up the currently active fiber.
+--
+s = box.schema.space.create('sync', {is_sync = true})
+ | ---
+ | ...
+_ = s:create_index('pk')
+ | ---
+ | ...
+s:insert{1}
+ | ---
+ | - [1]
+ | ...
+box.snapshot()
+ | ---
+ | - ok
+ | ...
+test_run:cmd('restart server default')
+ | 
+box.space.sync:drop()
+ | ---
+ | ...
diff --git a/test/replication/gh-5288-qsync-recovery.test.lua b/test/replication/gh-5288-qsync-recovery.test.lua
new file mode 100644
index 000000000..00bff7b87
--- /dev/null
+++ b/test/replication/gh-5288-qsync-recovery.test.lua
@@ -0,0 +1,11 @@
+test_run = require('test_run').new()
+--
+-- gh-5288: transaction limbo could crash during recovery, because in WAL write
+-- completion callback it woken up the currently active fiber.
+--
+s = box.schema.space.create('sync', {is_sync = true})
+_ = s:create_index('pk')
+s:insert{1}
+box.snapshot()
+test_run:cmd('restart server default')
+box.space.sync:drop()
-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] limbo: don't wake self fiber on CONFIRM write
  2020-09-10 21:53 [Tarantool-patches] [PATCH 1/1] limbo: don't wake self fiber on CONFIRM write Vladislav Shpilevoy
@ 2020-09-12  4:35 ` Alexander V. Tikhonov
  2020-09-12 12:13 ` Vladislav Shpilevoy
  1 sibling, 0 replies; 3+ messages in thread
From: Alexander V. Tikhonov @ 2020-09-12  4:35 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

Hi Vlad, thanks a lot for your patch. I've checked and it really helped
to fix the issue, patch LGTM.

On Thu, Sep 10, 2020 at 11:53:57PM +0200, Vladislav Shpilevoy wrote:
> During recovery WAL writes end immediately, without yields.
> Therefore WAL write completion callback is executed in the
> currently active fiber.
> 
> Txn limbo on CONFIRM WAL write wakes up the waiting fiber, which
> appears to be the same as the active fiber during recovery.
> 
> That breaks the fiber scheduler, because apparently it is not safe
> to wake the currently active fiber unless it is going to call
> fiber_yield() immediately after. See a comment in fiber_wakeup()
> implementation about that way of usage.
> 
> The patch simply stops waking the waiting fiber, if it is the
> currently active one.
> 
> Closes #5288
> Closes #5232
> ---
> Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5288-limbo-wakeup
> Issue: https://github.com/tarantool/tarantool/issues/5288
> Issue: https://github.com/tarantool/tarantool/issues/5232
> 
>  src/box/txn_limbo.c                           |  3 ++-
>  .../replication/gh-5288-qsync-recovery.result | 27 +++++++++++++++++++
>  .../gh-5288-qsync-recovery.test.lua           | 11 ++++++++
>  3 files changed, 40 insertions(+), 1 deletion(-)
>  create mode 100644 test/replication/gh-5288-qsync-recovery.result
>  create mode 100644 test/replication/gh-5288-qsync-recovery.test.lua
> 
> diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
> index 2ff3bdd05..3655338d8 100644
> --- a/src/box/txn_limbo.c
> +++ b/src/box/txn_limbo.c
> @@ -281,7 +281,8 @@ static void
>  txn_limbo_write_cb(struct journal_entry *entry)
>  {
>  	assert(entry->complete_data != NULL);
> -	fiber_wakeup(entry->complete_data);
> +	if (fiber() != entry->complete_data)
> +		fiber_wakeup(entry->complete_data);
>  }
>  
>  static void
> diff --git a/test/replication/gh-5288-qsync-recovery.result b/test/replication/gh-5288-qsync-recovery.result
> new file mode 100644
> index 000000000..dc0babef6
> --- /dev/null
> +++ b/test/replication/gh-5288-qsync-recovery.result
> @@ -0,0 +1,27 @@
> +-- test-run result file version 2
> +test_run = require('test_run').new()
> + | ---
> + | ...
> +--
> +-- gh-5288: transaction limbo could crash during recovery, because in WAL write
> +-- completion callback it woken up the currently active fiber.
> +--
> +s = box.schema.space.create('sync', {is_sync = true})
> + | ---
> + | ...
> +_ = s:create_index('pk')
> + | ---
> + | ...
> +s:insert{1}
> + | ---
> + | - [1]
> + | ...
> +box.snapshot()
> + | ---
> + | - ok
> + | ...
> +test_run:cmd('restart server default')
> + | 
> +box.space.sync:drop()
> + | ---
> + | ...
> diff --git a/test/replication/gh-5288-qsync-recovery.test.lua b/test/replication/gh-5288-qsync-recovery.test.lua
> new file mode 100644
> index 000000000..00bff7b87
> --- /dev/null
> +++ b/test/replication/gh-5288-qsync-recovery.test.lua
> @@ -0,0 +1,11 @@
> +test_run = require('test_run').new()
> +--
> +-- gh-5288: transaction limbo could crash during recovery, because in WAL write
> +-- completion callback it woken up the currently active fiber.
> +--
> +s = box.schema.space.create('sync', {is_sync = true})
> +_ = s:create_index('pk')
> +s:insert{1}
> +box.snapshot()
> +test_run:cmd('restart server default')
> +box.space.sync:drop()
> -- 
> 2.21.1 (Apple Git-122.3)
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] limbo: don't wake self fiber on CONFIRM write
  2020-09-10 21:53 [Tarantool-patches] [PATCH 1/1] limbo: don't wake self fiber on CONFIRM write Vladislav Shpilevoy
  2020-09-12  4:35 ` Alexander V. Tikhonov
@ 2020-09-12 12:13 ` Vladislav Shpilevoy
  1 sibling, 0 replies; 3+ messages in thread
From: Vladislav Shpilevoy @ 2020-09-12 12:13 UTC (permalink / raw)
  To: tarantool-patches, sergepetrenko, avtikhon

Pushed to master and 2.5 as trivial.

@ChangeLog
* During recovery of synchronous changes from snapshot the instance could crash (gh-5288).

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-09-12 12:13 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-10 21:53 [Tarantool-patches] [PATCH 1/1] limbo: don't wake self fiber on CONFIRM write Vladislav Shpilevoy
2020-09-12  4:35 ` Alexander V. Tikhonov
2020-09-12 12:13 ` Vladislav Shpilevoy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox