[PATCH v2] Fix fiber_join() hang in case fiber_cancel() was called

Vladimir Davydov vdavydov.dev at gmail.com
Wed Feb 6 12:55:55 MSK 2019


On Tue, Feb 05, 2019 at 06:01:11PM +0300, Serge Petrenko wrote:
> In case a fiber joining another fiber gets cancelled, it stays suspended
> forever and never finishes joining. This happens because fiber_cancel()
> wakes the fiber and removes it from all execution queues.
> Fix this by handling possible cancellation in fiber_join().
> 
> Closes #3948
> ---
> Changes in v2:
>   - rewrote the test completely.
>   - instead of continuing to join if the fiber
>     is cancelled make the fiber to be joined
>     non-joinable and exit. This solution was
>     discussed verbally.
>   - revert comment changes for fiber_yield().
>     It really isn't a cancellation point.
> 
>  src/fiber.c             | 25 ++++++++++++++++---------
>  src/lua/fiber.c         | 11 ++++++++++-
>  test/app/fiber.result   | 33 +++++++++++++++++++++++++++++++++
>  test/app/fiber.test.lua | 17 +++++++++++++++++
>  4 files changed, 76 insertions(+), 10 deletions(-)
> 
> diff --git a/src/fiber.c b/src/fiber.c
> index 6f3d0ab78..7e9a3d38d 100644
> --- a/src/fiber.c
> +++ b/src/fiber.c
> @@ -396,18 +396,25 @@ fiber_join(struct fiber *fiber)
>  
>  		do {
>  			fiber_yield();
> +			if (fiber_is_cancelled())
> +				break;
>  		} while (! fiber_is_dead(fiber));
>  	}
> -
> -	/* Move exception to the caller */
> -	int ret = fiber->f_ret;
> -	if (ret != 0) {
> -		assert(!diag_is_empty(&fiber->diag));
> -		diag_move(&fiber->diag, &fiber()->diag);
> +	if (! fiber_is_dead(fiber)) {
> +		fiber_set_joinable(fiber, false);
> +		diag_set(FiberIsCancelled);
> +		return -1;
> +	} else {
> +		/* Move exception to the caller */
> +		int ret = fiber->f_ret;
> +		if (ret != 0) {
> +			assert(!diag_is_empty(&fiber->diag));
> +			diag_move(&fiber->diag, &fiber()->diag);
> +		}
> +		/* The fiber is already dead. */
> +		fiber_recycle(fiber);
> +		return ret;
>  	}
> -	/* The fiber is already dead. */
> -	fiber_recycle(fiber);
> -	return ret;
>  }
>  
>  /**
> diff --git a/src/lua/fiber.c b/src/lua/fiber.c
> index 8b17d6475..c655c5258 100644
> --- a/src/lua/fiber.c
> +++ b/src/lua/fiber.c
> @@ -33,6 +33,7 @@
>  #include <fiber.h>
>  #include "lua/utils.h"
>  #include "backtrace.h"
> +#include "exception.h"
>  
>  #include <lua.h>
>  #include <lauxlib.h>
> @@ -676,10 +677,18 @@ lbox_fiber_join(struct lua_State *L)
>  {
>  	struct fiber *fiber = lbox_checkfiber(L, 1);
>  	struct lua_State *child_L = fiber->storage.lua.stack;
> -	fiber_join(fiber);
> +	int f_ret = fiber_join(fiber);
>  	struct error *e = NULL;
>  	int num_ret = 0;
>  	int coro_ref = 0;
> +
> +	if (f_ret != 0 && diag_last_error(&fiber()->diag)->type ==
> +	    &type_FiberIsCancelled) {

The check looks fragile - what if both our fiber and the fiber we've
been waiting for are cancelled? I'm afraid we'd leak the fiber we are
supposed to join then.

> +		/* The fiber was cancelled before we joined. */
> +		luaL_testcancel(L);
> +	}
> +	luaL_testcancel(L);
> +

luaL_testcancel() is called either...

Anyway, now I seem to understand why you wanted to ignore fiber_cancel
in fiber_join - making it cancellable complicates the function protocol,
rendering the function barely usable. Guess I was wrong when asked you
to rework the patch, sorry. Let's revert to v1.

>  	if (child_L != NULL) {
>  		coro_ref = lua_tointeger(child_L, -1);
>  		lua_pop(child_L, 1);
> diff --git a/test/app/fiber.result b/test/app/fiber.result
> index ab7c1941b..f73d32671 100644
> --- a/test/app/fiber.result
> +++ b/test/app/fiber.result
> @@ -1411,6 +1411,39 @@ l = nil
>  l1 = nil
>  ---
>  ...
> +-- gh-3948 fiber.join() blocks if fiber is cancelled.
> +function another_func() fiber.yield() end
> +---
> +...
> +test_run:cmd("setopt delimiter ';'")
> +---
> +- true
> +...
> +function func()
> +    local fib = fiber.create(another_func)
> +    fib:set_joinable(true)
> +    fib:join()
> +end;
> +---
> +...
> +f = fiber.create(func)
> +f:cancel()
> +while f:status() ~= 'dead' do fiber.sleep(0.01) end;

AFAICS the test highly depends on the scheduler algorithm. Let's rewrite
it using fiber.channel please.

> +---
> +...
> +test_run:cmd("setopt delimiter ''");
> +---
> +- true
> +...

> +f = nil
> +---
> +...
> +func = nil
> +---
> +...
> +another_func = nil
> +---
> +...

These assignments aren't necessary.



More information about the Tarantool-patches mailing list