[Tarantool-patches] [PATCH v2] core: handle fiber cancellation for fiber.cond

Leonid Vasiliev lvasiliev at tarantool.org
Wed Nov 4 13:00:36 MSK 2020


Continuation of the review.

On 03.11.2020 13:20, Sergey Ostanevich wrote:
> Hi Oleg!
> 
> I believe the point about 'consistency' is not valid here. I put a
> simple check that if diag is already set, then print it out. For the
> fiber_cond_wait_timeout() it happened multiple times with various
> reports, inlcuding this one:
> 
> 2020-11-03 10:28:01.630 [72411] relay/unix/:(socket)/101/main C> Did not
> set the DIAG to FiberIsCancelled, original diag: Missing .xlog file
> between LSN 5 {1: 5} and 6 {1: 6}
> 
> that is used in the test system:
> 
> test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status =
> 'loading'})
> 
> So, my resolution will be: it is wrong to set a diag in an arbitrary
> place, without clear understanting of the reason. This is the case for
> the cond_wait machinery, since it doesn't know _why_ the fiber is
> cancelled.
> 
> I propose to cover the case with condition on empty diag and call it a
> day.
> 
> regards,
> Sergos
> 
> =====
> 
> diff --git a/src/box/box.cc b/src/box/box.cc
> index 18568df3b..9e824453d 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -305,10 +305,8 @@ box_wait_ro(bool ro, double timeout)
>   {
>   	double deadline = ev_monotonic_now(loop()) + timeout;
>   	while (is_box_configured == false || box_is_ro() != ro) {
> -		if (fiber_cond_wait_deadline(&ro_cond, deadline) != 0)
> -			return -1;
> -		if (fiber_is_cancelled()) {
> -			diag_set(FiberIsCancelled);
> +		if (fiber_cond_wait_deadline(&ro_cond, deadline) != 0) {
> +			if (fiber_is_cancelled()) diag_set(FiberIsCancelled);

7) According to 
https://www.tarantool.io/en/doc/latest/dev_guide/c_style_guide/
it's seems like you are trying to hide something)
Use:
```
if (condition)
     action();
```
Instead of:
```
if (condition) action();
another_action();
```

>   			return -1;
>   		}
>   	}
> diff --git a/src/lib/core/fiber_cond.c b/src/lib/core/fiber_cond.c
> index 904a350d9..0c93c5842 100644
> --- a/src/lib/core/fiber_cond.c
> +++ b/src/lib/core/fiber_cond.c
> @@ -108,6 +108,10 @@ fiber_cond_wait_timeout(struct fiber_cond *c, double timeout)
>   		diag_set(TimedOut);
>   		return -1;
>   	}
> +	if (fiber_is_cancelled()) {
> +		if (diag_is_empty(diag_get())) diag_set(FiberIsCancelled);

8) The same as previously:
```
if (condition)
     action();
```

9) Checking diag on empty looks strange to me. I think we should add an
error to diag without this check. If we want to save the previous error,
I suggest to use the stack diag.

> +		return -1;
> +	}
>   	return 0;
>   }
>   
> diff --git a/test/app-tap/gh-5013-fiber-cancel.test.lua b/test/app-tap/gh-5013-fiber-cancel.test.lua
> new file mode 100755
> index 000000000..ae805c5bf
> --- /dev/null
> +++ b/test/app-tap/gh-5013-fiber-cancel.test.lua
> @@ -0,0 +1,23 @@
> +#!/usr/bin/env tarantool
> +
> +local tap = require('tap')
> +local fiber = require('fiber')
> +local test = tap.test("gh-5013-fiber-cancel")
> +
> +test:plan(2)
> +
> +local result = {}
> +
> +function test_f()
> +    local cond = fiber.cond()
> +    local res, err = pcall(cond.wait, cond)
> +    result.res = res
> +    result.err = err
> +end
> +
> +local f = fiber.create(test_f)
> +f:cancel()
> +fiber.yield()
> +
> +test:ok(result.res == false, tostring(result.res))
> +test:ok(tostring(result.err) == 'fiber is cancelled', tostring(result.err))

10) Use a user-frendly check message (in both checks).

> diff --git a/test/box/net.box_fiber_cancel_gh-4834.result b/test/box/net.box_fiber_cancel_gh-4834.result
> new file mode 100644
> index 000000000..eab0a5e4d
> --- /dev/null
> +++ b/test/box/net.box_fiber_cancel_gh-4834.result
> @@ -0,0 +1,65 @@
> +-- test-run result file version 2
> +remote = require 'net.box'
> + | ---
> + | ...
> +fiber = require 'fiber'
> + | ---
> + | ...
> +test_run = require('test_run').new()
> + | ---
> + | ...
> +
> +-- #4834: Cancelling fiber doesn't interrupt netbox operations
> +function infinite_call() fiber.channel(1):get() end
> + | ---
> + | ...
> +box.schema.func.create('infinite_call')
> + | ---
> + | ...
> +box.schema.user.grant('guest', 'execute', 'function', 'infinite_call')
> + | ---
> + | ...
> +
> +error_msg = nil
> + | ---
> + | ...
> +test_run:cmd("setopt delimiter ';'")
> + | ---
> + | - true
> + | ...
> +function gh4834( > +    local cn = remote.connect(box.cfg.listen)
> +    local f = fiber.new(function()
> +        _, error_msg = pcall(cn.call, cn, 'infinite_call')
> +    end)
> +    f:set_joinable(true)
> +    fiber.yield()
> +    f:cancel()
> +    f:join()
> +    cn:close()
> +end;
> + | ---
> + | ...
> +test_run:cmd("setopt delimiter ''");
> + | ---
> + | - true
> + | ...
> +gh4834()
> + | ---
> + | ...
> +error_msg
> + | ---
> + | - fiber is cancelled
> + | ...
> +box.schema.func.drop('infinite_call')
> + | ---
> + | ...
> +infinite_call = nil
> + | ---
> + | ...
> +channel = nil
> + | ---
> + | ...
> +error_msg = nil
> + | ---
> + | ...
> diff --git a/test/box/net.box_fiber_cancel_gh-4834.test.lua b/test/box/net.box_fiber_cancel_gh-4834.test.lua
> new file mode 100644
> index 000000000..06fb3ceac
> --- /dev/null
> +++ b/test/box/net.box_fiber_cancel_gh-4834.test.lua
> @@ -0,0 +1,29 @@
> +remote = require 'net.box'
> +fiber = require 'fiber'
> +test_run = require('test_run').new()
> +
> +-- #4834: Cancelling fiber doesn't interrupt netbox operations
> +function infinite_call() fiber.channel(1):get() end
> +box.schema.func.create('infinite_call')
> +box.schema.user.grant('guest', 'execute', 'function', 'infinite_call')
> +
> +error_msg = nil
> +test_run:cmd("setopt delimiter ';'")
> +function gh4834()

11) I think this is not a self-documenting name for the function.

> +    local cn = remote.connect(box.cfg.listen)
> +    local f = fiber.new(function()
> +        _, error_msg = pcall(cn.call, cn, 'infinite_call')
> +    end)
> +    f:set_joinable(true)
> +    fiber.yield()
> +    f:cancel()
> +    f:join()
> +    cn:close()
> +end;
> +test_run:cmd("setopt delimiter ''");
> +gh4834()
> +error_msg
> +box.schema.func.drop('infinite_call')
> +infinite_call = nil
> +channel = nil

12) You didn't introduce the variable "channel". In addition, I looked
at a couple of tests and didn't see them cleaning up variables. But I
don't mind.

> +error_msg = nil
> 
> 
> On 01 ноя 13:13, Oleg Babin wrote:
>> Hi! Thanks for changes. See two comments below.
>>
>> On 31/10/2020 19:29, sergos at tarantool.org wrote:
>>> From: Sergey Ostanevich <sergos at tarantool.org>
>>>
>>> Hi!
>>>
>>> Thanks to Oleg Babin's comment I found there's no need to update any lua
>>> interfaces, since the reason was in C implementation. Also, there is one
>>> place the change is played, so after I fixed it I got complete testing
>>> pass.
>>> Force-pushed branch, v2 patch attached.
>>>
>>>
>>>
>>> Before this patch fiber.cond():wait() just returns for cancelled
>>> fiber. In contrast fiber.channel():get() threw "fiber is
>>> canceled" error.
>>> This patch unify behaviour of channels and condvars and also fixes
>>> related net.box module problem - it was impossible to interrupt
>>> net.box call with fiber.cancel because it used fiber.cond under
>>> the hood. Test cases for both bugs are added.
>>>
>>> Closes #4834
>>> Closes #5013
>>>
>>> Co-authored-by: Oleg Babin <olegrok at tarantool.org>
>>>
>>> @TarantoolBot document
>>> Title: fiber.cond():wait() throws if fiber is cancelled
>>>
>>> Currently fiber.cond():wait() throws an error if waiting fiber is
>>> cancelled like in case with fiber.channel():get().
>>> ---
>>>
>>> Github: https://gitlab.com/tarantool/tarantool/-/commits/sergos/gh-5013-fiber-cond
>>> Issue: https://github.com/tarantool/tarantool/issues/5013
>>>
>>>    src/box/box.cc                                |  6 +-
>>>    src/lib/core/fiber_cond.c                     |  1 +
>>>    test/app-tap/gh-5013-fiber-cancel.test.lua    | 23 +++++++
>>>    test/box/net.box_fiber_cancel_gh-4834.result  | 65 +++++++++++++++++++
>>>    .../box/net.box_fiber_cancel_gh-4834.test.lua | 29 +++++++++
>>>    5 files changed, 120 insertions(+), 4 deletions(-)
>>>    create mode 100755 test/app-tap/gh-5013-fiber-cancel.test.lua
>>>    create mode 100644 test/box/net.box_fiber_cancel_gh-4834.result
>>>    create mode 100644 test/box/net.box_fiber_cancel_gh-4834.test.lua
>>>
>>> diff --git a/src/box/box.cc b/src/box/box.cc
>>> index 18568df3b..bfa1051f9 100644
>>> --- a/src/box/box.cc
>>> +++ b/src/box/box.cc
>>> @@ -305,10 +305,8 @@ box_wait_ro(bool ro, double timeout)
>>>    {
>>>    	double deadline = ev_monotonic_now(loop()) + timeout;
>>>    	while (is_box_configured == false || box_is_ro() != ro) {
>>> -		if (fiber_cond_wait_deadline(&ro_cond, deadline) != 0)
>>> -			return -1;
>>> -		if (fiber_is_cancelled()) {
>>> -			diag_set(FiberIsCancelled);
>>> +		if (fiber_cond_wait_deadline(&ro_cond, deadline) != 0) {
>>> +                        if (fiber_is_cancelled()) diag_set(FiberIsCancelled);
>>
>> Here you use spaces instead of tabs.
>>
> 
> Fixed.
> 
>>>    			return -1;
>>>    		}
>>>    	}
>>> diff --git a/src/lib/core/fiber_cond.c b/src/lib/core/fiber_cond.c
>>> index 904a350d9..b0645069e 100644
>>> --- a/src/lib/core/fiber_cond.c
>>> +++ b/src/lib/core/fiber_cond.c
>>> @@ -108,6 +108,7 @@ fiber_cond_wait_timeout(struct fiber_cond *c, double timeout)
>>>    		diag_set(TimedOut);
>>>    		return -1;
>>>    	}
>>> +	if (fiber_is_cancelled()) return -1;
>>
>> It's qute strange to return -1 here but don't set a reason to diagnostic
>> area. Look how it is done for channels
>>
>> (https://github.com/tarantool/tarantool/blob/42c64d06d5d1a3ec937b3c596af083a672a68ad8/src/lib/core/fiber_channel.c#L180).
>>
>> There is some inconsistency without it.
>>
>> I've looked a bit deeper at the failure I reported before. Seems the problem
>> is in "cbus_unpair" function.
>>
>> The problem appears only if FiberIsCancelled is setted to diag area in
>> "fiber_cond_wait" function.
>>
>> This is where my expertise ends, as I'm not familiar with "cbus". However I
>> have some minds how it could be eliminated.
>>
>> Let's declare cbus_unpair fiber as is not cancellable and stop report
>> is_cancellable flag for non-cancellable fibers. See some PoC below:
>>
>>
>> diff --git a/src/lib/core/cbus.c b/src/lib/core/cbus.c
>> index 5d91fb948..4167c756a 100644
>> --- a/src/lib/core/cbus.c
>> +++ b/src/lib/core/cbus.c
>> @@ -630,6 +630,7 @@ cbus_unpair(struct cpipe *dest_pipe, struct cpipe
>> *src_pipe,
>>       msg.unpair_arg = unpair_arg;
>>       msg.src_pipe = src_pipe;
>>       msg.complete = false;
>> +    fiber_set_cancellable(false);
>>       fiber_cond_create(&msg.cond);
>>
>>       cpipe_push(dest_pipe, &msg.cmsg);
>> @@ -643,6 +644,7 @@ cbus_unpair(struct cpipe *dest_pipe, struct cpipe
>> *src_pipe,
>>           fiber_cond_wait(&msg.cond);
>>       }
>>
>> +    fiber_set_cancellable(true);
>>       cpipe_destroy(dest_pipe);
>>   }
>>
>> diff --git a/src/lib/core/fiber.c b/src/lib/core/fiber.c
>> index 483ae3ce1..8100c9da6 100644
>> --- a/src/lib/core/fiber.c
>> +++ b/src/lib/core/fiber.c
>> @@ -553,6 +553,9 @@ fiber_set_cancellable(bool yesno)
>>   bool
>>   fiber_is_cancelled(void)
>>   {
>> +    if ((fiber()->flags & FIBER_IS_CANCELLABLE) == 0) {
>> +        return false;
>> +    }
>>       return fiber()->flags & FIBER_IS_CANCELLED;
>>   }
>>
>>
>> To be honest I've not checked such change carefully and also have segfault
>> at replication/gc.test.lua for "memtx" engine.
>>
>> Finally, feel free to ignore this comment I hope Vlad or Sasha can give you
>> more accurate and correct advices.
>>


More information about the Tarantool-patches mailing list