[Tarantool-patches] [PATCH v2] core: handle fiber cancellation for fiber.cond

Vladislav Shpilevoy v.shpilevoy at tarantool.org
Sun Nov 22 19:01:03 MSK 2020


Hi! Thanks for the changes!

Technically almost good.

>> On 17 Nov 2020, at 01:12, Vladislav Shpilevoy <v.shpilevoy at tarantool.org> wrote:
>>
>> On 03.11.2020 11:20, Sergey Ostanevich wrote:
>>> Hi Oleg!
>>>
>>> I believe the point about 'consistency' is not valid here. I put a
>>> simple check that if diag is already set, then print it out. For the
>>> fiber_cond_wait_timeout() it happened multiple times with various
>>> reports, inlcuding this one:
>>>
>>> 2020-11-03 10:28:01.630 [72411] relay/unix/:(socket)/101/main C> Did not
>>> set the DIAG to FiberIsCancelled, original diag: Missing .xlog file
>>> between LSN 5 {1: 5} and 6 {1: 6}
>>>
>>> that is used in the test system:
>>>
>>> test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status =
>>> 'loading'})
>>>
>>> So, my resolution will be: it is wrong to set a diag in an arbitrary
>>> place, without clear understanting of the reason. This is the case for
>>> the cond_wait machinery, since it doesn't know _why_ the fiber is
>>> cancelled.
>>
>> It is a wrong resolution, IMO. You just hacked cond wait not to change the
>> other places. It is not about tests. Tests only show what is provided by the
>> internal subsystems. And if they depend on fiber cond not setting diag in
>> case of a fail, then it looks wrong.
>>
> 
> Actually I didn’t make fiber_cond not setting a diag, rather preserve the
> original one if it is present.

Yes, by not setting a diag.

>>> Before this patch fiber.cond():wait() just returns for cancelled
>>> fiber. In contrast fiber.channel():get() threw "fiber is
>>> canceled" error.
>>> This patch unify behaviour of channels and condvars and also fixes
>>> related net.box module problem - it was impossible to interrupt
>>> net.box call with fiber.cancel because it used fiber.cond under
>>> the hood. Test cases for both bugs are added.
>>
>> Netbox hangs not because of using fiber.cond. But because it never
>> calls testcancel(). Normally all looped fibers should do that.
>>
> I believe the fix is an indirect result, since all 3 wait() calls in
> net_box.lua aren't covered with a pcall() - it errors out. I reproduced
> it with the following:

I know that your patch fixes it. But I was talking about the statement in
the commit message. You said:

	it was impossible to interrupt net.box call with fiber.cancel
	because it used fiber.cond under the hood

And it still uses the cond, but does not hang anymore. Therefore, the
statement obviously isn't true. The fact of cond usage is not a reason.

See 4 comments below.

> diff --git a/src/box/box.cc b/src/box/box.cc
> index 1f7dec362..aa23dcc13 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -305,10 +305,9 @@ box_wait_ro(bool ro, double timeout)
>  {
>  	double deadline = ev_monotonic_now(loop()) + timeout;
>  	while (is_box_configured == false || box_is_ro() != ro) {
> -		if (fiber_cond_wait_deadline(&ro_cond, deadline) != 0)
> -			return -1;
> -		if (fiber_is_cancelled()) {
> -			diag_set(FiberIsCancelled);
> +		if (fiber_cond_wait_deadline(&ro_cond, deadline) != 0) {
> +			if (fiber_is_cancelled())
> +				diag_set(FiberIsCancelled);

1. Why do you need this diag_set here? The cancellation diag is already
set in fiber_cond_wait_deadline(). Why do you set it again?

>  			return -1;
>  		}
>  	}
> diff --git a/src/box/relay.cc b/src/box/relay.cc
> index 1e77e0d9b..cce139c87 100644
> --- a/src/box/relay.cc
> +++ b/src/box/relay.cc
> @@ -821,8 +821,18 @@ relay_subscribe(struct replica *replica, int fd, uint64_t sync,
>  			      relay_subscribe_f, relay);
>  	if (rc == 0)
>  		rc = cord_cojoin(&relay->cord);
> -	if (rc != 0)
> +	if (rc != 0) {
> +		/*
> +		 * We should always raise a problem from relay itself, not all
> +		 * other modules that are change diag in the current fiber.
> +		 * TODO: investigate why and how we can leave the relay_subscribe_f
> +		 * with diag unset in the relay.
> +		 */
> +		if (diag_last_error(&relay->diag)) {

2. Please, use != NULL, according to our code style.
https://github.com/tarantool/tarantool/wiki/Code-review-procedure#code-style

> +			diag_set_error(diag_get(), diag_last_error(&relay->diag));
> +		}

3. What is the test where diag_last_error() is NULL? I added
an assertion, that it is not NULL, and the tests passed (except a
couple of tests which always fail on my machine due to too long UNIX
socket path).

Also in the end of relay_subscribe_f() there is an existing
assert(!diag_is_empty(&relay->diag)). So it can't be NULL. Or do you
have a test?

The only possible issue I see is that

	diag_set_error(diag_get(), diag_last_error(&relay->diag));

is called too early in relay_subscribe_f(). After that the diag
can be rewritten somehow. For instance, by fiber_join(reader). So
probably you can move this diag move to the end right before 'return -1;',
and remove this hunk entirely. In a separate commit in the same branch,
since it is not related to fiber_cond bug directly.

>  		diag_raise();
> +	}
>  }
> diff --git a/test/box/gh-4834-netbox-fiber-cancel.result b/test/box/gh-4834-netbox-fiber-cancel.result
> new file mode 100644
> index 000000000..4ed04bb61
> --- /dev/null
> +++ b/test/box/gh-4834-netbox-fiber-cancel.result
> @@ -0,0 +1,62 @@
> +-- test-run result file version 2
> +remote = require 'net.box'
> + | ---
> + | ...
> +fiber = require 'fiber'

4. Please, use () for require. You even did it in the other test file,
but here you changed your mind somewhy. Or it is bad copy-paste.


More information about the Tarantool-patches mailing list