Tarantool development patches archive
 help / color / mirror / Atom feed
From: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
To: Sergey Ostanevich <sergos@tarantool.org>
Cc: tarantool-patches@dev.tarantool.org, alexander.turenko@tarantool.org
Subject: Re: [Tarantool-patches] [PATCH v2] core: handle fiber cancellation for fiber.cond
Date: Sun, 22 Nov 2020 17:01:03 +0100	[thread overview]
Message-ID: <4b2ee1cf-babc-4745-7c01-355602b739bc@tarantool.org> (raw)
In-Reply-To: <100FE749-04E9-400D-8F6C-1E45F28B5A63@tarantool.org>

Hi! Thanks for the changes!

Technically almost good.

>> On 17 Nov 2020, at 01:12, Vladislav Shpilevoy <v.shpilevoy@tarantool.org> wrote:
>>
>> On 03.11.2020 11:20, Sergey Ostanevich wrote:
>>> Hi Oleg!
>>>
>>> I believe the point about 'consistency' is not valid here. I put a
>>> simple check that if diag is already set, then print it out. For the
>>> fiber_cond_wait_timeout() it happened multiple times with various
>>> reports, inlcuding this one:
>>>
>>> 2020-11-03 10:28:01.630 [72411] relay/unix/:(socket)/101/main C> Did not
>>> set the DIAG to FiberIsCancelled, original diag: Missing .xlog file
>>> between LSN 5 {1: 5} and 6 {1: 6}
>>>
>>> that is used in the test system:
>>>
>>> test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status =
>>> 'loading'})
>>>
>>> So, my resolution will be: it is wrong to set a diag in an arbitrary
>>> place, without clear understanting of the reason. This is the case for
>>> the cond_wait machinery, since it doesn't know _why_ the fiber is
>>> cancelled.
>>
>> It is a wrong resolution, IMO. You just hacked cond wait not to change the
>> other places. It is not about tests. Tests only show what is provided by the
>> internal subsystems. And if they depend on fiber cond not setting diag in
>> case of a fail, then it looks wrong.
>>
> 
> Actually I didn’t make fiber_cond not setting a diag, rather preserve the
> original one if it is present.

Yes, by not setting a diag.

>>> Before this patch fiber.cond():wait() just returns for cancelled
>>> fiber. In contrast fiber.channel():get() threw "fiber is
>>> canceled" error.
>>> This patch unify behaviour of channels and condvars and also fixes
>>> related net.box module problem - it was impossible to interrupt
>>> net.box call with fiber.cancel because it used fiber.cond under
>>> the hood. Test cases for both bugs are added.
>>
>> Netbox hangs not because of using fiber.cond. But because it never
>> calls testcancel(). Normally all looped fibers should do that.
>>
> I believe the fix is an indirect result, since all 3 wait() calls in
> net_box.lua aren't covered with a pcall() - it errors out. I reproduced
> it with the following:

I know that your patch fixes it. But I was talking about the statement in
the commit message. You said:

	it was impossible to interrupt net.box call with fiber.cancel
	because it used fiber.cond under the hood

And it still uses the cond, but does not hang anymore. Therefore, the
statement obviously isn't true. The fact of cond usage is not a reason.

See 4 comments below.

> diff --git a/src/box/box.cc b/src/box/box.cc
> index 1f7dec362..aa23dcc13 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -305,10 +305,9 @@ box_wait_ro(bool ro, double timeout)
>  {
>  	double deadline = ev_monotonic_now(loop()) + timeout;
>  	while (is_box_configured == false || box_is_ro() != ro) {
> -		if (fiber_cond_wait_deadline(&ro_cond, deadline) != 0)
> -			return -1;
> -		if (fiber_is_cancelled()) {
> -			diag_set(FiberIsCancelled);
> +		if (fiber_cond_wait_deadline(&ro_cond, deadline) != 0) {
> +			if (fiber_is_cancelled())
> +				diag_set(FiberIsCancelled);

1. Why do you need this diag_set here? The cancellation diag is already
set in fiber_cond_wait_deadline(). Why do you set it again?

>  			return -1;
>  		}
>  	}
> diff --git a/src/box/relay.cc b/src/box/relay.cc
> index 1e77e0d9b..cce139c87 100644
> --- a/src/box/relay.cc
> +++ b/src/box/relay.cc
> @@ -821,8 +821,18 @@ relay_subscribe(struct replica *replica, int fd, uint64_t sync,
>  			      relay_subscribe_f, relay);
>  	if (rc == 0)
>  		rc = cord_cojoin(&relay->cord);
> -	if (rc != 0)
> +	if (rc != 0) {
> +		/*
> +		 * We should always raise a problem from relay itself, not all
> +		 * other modules that are change diag in the current fiber.
> +		 * TODO: investigate why and how we can leave the relay_subscribe_f
> +		 * with diag unset in the relay.
> +		 */
> +		if (diag_last_error(&relay->diag)) {

2. Please, use != NULL, according to our code style.
https://github.com/tarantool/tarantool/wiki/Code-review-procedure#code-style

> +			diag_set_error(diag_get(), diag_last_error(&relay->diag));
> +		}

3. What is the test where diag_last_error() is NULL? I added
an assertion, that it is not NULL, and the tests passed (except a
couple of tests which always fail on my machine due to too long UNIX
socket path).

Also in the end of relay_subscribe_f() there is an existing
assert(!diag_is_empty(&relay->diag)). So it can't be NULL. Or do you
have a test?

The only possible issue I see is that

	diag_set_error(diag_get(), diag_last_error(&relay->diag));

is called too early in relay_subscribe_f(). After that the diag
can be rewritten somehow. For instance, by fiber_join(reader). So
probably you can move this diag move to the end right before 'return -1;',
and remove this hunk entirely. In a separate commit in the same branch,
since it is not related to fiber_cond bug directly.

>  		diag_raise();
> +	}
>  }
> diff --git a/test/box/gh-4834-netbox-fiber-cancel.result b/test/box/gh-4834-netbox-fiber-cancel.result
> new file mode 100644
> index 000000000..4ed04bb61
> --- /dev/null
> +++ b/test/box/gh-4834-netbox-fiber-cancel.result
> @@ -0,0 +1,62 @@
> +-- test-run result file version 2
> +remote = require 'net.box'
> + | ---
> + | ...
> +fiber = require 'fiber'

4. Please, use () for require. You even did it in the other test file,
but here you changed your mind somewhy. Or it is bad copy-paste.

  reply	other threads:[~2020-11-22 16:01 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-31 16:29 sergos
2020-11-01 10:13 ` Oleg Babin
2020-11-03 10:20   ` Sergey Ostanevich
2020-11-03 10:27     ` Oleg Babin
2020-11-04 10:00     ` Leonid Vasiliev
2020-11-16 22:12     ` Vladislav Shpilevoy
2020-11-18 22:05       ` Sergey Ostanevich
2020-11-22 16:01         ` Vladislav Shpilevoy [this message]
2020-11-23 21:47           ` Sergey Ostanevich
2020-11-24  7:31           ` Sergey Ostanevich
2020-11-04 10:00 ` Leonid Vasiliev
2020-11-05 20:42   ` Sergey Ostanevich
2020-11-10 21:16   ` Sergey Ostanevich
2020-11-12 20:15     ` Sergey Ostanevich
2020-11-13  8:26       ` Leonid Vasiliev
2020-11-30 22:49         ` Alexander Turenko
2020-11-16 22:12 ` Vladislav Shpilevoy
2020-11-25 21:32 ` Vladislav Shpilevoy
2020-11-29 21:41   ` Sergey Ostanevich
2020-11-30 21:46   ` Alexander Turenko
2020-11-30 21:01 ` Vladislav Shpilevoy
2020-12-02 10:58 ` Alexander V. Tikhonov
2020-12-02 22:18 ` Vladislav Shpilevoy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4b2ee1cf-babc-4745-7c01-355602b739bc@tarantool.org \
    --to=v.shpilevoy@tarantool.org \
    --cc=alexander.turenko@tarantool.org \
    --cc=sergos@tarantool.org \
    --cc=tarantool-patches@dev.tarantool.org \
    --subject='Re: [Tarantool-patches] [PATCH v2] core: handle fiber cancellation for fiber.cond' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox