From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtpng3.m.smailru.net (smtpng3.m.smailru.net [94.100.177.149]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 49B20469710 for ; Sun, 22 Nov 2020 19:01:06 +0300 (MSK) References: <20201031162911.61876-1-sergos@tarantool.org> <20201103102018.GC517@tarantool.org> <100FE749-04E9-400D-8F6C-1E45F28B5A63@tarantool.org> From: Vladislav Shpilevoy Message-ID: <4b2ee1cf-babc-4745-7c01-355602b739bc@tarantool.org> Date: Sun, 22 Nov 2020 17:01:03 +0100 MIME-Version: 1.0 In-Reply-To: <100FE749-04E9-400D-8F6C-1E45F28B5A63@tarantool.org> Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: 8bit Subject: Re: [Tarantool-patches] [PATCH v2] core: handle fiber cancellation for fiber.cond List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Sergey Ostanevich Cc: tarantool-patches@dev.tarantool.org, alexander.turenko@tarantool.org Hi! Thanks for the changes! Technically almost good. >> On 17 Nov 2020, at 01:12, Vladislav Shpilevoy wrote: >> >> On 03.11.2020 11:20, Sergey Ostanevich wrote: >>> Hi Oleg! >>> >>> I believe the point about 'consistency' is not valid here. I put a >>> simple check that if diag is already set, then print it out. For the >>> fiber_cond_wait_timeout() it happened multiple times with various >>> reports, inlcuding this one: >>> >>> 2020-11-03 10:28:01.630 [72411] relay/unix/:(socket)/101/main C> Did not >>> set the DIAG to FiberIsCancelled, original diag: Missing .xlog file >>> between LSN 5 {1: 5} and 6 {1: 6} >>> >>> that is used in the test system: >>> >>> test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = >>> 'loading'}) >>> >>> So, my resolution will be: it is wrong to set a diag in an arbitrary >>> place, without clear understanting of the reason. This is the case for >>> the cond_wait machinery, since it doesn't know _why_ the fiber is >>> cancelled. >> >> It is a wrong resolution, IMO. You just hacked cond wait not to change the >> other places. It is not about tests. Tests only show what is provided by the >> internal subsystems. And if they depend on fiber cond not setting diag in >> case of a fail, then it looks wrong. >> > > Actually I didn’t make fiber_cond not setting a diag, rather preserve the > original one if it is present. Yes, by not setting a diag. >>> Before this patch fiber.cond():wait() just returns for cancelled >>> fiber. In contrast fiber.channel():get() threw "fiber is >>> canceled" error. >>> This patch unify behaviour of channels and condvars and also fixes >>> related net.box module problem - it was impossible to interrupt >>> net.box call with fiber.cancel because it used fiber.cond under >>> the hood. Test cases for both bugs are added. >> >> Netbox hangs not because of using fiber.cond. But because it never >> calls testcancel(). Normally all looped fibers should do that. >> > I believe the fix is an indirect result, since all 3 wait() calls in > net_box.lua aren't covered with a pcall() - it errors out. I reproduced > it with the following: I know that your patch fixes it. But I was talking about the statement in the commit message. You said: it was impossible to interrupt net.box call with fiber.cancel because it used fiber.cond under the hood And it still uses the cond, but does not hang anymore. Therefore, the statement obviously isn't true. The fact of cond usage is not a reason. See 4 comments below. > diff --git a/src/box/box.cc b/src/box/box.cc > index 1f7dec362..aa23dcc13 100644 > --- a/src/box/box.cc > +++ b/src/box/box.cc > @@ -305,10 +305,9 @@ box_wait_ro(bool ro, double timeout) > { > double deadline = ev_monotonic_now(loop()) + timeout; > while (is_box_configured == false || box_is_ro() != ro) { > - if (fiber_cond_wait_deadline(&ro_cond, deadline) != 0) > - return -1; > - if (fiber_is_cancelled()) { > - diag_set(FiberIsCancelled); > + if (fiber_cond_wait_deadline(&ro_cond, deadline) != 0) { > + if (fiber_is_cancelled()) > + diag_set(FiberIsCancelled); 1. Why do you need this diag_set here? The cancellation diag is already set in fiber_cond_wait_deadline(). Why do you set it again? > return -1; > } > } > diff --git a/src/box/relay.cc b/src/box/relay.cc > index 1e77e0d9b..cce139c87 100644 > --- a/src/box/relay.cc > +++ b/src/box/relay.cc > @@ -821,8 +821,18 @@ relay_subscribe(struct replica *replica, int fd, uint64_t sync, > relay_subscribe_f, relay); > if (rc == 0) > rc = cord_cojoin(&relay->cord); > - if (rc != 0) > + if (rc != 0) { > + /* > + * We should always raise a problem from relay itself, not all > + * other modules that are change diag in the current fiber. > + * TODO: investigate why and how we can leave the relay_subscribe_f > + * with diag unset in the relay. > + */ > + if (diag_last_error(&relay->diag)) { 2. Please, use != NULL, according to our code style. https://github.com/tarantool/tarantool/wiki/Code-review-procedure#code-style > + diag_set_error(diag_get(), diag_last_error(&relay->diag)); > + } 3. What is the test where diag_last_error() is NULL? I added an assertion, that it is not NULL, and the tests passed (except a couple of tests which always fail on my machine due to too long UNIX socket path). Also in the end of relay_subscribe_f() there is an existing assert(!diag_is_empty(&relay->diag)). So it can't be NULL. Or do you have a test? The only possible issue I see is that diag_set_error(diag_get(), diag_last_error(&relay->diag)); is called too early in relay_subscribe_f(). After that the diag can be rewritten somehow. For instance, by fiber_join(reader). So probably you can move this diag move to the end right before 'return -1;', and remove this hunk entirely. In a separate commit in the same branch, since it is not related to fiber_cond bug directly. > diag_raise(); > + } > } > diff --git a/test/box/gh-4834-netbox-fiber-cancel.result b/test/box/gh-4834-netbox-fiber-cancel.result > new file mode 100644 > index 000000000..4ed04bb61 > --- /dev/null > +++ b/test/box/gh-4834-netbox-fiber-cancel.result > @@ -0,0 +1,62 @@ > +-- test-run result file version 2 > +remote = require 'net.box' > + | --- > + | ... > +fiber = require 'fiber' 4. Please, use () for require. You even did it in the other test file, but here you changed your mind somewhy. Or it is bad copy-paste.