From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp48.i.mail.ru (smtp48.i.mail.ru [94.100.177.108]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 9C42C469710 for ; Tue, 24 Nov 2020 10:31:43 +0300 (MSK) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.4\)) From: Sergey Ostanevich In-Reply-To: <4b2ee1cf-babc-4745-7c01-355602b739bc@tarantool.org> Date: Tue, 24 Nov 2020 10:31:40 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: <30AA598E-8418-44FF-8267-4214709A0815@tarantool.org> References: <20201031162911.61876-1-sergos@tarantool.org> <20201103102018.GC517@tarantool.org> <100FE749-04E9-400D-8F6C-1E45F28B5A63@tarantool.org> <4b2ee1cf-babc-4745-7c01-355602b739bc@tarantool.org> Subject: Re: [Tarantool-patches] [PATCH v2] core: handle fiber cancellation for fiber.cond List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Vladislav Shpilevoy Cc: tarantool-patches@dev.tarantool.org, Alexander Turenko Yet another fix-up because of test update: --- a/test/box/gh-4834-netbox-fiber-cancel.result +++ b/test/box/gh-4834-netbox-fiber-cancel.result @@ -1,8 +1,8 @@ -- test-run result file version 2 -remote =3D require 'net.box' +remote =3D require('net.box') | --- | ... -fiber =3D require 'fiber' +fiber =3D require('fiber') | --- | ... test_run =3D require('test_run').new() Force-pushed.=20 > On 22 Nov 2020, at 19:01, Vladislav Shpilevoy = wrote: >=20 > Hi! Thanks for the changes! >=20 > Technically almost good. >=20 >>> On 17 Nov 2020, at 01:12, Vladislav Shpilevoy = wrote: >>>=20 >>> On 03.11.2020 11:20, Sergey Ostanevich wrote: >>>> Hi Oleg! >>>>=20 >>>> I believe the point about 'consistency' is not valid here. I put a >>>> simple check that if diag is already set, then print it out. For = the >>>> fiber_cond_wait_timeout() it happened multiple times with various >>>> reports, inlcuding this one: >>>>=20 >>>> 2020-11-03 10:28:01.630 [72411] relay/unix/:(socket)/101/main C> = Did not >>>> set the DIAG to FiberIsCancelled, original diag: Missing .xlog file >>>> between LSN 5 {1: 5} and 6 {1: 6} >>>>=20 >>>> that is used in the test system: >>>>=20 >>>> test_run:wait_upstream(1, {message_re =3D 'Missing %.xlog file', = status =3D >>>> 'loading'}) >>>>=20 >>>> So, my resolution will be: it is wrong to set a diag in an = arbitrary >>>> place, without clear understanting of the reason. This is the case = for >>>> the cond_wait machinery, since it doesn't know _why_ the fiber is >>>> cancelled. >>>=20 >>> It is a wrong resolution, IMO. You just hacked cond wait not to = change the >>> other places. It is not about tests. Tests only show what is = provided by the >>> internal subsystems. And if they depend on fiber cond not setting = diag in >>> case of a fail, then it looks wrong. >>>=20 >>=20 >> Actually I didn=E2=80=99t make fiber_cond not setting a diag, rather = preserve the >> original one if it is present. >=20 > Yes, by not setting a diag. >=20 >>>> Before this patch fiber.cond():wait() just returns for cancelled >>>> fiber. In contrast fiber.channel():get() threw "fiber is >>>> canceled" error. >>>> This patch unify behaviour of channels and condvars and also fixes >>>> related net.box module problem - it was impossible to interrupt >>>> net.box call with fiber.cancel because it used fiber.cond under >>>> the hood. Test cases for both bugs are added. >>>=20 >>> Netbox hangs not because of using fiber.cond. But because it never >>> calls testcancel(). Normally all looped fibers should do that. >>>=20 >> I believe the fix is an indirect result, since all 3 wait() calls in >> net_box.lua aren't covered with a pcall() - it errors out. I = reproduced >> it with the following: >=20 > I know that your patch fixes it. But I was talking about the statement = in > the commit message. You said: >=20 > it was impossible to interrupt net.box call with fiber.cancel > because it used fiber.cond under the hood >=20 > And it still uses the cond, but does not hang anymore. Therefore, the > statement obviously isn't true. The fact of cond usage is not a = reason. >=20 > See 4 comments below. >=20 >> diff --git a/src/box/box.cc b/src/box/box.cc >> index 1f7dec362..aa23dcc13 100644 >> --- a/src/box/box.cc >> +++ b/src/box/box.cc >> @@ -305,10 +305,9 @@ box_wait_ro(bool ro, double timeout) >> { >> double deadline =3D ev_monotonic_now(loop()) + timeout; >> while (is_box_configured =3D=3D false || box_is_ro() !=3D ro) { >> - if (fiber_cond_wait_deadline(&ro_cond, deadline) !=3D 0) >> - return -1; >> - if (fiber_is_cancelled()) { >> - diag_set(FiberIsCancelled); >> + if (fiber_cond_wait_deadline(&ro_cond, deadline) !=3D 0) = { >> + if (fiber_is_cancelled()) >> + diag_set(FiberIsCancelled); >=20 > 1. Why do you need this diag_set here? The cancellation diag is = already > set in fiber_cond_wait_deadline(). Why do you set it again? >=20 >> return -1; >> } >> } >> diff --git a/src/box/relay.cc b/src/box/relay.cc >> index 1e77e0d9b..cce139c87 100644 >> --- a/src/box/relay.cc >> +++ b/src/box/relay.cc >> @@ -821,8 +821,18 @@ relay_subscribe(struct replica *replica, int fd, = uint64_t sync, >> relay_subscribe_f, relay); >> if (rc =3D=3D 0) >> rc =3D cord_cojoin(&relay->cord); >> - if (rc !=3D 0) >> + if (rc !=3D 0) { >> + /* >> + * We should always raise a problem from relay itself, = not all >> + * other modules that are change diag in the current = fiber. >> + * TODO: investigate why and how we can leave the = relay_subscribe_f >> + * with diag unset in the relay. >> + */ >> + if (diag_last_error(&relay->diag)) { >=20 > 2. Please, use !=3D NULL, according to our code style. > = https://github.com/tarantool/tarantool/wiki/Code-review-procedure#code-sty= le >=20 >> + diag_set_error(diag_get(), = diag_last_error(&relay->diag)); >> + } >=20 > 3. What is the test where diag_last_error() is NULL? I added > an assertion, that it is not NULL, and the tests passed (except a > couple of tests which always fail on my machine due to too long UNIX > socket path). >=20 > Also in the end of relay_subscribe_f() there is an existing > assert(!diag_is_empty(&relay->diag)). So it can't be NULL. Or do you > have a test? >=20 > The only possible issue I see is that >=20 > diag_set_error(diag_get(), diag_last_error(&relay->diag)); >=20 > is called too early in relay_subscribe_f(). After that the diag > can be rewritten somehow. For instance, by fiber_join(reader). So > probably you can move this diag move to the end right before 'return = -1;', > and remove this hunk entirely. In a separate commit in the same = branch, > since it is not related to fiber_cond bug directly. >=20 >> diag_raise(); >> + } >> } >> diff --git a/test/box/gh-4834-netbox-fiber-cancel.result = b/test/box/gh-4834-netbox-fiber-cancel.result >> new file mode 100644 >> index 000000000..4ed04bb61 >> --- /dev/null >> +++ b/test/box/gh-4834-netbox-fiber-cancel.result >> @@ -0,0 +1,62 @@ >> +-- test-run result file version 2 >> +remote =3D require 'net.box' >> + | --- >> + | ... >> +fiber =3D require 'fiber' >=20 > 4. Please, use () for require. You even did it in the other test file, > but here you changed your mind somewhy. Or it is bad copy-paste.