From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp35.i.mail.ru (smtp35.i.mail.ru [94.100.177.95]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 0B560469710 for ; Thu, 19 Nov 2020 01:05:06 +0300 (MSK) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.4\)) From: Sergey Ostanevich In-Reply-To: Date: Thu, 19 Nov 2020 01:05:05 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: <100FE749-04E9-400D-8F6C-1E45F28B5A63@tarantool.org> References: <20201031162911.61876-1-sergos@tarantool.org> <20201103102018.GC517@tarantool.org> Subject: Re: [Tarantool-patches] [PATCH v2] core: handle fiber cancellation for fiber.cond List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Vladislav Shpilevoy Cc: tarantool-patches@dev.tarantool.org, alexander.turenko@tarantool.org Hi Vlad! I put your comments and my patch from the second mail here to keep one thread - see below. Thanks, Sergos > On 17 Nov 2020, at 01:12, Vladislav Shpilevoy = wrote: >=20 > On 03.11.2020 11:20, Sergey Ostanevich wrote: >> Hi Oleg! >>=20 >> I believe the point about 'consistency' is not valid here. I put a >> simple check that if diag is already set, then print it out. For the >> fiber_cond_wait_timeout() it happened multiple times with various >> reports, inlcuding this one: >>=20 >> 2020-11-03 10:28:01.630 [72411] relay/unix/:(socket)/101/main C> Did = not >> set the DIAG to FiberIsCancelled, original diag: Missing .xlog file >> between LSN 5 {1: 5} and 6 {1: 6} >>=20 >> that is used in the test system: >>=20 >> test_run:wait_upstream(1, {message_re =3D 'Missing %.xlog file', = status =3D >> 'loading'}) >>=20 >> So, my resolution will be: it is wrong to set a diag in an arbitrary >> place, without clear understanting of the reason. This is the case = for >> the cond_wait machinery, since it doesn't know _why_ the fiber is >> cancelled. >=20 > It is a wrong resolution, IMO. You just hacked cond wait not to change = the > other places. It is not about tests. Tests only show what is provided = by the > internal subsystems. And if they depend on fiber cond not setting diag = in > case of a fail, then it looks wrong. >=20 Actually I didn=E2=80=99t make fiber_cond not setting a diag, rather = preserve the original one if it is present. As for correct or not - you=E2=80=99re = way more savvy in Tarantool sources than I, so it=E2=80=99s hard for me to oppose. = Still I=E2=80=99d try. For the case of the test I see the relay_process_wal_event() catches = from the recover_remaining_wals() and sets the XlogGapError for the relay. = Note, the relay_set_error() uses exactly the same semantics - sets the error = in case it was not set before. Then it happily cancels the fiber and = returns and eventually appears in relay_subscribe() where it throws (via = diag_raise) the latest diag of the fiber. It appears, that along the way there are=20= multiple places the fiber diag can be reset, so the best I can propose = is to use the relay=E2=80=99s diag. Also, to enforce this strategy we need = a follow-up ticket to ensure the relay=E2=80=99s diag is always set by the = relay_subscribe()=20 exit. > I suggest you to fix the usage places, where the caller code thinks = that > cond_wait never sets a diag on cancellation. >=20 > If a function fails, we set a diag. It is not a thing we do = optionally. > Otherwise you make it a bit simpler in this patch, but make it harder = to > work with the cond in future. >=20 > Talking of your statement: >=20 > I believe the stack diag also is not supported there yet. >=20 > It is supported on the level of lib/core, i.e. everywhere. But is not > present on 1.10. However it is not the point. The point is that it is = not > needed here. > On 17 Nov 2020, at 01:12, Vladislav Shpilevoy = wrote: >=20 > Hi! Thanks for the patch! >=20 > Please, change subsystem name to 'fiber:'. 'core:' is too general. > We have tons of stuff in libcore in addition to fibers. >=20 Done. >> Before this patch fiber.cond():wait() just returns for cancelled >> fiber. In contrast fiber.channel():get() threw "fiber is >> canceled" error. >> This patch unify behaviour of channels and condvars and also fixes >> related net.box module problem - it was impossible to interrupt >> net.box call with fiber.cancel because it used fiber.cond under >> the hood. Test cases for both bugs are added. >=20 > Netbox hangs not because of using fiber.cond. But because it never > calls testcancel(). Normally all looped fibers should do that. >=20 I believe the fix is an indirect result, since all 3 wait() calls in net_box.lua aren't covered with a pcall() - it errors out. I reproduced it with the following: --- a/src/box/lua/net_box.lua +++ b/src/box/lua/net_box.lua @@ -414,7 +417,11 @@ local function create_transport(host, port, user, = password, callback, -- waiting client is waked up prematurely. while timeout > 0 and not self:is_ready() do local ts =3D fiber.clock() - self.cond:wait(timeout) + local ok, err =3D pcall(self.cond.wait, self.cond, = timeout) + if (not ok) then + print('net_box.lua:423 thrown, err: ' .. = tostring(err)) + error(err) + end timeout =3D timeout - (fiber.clock() - ts) end if not self:is_ready() then Note, that if I don=E2=80=99t put the explicit error() there, the test = from the patch hangs.=20 To me it looks like testcancel(), since it raises error the same way.=20 Although, there could be more places in net_box where testcancel() can = be=20 called - still it is beyond this ticket. > Talking of compatibility - I think it always was supposed to throw. > luaT_fiber_cond_wait() calls luaL_testcancel(), but only when > fiber_cond_wait_timeout() returns not 0, which was never the case for > cancellation. So it was supposed to throw, but nobody covered it with > a test. >=20 > See 2 comments below. >=20 >> Closes #4834 >> Closes #5013 >>=20 >> Co-authored-by: Oleg Babin >> diff --git a/src/box/box.cc b/src/box/box.cc >> index 18568df3b..29f74e94b 100644 >> --- a/src/box/box.cc >> +++ b/src/box/box.cc >> @@ -305,10 +305,9 @@ box_wait_ro(bool ro, double timeout) >> { >> double deadline =3D ev_monotonic_now(loop()) + timeout; >> while (is_box_configured =3D=3D false || box_is_ro() !=3D ro) { >> - if (fiber_cond_wait_deadline(&ro_cond, deadline) !=3D 0) >> - return -1; >> - if (fiber_is_cancelled()) { >> - diag_set(FiberIsCancelled); >> + if (fiber_cond_wait_deadline(&ro_cond, deadline) !=3D 0) = { >> + if (fiber_is_cancelled()) >> + diag_set(FiberIsCancelled); >> return -1; >> } >> } >> diff --git a/src/lib/core/fiber_cond.c b/src/lib/core/fiber_cond.c >> index 904a350d9..cc59eaafb 100644 >> --- a/src/lib/core/fiber_cond.c >> +++ b/src/lib/core/fiber_cond.c >> @@ -108,6 +108,11 @@ fiber_cond_wait_timeout(struct fiber_cond *c, = double timeout) >> diag_set(TimedOut); >> return -1; >> } >> + if (fiber_is_cancelled()) { >> + if (diag_is_empty(diag_get())) >> + diag_set(FiberIsCancelled); >> + return -1; >=20 > 1. Wtf? Why don't you set an error, when this is an error? And why do > you do exactly the same in box_wait_ro() above? >=20 Fixed as per first part of review. >> + } >> return 0; >> }> diff --git a/test/box/net.box_fiber_cancel_gh-4834.result = b/test/box/net.box_fiber_cancel_gh-4834.result >> new file mode 100644 >> index 000000000..4ed04bb61 >> --- /dev/null >> +++ b/test/box/net.box_fiber_cancel_gh-4834.result >=20 > 2. This does not conform to our test file name pattern. Read this > carefully: = https://github.com/tarantool/tarantool/wiki/Code-review-procedure#testing Fixed. Although, we have a big legacy of similarly named tests under = test/box/ Branch is force-pushed, updated patch is below. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =46rom 5e6426340f4f5af8429c4273f1b251f503c6dd9b Mon Sep 17 00:00:00 2001 From: Sergey Ostanevich Date: Tue, 3 Nov 2020 12:52:26 +0300 Subject: [PATCH] fiber: handle fiber cancellation for fiber.cond Before this patch fiber.cond():wait() just returns for cancelled fiber. In contrast fiber.channel():get() threw "fiber is canceled" error. This patch unify behaviour of channels and condvars and also fixes related net.box module problem - it was impossible to interrupt net.box call with fiber.cancel because it used fiber.cond under the hood. Test cases for both bugs are added. Closes #4834 Closes #5013 Co-authored-by: Oleg Babin @TarantoolBot document Title: fiber.cond():wait() throws if fiber is cancelled Currently fiber.cond():wait() throws an error if waiting fiber is cancelled. --- Github: = https://gitlab.com/tarantool/tarantool/-/commits/sergos/gh-5013-fiber-cond= Issue: https://github.com/tarantool/tarantool/issues/5013 @Changelog * fiber.cond().wait() now throws if fiber is cancelled src/box/box.cc | 7 +-- src/box/relay.cc | 12 +++- src/lib/core/fiber_cond.c | 4 ++ src/lib/core/fiber_cond.h | 2 +- test/app-tap/gh-5013-fiber-cancel.test.lua | 23 +++++++ test/box/gh-4834-netbox-fiber-cancel.result | 62 +++++++++++++++++++ test/box/gh-4834-netbox-fiber-cancel.test.lua | 28 +++++++++ 7 files changed, 132 insertions(+), 6 deletions(-) create mode 100755 test/app-tap/gh-5013-fiber-cancel.test.lua create mode 100644 test/box/gh-4834-netbox-fiber-cancel.result create mode 100644 test/box/gh-4834-netbox-fiber-cancel.test.lua diff --git a/src/box/box.cc b/src/box/box.cc index 1f7dec362..aa23dcc13 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -305,10 +305,9 @@ box_wait_ro(bool ro, double timeout) { double deadline =3D ev_monotonic_now(loop()) + timeout; while (is_box_configured =3D=3D false || box_is_ro() !=3D ro) { - if (fiber_cond_wait_deadline(&ro_cond, deadline) !=3D 0) - return -1; - if (fiber_is_cancelled()) { - diag_set(FiberIsCancelled); + if (fiber_cond_wait_deadline(&ro_cond, deadline) !=3D 0) = { + if (fiber_is_cancelled()) + diag_set(FiberIsCancelled); return -1; } } diff --git a/src/box/relay.cc b/src/box/relay.cc index 1e77e0d9b..cce139c87 100644 --- a/src/box/relay.cc +++ b/src/box/relay.cc @@ -821,8 +821,18 @@ relay_subscribe(struct replica *replica, int fd, = uint64_t sync, relay_subscribe_f, relay); if (rc =3D=3D 0) rc =3D cord_cojoin(&relay->cord); - if (rc !=3D 0) + if (rc !=3D 0) { + /* + * We should always raise a problem from relay itself, = not all + * other modules that are change diag in the current = fiber. + * TODO: investigate why and how we can leave the = relay_subscribe_f + * with diag unset in the relay. + */ + if (diag_last_error(&relay->diag)) { + diag_set_error(diag_get(), = diag_last_error(&relay->diag)); + } diag_raise(); + } } =20 static void diff --git a/src/lib/core/fiber_cond.c b/src/lib/core/fiber_cond.c index 904a350d9..71bb2d04d 100644 --- a/src/lib/core/fiber_cond.c +++ b/src/lib/core/fiber_cond.c @@ -108,6 +108,10 @@ fiber_cond_wait_timeout(struct fiber_cond *c, = double timeout) diag_set(TimedOut); return -1; } + if (fiber_is_cancelled()) { + diag_set(FiberIsCancelled); + return -1; + } return 0; } =20 diff --git a/src/lib/core/fiber_cond.h b/src/lib/core/fiber_cond.h index 87c6f2ca2..2662e0654 100644 --- a/src/lib/core/fiber_cond.h +++ b/src/lib/core/fiber_cond.h @@ -114,7 +114,7 @@ fiber_cond_broadcast(struct fiber_cond *cond); * @param cond condition * @param timeout timeout in seconds * @retval 0 on fiber_cond_signal() call or a spurious wake up - * @retval -1 on timeout, diag is set to TimedOut + * @retval -1 on timeout or fiber cancellation, diag is set */ int fiber_cond_wait_timeout(struct fiber_cond *cond, double timeout); diff --git a/test/app-tap/gh-5013-fiber-cancel.test.lua = b/test/app-tap/gh-5013-fiber-cancel.test.lua new file mode 100755 index 000000000..ca4ca2c90 --- /dev/null +++ b/test/app-tap/gh-5013-fiber-cancel.test.lua @@ -0,0 +1,23 @@ +#!/usr/bin/env tarantool + +local tap =3D require('tap') +local fiber =3D require('fiber') +local test =3D tap.test("gh-5013-fiber-cancel") + +test:plan(2) + +local result =3D {} + +function test_f() + local cond =3D fiber.cond() + local res, err =3D pcall(cond.wait, cond) + result.res =3D res + result.err =3D err +end + +local f =3D fiber.create(test_f) +f:cancel() +fiber.yield() + +test:ok(result.res =3D=3D false, 'expected result is false') +test:ok(tostring(result.err) =3D=3D 'fiber is cancelled', 'fiber = cancellation should be reported') diff --git a/test/box/gh-4834-netbox-fiber-cancel.result = b/test/box/gh-4834-netbox-fiber-cancel.result new file mode 100644 index 000000000..4ed04bb61 --- /dev/null +++ b/test/box/gh-4834-netbox-fiber-cancel.result @@ -0,0 +1,62 @@ +-- test-run result file version 2 +remote =3D require 'net.box' + | --- + | ... +fiber =3D require 'fiber' + | --- + | ... +test_run =3D require('test_run').new() + | --- + | ... + +-- #4834: Cancelling fiber doesn't interrupt netbox operations +function infinite_call() fiber.channel(1):get() end + | --- + | ... +box.schema.func.create('infinite_call') + | --- + | ... +box.schema.user.grant('guest', 'execute', 'function', 'infinite_call') + | --- + | ... + +error_msg =3D nil + | --- + | ... +test_run:cmd("setopt delimiter ';'") + | --- + | - true + | ... +function netbox_runner() + local cn =3D remote.connect(box.cfg.listen) + local f =3D fiber.new(function() + _, error_msg =3D pcall(cn.call, cn, 'infinite_call') + end) + f:set_joinable(true) + fiber.yield() + f:cancel() + f:join() + cn:close() +end; + | --- + | ... +test_run:cmd("setopt delimiter ''"); + | --- + | - true + | ... +netbox_runner() + | --- + | ... +error_msg + | --- + | - fiber is cancelled + | ... +box.schema.func.drop('infinite_call') + | --- + | ... +infinite_call =3D nil + | --- + | ... +error_msg =3D nil + | --- + | ... diff --git a/test/box/gh-4834-netbox-fiber-cancel.test.lua = b/test/box/gh-4834-netbox-fiber-cancel.test.lua new file mode 100644 index 000000000..bc0e5af6e --- /dev/null +++ b/test/box/gh-4834-netbox-fiber-cancel.test.lua @@ -0,0 +1,28 @@ +remote =3D require 'net.box' +fiber =3D require 'fiber' +test_run =3D require('test_run').new() + +-- #4834: Cancelling fiber doesn't interrupt netbox operations +function infinite_call() fiber.channel(1):get() end +box.schema.func.create('infinite_call') +box.schema.user.grant('guest', 'execute', 'function', 'infinite_call') + +error_msg =3D nil +test_run:cmd("setopt delimiter ';'") +function netbox_runner() + local cn =3D remote.connect(box.cfg.listen) + local f =3D fiber.new(function() + _, error_msg =3D pcall(cn.call, cn, 'infinite_call') + end) + f:set_joinable(true) + fiber.yield() + f:cancel() + f:join() + cn:close() +end; +test_run:cmd("setopt delimiter ''"); +netbox_runner() +error_msg +box.schema.func.drop('infinite_call') +infinite_call =3D nil +error_msg =3D nil --=20 2.24.3 (Apple Git-128)