From: Sergey Ostanevich <sergos@tarantool.org> To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org> Cc: tarantool-patches@dev.tarantool.org, alexander.turenko@tarantool.org Subject: Re: [Tarantool-patches] [PATCH v2] core: handle fiber cancellation for fiber.cond Date: Tue, 24 Nov 2020 00:47:29 +0300 [thread overview] Message-ID: <99BB1BF3-9503-4F13-BFDC-1B0E6304DB20@tarantool.org> (raw) In-Reply-To: <4b2ee1cf-babc-4745-7c01-355602b739bc@tarantool.org> Hi! Thanks for review! > On 22 Nov 2020, at 19:01, Vladislav Shpilevoy <v.shpilevoy@tarantool.org> wrote: > > Hi! Thanks for the changes! > > Technically almost good. > >>> On 17 Nov 2020, at 01:12, Vladislav Shpilevoy <v.shpilevoy@tarantool.org> wrote: >>> >>> On 03.11.2020 11:20, Sergey Ostanevich wrote: >>>> Hi Oleg! >>>> >>>> I believe the point about 'consistency' is not valid here. I put a >>>> simple check that if diag is already set, then print it out. For the >>>> fiber_cond_wait_timeout() it happened multiple times with various >>>> reports, inlcuding this one: >>>> >>>> 2020-11-03 10:28:01.630 [72411] relay/unix/:(socket)/101/main C> Did not >>>> set the DIAG to FiberIsCancelled, original diag: Missing .xlog file >>>> between LSN 5 {1: 5} and 6 {1: 6} >>>> >>>> that is used in the test system: >>>> >>>> test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = >>>> 'loading'}) >>>> >>>> So, my resolution will be: it is wrong to set a diag in an arbitrary >>>> place, without clear understanting of the reason. This is the case for >>>> the cond_wait machinery, since it doesn't know _why_ the fiber is >>>> cancelled. >>> >>> It is a wrong resolution, IMO. You just hacked cond wait not to change the >>> other places. It is not about tests. Tests only show what is provided by the >>> internal subsystems. And if they depend on fiber cond not setting diag in >>> case of a fail, then it looks wrong. >>> >> >> Actually I didn’t make fiber_cond not setting a diag, rather preserve the >> original one if it is present. > > Yes, by not setting a diag. > >>>> Before this patch fiber.cond():wait() just returns for cancelled >>>> fiber. In contrast fiber.channel():get() threw "fiber is >>>> canceled" error. >>>> This patch unify behaviour of channels and condvars and also fixes >>>> related net.box module problem - it was impossible to interrupt >>>> net.box call with fiber.cancel because it used fiber.cond under >>>> the hood. Test cases for both bugs are added. >>> >>> Netbox hangs not because of using fiber.cond. But because it never >>> calls testcancel(). Normally all looped fibers should do that. >>> >> I believe the fix is an indirect result, since all 3 wait() calls in >> net_box.lua aren't covered with a pcall() - it errors out. I reproduced >> it with the following: > > I know that your patch fixes it. But I was talking about the statement in > the commit message. You said: > > sergos/gh-5013-fiber-cond > because it used fiber.cond under the hood > > And it still uses the cond, but does not hang anymore. Therefore, the > statement obviously isn't true. The fact of cond usage is not a reason. > Well, it might mislead, I see. Let me rephrase it. > See 4 comments below. > >> diff --git a/src/box/box.cc b/src/box/box.cc >> index 1f7dec362..aa23dcc13 100644 >> --- a/src/box/box.cc >> +++ b/src/box/box.cc >> @@ -305,10 +305,9 @@ box_wait_ro(bool ro, double timeout) >> { >> double deadline = ev_monotonic_now(loop()) + timeout; >> while (is_box_configured == false || box_is_ro() != ro) { >> - if (fiber_cond_wait_deadline(&ro_cond, deadline) != 0) >> - return -1; >> - if (fiber_is_cancelled()) { >> - diag_set(FiberIsCancelled); >> + if (fiber_cond_wait_deadline(&ro_cond, deadline) != 0) { >> + if (fiber_is_cancelled()) >> + diag_set(FiberIsCancelled); > > 1. Why do you need this diag_set here? The cancellation diag is already > set in fiber_cond_wait_deadline(). Why do you set it again? It's an artefact appeared during the solution. You’re right, it is not needed, removed. > >> return -1; >> } >> } >> diff --git a/src/box/relay.cc b/src/box/relay.cc >> index 1e77e0d9b..cce139c87 100644 >> --- a/src/box/relay.cc >> +++ b/src/box/relay.cc >> @@ -821,8 +821,18 @@ relay_subscribe(struct replica *replica, int fd, uint64_t sync, >> relay_subscribe_f, relay); >> if (rc == 0) >> rc = cord_cojoin(&relay->cord); >> - if (rc != 0) >> + if (rc != 0) { >> + /* >> + * We should always raise a problem from relay itself, not all >> + * other modules that are change diag in the current fiber. >> + * TODO: investigate why and how we can leave the relay_subscribe_f >> + * with diag unset in the relay. >> + */ >> + if (diag_last_error(&relay->diag)) { > > 2. Please, use != NULL, according to our code style. > https://github.com/tarantool/tarantool/wiki/Code-review-procedure#code-style > I prefer to fix this along with the comment #3 >> + diag_set_error(diag_get(), diag_last_error(&relay->diag)); >> + } > > 3. What is the test where diag_last_error() is NULL? I added > an assertion, that it is not NULL, and the tests passed (except a > couple of tests which always fail on my machine due to too long UNIX > socket path). > > Also in the end of relay_subscribe_f() there is an existing > assert(!diag_is_empty(&relay->diag)). So it can't be NULL. Or do you > have a test? > > The only possible issue I see is that > > diag_set_error(diag_get(), diag_last_error(&relay->diag)); > > is called too early inrelay_subscribe_f(). After that the diag > can be rewritten somehow. For instance, by fiber_join(reader). So > probably you can move this diag move to the end right before 'return -1;', > and remove this hunk entirely. In a separate commit in the same branch, > since it is not related to fiber_cond bug directly. > You’re right - the assert was there for long time, so should not happen at all. I move the diag_set_error, still I have a concern - the relay_exit(relay) at the very end of the relay_subscribe_f() can raise down in the call chain at the recovery_stop_local(). It can happen only if fiber from recovery watcher has some error set, which is not clear to me yet. > >> diag_raise(); >> + } >> } >> diff --git a/test/box/gh-4834-netbox-fiber-cancel.result b/test/box/gh-4834-netbox-fiber-cancel.result >> new file mode 100644 >> index 000000000..4ed04bb61 >> --- /dev/null >> +++ b/test/box/gh-4834-netbox-fiber-cancel.result >> @@ -0,0 +1,62 @@ >> +-- test-run result file version 2 >> +remote = require 'net.box' >> + | --- >> + | ... >> +fiber = require 'fiber' > > 4. Please, use () for require. You even did it in the other test file, > but here you changed your mind somewhy. Or it is bad copy-paste. Done. ================ From e8bd26bb26c76e0a722958db1fe4763c027dcb8b Mon Sep 17 00:00:00 2001 From: Sergey Ostanevich <sergos@tarantool.org> Date: Tue, 3 Nov 2020 12:52:26 +0300 Subject: [PATCH] fiber: handle fiber cancellation for fiber.cond Before this patch fiber.cond():wait() just returns for cancelled fiber. In contrast fiber.channel():get() throws "fiber is canceled" error. This patch unifies behaviour of channels and condvars. It also fixes a related net.box module problem #4834 since fiber.cond now performs test for fiber cancellation. Closes #4834 Closes #5013 Co-authored-by: Oleg Babin <olegrok@tarantool.org> @TarantoolBot document Title: fiber.cond():wait() throws if fiber is cancelled Currently fiber.cond():wait() throws an error if waiting fiber is cancelled. --- Github: https://gitlab.com/tarantool/tarantool/-/commits/sergos/gh-5013-fiber-cond Issue: https://github.com/tarantool/tarantool/issues/5013 src/box/box.cc | 4 -- src/box/relay.cc | 19 +++--- src/lib/core/fiber_cond.c | 4 ++ src/lib/core/fiber_cond.h | 2 +- test/app-tap/gh-5013-fiber-cancel.test.lua | 23 +++++++ test/box/gh-4834-netbox-fiber-cancel.result | 62 +++++++++++++++++++ test/box/gh-4834-netbox-fiber-cancel.test.lua | 28 +++++++++ 7 files changed, 128 insertions(+), 14 deletions(-) create mode 100755 test/app-tap/gh-5013-fiber-cancel.test.lua create mode 100644 test/box/gh-4834-netbox-fiber-cancel.result create mode 100644 test/box/gh-4834-netbox-fiber-cancel.test.lua diff --git a/src/box/box.cc b/src/box/box.cc index 18568df3b..111432937 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -307,10 +307,6 @@ box_wait_ro(bool ro, double timeout) while (is_box_configured == false || box_is_ro() != ro) { if (fiber_cond_wait_deadline(&ro_cond, deadline) != 0) return -1; - if (fiber_is_cancelled()) { - diag_set(FiberIsCancelled); - return -1; - } } return 0; } diff --git a/src/box/relay.cc b/src/box/relay.cc index b68b45e00..a7bc2c6f7 100644 --- a/src/box/relay.cc +++ b/src/box/relay.cc @@ -753,15 +753,6 @@ relay_subscribe_f(va_list ap) if (!relay->replica->anon) relay_send_is_raft_enabled(relay, &raft_enabler, false); - /* - * Log the error that caused the relay to break the loop. - * Don't clear the error for status reporting. - */ - assert(!diag_is_empty(&relay->diag)); - diag_set_error(diag_get(), diag_last_error(&relay->diag)); - diag_log(); - say_crit("exiting the relay loop"); - /* * Clear garbage collector trigger and WAL watcher. * trigger_clear() does nothing in case the triggers @@ -780,6 +771,16 @@ relay_subscribe_f(va_list ap) cbus_endpoint_destroy(&relay->endpoint, cbus_process); relay_exit(relay); + + /* + * Log the error that caused the relay to break the loop. + * Don't clear the error for status reporting. + */ + assert(!diag_is_empty(&relay->diag)); + diag_set_error(diag_get(), diag_last_error(&relay->diag)); + diag_log(); + say_crit("exiting the relay loop"); + return -1; } diff --git a/src/lib/core/fiber_cond.c b/src/lib/core/fiber_cond.c index 904a350d9..71bb2d04d 100644 --- a/src/lib/core/fiber_cond.c +++ b/src/lib/core/fiber_cond.c @@ -108,6 +108,10 @@ fiber_cond_wait_timeout(struct fiber_cond *c, double timeout) diag_set(TimedOut); return -1; } + if (fiber_is_cancelled()) { + diag_set(FiberIsCancelled); + return -1; + } return 0; } diff --git a/src/lib/core/fiber_cond.h b/src/lib/core/fiber_cond.h index 87c6f2ca2..2662e0654 100644 --- a/src/lib/core/fiber_cond.h +++ b/src/lib/core/fiber_cond.h @@ -114,7 +114,7 @@ fiber_cond_broadcast(struct fiber_cond *cond); * @param cond condition * @param timeout timeout in seconds * @retval 0 on fiber_cond_signal() call or a spurious wake up - * @retval -1 on timeout, diag is set to TimedOut + * @retval -1 on timeout or fiber cancellation, diag is set */ int fiber_cond_wait_timeout(struct fiber_cond *cond, double timeout); diff --git a/test/app-tap/gh-5013-fiber-cancel.test.lua b/test/app-tap/gh-5013-fiber-cancel.test.lua new file mode 100755 index 000000000..ca4ca2c90 --- /dev/null +++ b/test/app-tap/gh-5013-fiber-cancel.test.lua @@ -0,0 +1,23 @@ +#!/usr/bin/env tarantool + +local tap = require('tap') +local fiber = require('fiber') +local test = tap.test("gh-5013-fiber-cancel") + +test:plan(2) + +local result = {} + +function test_f() + local cond = fiber.cond() + local res, err = pcall(cond.wait, cond) + result.res = res + result.err = err +end + +local f = fiber.create(test_f) +f:cancel() +fiber.yield() + +test:ok(result.res == false, 'expected result is false') +test:ok(tostring(result.err) == 'fiber is cancelled', 'fiber cancellation should be reported') diff --git a/test/box/gh-4834-netbox-fiber-cancel.result b/test/box/gh-4834-netbox-fiber-cancel.result new file mode 100644 index 000000000..4ed04bb61 --- /dev/null +++ b/test/box/gh-4834-netbox-fiber-cancel.result @@ -0,0 +1,62 @@ +-- test-run result file version 2 +remote = require 'net.box' + | --- + | ... +fiber = require 'fiber' + | --- + | ... +test_run = require('test_run').new() + | --- + | ... + +-- #4834: Cancelling fiber doesn't interrupt netbox operations +function infinite_call() fiber.channel(1):get() end + | --- + | ... +box.schema.func.create('infinite_call') + | --- + | ... +box.schema.user.grant('guest', 'execute', 'function', 'infinite_call') + | --- + | ... + +error_msg = nil + | --- + | ... +test_run:cmd("setopt delimiter ';'") + | --- + | - true + | ... +function netbox_runner() + local cn = remote.connect(box.cfg.listen) + local f = fiber.new(function() + _, error_msg = pcall(cn.call, cn, 'infinite_call') + end) + f:set_joinable(true) + fiber.yield() + f:cancel() + f:join() + cn:close() +end; + | --- + | ... +test_run:cmd("setopt delimiter ''"); + | --- + | - true + | ... +netbox_runner() + | --- + | ... +error_msg + | --- + | - fiber is cancelled + | ... +box.schema.func.drop('infinite_call') + | --- + | ... +infinite_call = nil + | --- + | ... +error_msg = nil + | --- + | ... diff --git a/test/box/gh-4834-netbox-fiber-cancel.test.lua b/test/box/gh-4834-netbox-fiber-cancel.test.lua new file mode 100644 index 000000000..dbeff083b --- /dev/null +++ b/test/box/gh-4834-netbox-fiber-cancel.test.lua @@ -0,0 +1,28 @@ +remote = require('net.box') +fiber = require('fiber') +test_run = require('test_run').new() + +-- #4834: Cancelling fiber doesn't interrupt netbox operations +function infinite_call() fiber.channel(1):get() end +box.schema.func.create('infinite_call') +box.schema.user.grant('guest', 'execute', 'function', 'infinite_call') + +error_msg = nil +test_run:cmd("setopt delimiter ';'") +function netbox_runner() + local cn = remote.connect(box.cfg.listen) + local f = fiber.new(function() + _, error_msg = pcall(cn.call, cn, 'infinite_call') + end) + f:set_joinable(true) + fiber.yield() + f:cancel() + f:join() + cn:close() +end; +test_run:cmd("setopt delimiter ''"); +netbox_runner() +error_msg +box.schema.func.drop('infinite_call') +infinite_call = nil +error_msg = nil -- 2.24.3 (Apple Git-128)
next prev parent reply other threads:[~2020-11-23 21:47 UTC|newest] Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-10-31 16:29 sergos 2020-11-01 10:13 ` Oleg Babin 2020-11-03 10:20 ` Sergey Ostanevich 2020-11-03 10:27 ` Oleg Babin 2020-11-04 10:00 ` Leonid Vasiliev 2020-11-16 22:12 ` Vladislav Shpilevoy 2020-11-18 22:05 ` Sergey Ostanevich 2020-11-22 16:01 ` Vladislav Shpilevoy 2020-11-23 21:47 ` Sergey Ostanevich [this message] 2020-11-24 7:31 ` Sergey Ostanevich 2020-11-04 10:00 ` Leonid Vasiliev 2020-11-05 20:42 ` Sergey Ostanevich 2020-11-10 21:16 ` Sergey Ostanevich 2020-11-12 20:15 ` Sergey Ostanevich 2020-11-13 8:26 ` Leonid Vasiliev 2020-11-30 22:49 ` Alexander Turenko 2020-11-16 22:12 ` Vladislav Shpilevoy 2020-11-25 21:32 ` Vladislav Shpilevoy 2020-11-29 21:41 ` Sergey Ostanevich 2020-11-30 21:46 ` Alexander Turenko 2020-11-30 21:01 ` Vladislav Shpilevoy 2020-12-02 10:58 ` Alexander V. Tikhonov 2020-12-02 22:18 ` Vladislav Shpilevoy
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=99BB1BF3-9503-4F13-BFDC-1B0E6304DB20@tarantool.org \ --to=sergos@tarantool.org \ --cc=alexander.turenko@tarantool.org \ --cc=tarantool-patches@dev.tarantool.org \ --cc=v.shpilevoy@tarantool.org \ --subject='Re: [Tarantool-patches] [PATCH v2] core: handle fiber cancellation for fiber.cond' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox