[Tarantool-patches] [PATCH v2] core: handle fiber cancellation for fiber.cond
Sergey Ostanevich
sergos at tarantool.org
Tue Nov 24 10:31:40 MSK 2020
Yet another fix-up because of test update:
--- a/test/box/gh-4834-netbox-fiber-cancel.result
+++ b/test/box/gh-4834-netbox-fiber-cancel.result
@@ -1,8 +1,8 @@
-- test-run result file version 2
-remote = require 'net.box'
+remote = require('net.box')
| ---
| ...
-fiber = require 'fiber'
+fiber = require('fiber')
| ---
| ...
test_run = require('test_run').new()
Force-pushed.
> On 22 Nov 2020, at 19:01, Vladislav Shpilevoy <v.shpilevoy at tarantool.org> wrote:
>
> Hi! Thanks for the changes!
>
> Technically almost good.
>
>>> On 17 Nov 2020, at 01:12, Vladislav Shpilevoy <v.shpilevoy at tarantool.org> wrote:
>>>
>>> On 03.11.2020 11:20, Sergey Ostanevich wrote:
>>>> Hi Oleg!
>>>>
>>>> I believe the point about 'consistency' is not valid here. I put a
>>>> simple check that if diag is already set, then print it out. For the
>>>> fiber_cond_wait_timeout() it happened multiple times with various
>>>> reports, inlcuding this one:
>>>>
>>>> 2020-11-03 10:28:01.630 [72411] relay/unix/:(socket)/101/main C> Did not
>>>> set the DIAG to FiberIsCancelled, original diag: Missing .xlog file
>>>> between LSN 5 {1: 5} and 6 {1: 6}
>>>>
>>>> that is used in the test system:
>>>>
>>>> test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status =
>>>> 'loading'})
>>>>
>>>> So, my resolution will be: it is wrong to set a diag in an arbitrary
>>>> place, without clear understanting of the reason. This is the case for
>>>> the cond_wait machinery, since it doesn't know _why_ the fiber is
>>>> cancelled.
>>>
>>> It is a wrong resolution, IMO. You just hacked cond wait not to change the
>>> other places. It is not about tests. Tests only show what is provided by the
>>> internal subsystems. And if they depend on fiber cond not setting diag in
>>> case of a fail, then it looks wrong.
>>>
>>
>> Actually I didn’t make fiber_cond not setting a diag, rather preserve the
>> original one if it is present.
>
> Yes, by not setting a diag.
>
>>>> Before this patch fiber.cond():wait() just returns for cancelled
>>>> fiber. In contrast fiber.channel():get() threw "fiber is
>>>> canceled" error.
>>>> This patch unify behaviour of channels and condvars and also fixes
>>>> related net.box module problem - it was impossible to interrupt
>>>> net.box call with fiber.cancel because it used fiber.cond under
>>>> the hood. Test cases for both bugs are added.
>>>
>>> Netbox hangs not because of using fiber.cond. But because it never
>>> calls testcancel(). Normally all looped fibers should do that.
>>>
>> I believe the fix is an indirect result, since all 3 wait() calls in
>> net_box.lua aren't covered with a pcall() - it errors out. I reproduced
>> it with the following:
>
> I know that your patch fixes it. But I was talking about the statement in
> the commit message. You said:
>
> it was impossible to interrupt net.box call with fiber.cancel
> because it used fiber.cond under the hood
>
> And it still uses the cond, but does not hang anymore. Therefore, the
> statement obviously isn't true. The fact of cond usage is not a reason.
>
> See 4 comments below.
>
>> diff --git a/src/box/box.cc b/src/box/box.cc
>> index 1f7dec362..aa23dcc13 100644
>> --- a/src/box/box.cc
>> +++ b/src/box/box.cc
>> @@ -305,10 +305,9 @@ box_wait_ro(bool ro, double timeout)
>> {
>> double deadline = ev_monotonic_now(loop()) + timeout;
>> while (is_box_configured == false || box_is_ro() != ro) {
>> - if (fiber_cond_wait_deadline(&ro_cond, deadline) != 0)
>> - return -1;
>> - if (fiber_is_cancelled()) {
>> - diag_set(FiberIsCancelled);
>> + if (fiber_cond_wait_deadline(&ro_cond, deadline) != 0) {
>> + if (fiber_is_cancelled())
>> + diag_set(FiberIsCancelled);
>
> 1. Why do you need this diag_set here? The cancellation diag is already
> set in fiber_cond_wait_deadline(). Why do you set it again?
>
>> return -1;
>> }
>> }
>> diff --git a/src/box/relay.cc b/src/box/relay.cc
>> index 1e77e0d9b..cce139c87 100644
>> --- a/src/box/relay.cc
>> +++ b/src/box/relay.cc
>> @@ -821,8 +821,18 @@ relay_subscribe(struct replica *replica, int fd, uint64_t sync,
>> relay_subscribe_f, relay);
>> if (rc == 0)
>> rc = cord_cojoin(&relay->cord);
>> - if (rc != 0)
>> + if (rc != 0) {
>> + /*
>> + * We should always raise a problem from relay itself, not all
>> + * other modules that are change diag in the current fiber.
>> + * TODO: investigate why and how we can leave the relay_subscribe_f
>> + * with diag unset in the relay.
>> + */
>> + if (diag_last_error(&relay->diag)) {
>
> 2. Please, use != NULL, according to our code style.
> https://github.com/tarantool/tarantool/wiki/Code-review-procedure#code-style
>
>> + diag_set_error(diag_get(), diag_last_error(&relay->diag));
>> + }
>
> 3. What is the test where diag_last_error() is NULL? I added
> an assertion, that it is not NULL, and the tests passed (except a
> couple of tests which always fail on my machine due to too long UNIX
> socket path).
>
> Also in the end of relay_subscribe_f() there is an existing
> assert(!diag_is_empty(&relay->diag)). So it can't be NULL. Or do you
> have a test?
>
> The only possible issue I see is that
>
> diag_set_error(diag_get(), diag_last_error(&relay->diag));
>
> is called too early in relay_subscribe_f(). After that the diag
> can be rewritten somehow. For instance, by fiber_join(reader). So
> probably you can move this diag move to the end right before 'return -1;',
> and remove this hunk entirely. In a separate commit in the same branch,
> since it is not related to fiber_cond bug directly.
>
>> diag_raise();
>> + }
>> }
>> diff --git a/test/box/gh-4834-netbox-fiber-cancel.result b/test/box/gh-4834-netbox-fiber-cancel.result
>> new file mode 100644
>> index 000000000..4ed04bb61
>> --- /dev/null
>> +++ b/test/box/gh-4834-netbox-fiber-cancel.result
>> @@ -0,0 +1,62 @@
>> +-- test-run result file version 2
>> +remote = require 'net.box'
>> + | ---
>> + | ...
>> +fiber = require 'fiber'
>
> 4. Please, use () for require. You even did it in the other test file,
> but here you changed your mind somewhy. Or it is bad copy-paste.
More information about the Tarantool-patches
mailing list