[Tarantool-patches] [PATCH v2] core: handle fiber cancellation for fiber.cond

Sergey Ostanevich sergos at tarantool.org
Tue Nov 24 00:47:29 MSK 2020


Hi!

Thanks for review! 

> On 22 Nov 2020, at 19:01, Vladislav Shpilevoy <v.shpilevoy at tarantool.org> wrote:
> 
> Hi! Thanks for the changes!
> 
> Technically almost good.
> 
>>> On 17 Nov 2020, at 01:12, Vladislav Shpilevoy <v.shpilevoy at tarantool.org> wrote:
>>> 
>>> On 03.11.2020 11:20, Sergey Ostanevich wrote:
>>>> Hi Oleg!
>>>> 
>>>> I believe the point about 'consistency' is not valid here. I put a
>>>> simple check that if diag is already set, then print it out. For the
>>>> fiber_cond_wait_timeout() it happened multiple times with various
>>>> reports, inlcuding this one:
>>>> 
>>>> 2020-11-03 10:28:01.630 [72411] relay/unix/:(socket)/101/main C> Did not
>>>> set the DIAG to FiberIsCancelled, original diag: Missing .xlog file
>>>> between LSN 5 {1: 5} and 6 {1: 6}
>>>> 
>>>> that is used in the test system:
>>>> 
>>>> test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status =
>>>> 'loading'})
>>>> 
>>>> So, my resolution will be: it is wrong to set a diag in an arbitrary
>>>> place, without clear understanting of the reason. This is the case for
>>>> the cond_wait machinery, since it doesn't know _why_ the fiber is
>>>> cancelled.
>>> 
>>> It is a wrong resolution, IMO. You just hacked cond wait not to change the
>>> other places. It is not about tests. Tests only show what is provided by the
>>> internal subsystems. And if they depend on fiber cond not setting diag in
>>> case of a fail, then it looks wrong.
>>> 
>> 
>> Actually I didn’t make fiber_cond not setting a diag, rather preserve the
>> original one if it is present.
> 
> Yes, by not setting a diag.
> 
>>>> Before this patch fiber.cond():wait() just returns for cancelled
>>>> fiber. In contrast fiber.channel():get() threw "fiber is
>>>> canceled" error.
>>>> This patch unify behaviour of channels and condvars and also fixes
>>>> related net.box module problem - it was impossible to interrupt
>>>> net.box call with fiber.cancel because it used fiber.cond under
>>>> the hood. Test cases for both bugs are added.
>>> 
>>> Netbox hangs not because of using fiber.cond. But because it never
>>> calls testcancel(). Normally all looped fibers should do that.
>>> 
>> I believe the fix is an indirect result, since all 3 wait() calls in
>> net_box.lua aren't covered with a pcall() - it errors out. I reproduced
>> it with the following:
> 
> I know that your patch fixes it. But I was talking about the statement in
> the commit message. You said:
> 
> 	sergos/gh-5013-fiber-cond 
> 	because it used fiber.cond under the hood
> 
> And it still uses the cond, but does not hang anymore. Therefore, the
> statement obviously isn't true. The fact of cond usage is not a reason.
> 

Well, it might mislead, I see. Let me rephrase it.

> See 4 comments below.
> 
>> diff --git a/src/box/box.cc b/src/box/box.cc
>> index 1f7dec362..aa23dcc13 100644
>> --- a/src/box/box.cc
>> +++ b/src/box/box.cc
>> @@ -305,10 +305,9 @@ box_wait_ro(bool ro, double timeout)
>> {
>> 	double deadline = ev_monotonic_now(loop()) + timeout;
>> 	while (is_box_configured == false || box_is_ro() != ro) {
>> -		if (fiber_cond_wait_deadline(&ro_cond, deadline) != 0)
>> -			return -1;
>> -		if (fiber_is_cancelled()) {
>> -			diag_set(FiberIsCancelled);
>> +		if (fiber_cond_wait_deadline(&ro_cond, deadline) != 0) {
>> +			if (fiber_is_cancelled())
>> +				diag_set(FiberIsCancelled);
> 
> 1. Why do you need this diag_set here? The cancellation diag is already
> set in fiber_cond_wait_deadline(). Why do you set it again?

It's an artefact appeared during the solution. You’re right, it is not 
needed, removed. 

> 
>> 			return -1;
>> 		}
>> 	}
>> diff --git a/src/box/relay.cc b/src/box/relay.cc
>> index 1e77e0d9b..cce139c87 100644
>> --- a/src/box/relay.cc
>> +++ b/src/box/relay.cc
>> @@ -821,8 +821,18 @@ relay_subscribe(struct replica *replica, int fd, uint64_t sync,
>> 			      relay_subscribe_f, relay);
>> 	if (rc == 0)
>> 		rc = cord_cojoin(&relay->cord);
>> -	if (rc != 0)
>> +	if (rc != 0) {
>> +		/*
>> +		 * We should always raise a problem from relay itself, not all
>> +		 * other modules that are change diag in the current fiber.
>> +		 * TODO: investigate why and how we can leave the relay_subscribe_f
>> +		 * with diag unset in the relay.
>> +		 */
>> +		if (diag_last_error(&relay->diag)) {
> 
> 2. Please, use != NULL, according to our code style.
> https://github.com/tarantool/tarantool/wiki/Code-review-procedure#code-style
> 

I prefer to fix this along with the comment #3

>> +			diag_set_error(diag_get(), diag_last_error(&relay->diag));
>> +		}
> 
> 3. What is the test where diag_last_error() is NULL? I added
> an assertion, that it is not NULL, and the tests passed (except a
> couple of tests which always fail on my machine due to too long UNIX
> socket path).
> 
> Also in the end of relay_subscribe_f() there is an existing
> assert(!diag_is_empty(&relay->diag)). So it can't be NULL. Or do you
> have a test?
> 
> The only possible issue I see is that
> 
> 	diag_set_error(diag_get(), diag_last_error(&relay->diag));
> 
> is called too early inrelay_subscribe_f(). After that the diag
> can be rewritten somehow. For instance, by fiber_join(reader). So
> probably you can move this diag move to the end right before 'return -1;',
> and remove this hunk entirely. In a separate commit in the same branch,
> since it is not related to fiber_cond bug directly.
> 

You’re right - the assert was there for long time, so should not happen 
at all. I move the diag_set_error, still I have a concern - the
relay_exit(relay) at the very end of the relay_subscribe_f() can raise
down in the call chain at the recovery_stop_local(). It can happen only
if fiber from recovery watcher has some error set, which is not clear
to me yet. 
 
> 
>> 		diag_raise();
>> +	}
>> }
>> diff --git a/test/box/gh-4834-netbox-fiber-cancel.result b/test/box/gh-4834-netbox-fiber-cancel.result
>> new file mode 100644
>> index 000000000..4ed04bb61
>> --- /dev/null
>> +++ b/test/box/gh-4834-netbox-fiber-cancel.result
>> @@ -0,0 +1,62 @@
>> +-- test-run result file version 2
>> +remote = require 'net.box'
>> + | ---
>> + | ...
>> +fiber = require 'fiber'
> 
> 4. Please, use () for require. You even did it in the other test file,
> but here you changed your mind somewhy. Or it is bad copy-paste.
Done. 

================
From e8bd26bb26c76e0a722958db1fe4763c027dcb8b Mon Sep 17 00:00:00 2001
From: Sergey Ostanevich <sergos at tarantool.org>
Date: Tue, 3 Nov 2020 12:52:26 +0300
Subject: [PATCH] fiber: handle fiber cancellation for fiber.cond

Before this patch fiber.cond():wait() just returns for cancelled
fiber. In contrast fiber.channel():get() throws "fiber is canceled"
error. This patch unifies behaviour of channels and condvars.
It also fixes a related net.box module problem #4834 since fiber.cond
now performs test for fiber cancellation.

Closes #4834
Closes #5013

Co-authored-by: Oleg Babin <olegrok at tarantool.org>

@TarantoolBot document
Title: fiber.cond():wait() throws if fiber is cancelled

Currently fiber.cond():wait() throws an error if waiting fiber is
cancelled.
---

Github: https://gitlab.com/tarantool/tarantool/-/commits/sergos/gh-5013-fiber-cond
Issue: https://github.com/tarantool/tarantool/issues/5013

 src/box/box.cc                                |  4 --
 src/box/relay.cc                              | 19 +++---
 src/lib/core/fiber_cond.c                     |  4 ++
 src/lib/core/fiber_cond.h                     |  2 +-
 test/app-tap/gh-5013-fiber-cancel.test.lua    | 23 +++++++
 test/box/gh-4834-netbox-fiber-cancel.result   | 62 +++++++++++++++++++
 test/box/gh-4834-netbox-fiber-cancel.test.lua | 28 +++++++++
 7 files changed, 128 insertions(+), 14 deletions(-)
 create mode 100755 test/app-tap/gh-5013-fiber-cancel.test.lua
 create mode 100644 test/box/gh-4834-netbox-fiber-cancel.result
 create mode 100644 test/box/gh-4834-netbox-fiber-cancel.test.lua

diff --git a/src/box/box.cc b/src/box/box.cc
index 18568df3b..111432937 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -307,10 +307,6 @@ box_wait_ro(bool ro, double timeout)
 	while (is_box_configured == false || box_is_ro() != ro) {
 		if (fiber_cond_wait_deadline(&ro_cond, deadline) != 0)
 			return -1;
-		if (fiber_is_cancelled()) {
-			diag_set(FiberIsCancelled);
-			return -1;
-		}
 	}
 	return 0;
 }
diff --git a/src/box/relay.cc b/src/box/relay.cc
index b68b45e00..a7bc2c6f7 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -753,15 +753,6 @@ relay_subscribe_f(va_list ap)
 	if (!relay->replica->anon)
 		relay_send_is_raft_enabled(relay, &raft_enabler, false);
 
-	/*
-	 * Log the error that caused the relay to break the loop.
-	 * Don't clear the error for status reporting.
-	 */
-	assert(!diag_is_empty(&relay->diag));
-	diag_set_error(diag_get(), diag_last_error(&relay->diag));
-	diag_log();
-	say_crit("exiting the relay loop");
-
 	/*
 	 * Clear garbage collector trigger and WAL watcher.
 	 * trigger_clear() does nothing in case the triggers
@@ -780,6 +771,16 @@ relay_subscribe_f(va_list ap)
 	cbus_endpoint_destroy(&relay->endpoint, cbus_process);
 
 	relay_exit(relay);
+
+	/*
+	 * Log the error that caused the relay to break the loop.
+	 * Don't clear the error for status reporting.
+	 */
+	assert(!diag_is_empty(&relay->diag));
+	diag_set_error(diag_get(), diag_last_error(&relay->diag));
+	diag_log();
+	say_crit("exiting the relay loop");
+
 	return -1;
 }
 
diff --git a/src/lib/core/fiber_cond.c b/src/lib/core/fiber_cond.c
index 904a350d9..71bb2d04d 100644
--- a/src/lib/core/fiber_cond.c
+++ b/src/lib/core/fiber_cond.c
@@ -108,6 +108,10 @@ fiber_cond_wait_timeout(struct fiber_cond *c, double timeout)
 		diag_set(TimedOut);
 		return -1;
 	}
+	if (fiber_is_cancelled()) {
+		diag_set(FiberIsCancelled);
+		return -1;
+	}
 	return 0;
 }
 
diff --git a/src/lib/core/fiber_cond.h b/src/lib/core/fiber_cond.h
index 87c6f2ca2..2662e0654 100644
--- a/src/lib/core/fiber_cond.h
+++ b/src/lib/core/fiber_cond.h
@@ -114,7 +114,7 @@ fiber_cond_broadcast(struct fiber_cond *cond);
  * @param cond condition
  * @param timeout timeout in seconds
  * @retval 0 on fiber_cond_signal() call or a spurious wake up
- * @retval -1 on timeout, diag is set to TimedOut
+ * @retval -1 on timeout or fiber cancellation, diag is set
  */
 int
 fiber_cond_wait_timeout(struct fiber_cond *cond, double timeout);
diff --git a/test/app-tap/gh-5013-fiber-cancel.test.lua b/test/app-tap/gh-5013-fiber-cancel.test.lua
new file mode 100755
index 000000000..ca4ca2c90
--- /dev/null
+++ b/test/app-tap/gh-5013-fiber-cancel.test.lua
@@ -0,0 +1,23 @@
+#!/usr/bin/env tarantool
+
+local tap = require('tap')
+local fiber = require('fiber')
+local test = tap.test("gh-5013-fiber-cancel")
+
+test:plan(2)
+
+local result = {}
+
+function test_f()
+    local cond = fiber.cond()
+    local res, err = pcall(cond.wait, cond)
+    result.res = res
+    result.err = err
+end
+
+local f = fiber.create(test_f)
+f:cancel()
+fiber.yield()
+
+test:ok(result.res == false, 'expected result is false')
+test:ok(tostring(result.err) == 'fiber is cancelled', 'fiber cancellation should be reported')
diff --git a/test/box/gh-4834-netbox-fiber-cancel.result b/test/box/gh-4834-netbox-fiber-cancel.result
new file mode 100644
index 000000000..4ed04bb61
--- /dev/null
+++ b/test/box/gh-4834-netbox-fiber-cancel.result
@@ -0,0 +1,62 @@
+-- test-run result file version 2
+remote = require 'net.box'
+ | ---
+ | ...
+fiber = require 'fiber'
+ | ---
+ | ...
+test_run = require('test_run').new()
+ | ---
+ | ...
+
+-- #4834: Cancelling fiber doesn't interrupt netbox operations
+function infinite_call() fiber.channel(1):get() end
+ | ---
+ | ...
+box.schema.func.create('infinite_call')
+ | ---
+ | ...
+box.schema.user.grant('guest', 'execute', 'function', 'infinite_call')
+ | ---
+ | ...
+
+error_msg = nil
+ | ---
+ | ...
+test_run:cmd("setopt delimiter ';'")
+ | ---
+ | - true
+ | ...
+function netbox_runner()
+    local cn = remote.connect(box.cfg.listen)
+    local f = fiber.new(function()
+        _, error_msg = pcall(cn.call, cn, 'infinite_call')
+    end)
+    f:set_joinable(true)
+    fiber.yield()
+    f:cancel()
+    f:join()
+    cn:close()
+end;
+ | ---
+ | ...
+test_run:cmd("setopt delimiter ''");
+ | ---
+ | - true
+ | ...
+netbox_runner()
+ | ---
+ | ...
+error_msg
+ | ---
+ | - fiber is cancelled
+ | ...
+box.schema.func.drop('infinite_call')
+ | ---
+ | ...
+infinite_call = nil
+ | ---
+ | ...
+error_msg = nil
+ | ---
+ | ...
diff --git a/test/box/gh-4834-netbox-fiber-cancel.test.lua b/test/box/gh-4834-netbox-fiber-cancel.test.lua
new file mode 100644
index 000000000..dbeff083b
--- /dev/null
+++ b/test/box/gh-4834-netbox-fiber-cancel.test.lua
@@ -0,0 +1,28 @@
+remote = require('net.box')
+fiber = require('fiber')
+test_run = require('test_run').new()
+
+-- #4834: Cancelling fiber doesn't interrupt netbox operations
+function infinite_call() fiber.channel(1):get() end
+box.schema.func.create('infinite_call')
+box.schema.user.grant('guest', 'execute', 'function', 'infinite_call')
+
+error_msg = nil
+test_run:cmd("setopt delimiter ';'")
+function netbox_runner()
+    local cn = remote.connect(box.cfg.listen)
+    local f = fiber.new(function()
+        _, error_msg = pcall(cn.call, cn, 'infinite_call')
+    end)
+    f:set_joinable(true)
+    fiber.yield()
+    f:cancel()
+    f:join()
+    cn:close()
+end;
+test_run:cmd("setopt delimiter ''");
+netbox_runner()
+error_msg
+box.schema.func.drop('infinite_call')
+infinite_call = nil
+error_msg = nil
-- 
2.24.3 (Apple Git-128)




More information about the Tarantool-patches mailing list