[Tarantool-patches] [PATCH v2] core: handle fiber cancellation for fiber.cond

Sergey Ostanevich sergos at tarantool.org
Thu Nov 19 01:05:05 MSK 2020


Hi Vlad!

I put your comments and my patch from the second mail here to keep one
thread - see below.

Thanks,
Sergos


> On 17 Nov 2020, at 01:12, Vladislav Shpilevoy <v.shpilevoy at tarantool.org> wrote:
> 
> On 03.11.2020 11:20, Sergey Ostanevich wrote:
>> Hi Oleg!
>> 
>> I believe the point about 'consistency' is not valid here. I put a
>> simple check that if diag is already set, then print it out. For the
>> fiber_cond_wait_timeout() it happened multiple times with various
>> reports, inlcuding this one:
>> 
>> 2020-11-03 10:28:01.630 [72411] relay/unix/:(socket)/101/main C> Did not
>> set the DIAG to FiberIsCancelled, original diag: Missing .xlog file
>> between LSN 5 {1: 5} and 6 {1: 6}
>> 
>> that is used in the test system:
>> 
>> test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status =
>> 'loading'})
>> 
>> So, my resolution will be: it is wrong to set a diag in an arbitrary
>> place, without clear understanting of the reason. This is the case for
>> the cond_wait machinery, since it doesn't know _why_ the fiber is
>> cancelled.
> 
> It is a wrong resolution, IMO. You just hacked cond wait not to change the
> other places. It is not about tests. Tests only show what is provided by the
> internal subsystems. And if they depend on fiber cond not setting diag in
> case of a fail, then it looks wrong.
> 

Actually I didn’t make fiber_cond not setting a diag, rather preserve the
original one if it is present. As for correct or not - you’re way more savvy
in Tarantool sources than I, so it’s hard for me to oppose. Still I’d try.

For the case of the test I see the relay_process_wal_event() catches from
the recover_remaining_wals() and sets the XlogGapError for the relay. Note,
the relay_set_error() uses exactly the same semantics - sets the error in
case it was not set before. Then it happily cancels the fiber and returns
and eventually appears in relay_subscribe() where it throws (via diag_raise)
the latest diag of the fiber. It appears, that along the way there are 
multiple places the fiber diag can be reset, so the best I can propose is
to use the relay’s diag. Also, to enforce this strategy we need a follow-up
ticket to ensure the relay’s diag is always set by the relay_subscribe() 
exit.

> I suggest you to fix the usage places, where the caller code thinks that
> cond_wait never sets a diag on cancellation.
> 
> If a function fails, we set a diag. It is not a thing we do optionally.
> Otherwise you make it a bit simpler in this patch, but make it harder to
> work with the cond in future.
> 
> Talking of your statement:
> 
> 	I believe the stack diag also is not supported there yet.
> 
> It is supported on the level of lib/core, i.e. everywhere. But is not
> present on 1.10. However it is not the point. The point is that it is not
> needed here.


> On 17 Nov 2020, at 01:12, Vladislav Shpilevoy <v.shpilevoy at tarantool.org> wrote:
> 
> Hi! Thanks for the patch!
> 
> Please, change subsystem name to 'fiber:'. 'core:' is too general.
> We have tons of stuff in libcore in addition to fibers.
> 
Done.

>> Before this patch fiber.cond():wait() just returns for cancelled
>> fiber. In contrast fiber.channel():get() threw "fiber is
>> canceled" error.
>> This patch unify behaviour of channels and condvars and also fixes
>> related net.box module problem - it was impossible to interrupt
>> net.box call with fiber.cancel because it used fiber.cond under
>> the hood. Test cases for both bugs are added.
> 
> Netbox hangs not because of using fiber.cond. But because it never
> calls testcancel(). Normally all looped fibers should do that.
> 
I believe the fix is an indirect result, since all 3 wait() calls in
net_box.lua aren't covered with a pcall() - it errors out. I reproduced
it with the following:

--- a/src/box/lua/net_box.lua
+++ b/src/box/lua/net_box.lua
@@ -414,7 +417,11 @@ local function create_transport(host, port, user, password, callback,
             -- waiting client is waked up prematurely.
             while timeout > 0 and not self:is_ready() do
                 local ts = fiber.clock()
-                self.cond:wait(timeout)
+                local ok, err = pcall(self.cond.wait, self.cond, timeout)
+                if (not ok) then
+                    print('net_box.lua:423 thrown, err: ' .. tostring(err))
+                    error(err)
+                end
                 timeout = timeout - (fiber.clock() - ts)
             end
             if not self:is_ready() then

Note, that if I don’t put the explicit error() there, the test from the
patch hangs. 
To me it looks like testcancel(), since it raises error the same way. 
Although, there could be more places in net_box where testcancel() can be 
called - still it is beyond this ticket.

> Talking of compatibility - I think it always was supposed to throw.
> luaT_fiber_cond_wait() calls luaL_testcancel(), but only when
> fiber_cond_wait_timeout() returns not 0, which was never the case for
> cancellation. So it was supposed to throw, but nobody covered it with
> a test.
> 
> See 2 comments below.
> 
>> Closes #4834
>> Closes #5013
>> 
>> Co-authored-by: Oleg Babin <olegrok at tarantool.org>
>> diff --git a/src/box/box.cc b/src/box/box.cc
>> index 18568df3b..29f74e94b 100644
>> --- a/src/box/box.cc
>> +++ b/src/box/box.cc
>> @@ -305,10 +305,9 @@ box_wait_ro(bool ro, double timeout)
>> {
>> 	double deadline = ev_monotonic_now(loop()) + timeout;
>> 	while (is_box_configured == false || box_is_ro() != ro) {
>> -		if (fiber_cond_wait_deadline(&ro_cond, deadline) != 0)
>> -			return -1;
>> -		if (fiber_is_cancelled()) {
>> -			diag_set(FiberIsCancelled);
>> +		if (fiber_cond_wait_deadline(&ro_cond, deadline) != 0) {
>> +			if (fiber_is_cancelled())
>> +				diag_set(FiberIsCancelled);
>> 			return -1;
>> 		}
>> 	}
>> diff --git a/src/lib/core/fiber_cond.c b/src/lib/core/fiber_cond.c
>> index 904a350d9..cc59eaafb 100644
>> --- a/src/lib/core/fiber_cond.c
>> +++ b/src/lib/core/fiber_cond.c
>> @@ -108,6 +108,11 @@ fiber_cond_wait_timeout(struct fiber_cond *c, double timeout)
>> 		diag_set(TimedOut);
>> 		return -1;
>> 	}
>> +	if (fiber_is_cancelled()) {
>> +		if (diag_is_empty(diag_get()))
>> +                        diag_set(FiberIsCancelled);
>> +		return -1;
> 
> 1. Wtf? Why don't you set an error, when this is an error? And why do
> you do exactly the same in box_wait_ro() above?
> 
Fixed as per first part of review.

>> +	}
>> 	return 0;
>> }> diff --git a/test/box/net.box_fiber_cancel_gh-4834.result b/test/box/net.box_fiber_cancel_gh-4834.result
>> new file mode 100644
>> index 000000000..4ed04bb61
>> --- /dev/null
>> +++ b/test/box/net.box_fiber_cancel_gh-4834.result
> 
> 2. This does not conform to our test file name pattern. Read this
> carefully: https://github.com/tarantool/tarantool/wiki/Code-review-procedure#testing
Fixed. Although, we have a big legacy of similarly named tests under test/box/

Branch is force-pushed, updated patch is below.

===============
From 5e6426340f4f5af8429c4273f1b251f503c6dd9b Mon Sep 17 00:00:00 2001
From: Sergey Ostanevich <sergos at tarantool.org>
Date: Tue, 3 Nov 2020 12:52:26 +0300
Subject: [PATCH] fiber: handle fiber cancellation for fiber.cond

Before this patch fiber.cond():wait() just returns for cancelled
fiber. In contrast fiber.channel():get() threw "fiber is
canceled" error.
This patch unify behaviour of channels and condvars and also fixes
related net.box module problem - it was impossible to interrupt
net.box call with fiber.cancel because it used fiber.cond under
the hood. Test cases for both bugs are added.

Closes #4834
Closes #5013

Co-authored-by: Oleg Babin <olegrok at tarantool.org>

@TarantoolBot document
Title: fiber.cond():wait() throws if fiber is cancelled

Currently fiber.cond():wait() throws an error if waiting fiber is
cancelled.
---

Github: https://gitlab.com/tarantool/tarantool/-/commits/sergos/gh-5013-fiber-cond
Issue: https://github.com/tarantool/tarantool/issues/5013

@Changelog
* fiber.cond().wait() now throws if fiber is cancelled

 src/box/box.cc                                |  7 +--
 src/box/relay.cc                              | 12 +++-
 src/lib/core/fiber_cond.c                     |  4 ++
 src/lib/core/fiber_cond.h                     |  2 +-
 test/app-tap/gh-5013-fiber-cancel.test.lua    | 23 +++++++
 test/box/gh-4834-netbox-fiber-cancel.result   | 62 +++++++++++++++++++
 test/box/gh-4834-netbox-fiber-cancel.test.lua | 28 +++++++++
 7 files changed, 132 insertions(+), 6 deletions(-)
 create mode 100755 test/app-tap/gh-5013-fiber-cancel.test.lua
 create mode 100644 test/box/gh-4834-netbox-fiber-cancel.result
 create mode 100644 test/box/gh-4834-netbox-fiber-cancel.test.lua

diff --git a/src/box/box.cc b/src/box/box.cc
index 1f7dec362..aa23dcc13 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -305,10 +305,9 @@ box_wait_ro(bool ro, double timeout)
 {
 	double deadline = ev_monotonic_now(loop()) + timeout;
 	while (is_box_configured == false || box_is_ro() != ro) {
-		if (fiber_cond_wait_deadline(&ro_cond, deadline) != 0)
-			return -1;
-		if (fiber_is_cancelled()) {
-			diag_set(FiberIsCancelled);
+		if (fiber_cond_wait_deadline(&ro_cond, deadline) != 0) {
+			if (fiber_is_cancelled())
+				diag_set(FiberIsCancelled);
 			return -1;
 		}
 	}
diff --git a/src/box/relay.cc b/src/box/relay.cc
index 1e77e0d9b..cce139c87 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -821,8 +821,18 @@ relay_subscribe(struct replica *replica, int fd, uint64_t sync,
 			      relay_subscribe_f, relay);
 	if (rc == 0)
 		rc = cord_cojoin(&relay->cord);
-	if (rc != 0)
+	if (rc != 0) {
+		/*
+		 * We should always raise a problem from relay itself, not all
+		 * other modules that are change diag in the current fiber.
+		 * TODO: investigate why and how we can leave the relay_subscribe_f
+		 * with diag unset in the relay.
+		 */
+		if (diag_last_error(&relay->diag)) {
+			diag_set_error(diag_get(), diag_last_error(&relay->diag));
+		}
 		diag_raise();
+	}
 }
 
 static void
diff --git a/src/lib/core/fiber_cond.c b/src/lib/core/fiber_cond.c
index 904a350d9..71bb2d04d 100644
--- a/src/lib/core/fiber_cond.c
+++ b/src/lib/core/fiber_cond.c
@@ -108,6 +108,10 @@ fiber_cond_wait_timeout(struct fiber_cond *c, double timeout)
 		diag_set(TimedOut);
 		return -1;
 	}
+	if (fiber_is_cancelled()) {
+		diag_set(FiberIsCancelled);
+		return -1;
+	}
 	return 0;
 }
 
diff --git a/src/lib/core/fiber_cond.h b/src/lib/core/fiber_cond.h
index 87c6f2ca2..2662e0654 100644
--- a/src/lib/core/fiber_cond.h
+++ b/src/lib/core/fiber_cond.h
@@ -114,7 +114,7 @@ fiber_cond_broadcast(struct fiber_cond *cond);
  * @param cond condition
  * @param timeout timeout in seconds
  * @retval 0 on fiber_cond_signal() call or a spurious wake up
- * @retval -1 on timeout, diag is set to TimedOut
+ * @retval -1 on timeout or fiber cancellation, diag is set
  */
 int
 fiber_cond_wait_timeout(struct fiber_cond *cond, double timeout);
diff --git a/test/app-tap/gh-5013-fiber-cancel.test.lua b/test/app-tap/gh-5013-fiber-cancel.test.lua
new file mode 100755
index 000000000..ca4ca2c90
--- /dev/null
+++ b/test/app-tap/gh-5013-fiber-cancel.test.lua
@@ -0,0 +1,23 @@
+#!/usr/bin/env tarantool
+
+local tap = require('tap')
+local fiber = require('fiber')
+local test = tap.test("gh-5013-fiber-cancel")
+
+test:plan(2)
+
+local result = {}
+
+function test_f()
+    local cond = fiber.cond()
+    local res, err = pcall(cond.wait, cond)
+    result.res = res
+    result.err = err
+end
+
+local f = fiber.create(test_f)
+f:cancel()
+fiber.yield()
+
+test:ok(result.res == false, 'expected result is false')
+test:ok(tostring(result.err) == 'fiber is cancelled', 'fiber cancellation should be reported')
diff --git a/test/box/gh-4834-netbox-fiber-cancel.result b/test/box/gh-4834-netbox-fiber-cancel.result
new file mode 100644
index 000000000..4ed04bb61
--- /dev/null
+++ b/test/box/gh-4834-netbox-fiber-cancel.result
@@ -0,0 +1,62 @@
+-- test-run result file version 2
+remote = require 'net.box'
+ | ---
+ | ...
+fiber = require 'fiber'
+ | ---
+ | ...
+test_run = require('test_run').new()
+ | ---
+ | ...
+
+-- #4834: Cancelling fiber doesn't interrupt netbox operations
+function infinite_call() fiber.channel(1):get() end
+ | ---
+ | ...
+box.schema.func.create('infinite_call')
+ | ---
+ | ...
+box.schema.user.grant('guest', 'execute', 'function', 'infinite_call')
+ | ---
+ | ...
+
+error_msg = nil
+ | ---
+ | ...
+test_run:cmd("setopt delimiter ';'")
+ | ---
+ | - true
+ | ...
+function netbox_runner()
+    local cn = remote.connect(box.cfg.listen)
+    local f = fiber.new(function()
+        _, error_msg = pcall(cn.call, cn, 'infinite_call')
+    end)
+    f:set_joinable(true)
+    fiber.yield()
+    f:cancel()
+    f:join()
+    cn:close()
+end;
+ | ---
+ | ...
+test_run:cmd("setopt delimiter ''");
+ | ---
+ | - true
+ | ...
+netbox_runner()
+ | ---
+ | ...
+error_msg
+ | ---
+ | - fiber is cancelled
+ | ...
+box.schema.func.drop('infinite_call')
+ | ---
+ | ...
+infinite_call = nil
+ | ---
+ | ...
+error_msg = nil
+ | ---
+ | ...
diff --git a/test/box/gh-4834-netbox-fiber-cancel.test.lua b/test/box/gh-4834-netbox-fiber-cancel.test.lua
new file mode 100644
index 000000000..bc0e5af6e
--- /dev/null
+++ b/test/box/gh-4834-netbox-fiber-cancel.test.lua
@@ -0,0 +1,28 @@
+remote = require 'net.box'
+fiber = require 'fiber'
+test_run = require('test_run').new()
+
+-- #4834: Cancelling fiber doesn't interrupt netbox operations
+function infinite_call() fiber.channel(1):get() end
+box.schema.func.create('infinite_call')
+box.schema.user.grant('guest', 'execute', 'function', 'infinite_call')
+
+error_msg = nil
+test_run:cmd("setopt delimiter ';'")
+function netbox_runner()
+    local cn = remote.connect(box.cfg.listen)
+    local f = fiber.new(function()
+        _, error_msg = pcall(cn.call, cn, 'infinite_call')
+    end)
+    f:set_joinable(true)
+    fiber.yield()
+    f:cancel()
+    f:join()
+    cn:close()
+end;
+test_run:cmd("setopt delimiter ''");
+netbox_runner()
+error_msg
+box.schema.func.drop('infinite_call')
+infinite_call = nil
+error_msg = nil
-- 
2.24.3 (Apple Git-128)





More information about the Tarantool-patches mailing list