[Tarantool-patches] [PATCH v2 2/2] lua: abort trace recording on fiber yield

sergos at tarantool.org sergos at tarantool.org
Thu Sep 24 16:00:50 MSK 2020


Hi!

Thanks for the patch, see my 2 comments below.


> On 23 Sep 2020, at 22:06, Igor Munkin <imun at tarantool.org> wrote:
> 
> Since Tarantool fibers don't respect Lua coroutine switch mechanism, JIT
> machinery stays unnotified when one lua_State substitutes another one.
> As a result if trace recording hasn't been aborted prior to fiber
> switch, the recording proceeds using the new lua_State and leads to a
> failure either on any further compiler phase or while the compiled trace
> is executed.
> 
> This changeset extends <cord_on_yield> routine aborting trace recording
> when the fiber switches to another one. If the switch-over occurs while
> mcode is being run the platform finishes its execution with EXIT_FAILURE
> code and calls panic routine prior to the exit.
> 
> Closes #1700
> Fixes #4491
> 
> Signed-off-by: Igor Munkin <imun at tarantool.org>
> ---
> 
> Here are also the benchmark results for the Release build:
> * Vanilla (8477b6c) -> Patched (f1026a1) (min, median, mean, max):
> | fibers: 10; iters: 100	0%	0%	0%	0%
> | fibers: 10; iters: 1000	2%	2%	1%	2%
> | fibers: 10; iters: 10000	6%	4%	5%	5%
> | fibers: 10; iters: 100000	4%	3%	4%	1%
> | fibers: 100; iters: 100	1%	0%	1%	10%
> | fibers: 100; iters: 1000	2%	2%	0%	-7%
> | fibers: 100; iters: 10000	2%	1%	2%	1%
> | fibers: 100; iters: 100000	2%	2%	2%	5%
> | fibers: 1000; iters: 100	1%	0%	0%	-2%
> | fibers: 1000; iters: 1000	1%	1%	1%	0%
> | fibers: 1000; iters: 10000	1%	1%	1%	2%
> | fibers: 1000; iters: 100000	1%	1%	1%	1%
> | fibers: 10000; iters: 100	0%	0%	0%	1%
> | fibers: 10000; iters: 1000	0%	2%	3%	3%
> | fibers: 10000; iters: 10000	0%	0%	0%	-2%
> | fibers: 10000; iters: 100000	0%	1%	0%	-4%
> 
> src/lua/utils.c                               | 49 +++++++++++++++++
> ...-4491-coio-wait-leads-to-segfault.test.lua | 53 +++++++++++++++++++
> 2 files changed, 102 insertions(+)
> create mode 100755 test/app-tap/gh-4491-coio-wait-leads-to-segfault.test.lua
> 
> diff --git a/src/lua/utils.c b/src/lua/utils.c
> index 8e98f7607..39fe4e30c 100644
> --- a/src/lua/utils.c
> +++ b/src/lua/utils.c
> @@ -29,6 +29,7 @@
>  * SUCH DAMAGE.
>  */
> #include "lua/utils.h"
> +#include <lj_trace.h>
> 
> #include <assert.h>
> #include <errno.h>
> @@ -1309,10 +1310,57 @@ tarantool_lua_utils_init(struct lua_State *L)
> 	return 0;
> }
> 
> +/*
> + * XXX: There is already defined <panic> macro in say.h header
> + * (included in diag.h). As a result the call below is misexpanded
> + * and compilation fails with the corresponding error. To avoid
> + * this error the macro is temporary renamed and restored later.
> + * Compilation now fails for the following cases:
> + * * temporary name <_panic> is used in this translation unit

I would propose some more sophisitcated name to avoid possible overlap 
with ‘hidden’ _something convention. Since it is Tarantool sources, a
reference to the gh will be just fine, like panic_gh1700.

> + * * <panic> is not define, so this hack can be freely dropped
> + */
> +#if defined(panic) && !defined(_panic)
> +#  define _panic panic
> +#  undef panic
> +#else
> +#  error "Can't redefine <panic> macro"
> +#endif
> +
> /**
>  * This routine encloses the checks and actions to be done when
>  * the running fiber yields the execution.
> + * Since Tarantool fibers don't switch-over the way Lua coroutines
> + * do the platform ought to notify JIT engine when one lua_State
> + * substitutes another one.
>  */
> void cord_on_yield(void)
> {
> +	struct global_State *g = G(tarantool_L);
> +	/*
> +	 * XXX: Switching fibers while running the trace leads to
> +	 * code misbehaviour and failures, so stop its execution.
> +	 */
> +	if (unlikely(tvref(g->jit_base))) {
> +		/*
> +		 * XXX: mcode is executed only in scope of Lua
> +		 * world and one can obtain the corresponding Lua
> +		 * coroutine from the fiber storage.
> +		 */
> +		struct lua_State *L = fiber()->storage.lua.stack;
> +		assert(L != NULL);
> +		lua_pushstring(L, "Fiber is switched on the trace”);

To me it is very much ’Too long WAL write’ style. Everything said,
nothing is clear. Could you please elaborate on the ‘FFI code’ and 
‘led to a yield’ and probably the fiber number.

> +		if (g->panic)
> +			g->panic(L);
> +		exit(EXIT_FAILURE);
> +	}
> +	/*
> +	 * Unconditionally abort trace recording whether fibers
> +	 * switch each other. Otherwise, further compilation may
> +	 * lead to a failure on any next compiler phase.
> +	 */
> +	lj_trace_abort(g);
> }
> +
> +/* Restore <panic> macro back */
> +#define panic _panic
> +#undef _panic
> diff --git a/test/app-tap/gh-4491-coio-wait-leads-to-segfault.test.lua b/test/app-tap/gh-4491-coio-wait-leads-to-segfault.test.lua
> new file mode 100755
> index 000000000..0dd8dfbee
> --- /dev/null
> +++ b/test/app-tap/gh-4491-coio-wait-leads-to-segfault.test.lua
> @@ -0,0 +1,53 @@
> +#!/usr/bin/env tarantool
> +
> +local tap = require('tap')
> +
> +local test = tap.test('gh-4491-coio-wait-leads-to-segfault')
> +
> +-- Test file to demonstrate platform failure due to fiber switch
> +-- while trace recording, details:
> +--     https://github.com/tarantool/tarantool/issues/4491
> +
> +local fiber = require('fiber')
> +local ffi = require('ffi')
> +ffi.cdef('int coio_wait(int fd, int event, double timeout);')
> +
> +local cfg = {
> +  hotloop = arg[1] or 1,
> +  fibers = arg[1] or 2,
> +  timeout = { put = 1, get = 1 },
> +}
> +
> +test:plan(cfg.fibers + 1)
> +
> +local args = {
> +  fd      = 1   , -- STDIN file descriptor
> +  event   = 0x1 , -- COIO_READ event
> +  timeout = 0.05, -- Timeout value
> +}
> +
> +local function run(iterations, channel)
> +  for _ = 1, iterations do
> +    ffi.C.coio_wait(args.fd, args.event, args.timeout)
> +  end
> +  channel:put(true, cfg.timeout.put)
> +end
> +
> +local channels = { }
> +
> +jit.opt.start('3', string.format('hotloop=%d', cfg.hotloop))
> +
> +for _ = 1, cfg.fibers do
> +  channels[_] = fiber.channel(1)
> +  fiber.new(run, cfg.hotloop + 1, channels[_])
> +end
> +
> +-- Finalize the existing fibers
> +for _ = 1, cfg.fibers do
> +  test:ok(channels[_]:get(cfg.timeout.get),
> +          string.format('fiber #%d successfully finished', _))
> +end
> +
> +test:ok(true, 'trace is not recorded due to fiber switch underneath coio_wait')
> +
> +os.exit(test:check() and 0 or 1)
> -- 
> 2.25.0
> 



More information about the Tarantool-patches mailing list