[Tarantool-patches] [PATCH v2 2/2] lua: abort trace recording on fiber yield
sergos at tarantool.org
sergos at tarantool.org
Thu Sep 24 16:00:50 MSK 2020
Hi!
Thanks for the patch, see my 2 comments below.
> On 23 Sep 2020, at 22:06, Igor Munkin <imun at tarantool.org> wrote:
>
> Since Tarantool fibers don't respect Lua coroutine switch mechanism, JIT
> machinery stays unnotified when one lua_State substitutes another one.
> As a result if trace recording hasn't been aborted prior to fiber
> switch, the recording proceeds using the new lua_State and leads to a
> failure either on any further compiler phase or while the compiled trace
> is executed.
>
> This changeset extends <cord_on_yield> routine aborting trace recording
> when the fiber switches to another one. If the switch-over occurs while
> mcode is being run the platform finishes its execution with EXIT_FAILURE
> code and calls panic routine prior to the exit.
>
> Closes #1700
> Fixes #4491
>
> Signed-off-by: Igor Munkin <imun at tarantool.org>
> ---
>
> Here are also the benchmark results for the Release build:
> * Vanilla (8477b6c) -> Patched (f1026a1) (min, median, mean, max):
> | fibers: 10; iters: 100 0% 0% 0% 0%
> | fibers: 10; iters: 1000 2% 2% 1% 2%
> | fibers: 10; iters: 10000 6% 4% 5% 5%
> | fibers: 10; iters: 100000 4% 3% 4% 1%
> | fibers: 100; iters: 100 1% 0% 1% 10%
> | fibers: 100; iters: 1000 2% 2% 0% -7%
> | fibers: 100; iters: 10000 2% 1% 2% 1%
> | fibers: 100; iters: 100000 2% 2% 2% 5%
> | fibers: 1000; iters: 100 1% 0% 0% -2%
> | fibers: 1000; iters: 1000 1% 1% 1% 0%
> | fibers: 1000; iters: 10000 1% 1% 1% 2%
> | fibers: 1000; iters: 100000 1% 1% 1% 1%
> | fibers: 10000; iters: 100 0% 0% 0% 1%
> | fibers: 10000; iters: 1000 0% 2% 3% 3%
> | fibers: 10000; iters: 10000 0% 0% 0% -2%
> | fibers: 10000; iters: 100000 0% 1% 0% -4%
>
> src/lua/utils.c | 49 +++++++++++++++++
> ...-4491-coio-wait-leads-to-segfault.test.lua | 53 +++++++++++++++++++
> 2 files changed, 102 insertions(+)
> create mode 100755 test/app-tap/gh-4491-coio-wait-leads-to-segfault.test.lua
>
> diff --git a/src/lua/utils.c b/src/lua/utils.c
> index 8e98f7607..39fe4e30c 100644
> --- a/src/lua/utils.c
> +++ b/src/lua/utils.c
> @@ -29,6 +29,7 @@
> * SUCH DAMAGE.
> */
> #include "lua/utils.h"
> +#include <lj_trace.h>
>
> #include <assert.h>
> #include <errno.h>
> @@ -1309,10 +1310,57 @@ tarantool_lua_utils_init(struct lua_State *L)
> return 0;
> }
>
> +/*
> + * XXX: There is already defined <panic> macro in say.h header
> + * (included in diag.h). As a result the call below is misexpanded
> + * and compilation fails with the corresponding error. To avoid
> + * this error the macro is temporary renamed and restored later.
> + * Compilation now fails for the following cases:
> + * * temporary name <_panic> is used in this translation unit
I would propose some more sophisitcated name to avoid possible overlap
with ‘hidden’ _something convention. Since it is Tarantool sources, a
reference to the gh will be just fine, like panic_gh1700.
> + * * <panic> is not define, so this hack can be freely dropped
> + */
> +#if defined(panic) && !defined(_panic)
> +# define _panic panic
> +# undef panic
> +#else
> +# error "Can't redefine <panic> macro"
> +#endif
> +
> /**
> * This routine encloses the checks and actions to be done when
> * the running fiber yields the execution.
> + * Since Tarantool fibers don't switch-over the way Lua coroutines
> + * do the platform ought to notify JIT engine when one lua_State
> + * substitutes another one.
> */
> void cord_on_yield(void)
> {
> + struct global_State *g = G(tarantool_L);
> + /*
> + * XXX: Switching fibers while running the trace leads to
> + * code misbehaviour and failures, so stop its execution.
> + */
> + if (unlikely(tvref(g->jit_base))) {
> + /*
> + * XXX: mcode is executed only in scope of Lua
> + * world and one can obtain the corresponding Lua
> + * coroutine from the fiber storage.
> + */
> + struct lua_State *L = fiber()->storage.lua.stack;
> + assert(L != NULL);
> + lua_pushstring(L, "Fiber is switched on the trace”);
To me it is very much ’Too long WAL write’ style. Everything said,
nothing is clear. Could you please elaborate on the ‘FFI code’ and
‘led to a yield’ and probably the fiber number.
> + if (g->panic)
> + g->panic(L);
> + exit(EXIT_FAILURE);
> + }
> + /*
> + * Unconditionally abort trace recording whether fibers
> + * switch each other. Otherwise, further compilation may
> + * lead to a failure on any next compiler phase.
> + */
> + lj_trace_abort(g);
> }
> +
> +/* Restore <panic> macro back */
> +#define panic _panic
> +#undef _panic
> diff --git a/test/app-tap/gh-4491-coio-wait-leads-to-segfault.test.lua b/test/app-tap/gh-4491-coio-wait-leads-to-segfault.test.lua
> new file mode 100755
> index 000000000..0dd8dfbee
> --- /dev/null
> +++ b/test/app-tap/gh-4491-coio-wait-leads-to-segfault.test.lua
> @@ -0,0 +1,53 @@
> +#!/usr/bin/env tarantool
> +
> +local tap = require('tap')
> +
> +local test = tap.test('gh-4491-coio-wait-leads-to-segfault')
> +
> +-- Test file to demonstrate platform failure due to fiber switch
> +-- while trace recording, details:
> +-- https://github.com/tarantool/tarantool/issues/4491
> +
> +local fiber = require('fiber')
> +local ffi = require('ffi')
> +ffi.cdef('int coio_wait(int fd, int event, double timeout);')
> +
> +local cfg = {
> + hotloop = arg[1] or 1,
> + fibers = arg[1] or 2,
> + timeout = { put = 1, get = 1 },
> +}
> +
> +test:plan(cfg.fibers + 1)
> +
> +local args = {
> + fd = 1 , -- STDIN file descriptor
> + event = 0x1 , -- COIO_READ event
> + timeout = 0.05, -- Timeout value
> +}
> +
> +local function run(iterations, channel)
> + for _ = 1, iterations do
> + ffi.C.coio_wait(args.fd, args.event, args.timeout)
> + end
> + channel:put(true, cfg.timeout.put)
> +end
> +
> +local channels = { }
> +
> +jit.opt.start('3', string.format('hotloop=%d', cfg.hotloop))
> +
> +for _ = 1, cfg.fibers do
> + channels[_] = fiber.channel(1)
> + fiber.new(run, cfg.hotloop + 1, channels[_])
> +end
> +
> +-- Finalize the existing fibers
> +for _ = 1, cfg.fibers do
> + test:ok(channels[_]:get(cfg.timeout.get),
> + string.format('fiber #%d successfully finished', _))
> +end
> +
> +test:ok(true, 'trace is not recorded due to fiber switch underneath coio_wait')
> +
> +os.exit(test:check() and 0 or 1)
> --
> 2.25.0
>
More information about the Tarantool-patches
mailing list