[Tarantool-patches] [PATCH luajit 1/2] jit: abort trace recording and execution for C API

Igor Munkin imun at tarantool.org
Wed Apr 8 00:16:01 MSK 2020


Vlad,

Thanks for your review!

On 04.04.20, Vladislav Shpilevoy wrote:
> See 8 comments below.
> 
> > diff --git a/src/lj_api.c b/src/lj_api.c
> > index a5e2465..b543231 100644
> > --- a/src/lj_api.c
> > +++ b/src/lj_api.c
> > @@ -76,6 +76,18 @@ static GCtab *getcurrenv(lua_State *L)
> >    return fn->c.gct == ~LJ_TFUNC ? tabref(fn->c.env) : tabref(L->env);
> >  }
> >  
> > +static void jit_secure_call(lua_State *L, TValue *base, int nres) {
> 
> 1. Many functions in luajit related to jitting have prefix 'lj_'.
> Probably this should be applied here too. I didn't see any other
> starting with 'jit_'.

'lj_' prefix is used for extern functions in various units of internal
API. For static (i.e. private) routines no prefixes are used (consider
<index2adr>, <stkindex2adr>, <getcurrenv> helpers nearby in lj_api.c).
Originally, I named this function just <call>, but Sergos asked for a
more describing name in the neighbour thread and finally we chose this
one. I used underscores for readability, not for a prefix (I'm OK with
jitsecurecall though it much harder to parse :)). I *literally* OK with
any name if it is not misguiding one. Feel free to propose your naming
if you have a better one.

> 
> > +  global_State *g = G(L);
> > +  /* Forbid Lua world re-entrancy while running the trace */
> > +  if (tvref(g->jit_base)) {
> > +    setstrV(L, L->top++, lj_err_str(L, LJ_ERR_JITCALL));
> > +    if (g->panic) g->panic(L);
> > +    exit(EXIT_FAILURE);
> 
> 2. Is it correct, that it is allowed to renter Lua, and the panic
> will happen only if we see that during reentering the trace is
> still being recorded (i.e. if there is a bug in lj)?

Not quite right. there are two cases:

1. Trace is being recorded when re-entrancy occurs. In this case
lj_trace_abort macro asynchroniously aborts (i.e. signals to stop and
purge the result) trace recording prior to re-entering Lua VM.

2. Trace is being run when re-entrancy occurs. In this case jit_base
field is not NULL and contains an address to the guest stack frame base
on the moment trace started its execution. Unfortunately in this case we
can do nothing for now except panic.

> 
> I don't see in the code above where is that check like 'the trace
> is being recorded'. Could you please point at it?

Trace recording is aborted unconditionally via lj_trace_abort macro.

> 
> > +  }
> > +  lj_trace_abort(g);  /* Never record across Lua VM entrance */
> > +  lj_vm_call(L, base, nres);
> > +}
> > @@ -1172,7 +1191,7 @@ LUALIB_API int luaL_callmeta(lua_State *L, int idx, const char *field)
> >      if (LJ_FR2) setnilV(top++);
> >      copyTV(L, top++, index2adr(L, idx));
> >      L->top = top;
> > -    lj_vm_call(L, top-1, 1+1);
> > +    jit_secure_call(L, top-1, 1+1);
> >      return 1;
> >    }
> 
> 3. Why didn't you change lua_cpcall(), luaL_callmeta()?

OK, let's start with luaL_callmeta: I've checked twice and see it is
patched. Here is the corresponding diff chunk:

================================================================================

@@ -1172,7 +1191,7 @@ LUALIB_API int luaL_callmeta(lua_State *L, int idx, const char *field)
     if (LJ_FR2) setnilV(top++);
     copyTV(L, top++, index2adr(L, idx));
     L->top = top;
-    lj_vm_call(L, top-1, 1+1);
+    jit_secure_call(L, top-1, 1+1);
     return 1;
   }
   return 0;

================================================================================

Now I clarify the situation regarding lua_cpcall: Yes, you're *formally*
right it re-enters Lua VM.

However, it just prepares the safe frame, creates a GC object of Lua
function type using the given C function (strictly saying lua_CFunction)
and executes it via BC_FUNCC. Yes, BC_FUNCC is "recordable" for platform
builtins and "fast" functions (i.e. functions with success paths
implemented in DynASM, and fallbacks implemented in C, e.g. tonumber,
rawget). Thereby calling arbitrary "extern" C function aborts recording.

If the call occurs while running already compiled trace it doesn't break
the following:
* it respects caller-callee convention and all callee-safe registers are
  preserved
* it leaves the guest stack unchanged (since lua_cpcall allows no values
  to be returned)
* it doesn't change global fields required by trace (e.g. jit_base)

*BUT* BC_FUNCC semantics changes vmstate field, where currently running
trace number is stored and after C function yields the execution back to
the VM the vmstate is not restored to traceno. This value is used within
vm_exit_handler to obtain trace number, so further execution might
lead to the platform failure.

Long story short: thanks for your preciseness, I re-checked lua_cpcall
sematics while writing this section and here is a fix for lua_cpcall:

================================================================================

diff --git a/src/lj_api.c b/src/lj_api.c
index b543231..c9b5d22 100644
--- a/src/lj_api.c
+++ b/src/lj_api.c
@@ -1179,6 +1179,13 @@ LUA_API int lua_cpcall(lua_State *L, lua_CFunction func, void *ud)
   uint8_t oldh = hook_save(g);
   int status;
   api_check(L, L->status == LUA_OK || L->status == LUA_ERRERR);
+  /* Forbid Lua world re-entrancy while running the trace */
+  if (tvref(g->jit_base)) {
+    setstrV(L, L->top++, lj_err_str(L, LJ_ERR_JITCALL));
+    if (g->panic) g->panic(L);
+    exit(EXIT_FAILURE);
+  }
+  lj_trace_abort(g);  /* Never record across Lua VM entrance */
   status = lj_vm_cpcall(L, func, ud, cpcall);
   if (status) hook_restore(g, oldh);
   return status;

================================================================================

Squashed, force pushed to the upstream branch.

> 
> > diff --git a/test/gh-4427-ffi-sandwich/CMakeLists.txt b/test/gh-4427-ffi-sandwich/CMakeLists.txt
> > new file mode 100644
> > index 0000000..995c6bb
> > --- /dev/null
> > +++ b/test/gh-4427-ffi-sandwich/CMakeLists.txt
> 
> 4. So this is not built by luajit? Does not it have anything
> to be more self sufficient here? This looks strange to add
> a CMake file to a probject, which uses Makefiles.
> 
> Also this cmake does not work. I get error:
> 
> CMake Error at CMakeLists.txt:1 (build_lualib):
>   Unknown CMake command "build_lualib".
> 
> This is a clear cyclic dependency on Tarantool. Can it be
> built by luajit's own means?

No, it can't for now. LuaJIT tests are run by Tarantool testing
machinery and depends on it since[1].

> 
> This dependency becomes even stronger in the next commit,
> which introduces another CMake file.

I'm definitely going to rework LuaJIT testing environment to make it
more standalone and self-sufficient in the nearest future. Here is a
ticket for it[2].

> 
> > @@ -0,0 +1 @@
> > +build_lualib(libsandwich libsandwich.c)
> > diff --git a/test/gh-4427-ffi-sandwich/libsandwich.c b/test/gh-4427-ffi-sandwich/libsandwich.c
> > new file mode 100644
> > index 0000000..2a24fb2
> > --- /dev/null
> > +++ b/test/gh-4427-ffi-sandwich/libsandwich.c
> > @@ -0,0 +1,59 @@
> > +#include <lua.h>
> > +#include <lauxlib.h>
> > +
> > +struct sandwich {
> > +	lua_State *L; /* Coroutine saved for a Lua call */
> > +	int ref;      /* Anchor to the Lua function to be run */
> > +	int trigger;  /* Trigger for switching to Lua call */
> > +};
> > +
> > +int ipp(struct sandwich *state, int i)
> 
> 5. What is 'ipp'?

<ipp> stands for i-plus-plus. Does <increment> name look better?

> 
> > +{
> > +	if (i < state->trigger)
> > +		return i + 1;
> > +
> > +	/* Sandwich is triggered and Lua ipp function is called */
> > +	lua_pushnumber(state->L, state->ref);
> > +	lua_gettable(state->L, LUA_REGISTRYINDEX);
> > +	lua_pushnumber(state->L, i);
> > +	lua_call(state->L, 1, 1);
> > +	return lua_tonumber(state->L, -1);
> > 
> >> These changes look dangerous to perf. So if I had operators ==
> >> and < overloaded for an FFI struct, they are not jitted anymore,
> >> right?
> > 
> > Nope. I guess I need to stop right here and clarify the patch a little.
> > 
> > Let's start with the patch itself: the introduced changes abort the
> > trace recording whether re-entrancy to Lua VM is detected or "panic" if
> > such re-entrancy is faced while running already compiled machine code. A
> > good rationale for such protection is the platform failure Nikita faced
> > in #4427: the C function called via FFI makes the trigger with a Lua-C
> > API call underneath run. However it also might be any Lua-C API function
> > respecting the one of available metamethods (e.g. lua_less, lua_equal,
> > etc). If the correponding metamethod is set, the platform also re-enters
> > to execute it. Such behaviour is also unnacceptable, so metamethod call
> > ought to be invoked via jit_secure_call too. It's not optional and we
> > can't avoid it, since there are no guarantees the situation never
> > occurs.
> 
> 6. Yeah, ok, now I see, the changes functions are from the
> public C Lua API, and are called by external code only. Such
> as from tarantool. I was afraid, that lj uses them for all
> comparisons internally.
> 
> > OK, what about FFI: it's a well isolated machinery, that handles
> > everything its own way: own trace recording semantics, own metamethod
> > handling (since only tables and userdata can have an individual
> > metatable FFI has the one <ffi> "first level" metatable), etc. Nothing
> > FFI-related is changed here.
> 
> 7. If this has nothing to do with ffi, then why it is used
> in the test?

It is just a particular way to make compiler crazy (but I don't know
another one related to the case).

> 
> > If you run the following snippet, you'll see the artefacts for
> > successfully compiled trace in dump.out.
> > | require('jit.dump').start('+tbisrmXaT', 'dump.out')
> > | 
> > | local limit = -1ULL / 2
> > | local value = 1ULL
> > | local tests = { }
> > | 
> > | while value < limit do
> > |   value = value * 2
> > |   table.insert(tests, value < limit)
> > | end
> > | 
> > | jit.off() -- prevent spoiling dump.out with other traces.
> > | print(value, limit, #tests)
> > 
> > Since the patch relates only to Lua-C API, metamethods triggered while
> > running Lua code are also successfully compiled.
> > | require('jit.dump').start('+tbisrmXaT', 'dump.out')
> > | 
> > | local complex
> > | 
> > | local complex_mt = {
> > |   __add = function(a, b) return complex(a.re + b.re, a.im + b.im) end,
> > |   __eq  = function(a, b) return a.re == b.re and a.im == b.im end,
> > |   __tostring = function(o) return ('re: %d, im: %d'):format(o.re, o.im) end
> > | }
> > | 
> > | complex = function(re, im)
> > |   return setmetatable({ re = re, im = im }, complex_mt)
> > | end
> > | 
> > | local start = complex(0, 0)
> > | local finish = complex(64, 64)
> > | local step = complex(1, 1)
> > | 
> > | repeat
> > |   start = start + step
> > | until start == finish
> > 
> > Feel free to point the parts that looks good to be included into commit
> > message, comments, etc.
> > 
> >> Why? Does someone really yield in them?
> > 
> > The main idea is to protect ourselves in the future (we have already
> > discussed it with Kostja here[1]).
> > 
> >> Could you measure how does it affect perf of these calls?
> > 
> > Talking about perf: Sergos also asked to make benchmarks and I provided
> > the sysbench results[2] I obtained in Gitlab. I saw no dramatically
> > changes. I guess the reason the performance stays the same is that
> > all such places has been already handled manually in Tarantool sources
> > (like box_txn_rollback_to_savepoint after Leonid changes in #4427).
> > 
> > However, if you would like to see more benchmarks, please let me know.
> > 
> >> From what I see, these changes basically kill FFI making it not better
> >> than Lua C. It no longer helps JITting anything, does it?
> >>
> >> The same for concat, and other things. Looks like an overkill. We
> >> would need sandwich protection only for a few places. Is it possible
> >> to avoid such drastic changes?
> > 
> > [1]: https://lists.tarantool.org/pipermail/tarantool-patches/2019-November/012732.html
> > [2]: https://lists.tarantool.org/pipermail/tarantool-patches/2020-March/015236.html
> 8. So are you saying, that ffi part is still fast?

Yes. Nothing has changed, the patch just protects one from FFI misuse.

[1]: https://github.com/tarantool/tarantool/issues/4478
[2]: https://github.com/tarantool/tarantool/issues/4862

-- 
Best regards,
IM


More information about the Tarantool-patches mailing list