[Tarantool-patches] [PATCH] tuple: make tuple_bless() compilable
Sergey Kaplun
skaplun at tarantool.org
Wed Oct 27 16:16:43 MSK 2021
Igor,
Thanks for the review!
On 27.10.21, Igor Munkin wrote:
> Sergey,
>
> Thanks for the patch! LGTM with some nits below.
>
> On 22.10.21, Sergey Kaplun wrote:
> > tuple_bless() uses a tail call to ffi.gc() with return to the caller.
> > This tail call replaces the current (tuple_bless) frame with the frame
> > of the callee (ffi.gc). When JIT tries to compile return from `ffi.gc()`
> > to the frame below it aborts the trace recording with the error "NYI:
> > return to lower frame".
>
> Side note: for the root traces the issue is the same, but the error is
> different.
Yep.
>
> >
> > This patch replaces the tail call with using additional local variable
>
> Minor: You do not replace tail call, but rather don't give an option for
> LuaJIT to emit CALLT. Anyway, just being pedantic, feel free to ignore.
So, the CALLT is replaced with a regular call :). Ignoring.
>
> > returned to the caller right after.
> > ---
> >
> > Actually, this patch become possible thanks to Michael Filonenko and his
> > benchmarks of TDG runs with jit.dump() enabled. After analysis of this
> > dump we realize that tuple_bless is not compiled. This uncompiled chunk
> > of code leads to the JIT cancer for all possible workflows that use
> > tuple_bless() (i.e. tuple:update() and tuple:upsert()). This change is
> > really trivial, but adds almost x2 improvement of performance for
> > tuple:update()/upsert() scenario. Hope, that this patch will be a
> > stimulus for including benchmarks of our forward products like TDG to
> > routine performance running with the corresponding profilers dumps.
>
> Kekw, one-liner boosting update/upsert in two times -- nice catch!
> Anyway, please check that your change doesn't affect overall perfomance
> in interpreter mode too.
The new one (without tailcall) is 1% slower:
21.2 sec vs 21.0 sec with jit.off().
This looks like a good trade to me.
>
> The bad thing in this, that we have no regular Lua benchmarks at all
> (even those you provided below), so we can't watch the effect of such
> changes regularly.
It's true.
>
> >
> > Benchmarks:
> >
> > Before patch:
> >
> > Update:
> > | Tarantool 2.10.0-beta1-90-g31594b427
> > | type 'help' for interactive help
> > | tarantool> local t = {}
> > | for i = 1, 1e6 do
> > | table.insert(t, box.tuple.new{'abc', 'def', 'ghi', 'abc'})
> > | end
> > | local clock = require"clock"
> > | local S = clock.proc()
> > | for i = 1, 1e6 do t[i]:update{{"=", 3, "xxx"}} end
> > | return clock.proc() - S;
> > | ---
> > | - 4.208298872
> >
> > Upsert: 4.158661731
> >
> > After patch:
> >
> > Update:
> > | Tarantool 2.10.0-beta1-90-g31594b427
> > | type 'help' for interactive help
> > | tarantool> local t = {}
> > | for i = 1, 1e6 do
> > | table.insert(t, box.tuple.new{'abc', 'def', 'ghi', 'abc'})
> > | end
> > | local clock = require"clock"
> > | local S = clock.proc()
> > | for i = 1, 1e6 do t[i]:update{{"=", 3, "xxx"}} end
> > | return clock.proc() - S;
> > | ---
> > | - 2.357670738
> >
> > Upsert: 2.334134195
> >
> > Branch: https://github.com/tarantool/tarantool/tree/skaplun/gh-noticket-tuple-bless-compile
> >
> > src/box/lua/tuple.lua | 9 ++++++++-
> > 1 file changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/src/box/lua/tuple.lua b/src/box/lua/tuple.lua
> > index fa76f4f7f..73446ab22 100644
> > --- a/src/box/lua/tuple.lua
> > +++ b/src/box/lua/tuple.lua
> > @@ -98,7 +98,14 @@ local tuple_bless = function(tuple)
> > -- overflow checked by tuple_bless() in C
> > builtin.box_tuple_ref(tuple)
> > -- must never fail:
> > - return ffi.gc(ffi.cast(const_tuple_ref_t, tuple), tuple_gc)
> > + -- XXX: If we use tail call (instead creating a new frame for
>
> Typo: s/instead/instead of/.
>
> > + -- a call just replace the top one) here, then JIT tries
>
> Minor: I see "replace" for the second time, but LuaJIT just "use" the
> caller frame for callee. I propose to s/replace/use/g, but this is
> neglible, so feel free to ignore.
>
> > + -- to compile return from `ffi.gc()` to the frame below. This
> > + -- abort the trace recording with the error "NYI: return to
>
> Typo: s/abort/aborts/.
Fixed your comments. See the iterative patch below.
Branch is force-pushed.
===================================================================
diff --git a/src/box/lua/tuple.lua b/src/box/lua/tuple.lua
index 1201c7c34..f47b5926d 100644
--- a/src/box/lua/tuple.lua
+++ b/src/box/lua/tuple.lua
@@ -98,10 +98,10 @@ local tuple_bless = function(tuple)
-- overflow checked by tuple_bless() in C
builtin.box_tuple_ref(tuple)
-- must never fail:
- -- XXX: If we use tail call (instead creating a new frame for
- -- a call just replace the top one) here, then JIT tries
- -- to compile return from `ffi.gc()` to the frame below. This
- -- abort the trace recording with the error "NYI: return to
+ -- XXX: If we use tail call (instead of creating a new frame
+ -- for a call just use the top one) here, then JIT tries to
+ -- compile return from `ffi.gc()` to the frame below. This
+ -- aborts the trace recording with the error "NYI: return to
-- lower frame". So avoid tail call and use additional stack
-- slots (for the local variable and the frame).
local tuple_ref = ffi.gc(ffi.cast(const_tuple_ref_t, tuple), tuple_gc)
===================================================================
And the new commit message (remove "replace" usage):
===================================================================
tuple: make tuple_bless() compilable
tuple_bless() uses a tail call to ffi.gc() with return to the caller.
This tail call uses the current (tuple_bless) frame instead of creating
the frame for the callee (ffi.gc). When JIT tries to compile return from
`ffi.gc()` to the frame below it aborts the trace recording with the
error "NYI: return to lower frame".
This patch replaces the tail call with using additional local variable
returned to the caller right after.
===================================================================
>
> > + -- lower frame". So avoid tail call and use additional stack
> > + -- slots (for the local variable and the frame).
> > + local tuple_ref = ffi.gc(ffi.cast(const_tuple_ref_t, tuple), tuple_gc)
> > + return tuple_ref
>
> Side note: Ugh... I'm sad we're doing things like this one. Complicating
> the code, leaving huge comments with the rationale of such complicating
> to reach the desirable (and what is important, local) performance. I
> propose to spend your innovative time to try solving the problem in the
> JIT engine: it will be more fun and allow us to avoid writing the
> cookbook "How to write super-duper-jittable code in LuaJIT".
Yes, it is ugly workaround. The true way is to resolve problem with
compiling of return to frame below one where trace was started.
>
> Here is the valid question: what about other hot places with CALLT in
> Tarantool? Should they be considered/fixed? I guess a ticket will help
> to not forget about this problem.
I suppose it should be created within the activity of regular testing our
most valuable products and "boxes".
>
> Anyway, for now the fix provides the considerable boost, so feel free to
> proceed with the patch.
Thanks!:)
>
> > end
> >
> > local tuple_check = function(tuple, usage)
> > --
> > 2.31.0
> >
>
> --
> Best regards,
> IM
--
Best regards,
Sergey Kaplun
More information about the Tarantool-patches
mailing list