* [Tarantool-patches] [PATCH] tuple: make tuple_bless() compilable @ 2021-10-22 13:02 Sergey Kaplun via Tarantool-patches 2021-10-27 11:06 ` Igor Munkin via Tarantool-patches 2021-11-10 12:31 ` Igor Munkin via Tarantool-patches 0 siblings, 2 replies; 5+ messages in thread From: Sergey Kaplun via Tarantool-patches @ 2021-10-22 13:02 UTC (permalink / raw) To: Nikita Pettik, Igor Munkin; +Cc: tarantool-patches tuple_bless() uses a tail call to ffi.gc() with return to the caller. This tail call replaces the current (tuple_bless) frame with the frame of the callee (ffi.gc). When JIT tries to compile return from `ffi.gc()` to the frame below it aborts the trace recording with the error "NYI: return to lower frame". This patch replaces the tail call with using additional local variable returned to the caller right after. --- Actually, this patch become possible thanks to Michael Filonenko and his benchmarks of TDG runs with jit.dump() enabled. After analysis of this dump we realize that tuple_bless is not compiled. This uncompiled chunk of code leads to the JIT cancer for all possible workflows that use tuple_bless() (i.e. tuple:update() and tuple:upsert()). This change is really trivial, but adds almost x2 improvement of performance for tuple:update()/upsert() scenario. Hope, that this patch will be a stimulus for including benchmarks of our forward products like TDG to routine performance running with the corresponding profilers dumps. Benchmarks: Before patch: Update: | Tarantool 2.10.0-beta1-90-g31594b427 | type 'help' for interactive help | tarantool> local t = {} | for i = 1, 1e6 do | table.insert(t, box.tuple.new{'abc', 'def', 'ghi', 'abc'}) | end | local clock = require"clock" | local S = clock.proc() | for i = 1, 1e6 do t[i]:update{{"=", 3, "xxx"}} end | return clock.proc() - S; | --- | - 4.208298872 Upsert: 4.158661731 After patch: Update: | Tarantool 2.10.0-beta1-90-g31594b427 | type 'help' for interactive help | tarantool> local t = {} | for i = 1, 1e6 do | table.insert(t, box.tuple.new{'abc', 'def', 'ghi', 'abc'}) | end | local clock = require"clock" | local S = clock.proc() | for i = 1, 1e6 do t[i]:update{{"=", 3, "xxx"}} end | return clock.proc() - S; | --- | - 2.357670738 Upsert: 2.334134195 Branch: https://github.com/tarantool/tarantool/tree/skaplun/gh-noticket-tuple-bless-compile src/box/lua/tuple.lua | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/src/box/lua/tuple.lua b/src/box/lua/tuple.lua index fa76f4f7f..73446ab22 100644 --- a/src/box/lua/tuple.lua +++ b/src/box/lua/tuple.lua @@ -98,7 +98,14 @@ local tuple_bless = function(tuple) -- overflow checked by tuple_bless() in C builtin.box_tuple_ref(tuple) -- must never fail: - return ffi.gc(ffi.cast(const_tuple_ref_t, tuple), tuple_gc) + -- XXX: If we use tail call (instead creating a new frame for + -- a call just replace the top one) here, then JIT tries + -- to compile return from `ffi.gc()` to the frame below. This + -- abort the trace recording with the error "NYI: return to + -- lower frame". So avoid tail call and use additional stack + -- slots (for the local variable and the frame). + local tuple_ref = ffi.gc(ffi.cast(const_tuple_ref_t, tuple), tuple_gc) + return tuple_ref end local tuple_check = function(tuple, usage) -- 2.31.0 ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Tarantool-patches] [PATCH] tuple: make tuple_bless() compilable 2021-10-22 13:02 [Tarantool-patches] [PATCH] tuple: make tuple_bless() compilable Sergey Kaplun via Tarantool-patches @ 2021-10-27 11:06 ` Igor Munkin via Tarantool-patches 2021-10-27 13:16 ` Sergey Kaplun via Tarantool-patches 2021-11-08 12:35 ` Nikita Pettik via Tarantool-patches 2021-11-10 12:31 ` Igor Munkin via Tarantool-patches 1 sibling, 2 replies; 5+ messages in thread From: Igor Munkin via Tarantool-patches @ 2021-10-27 11:06 UTC (permalink / raw) To: Sergey Kaplun; +Cc: tarantool-patches Sergey, Thanks for the patch! LGTM with some nits below. On 22.10.21, Sergey Kaplun wrote: > tuple_bless() uses a tail call to ffi.gc() with return to the caller. > This tail call replaces the current (tuple_bless) frame with the frame > of the callee (ffi.gc). When JIT tries to compile return from `ffi.gc()` > to the frame below it aborts the trace recording with the error "NYI: > return to lower frame". Side note: for the root traces the issue is the same, but the error is different. > > This patch replaces the tail call with using additional local variable Minor: You do not replace tail call, but rather don't give an option for LuaJIT to emit CALLT. Anyway, just being pedantic, feel free to ignore. > returned to the caller right after. > --- > > Actually, this patch become possible thanks to Michael Filonenko and his > benchmarks of TDG runs with jit.dump() enabled. After analysis of this > dump we realize that tuple_bless is not compiled. This uncompiled chunk > of code leads to the JIT cancer for all possible workflows that use > tuple_bless() (i.e. tuple:update() and tuple:upsert()). This change is > really trivial, but adds almost x2 improvement of performance for > tuple:update()/upsert() scenario. Hope, that this patch will be a > stimulus for including benchmarks of our forward products like TDG to > routine performance running with the corresponding profilers dumps. Kekw, one-liner boosting update/upsert in two times -- nice catch! Anyway, please check that your change doesn't affect overall perfomance in interpreter mode too. The bad thing in this, that we have no regular Lua benchmarks at all (even those you provided below), so we can't watch the effect of such changes regularly. > > Benchmarks: > > Before patch: > > Update: > | Tarantool 2.10.0-beta1-90-g31594b427 > | type 'help' for interactive help > | tarantool> local t = {} > | for i = 1, 1e6 do > | table.insert(t, box.tuple.new{'abc', 'def', 'ghi', 'abc'}) > | end > | local clock = require"clock" > | local S = clock.proc() > | for i = 1, 1e6 do t[i]:update{{"=", 3, "xxx"}} end > | return clock.proc() - S; > | --- > | - 4.208298872 > > Upsert: 4.158661731 > > After patch: > > Update: > | Tarantool 2.10.0-beta1-90-g31594b427 > | type 'help' for interactive help > | tarantool> local t = {} > | for i = 1, 1e6 do > | table.insert(t, box.tuple.new{'abc', 'def', 'ghi', 'abc'}) > | end > | local clock = require"clock" > | local S = clock.proc() > | for i = 1, 1e6 do t[i]:update{{"=", 3, "xxx"}} end > | return clock.proc() - S; > | --- > | - 2.357670738 > > Upsert: 2.334134195 > > Branch: https://github.com/tarantool/tarantool/tree/skaplun/gh-noticket-tuple-bless-compile > > src/box/lua/tuple.lua | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/src/box/lua/tuple.lua b/src/box/lua/tuple.lua > index fa76f4f7f..73446ab22 100644 > --- a/src/box/lua/tuple.lua > +++ b/src/box/lua/tuple.lua > @@ -98,7 +98,14 @@ local tuple_bless = function(tuple) > -- overflow checked by tuple_bless() in C > builtin.box_tuple_ref(tuple) > -- must never fail: > - return ffi.gc(ffi.cast(const_tuple_ref_t, tuple), tuple_gc) > + -- XXX: If we use tail call (instead creating a new frame for Typo: s/instead/instead of/. > + -- a call just replace the top one) here, then JIT tries Minor: I see "replace" for the second time, but LuaJIT just "use" the caller frame for callee. I propose to s/replace/use/g, but this is neglible, so feel free to ignore. > + -- to compile return from `ffi.gc()` to the frame below. This > + -- abort the trace recording with the error "NYI: return to Typo: s/abort/aborts/. > + -- lower frame". So avoid tail call and use additional stack > + -- slots (for the local variable and the frame). > + local tuple_ref = ffi.gc(ffi.cast(const_tuple_ref_t, tuple), tuple_gc) > + return tuple_ref Side note: Ugh... I'm sad we're doing things like this one. Complicating the code, leaving huge comments with the rationale of such complicating to reach the desirable (and what is important, local) performance. I propose to spend your innovative time to try solving the problem in the JIT engine: it will be more fun and allow us to avoid writing the cookbook "How to write super-duper-jittable code in LuaJIT". Here is the valid question: what about other hot places with CALLT in Tarantool? Should they be considered/fixed? I guess a ticket will help to not forget about this problem. Anyway, for now the fix provides the considerable boost, so feel free to proceed with the patch. > end > > local tuple_check = function(tuple, usage) > -- > 2.31.0 > -- Best regards, IM ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Tarantool-patches] [PATCH] tuple: make tuple_bless() compilable 2021-10-27 11:06 ` Igor Munkin via Tarantool-patches @ 2021-10-27 13:16 ` Sergey Kaplun via Tarantool-patches 2021-11-08 12:35 ` Nikita Pettik via Tarantool-patches 1 sibling, 0 replies; 5+ messages in thread From: Sergey Kaplun via Tarantool-patches @ 2021-10-27 13:16 UTC (permalink / raw) To: Igor Munkin; +Cc: tarantool-patches Igor, Thanks for the review! On 27.10.21, Igor Munkin wrote: > Sergey, > > Thanks for the patch! LGTM with some nits below. > > On 22.10.21, Sergey Kaplun wrote: > > tuple_bless() uses a tail call to ffi.gc() with return to the caller. > > This tail call replaces the current (tuple_bless) frame with the frame > > of the callee (ffi.gc). When JIT tries to compile return from `ffi.gc()` > > to the frame below it aborts the trace recording with the error "NYI: > > return to lower frame". > > Side note: for the root traces the issue is the same, but the error is > different. Yep. > > > > > This patch replaces the tail call with using additional local variable > > Minor: You do not replace tail call, but rather don't give an option for > LuaJIT to emit CALLT. Anyway, just being pedantic, feel free to ignore. So, the CALLT is replaced with a regular call :). Ignoring. > > > returned to the caller right after. > > --- > > > > Actually, this patch become possible thanks to Michael Filonenko and his > > benchmarks of TDG runs with jit.dump() enabled. After analysis of this > > dump we realize that tuple_bless is not compiled. This uncompiled chunk > > of code leads to the JIT cancer for all possible workflows that use > > tuple_bless() (i.e. tuple:update() and tuple:upsert()). This change is > > really trivial, but adds almost x2 improvement of performance for > > tuple:update()/upsert() scenario. Hope, that this patch will be a > > stimulus for including benchmarks of our forward products like TDG to > > routine performance running with the corresponding profilers dumps. > > Kekw, one-liner boosting update/upsert in two times -- nice catch! > Anyway, please check that your change doesn't affect overall perfomance > in interpreter mode too. The new one (without tailcall) is 1% slower: 21.2 sec vs 21.0 sec with jit.off(). This looks like a good trade to me. > > The bad thing in this, that we have no regular Lua benchmarks at all > (even those you provided below), so we can't watch the effect of such > changes regularly. It's true. > > > > > Benchmarks: > > > > Before patch: > > > > Update: > > | Tarantool 2.10.0-beta1-90-g31594b427 > > | type 'help' for interactive help > > | tarantool> local t = {} > > | for i = 1, 1e6 do > > | table.insert(t, box.tuple.new{'abc', 'def', 'ghi', 'abc'}) > > | end > > | local clock = require"clock" > > | local S = clock.proc() > > | for i = 1, 1e6 do t[i]:update{{"=", 3, "xxx"}} end > > | return clock.proc() - S; > > | --- > > | - 4.208298872 > > > > Upsert: 4.158661731 > > > > After patch: > > > > Update: > > | Tarantool 2.10.0-beta1-90-g31594b427 > > | type 'help' for interactive help > > | tarantool> local t = {} > > | for i = 1, 1e6 do > > | table.insert(t, box.tuple.new{'abc', 'def', 'ghi', 'abc'}) > > | end > > | local clock = require"clock" > > | local S = clock.proc() > > | for i = 1, 1e6 do t[i]:update{{"=", 3, "xxx"}} end > > | return clock.proc() - S; > > | --- > > | - 2.357670738 > > > > Upsert: 2.334134195 > > > > Branch: https://github.com/tarantool/tarantool/tree/skaplun/gh-noticket-tuple-bless-compile > > > > src/box/lua/tuple.lua | 9 ++++++++- > > 1 file changed, 8 insertions(+), 1 deletion(-) > > > > diff --git a/src/box/lua/tuple.lua b/src/box/lua/tuple.lua > > index fa76f4f7f..73446ab22 100644 > > --- a/src/box/lua/tuple.lua > > +++ b/src/box/lua/tuple.lua > > @@ -98,7 +98,14 @@ local tuple_bless = function(tuple) > > -- overflow checked by tuple_bless() in C > > builtin.box_tuple_ref(tuple) > > -- must never fail: > > - return ffi.gc(ffi.cast(const_tuple_ref_t, tuple), tuple_gc) > > + -- XXX: If we use tail call (instead creating a new frame for > > Typo: s/instead/instead of/. > > > + -- a call just replace the top one) here, then JIT tries > > Minor: I see "replace" for the second time, but LuaJIT just "use" the > caller frame for callee. I propose to s/replace/use/g, but this is > neglible, so feel free to ignore. > > > + -- to compile return from `ffi.gc()` to the frame below. This > > + -- abort the trace recording with the error "NYI: return to > > Typo: s/abort/aborts/. Fixed your comments. See the iterative patch below. Branch is force-pushed. =================================================================== diff --git a/src/box/lua/tuple.lua b/src/box/lua/tuple.lua index 1201c7c34..f47b5926d 100644 --- a/src/box/lua/tuple.lua +++ b/src/box/lua/tuple.lua @@ -98,10 +98,10 @@ local tuple_bless = function(tuple) -- overflow checked by tuple_bless() in C builtin.box_tuple_ref(tuple) -- must never fail: - -- XXX: If we use tail call (instead creating a new frame for - -- a call just replace the top one) here, then JIT tries - -- to compile return from `ffi.gc()` to the frame below. This - -- abort the trace recording with the error "NYI: return to + -- XXX: If we use tail call (instead of creating a new frame + -- for a call just use the top one) here, then JIT tries to + -- compile return from `ffi.gc()` to the frame below. This + -- aborts the trace recording with the error "NYI: return to -- lower frame". So avoid tail call and use additional stack -- slots (for the local variable and the frame). local tuple_ref = ffi.gc(ffi.cast(const_tuple_ref_t, tuple), tuple_gc) =================================================================== And the new commit message (remove "replace" usage): =================================================================== tuple: make tuple_bless() compilable tuple_bless() uses a tail call to ffi.gc() with return to the caller. This tail call uses the current (tuple_bless) frame instead of creating the frame for the callee (ffi.gc). When JIT tries to compile return from `ffi.gc()` to the frame below it aborts the trace recording with the error "NYI: return to lower frame". This patch replaces the tail call with using additional local variable returned to the caller right after. =================================================================== > > > + -- lower frame". So avoid tail call and use additional stack > > + -- slots (for the local variable and the frame). > > + local tuple_ref = ffi.gc(ffi.cast(const_tuple_ref_t, tuple), tuple_gc) > > + return tuple_ref > > Side note: Ugh... I'm sad we're doing things like this one. Complicating > the code, leaving huge comments with the rationale of such complicating > to reach the desirable (and what is important, local) performance. I > propose to spend your innovative time to try solving the problem in the > JIT engine: it will be more fun and allow us to avoid writing the > cookbook "How to write super-duper-jittable code in LuaJIT". Yes, it is ugly workaround. The true way is to resolve problem with compiling of return to frame below one where trace was started. > > Here is the valid question: what about other hot places with CALLT in > Tarantool? Should they be considered/fixed? I guess a ticket will help > to not forget about this problem. I suppose it should be created within the activity of regular testing our most valuable products and "boxes". > > Anyway, for now the fix provides the considerable boost, so feel free to > proceed with the patch. Thanks!:) > > > end > > > > local tuple_check = function(tuple, usage) > > -- > > 2.31.0 > > > > -- > Best regards, > IM -- Best regards, Sergey Kaplun ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Tarantool-patches] [PATCH] tuple: make tuple_bless() compilable 2021-10-27 11:06 ` Igor Munkin via Tarantool-patches 2021-10-27 13:16 ` Sergey Kaplun via Tarantool-patches @ 2021-11-08 12:35 ` Nikita Pettik via Tarantool-patches 1 sibling, 0 replies; 5+ messages in thread From: Nikita Pettik via Tarantool-patches @ 2021-11-08 12:35 UTC (permalink / raw) To: Igor Munkin; +Cc: tarantool-patches On 27 Oct 14:06, Igor Munkin wrote: > > + -- lower frame". So avoid tail call and use additional stack > > + -- slots (for the local variable and the frame). > > + local tuple_ref = ffi.gc(ffi.cast(const_tuple_ref_t, tuple), tuple_gc) > > + return tuple_ref > > Side note: Ugh... I'm sad we're doing things like this one. Complicating > the code, leaving huge comments with the rationale of such complicating > to reach the desirable (and what is important, local) performance. I > propose to spend your innovative time to try solving the problem in the > JIT engine: it will be more fun and allow us to avoid writing the > cookbook "How to write super-duper-jittable code in LuaJIT". > > Here is the valid question: what about other hot places with CALLT in > Tarantool? Should they be considered/fixed? I guess a ticket will help > to not forget about this problem. > > Anyway, for now the fix provides the considerable boost, so feel free to > proceed with the patch. Said today exactly the same:) LGTM as well. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Tarantool-patches] [PATCH] tuple: make tuple_bless() compilable 2021-10-22 13:02 [Tarantool-patches] [PATCH] tuple: make tuple_bless() compilable Sergey Kaplun via Tarantool-patches 2021-10-27 11:06 ` Igor Munkin via Tarantool-patches @ 2021-11-10 12:31 ` Igor Munkin via Tarantool-patches 1 sibling, 0 replies; 5+ messages in thread From: Igor Munkin via Tarantool-patches @ 2021-11-10 12:31 UTC (permalink / raw) To: Sergey Kaplun; +Cc: tarantool-patches Sergey, I've checked the patch into master and 2.8. On 22.10.21, Sergey Kaplun wrote: > tuple_bless() uses a tail call to ffi.gc() with return to the caller. > This tail call replaces the current (tuple_bless) frame with the frame > of the callee (ffi.gc). When JIT tries to compile return from `ffi.gc()` > to the frame below it aborts the trace recording with the error "NYI: > return to lower frame". > > This patch replaces the tail call with using additional local variable > returned to the caller right after. > --- > > Actually, this patch become possible thanks to Michael Filonenko and his > benchmarks of TDG runs with jit.dump() enabled. After analysis of this > dump we realize that tuple_bless is not compiled. This uncompiled chunk > of code leads to the JIT cancer for all possible workflows that use > tuple_bless() (i.e. tuple:update() and tuple:upsert()). This change is > really trivial, but adds almost x2 improvement of performance for > tuple:update()/upsert() scenario. Hope, that this patch will be a > stimulus for including benchmarks of our forward products like TDG to > routine performance running with the corresponding profilers dumps. > > Benchmarks: > > Before patch: > > Update: > | Tarantool 2.10.0-beta1-90-g31594b427 > | type 'help' for interactive help > | tarantool> local t = {} > | for i = 1, 1e6 do > | table.insert(t, box.tuple.new{'abc', 'def', 'ghi', 'abc'}) > | end > | local clock = require"clock" > | local S = clock.proc() > | for i = 1, 1e6 do t[i]:update{{"=", 3, "xxx"}} end > | return clock.proc() - S; > | --- > | - 4.208298872 > > Upsert: 4.158661731 > > After patch: > > Update: > | Tarantool 2.10.0-beta1-90-g31594b427 > | type 'help' for interactive help > | tarantool> local t = {} > | for i = 1, 1e6 do > | table.insert(t, box.tuple.new{'abc', 'def', 'ghi', 'abc'}) > | end > | local clock = require"clock" > | local S = clock.proc() > | for i = 1, 1e6 do t[i]:update{{"=", 3, "xxx"}} end > | return clock.proc() - S; > | --- > | - 2.357670738 > > Upsert: 2.334134195 > > Branch: https://github.com/tarantool/tarantool/tree/skaplun/gh-noticket-tuple-bless-compile > > src/box/lua/tuple.lua | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > <snipped> > -- > 2.31.0 > -- Best regards, IM ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2021-11-10 12:32 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-10-22 13:02 [Tarantool-patches] [PATCH] tuple: make tuple_bless() compilable Sergey Kaplun via Tarantool-patches 2021-10-27 11:06 ` Igor Munkin via Tarantool-patches 2021-10-27 13:16 ` Sergey Kaplun via Tarantool-patches 2021-11-08 12:35 ` Nikita Pettik via Tarantool-patches 2021-11-10 12:31 ` Igor Munkin via Tarantool-patches
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox