* [Tarantool-patches] [PATCH] tuple: make tuple_bless() compilable
@ 2021-10-22 13:02 Sergey Kaplun via Tarantool-patches
  2021-10-27 11:06 ` Igor Munkin via Tarantool-patches
  2021-11-10 12:31 ` Igor Munkin via Tarantool-patches
  0 siblings, 2 replies; 5+ messages in thread
From: Sergey Kaplun via Tarantool-patches @ 2021-10-22 13:02 UTC (permalink / raw)
  To: Nikita Pettik, Igor Munkin; +Cc: tarantool-patches
tuple_bless() uses a tail call to ffi.gc() with return to the caller.
This tail call replaces the current (tuple_bless) frame with the frame
of the callee (ffi.gc). When JIT tries to compile return from `ffi.gc()`
to the frame below it aborts the trace recording with the error "NYI:
return to lower frame".
This patch replaces the tail call with using additional local variable
returned to the caller right after.
---
Actually, this patch become possible thanks to Michael Filonenko and his
benchmarks of TDG runs with jit.dump() enabled. After analysis of this
dump we realize that tuple_bless is not compiled. This uncompiled chunk
of code leads to the JIT cancer for all possible workflows that use
tuple_bless() (i.e. tuple:update() and tuple:upsert()). This change is
really trivial, but adds almost x2 improvement of performance for
tuple:update()/upsert() scenario. Hope, that this patch will be a
stimulus for including benchmarks of our forward products like TDG to
routine performance running with the corresponding profilers dumps.
Benchmarks:
Before patch:
Update:
| Tarantool 2.10.0-beta1-90-g31594b427
| type 'help' for interactive help
| tarantool> local t = {}
|            for i = 1, 1e6 do
|                table.insert(t, box.tuple.new{'abc', 'def', 'ghi', 'abc'})
|            end
|            local clock = require"clock"
|            local S = clock.proc()
|            for i = 1, 1e6 do t[i]:update{{"=", 3, "xxx"}} end
|            return clock.proc() - S;
| ---
| - 4.208298872
Upsert: 4.158661731
After patch:
Update:
| Tarantool 2.10.0-beta1-90-g31594b427
| type 'help' for interactive help
| tarantool> local t = {}
|            for i = 1, 1e6 do
|                table.insert(t, box.tuple.new{'abc', 'def', 'ghi', 'abc'})
|            end
|            local clock = require"clock"
|            local S = clock.proc()
|            for i = 1, 1e6 do t[i]:update{{"=", 3, "xxx"}} end
|            return clock.proc() - S;
| ---
| - 2.357670738
Upsert: 2.334134195
Branch: https://github.com/tarantool/tarantool/tree/skaplun/gh-noticket-tuple-bless-compile
 src/box/lua/tuple.lua | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/src/box/lua/tuple.lua b/src/box/lua/tuple.lua
index fa76f4f7f..73446ab22 100644
--- a/src/box/lua/tuple.lua
+++ b/src/box/lua/tuple.lua
@@ -98,7 +98,14 @@ local tuple_bless = function(tuple)
     -- overflow checked by tuple_bless() in C
     builtin.box_tuple_ref(tuple)
     -- must never fail:
-    return ffi.gc(ffi.cast(const_tuple_ref_t, tuple), tuple_gc)
+    -- XXX: If we use tail call (instead creating a new frame for
+    -- a call just replace the top one) here, then JIT tries
+    -- to compile return from `ffi.gc()` to the frame below. This
+    -- abort the trace recording with the error "NYI: return to
+    -- lower frame". So avoid tail call and use additional stack
+    -- slots (for the local variable and the frame).
+    local tuple_ref = ffi.gc(ffi.cast(const_tuple_ref_t, tuple), tuple_gc)
+    return tuple_ref
 end
 
 local tuple_check = function(tuple, usage)
-- 
2.31.0
^ permalink raw reply	[flat|nested] 5+ messages in thread
* Re: [Tarantool-patches] [PATCH] tuple: make tuple_bless() compilable
  2021-10-22 13:02 [Tarantool-patches] [PATCH] tuple: make tuple_bless() compilable Sergey Kaplun via Tarantool-patches
@ 2021-10-27 11:06 ` Igor Munkin via Tarantool-patches
  2021-10-27 13:16   ` Sergey Kaplun via Tarantool-patches
  2021-11-08 12:35   ` Nikita Pettik via Tarantool-patches
  2021-11-10 12:31 ` Igor Munkin via Tarantool-patches
  1 sibling, 2 replies; 5+ messages in thread
From: Igor Munkin via Tarantool-patches @ 2021-10-27 11:06 UTC (permalink / raw)
  To: Sergey Kaplun; +Cc: tarantool-patches
Sergey,
Thanks for the patch! LGTM with some nits below.
On 22.10.21, Sergey Kaplun wrote:
> tuple_bless() uses a tail call to ffi.gc() with return to the caller.
> This tail call replaces the current (tuple_bless) frame with the frame
> of the callee (ffi.gc). When JIT tries to compile return from `ffi.gc()`
> to the frame below it aborts the trace recording with the error "NYI:
> return to lower frame".
Side note: for the root traces the issue is the same, but the error is
different.
> 
> This patch replaces the tail call with using additional local variable
Minor: You do not replace tail call, but rather don't give an option for
LuaJIT to emit CALLT. Anyway, just being pedantic, feel free to ignore.
> returned to the caller right after.
> ---
> 
> Actually, this patch become possible thanks to Michael Filonenko and his
> benchmarks of TDG runs with jit.dump() enabled. After analysis of this
> dump we realize that tuple_bless is not compiled. This uncompiled chunk
> of code leads to the JIT cancer for all possible workflows that use
> tuple_bless() (i.e. tuple:update() and tuple:upsert()). This change is
> really trivial, but adds almost x2 improvement of performance for
> tuple:update()/upsert() scenario. Hope, that this patch will be a
> stimulus for including benchmarks of our forward products like TDG to
> routine performance running with the corresponding profilers dumps.
Kekw, one-liner boosting update/upsert in two times -- nice catch!
Anyway, please check that your change doesn't affect overall perfomance
in interpreter mode too.
The bad thing in this, that we have no regular Lua benchmarks at all
(even those you provided below), so we can't watch the effect of such
changes regularly.
> 
> Benchmarks:
> 
> Before patch:
> 
> Update:
> | Tarantool 2.10.0-beta1-90-g31594b427
> | type 'help' for interactive help
> | tarantool> local t = {}
> |            for i = 1, 1e6 do
> |                table.insert(t, box.tuple.new{'abc', 'def', 'ghi', 'abc'})
> |            end
> |            local clock = require"clock"
> |            local S = clock.proc()
> |            for i = 1, 1e6 do t[i]:update{{"=", 3, "xxx"}} end
> |            return clock.proc() - S;
> | ---
> | - 4.208298872
> 
> Upsert: 4.158661731
> 
> After patch:
> 
> Update:
> | Tarantool 2.10.0-beta1-90-g31594b427
> | type 'help' for interactive help
> | tarantool> local t = {}
> |            for i = 1, 1e6 do
> |                table.insert(t, box.tuple.new{'abc', 'def', 'ghi', 'abc'})
> |            end
> |            local clock = require"clock"
> |            local S = clock.proc()
> |            for i = 1, 1e6 do t[i]:update{{"=", 3, "xxx"}} end
> |            return clock.proc() - S;
> | ---
> | - 2.357670738
> 
> Upsert: 2.334134195
> 
> Branch: https://github.com/tarantool/tarantool/tree/skaplun/gh-noticket-tuple-bless-compile
> 
>  src/box/lua/tuple.lua | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/src/box/lua/tuple.lua b/src/box/lua/tuple.lua
> index fa76f4f7f..73446ab22 100644
> --- a/src/box/lua/tuple.lua
> +++ b/src/box/lua/tuple.lua
> @@ -98,7 +98,14 @@ local tuple_bless = function(tuple)
>      -- overflow checked by tuple_bless() in C
>      builtin.box_tuple_ref(tuple)
>      -- must never fail:
> -    return ffi.gc(ffi.cast(const_tuple_ref_t, tuple), tuple_gc)
> +    -- XXX: If we use tail call (instead creating a new frame for
Typo: s/instead/instead of/.
> +    -- a call just replace the top one) here, then JIT tries
Minor: I see "replace" for the second time, but LuaJIT just "use" the
caller frame for callee. I propose to s/replace/use/g, but this is
neglible, so feel free to ignore.
> +    -- to compile return from `ffi.gc()` to the frame below. This
> +    -- abort the trace recording with the error "NYI: return to
Typo: s/abort/aborts/.
> +    -- lower frame". So avoid tail call and use additional stack
> +    -- slots (for the local variable and the frame).
> +    local tuple_ref = ffi.gc(ffi.cast(const_tuple_ref_t, tuple), tuple_gc)
> +    return tuple_ref
Side note: Ugh... I'm sad we're doing things like this one. Complicating
the code, leaving huge comments with the rationale of such complicating
to reach the desirable (and what is important, local) performance. I
propose to spend your innovative time to try solving the problem in the
JIT engine: it will be more fun and allow us to avoid writing the
cookbook "How to write super-duper-jittable code in LuaJIT".
Here is the valid question: what about other hot places with CALLT in
Tarantool? Should they be considered/fixed? I guess a ticket will help
to not forget about this problem.
Anyway, for now the fix provides the considerable boost, so feel free to
proceed with the patch.
>  end
>  
>  local tuple_check = function(tuple, usage)
> -- 
> 2.31.0
> 
-- 
Best regards,
IM
^ permalink raw reply	[flat|nested] 5+ messages in thread
* Re: [Tarantool-patches] [PATCH] tuple: make tuple_bless() compilable
  2021-10-27 11:06 ` Igor Munkin via Tarantool-patches
@ 2021-10-27 13:16   ` Sergey Kaplun via Tarantool-patches
  2021-11-08 12:35   ` Nikita Pettik via Tarantool-patches
  1 sibling, 0 replies; 5+ messages in thread
From: Sergey Kaplun via Tarantool-patches @ 2021-10-27 13:16 UTC (permalink / raw)
  To: Igor Munkin; +Cc: tarantool-patches
Igor,
Thanks for the review!
On 27.10.21, Igor Munkin wrote:
> Sergey,
> 
> Thanks for the patch! LGTM with some nits below.
> 
> On 22.10.21, Sergey Kaplun wrote:
> > tuple_bless() uses a tail call to ffi.gc() with return to the caller.
> > This tail call replaces the current (tuple_bless) frame with the frame
> > of the callee (ffi.gc). When JIT tries to compile return from `ffi.gc()`
> > to the frame below it aborts the trace recording with the error "NYI:
> > return to lower frame".
> 
> Side note: for the root traces the issue is the same, but the error is
> different.
Yep.
> 
> > 
> > This patch replaces the tail call with using additional local variable
> 
> Minor: You do not replace tail call, but rather don't give an option for
> LuaJIT to emit CALLT. Anyway, just being pedantic, feel free to ignore.
So, the CALLT is replaced with a regular call :). Ignoring.
> 
> > returned to the caller right after.
> > ---
> > 
> > Actually, this patch become possible thanks to Michael Filonenko and his
> > benchmarks of TDG runs with jit.dump() enabled. After analysis of this
> > dump we realize that tuple_bless is not compiled. This uncompiled chunk
> > of code leads to the JIT cancer for all possible workflows that use
> > tuple_bless() (i.e. tuple:update() and tuple:upsert()). This change is
> > really trivial, but adds almost x2 improvement of performance for
> > tuple:update()/upsert() scenario. Hope, that this patch will be a
> > stimulus for including benchmarks of our forward products like TDG to
> > routine performance running with the corresponding profilers dumps.
> 
> Kekw, one-liner boosting update/upsert in two times -- nice catch!
> Anyway, please check that your change doesn't affect overall perfomance
> in interpreter mode too.
The new one (without tailcall) is 1% slower:
21.2 sec vs 21.0 sec with jit.off().
This looks like a good trade to me.
> 
> The bad thing in this, that we have no regular Lua benchmarks at all
> (even those you provided below), so we can't watch the effect of such
> changes regularly.
It's true.
> 
> > 
> > Benchmarks:
> > 
> > Before patch:
> > 
> > Update:
> > | Tarantool 2.10.0-beta1-90-g31594b427
> > | type 'help' for interactive help
> > | tarantool> local t = {}
> > |            for i = 1, 1e6 do
> > |                table.insert(t, box.tuple.new{'abc', 'def', 'ghi', 'abc'})
> > |            end
> > |            local clock = require"clock"
> > |            local S = clock.proc()
> > |            for i = 1, 1e6 do t[i]:update{{"=", 3, "xxx"}} end
> > |            return clock.proc() - S;
> > | ---
> > | - 4.208298872
> > 
> > Upsert: 4.158661731
> > 
> > After patch:
> > 
> > Update:
> > | Tarantool 2.10.0-beta1-90-g31594b427
> > | type 'help' for interactive help
> > | tarantool> local t = {}
> > |            for i = 1, 1e6 do
> > |                table.insert(t, box.tuple.new{'abc', 'def', 'ghi', 'abc'})
> > |            end
> > |            local clock = require"clock"
> > |            local S = clock.proc()
> > |            for i = 1, 1e6 do t[i]:update{{"=", 3, "xxx"}} end
> > |            return clock.proc() - S;
> > | ---
> > | - 2.357670738
> > 
> > Upsert: 2.334134195
> > 
> > Branch: https://github.com/tarantool/tarantool/tree/skaplun/gh-noticket-tuple-bless-compile
> > 
> >  src/box/lua/tuple.lua | 9 ++++++++-
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/src/box/lua/tuple.lua b/src/box/lua/tuple.lua
> > index fa76f4f7f..73446ab22 100644
> > --- a/src/box/lua/tuple.lua
> > +++ b/src/box/lua/tuple.lua
> > @@ -98,7 +98,14 @@ local tuple_bless = function(tuple)
> >      -- overflow checked by tuple_bless() in C
> >      builtin.box_tuple_ref(tuple)
> >      -- must never fail:
> > -    return ffi.gc(ffi.cast(const_tuple_ref_t, tuple), tuple_gc)
> > +    -- XXX: If we use tail call (instead creating a new frame for
> 
> Typo: s/instead/instead of/.
> 
> > +    -- a call just replace the top one) here, then JIT tries
> 
> Minor: I see "replace" for the second time, but LuaJIT just "use" the
> caller frame for callee. I propose to s/replace/use/g, but this is
> neglible, so feel free to ignore.
> 
> > +    -- to compile return from `ffi.gc()` to the frame below. This
> > +    -- abort the trace recording with the error "NYI: return to
> 
> Typo: s/abort/aborts/.
Fixed your comments. See the iterative patch below.
Branch is force-pushed.
===================================================================
diff --git a/src/box/lua/tuple.lua b/src/box/lua/tuple.lua
index 1201c7c34..f47b5926d 100644
--- a/src/box/lua/tuple.lua
+++ b/src/box/lua/tuple.lua
@@ -98,10 +98,10 @@ local tuple_bless = function(tuple)
     -- overflow checked by tuple_bless() in C
     builtin.box_tuple_ref(tuple)
     -- must never fail:
-    -- XXX: If we use tail call (instead creating a new frame for
-    -- a call just replace the top one) here, then JIT tries
-    -- to compile return from `ffi.gc()` to the frame below. This
-    -- abort the trace recording with the error "NYI: return to
+    -- XXX: If we use tail call (instead of creating a new frame
+    -- for a call just use the top one) here, then JIT tries to
+    -- compile return from `ffi.gc()` to the frame below. This
+    -- aborts the trace recording with the error "NYI: return to
     -- lower frame". So avoid tail call and use additional stack
     -- slots (for the local variable and the frame).
     local tuple_ref = ffi.gc(ffi.cast(const_tuple_ref_t, tuple), tuple_gc)
===================================================================
And the new commit message (remove "replace" usage):
===================================================================
tuple: make tuple_bless() compilable
tuple_bless() uses a tail call to ffi.gc() with return to the caller.
This tail call uses the current (tuple_bless) frame instead of creating
the frame for the callee (ffi.gc). When JIT tries to compile return from
`ffi.gc()` to the frame below it aborts the trace recording with the
error "NYI: return to lower frame".
This patch replaces the tail call with using additional local variable
returned to the caller right after.
===================================================================
> 
> > +    -- lower frame". So avoid tail call and use additional stack
> > +    -- slots (for the local variable and the frame).
> > +    local tuple_ref = ffi.gc(ffi.cast(const_tuple_ref_t, tuple), tuple_gc)
> > +    return tuple_ref
> 
> Side note: Ugh... I'm sad we're doing things like this one. Complicating
> the code, leaving huge comments with the rationale of such complicating
> to reach the desirable (and what is important, local) performance. I
> propose to spend your innovative time to try solving the problem in the
> JIT engine: it will be more fun and allow us to avoid writing the
> cookbook "How to write super-duper-jittable code in LuaJIT".
Yes, it is ugly workaround. The true way is to resolve problem with
compiling of return to frame below one where trace was started.
> 
> Here is the valid question: what about other hot places with CALLT in
> Tarantool? Should they be considered/fixed? I guess a ticket will help
> to not forget about this problem.
I suppose it should be created within the activity of regular testing our
most valuable products and "boxes".
> 
> Anyway, for now the fix provides the considerable boost, so feel free to
> proceed with the patch.
Thanks!:)
> 
> >  end
> >  
> >  local tuple_check = function(tuple, usage)
> > -- 
> > 2.31.0
> > 
> 
> -- 
> Best regards,
> IM
-- 
Best regards,
Sergey Kaplun
^ permalink raw reply	[flat|nested] 5+ messages in thread
* Re: [Tarantool-patches] [PATCH] tuple: make tuple_bless() compilable
  2021-10-27 11:06 ` Igor Munkin via Tarantool-patches
  2021-10-27 13:16   ` Sergey Kaplun via Tarantool-patches
@ 2021-11-08 12:35   ` Nikita Pettik via Tarantool-patches
  1 sibling, 0 replies; 5+ messages in thread
From: Nikita Pettik via Tarantool-patches @ 2021-11-08 12:35 UTC (permalink / raw)
  To: Igor Munkin; +Cc: tarantool-patches
On 27 Oct 14:06, Igor Munkin wrote:
> > +    -- lower frame". So avoid tail call and use additional stack
> > +    -- slots (for the local variable and the frame).
> > +    local tuple_ref = ffi.gc(ffi.cast(const_tuple_ref_t, tuple), tuple_gc)
> > +    return tuple_ref
> 
> Side note: Ugh... I'm sad we're doing things like this one. Complicating
> the code, leaving huge comments with the rationale of such complicating
> to reach the desirable (and what is important, local) performance. I
> propose to spend your innovative time to try solving the problem in the
> JIT engine: it will be more fun and allow us to avoid writing the
> cookbook "How to write super-duper-jittable code in LuaJIT".
> 
> Here is the valid question: what about other hot places with CALLT in
> Tarantool? Should they be considered/fixed? I guess a ticket will help
> to not forget about this problem.
> 
> Anyway, for now the fix provides the considerable boost, so feel free to
> proceed with the patch.
Said today exactly the same:) LGTM as well.
 
^ permalink raw reply	[flat|nested] 5+ messages in thread
* Re: [Tarantool-patches] [PATCH] tuple: make tuple_bless() compilable
  2021-10-22 13:02 [Tarantool-patches] [PATCH] tuple: make tuple_bless() compilable Sergey Kaplun via Tarantool-patches
  2021-10-27 11:06 ` Igor Munkin via Tarantool-patches
@ 2021-11-10 12:31 ` Igor Munkin via Tarantool-patches
  1 sibling, 0 replies; 5+ messages in thread
From: Igor Munkin via Tarantool-patches @ 2021-11-10 12:31 UTC (permalink / raw)
  To: Sergey Kaplun; +Cc: tarantool-patches
Sergey,
I've checked the patch into master and 2.8.
On 22.10.21, Sergey Kaplun wrote:
> tuple_bless() uses a tail call to ffi.gc() with return to the caller.
> This tail call replaces the current (tuple_bless) frame with the frame
> of the callee (ffi.gc). When JIT tries to compile return from `ffi.gc()`
> to the frame below it aborts the trace recording with the error "NYI:
> return to lower frame".
> 
> This patch replaces the tail call with using additional local variable
> returned to the caller right after.
> ---
> 
> Actually, this patch become possible thanks to Michael Filonenko and his
> benchmarks of TDG runs with jit.dump() enabled. After analysis of this
> dump we realize that tuple_bless is not compiled. This uncompiled chunk
> of code leads to the JIT cancer for all possible workflows that use
> tuple_bless() (i.e. tuple:update() and tuple:upsert()). This change is
> really trivial, but adds almost x2 improvement of performance for
> tuple:update()/upsert() scenario. Hope, that this patch will be a
> stimulus for including benchmarks of our forward products like TDG to
> routine performance running with the corresponding profilers dumps.
> 
> Benchmarks:
> 
> Before patch:
> 
> Update:
> | Tarantool 2.10.0-beta1-90-g31594b427
> | type 'help' for interactive help
> | tarantool> local t = {}
> |            for i = 1, 1e6 do
> |                table.insert(t, box.tuple.new{'abc', 'def', 'ghi', 'abc'})
> |            end
> |            local clock = require"clock"
> |            local S = clock.proc()
> |            for i = 1, 1e6 do t[i]:update{{"=", 3, "xxx"}} end
> |            return clock.proc() - S;
> | ---
> | - 4.208298872
> 
> Upsert: 4.158661731
> 
> After patch:
> 
> Update:
> | Tarantool 2.10.0-beta1-90-g31594b427
> | type 'help' for interactive help
> | tarantool> local t = {}
> |            for i = 1, 1e6 do
> |                table.insert(t, box.tuple.new{'abc', 'def', 'ghi', 'abc'})
> |            end
> |            local clock = require"clock"
> |            local S = clock.proc()
> |            for i = 1, 1e6 do t[i]:update{{"=", 3, "xxx"}} end
> |            return clock.proc() - S;
> | ---
> | - 2.357670738
> 
> Upsert: 2.334134195
> 
> Branch: https://github.com/tarantool/tarantool/tree/skaplun/gh-noticket-tuple-bless-compile
> 
>  src/box/lua/tuple.lua | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
<snipped>
> -- 
> 2.31.0
> 
-- 
Best regards,
IM
^ permalink raw reply	[flat|nested] 5+ messages in thread
end of thread, other threads:[~2021-11-10 12:32 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-22 13:02 [Tarantool-patches] [PATCH] tuple: make tuple_bless() compilable Sergey Kaplun via Tarantool-patches
2021-10-27 11:06 ` Igor Munkin via Tarantool-patches
2021-10-27 13:16   ` Sergey Kaplun via Tarantool-patches
2021-11-08 12:35   ` Nikita Pettik via Tarantool-patches
2021-11-10 12:31 ` Igor Munkin via Tarantool-patches
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox