From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtpng3.m.smailru.net (smtpng3.m.smailru.net [94.100.177.149]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id ADA3B445322 for ; Fri, 24 Jul 2020 17:24:48 +0300 (MSK) Date: Fri, 24 Jul 2020 17:14:29 +0300 From: Igor Munkin Message-ID: <20200724141429.GR18920@tarantool.org> References: <5754acde-5591-791e-c836-c8685d9df208@tarantool.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <5754acde-5591-791e-c836-c8685d9df208@tarantool.org> Subject: Re: [Tarantool-patches] [PATCH] lua/utils: improve luaT_newthread performance List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Vladislav Shpilevoy Cc: tarantool-patches@dev.tarantool.org Vlad, Thanks for your review! On 23.07.20, Vladislav Shpilevoy wrote: > Hi! Thanks for the patch! > > On 20.07.2020 13:28, Igor Munkin wrote: > > created a new GCfunc object for the helper invoked in a > > protected frame (i.e. ) on each > > call. The change introduces a static reference to a GCfunc object for > > to be initialized on Tarantool startup to > > reduce Lua GC memory usage. > > > > Furthermore, since yields nothing on guest stack, the newly > > created Lua coroutine need to be pushed back to prevent its sweep. So > > to reduce guest stack manipulations is replaced with > > and the resulting Lua thread is obtained via guest stack. > > I don't think I understand. Before we had one store into an 8 byte location > with lua_cpcall(). Now we have push and pop on the stack with lua_pcall(). > What is the point? It seems to be worse. No, it doesn't. It seems I need to explain this in a more precise way. Let's start with the old implementation: | (gdb) b luaT_newthread | Breakpoint 1 at 0x156504: luaT_newthread. (5 locations) | (gdb) b luaT_newthread_wrapper | Breakpoint 2 at 0x1564c9: luaT_newthread_wrapper. (5 locations) | (gdb) r | | Breakpoint 1, luaT_newthread (L=0x40000378) at /tarantool/src/lua/utils.h:619 | 619 lua_State *L1 = NULL; | (gdb) s | 620 if (luaT_cpcall(L, luaT_newthread_wrapper, &L1) != 0) { | (gdb) lj-stack L | 0x40023eb0:0x40023ed0 [ ] 5 slots: Red zone | 0x40023ea8 [ M] | 0x40023ac0:0x40023ea0 [ ] 125 slots: Free stack slots | 0x40023ab8 [ T ] | 0x40023ab0 [ B ] VALUE: Lua function @ 0x4003da38, 4 upvalues, "@builtin/fiber.lua":67 | 0x40023aa8 [ ] FRAME: [L] delta=14, C function @ 0x55555577a9a2 | This is the Lua stack state, when is called, but a new Lua coroutine is not created yet. Let's continue: | (gdb) c | Continuing. | | Breakpoint 2, luaT_newthread_wrapper (L=0x40000378) at /tarantool/src/lua/utils.h:607 | 607 *(lua_State **)lua_touserdata(L, 1) = lua_newthread(L); | (gdb) lj-stack L | 0x40023eb0:0x40023ed0 [ ] 5 slots: Red zone | 0x40023ea8 [ M] | 0x40023ad0:0x40023ea0 [ ] 123 slots: Free stack slots | 0x40023ac8 [ T ] | 0x40023ac0 [ B ] VALUE: light userdata @ 0xffffd648 | 0x40023ab8 [ ] FRAME: [CP] delta=2, C function @ 0x555555779291 | 0x40023ab0 [ ] VALUE: Lua function @ 0x4003da38, 4 upvalues, "@builtin/fiber.lua":67 | 0x40023aa8 [ ] FRAME: [L] delta=14, C function @ 0x55555577a9a2 | This is the Lua stack state, when is called, but a new Lua coroutine is still not created yet. NB: you can see a strange address for light userdata payload (0xffffd648), but it's a bug in the gdb extension. The valid address is 0x7fffffffd648 as you can see below. Let's proceed further: | (gdb) s | lua_touserdata (L=0x40000378, idx=1) at lj_api.c:616 | | Run till exit from #0 lua_touserdata (L=0x40000378, idx=1) at lj_api.c:616 | 0x00005555557792af in luaT_newthread_wrapper (L=0x40000378) at /tarantool/src/lua/utils.h:607 | 607 *(lua_State **)lua_touserdata(L, 1) = lua_newthread(L); | Value returned is $1 = (void *) 0x7fffffffd648 | (gdb) s | lua_newthread (L=0x40000378) at lj_api.c:758 | | Run till exit from #0 lua_newthread (L=0x40000378) at lj_api.c:758 | luaT_newthread_wrapper (L=0x40000378) at /tarantool/src/lua/utils.h:607 | 607 *(lua_State **)lua_touserdata(L, 1) = lua_newthread(L); | Value returned is $2 = (lua_State *) 0x4003ee28 | (gdb) s | 608 return 0; | (gdb) lj-stack L | 0x40023eb0:0x40023ed0 [ ] 5 slots: Red zone | 0x40023ea8 [ M] | 0x40023ad8:0x40023ea0 [ ] 122 slots: Free stack slots | 0x40023ad0 [ T ] | 0x40023ac8 [ ] VALUE: thread @ 0x4003ee28 | 0x40023ac0 [ B ] VALUE: light userdata @ 0xffffd648 | 0x40023ab8 [ ] FRAME: [CP] delta=2, C function @ 0x555555779291 | 0x40023ab0 [ ] VALUE: Lua function @ 0x4003da38, 4 upvalues, "@builtin/fiber.lua":67 | 0x40023aa8 [ ] FRAME: [L] delta=14, C function @ 0x55555577a9a2 | | (gdb) p/x *(uintptr_t *)0x7fffffffd648 | $3 = 0x4003ee28 As a result there is the light userdata payload initialized with the created Lua coroutine address and the same coroutine in Lua stack slot at 0x40023ac8. Now, let's see the stack after the execution leaves the protected frame: | luaT_newthread (L=0x40000378) at /tarantool/src/lua/utils.h:620 | 620 if (luaT_cpcall(L, luaT_newthread_wrapper, &L1) != 0) { | (gdb) lj-stack L | 0x40023eb0:0x40023ed0 [ ] 5 slots: Red zone | 0x40023ea8 [ M] | 0x40023ac0:0x40023ea0 [ ] 125 slots: Free stack slots | 0x40023ab8 [ T ] | 0x40023ab0 [ B ] VALUE: Lua function @ 0x4003da38, 4 upvalues, "@builtin/fiber.lua":67 | 0x40023aa8 [ ] FRAME: [L] delta=14, C function @ 0x55555577a9a2 | So, as I mentioned in the patch nothing is returned on the Lua stack. As a result the created Lua coroutine is unreachable for GC machinery. That means it will not be marked and can be released in the nearest GC cycle. Yes, we anchor it to Lua registry later (in the caller) but considering API[1] the object to be anchored also ought to be on the top of the stack. Thereby the Lua coroutine obtained via L1 is put back to Lua stack: | (gdb) n | 624 setthreadV(L, L->top, L1); | (gdb) | 625 incr_top(L); | (gdb) lj-stack L | 0x40023eb0:0x40023ed0 [ ] 5 slots: Red zone | 0x40023ea8 [ M] | 0x40023ac8:0x40023ea0 [ ] 124 slots: Free stack slots | 0x40023ac0 [ T ] | 0x40023ab8 [ ] VALUE: thread @ 0x4003ee28 | 0x40023ab0 [ B ] VALUE: Lua function @ 0x4003da38, 4 upvalues, "@builtin/fiber.lua":67 | 0x40023aa8 [ ] FRAME: [L] delta=14, C function @ 0x55555577a9a2 | My main point is regarding these guest stack manipulations: since we need the created Lua coroutine on the Lua stack, just leave it there as a return value of (i.e. ). The differences are below: * Lua stack state prior to : | 1249 if (luaT_call(L, 0, 1) != 0) | (gdb) lj-stack L | 0x40023dc8:0x40023de8 [ ] 5 slots: Red zone | 0x40023dc0 [ M] | 0x400239e0:0x40023db8 [ ] 124 slots: Free stack slots | 0x400239d8 [ T ] | 0x400239d0 [ ] VALUE: C function @ 0x555555781ff6 | 0x400239c8 [ B ] VALUE: Lua function @ 0x4003da80, 4 upvalues, "@builtin/fiber.lua":67 | 0x400239c0 [ ] FRAME: [L] delta=14, C function @ 0x55555577a486 | * Lua stack state as a result of call: | (gdb) c | Continuing. | | Breakpoint 2, luaT_newthread_wrapper (L=0x40000378) at /tarantool/src/lua/utils.c:1239 | 1239 (void)lua_newthread(L); | (gdb) s | | Run till exit from #0 lua_newthread (L=0x40000378) at lj_api.c:758 | luaT_newthread_wrapper (L=0x40000378) at /tarantool/src/lua/utils.c:1240 | 1240 return 1; | Value returned is $2 = (lua_State *) 0x4003ee48 | (gdb) lj-stack L | 0x40023dc8:0x40023de8 [ ] 5 slots: Red zone | 0x40023dc0 [ M] | 0x400239e8:0x40023db8 [ ] 123 slots: Free stack slots | 0x400239e0 [ T ] | 0x400239d8 [ B ] VALUE: thread @ 0x4003ee48 | 0x400239d0 [ ] FRAME: [CP] delta=2, C function @ 0x555555781ff6 | 0x400239c8 [ ] VALUE: Lua function @ 0x4003da80, 4 upvalues, "@builtin/fiber.lua":67 | 0x400239c0 [ ] FRAME: [L] delta=14, C function @ 0x55555577a486 | * Lua stack state when returned: | luaT_newthread (L=0x40000378) at /tarantool/src/lua/utils.c:1249 | 1249 if (luaT_call(L, 0, 1) != 0) | (gdb) lj-stack L | 0x40023dc8:0x40023de8 [ ] 5 slots: Red zone | 0x40023dc0 [ M] | 0x400239e0:0x40023db8 [ ] 124 slots: Free stack slots | 0x400239d8 [ T ] | 0x400239d0 [ ] VALUE: thread @ 0x4003ee48 | 0x400239c8 [ B ] VALUE: Lua function @ 0x4003da80, 4 upvalues, "@builtin/fiber.lua":67 | 0x400239c0 [ ] FRAME: [L] delta=14, C function @ 0x55555577a486 | As a result, the created Lua coroutine is returned on the guest stack and L1 is initialized via call. If you find something relevant or useful to be included right to the commit message, feel free to point it out. > > > Signed-off-by: Igor Munkin > > This is already the second patch on removal of excessive func pushes. > I suggest you take a look at all the other lua_pushcfunction() calls. > For example, luaT_pushtuple() does it, and it is called on each tuple > push. Can be tens of thousands per seconds. > > The same for luaT_pusherror, lua_field_inspect_ucdata. These should be > called often too. I also thought about such activity while testing this change. I filed an issue[2] for it and referred it with this patch: | Part of #5201 > > > --- > > Recently we discussed with Timur his struggling with linking his binary > > Lua module against Tarantool. The reason is LuaJIT internals usage for > > manipulations with the guest stack that are not provided by the binary. > > I glanced the current implementation and found out two > > another problems related to the platform overall performance (as it is > > proved with the corresponding benchmarks). > > > > The first problem is the similar one had prior to the > > patch[1]: creates an auxiliary GCfunc object for the > > function to be called in protected frame. However this function is the > > same throughout the platform uptime. It can be created on Taranool > > startup and I see no reason to clobber GC that way. > > > > Another problem I found are excess manipulations with the guest stack: > > one need the newly born coroutine on it to prevent it being collected by > > GC, but purges everything left on the stack in scope of the > > invoked function. As a result the obtained Lua coroutine is "pushed" > > back to the guest stack via LuaJIT internal interfaces. It's a bit > > ridiculous, since one can just use public API to get the same results: > > Lua coroutine on the guest stack and the corresponding pointer to it. > > I understand the GC object point. But I don't understand the pcall vs cpcall. > What is the problem with lua_cpcall() removing all from the stack? We don't > need anything on the stack here. We just need to return a pointer, and this > was done fine before. As I mentioned above, we also need to return the new object on the guest stack to anchor it to Lua registry table in the caller. Besides, if we go further we can enclose all registry-related actions in scope of the , but it seems to lead to a huge refactoring: * the reference obtained via need to be saved the complement * function for release need to be implemented * all callers need to be adjusted to the new interface Thoughts? > > > I tested the platform performance with the same benchmark[2] I made for > > the patch and here are the numbers I obtained after > > the 15 runs: > > * Vanilla bleeding master (mean): > > | ===== 2.5.0-267-gbf047ad44 ===== > > | call by ref GC: 921877 Kb > > | call by ref time: 1.340172 sec > > | call GC: 476563.991667 Kb > > | eval GC: 655274.935547 Kb > > * Patched bleeding master (mean): > > | ===== 2.5.0-268-gec0eb12f4 ===== > > | call by ref GC: 859377 Kb > > | call by ref time: 1.215410 sec > > | call GC: 445313 Kb > > | eval GC: 624024 Kb > > * Relative measurements (before -> after): > > | call by ref GC: -6% (-62500 Kb) > > | call by ref time: -9% (-0.124762 sec) > > | call GC: -6% (-31250 Kb) > > | eval GC: -4% (-31250 Kb) > > > > There is one hot path I left unverified -- Lua-born fibers creation, but > > I guess the relative numbers are quite similar to the ones I mentioned > > above. However, if one wonders these results, feel free to ask me. > > > > diff --git a/src/lua/utils.c b/src/lua/utils.c > > index 0b05d7257..23ccbc3c9 100644 > > --- a/src/lua/utils.c > > +++ b/src/lua/utils.c > > @@ -1224,6 +1226,33 @@ void luaL_iterator_delete(struct luaL_iterator *it) > > > > /* }}} */ > > > > +/** > > + * @brief A wrapper for to be called via luaT_call > > + * in luaT_newthread. Whether new Lua coroutine is created it is > > + * returned on the top of the guest stack. > > + * @param L is a Lua state > > + * @sa > > + */ > > +static int > > +luaT_newthread_wrapper(lua_State *L) > > All the other code uses 'struct lua_State' instead of 'lua_State'. > Except old code. Why isn't it so here? At first, it was used this way in the original implementation (as you can see I moved most of the code from lua/utils.h to lua/utils.c). Furthermore, a lot of Lua C API code (at least that I've seen outside the Tarantool) uses provided via lua.h header. Besides, there are many function definitions nearby in lua/utils.c where keyword is omitted. However, I checked our guidelines[3] and fixed usage the way you proposed: ================================================================================ diff --git a/src/lua/utils.c b/src/lua/utils.c index 23ccbc3c9..af114b0a2 100644 --- a/src/lua/utils.c +++ b/src/lua/utils.c @@ -1234,21 +1234,21 @@ void luaL_iterator_delete(struct luaL_iterator *it) * @sa */ static int -luaT_newthread_wrapper(lua_State *L) +luaT_newthread_wrapper(struct lua_State *L) { (void)lua_newthread(L); return 1; } -lua_State * -luaT_newthread(lua_State *L) +struct lua_State * +luaT_newthread(struct lua_State *L) { assert(luaT_newthread_ref != LUA_NOREF); lua_rawgeti(L, LUA_REGISTRYINDEX, luaT_newthread_ref); assert(lua_isfunction(L, -1)); if (luaT_call(L, 0, 1) != 0) return NULL; - lua_State *L1 = lua_tothread(L, -1); + struct lua_State *L1 = lua_tothread(L, -1); assert(L1 != NULL); return L1; } diff --git a/src/lua/utils.h b/src/lua/utils.h index 404eafe9b..7e02a05f2 100644 --- a/src/lua/utils.h +++ b/src/lua/utils.h @@ -605,8 +605,8 @@ luaL_checkfinite(struct lua_State *L, struct luaL_serializer *cfg, * @param L is a Lua state * @sa */ -lua_State * -luaT_newthread(lua_State *L); +struct lua_State * +luaT_newthread(struct lua_State *L); /** * Check if a value on @a L stack by index @a idx is an ibuf ================================================================================ > > > +{ > > + (void)lua_newthread(L); > > + return 1; > > +} > > + > > +lua_State * > > +luaT_newthread(lua_State *L) > > +{ > > + assert(luaT_newthread_ref != LUA_NOREF); > > + lua_rawgeti(L, LUA_REGISTRYINDEX, luaT_newthread_ref); > > + assert(lua_isfunction(L, -1)); > > + if (luaT_call(L, 0, 1) != 0) > > + return NULL; > > + lua_State *L1 = lua_tothread(L, -1); > > + assert(L1 != NULL); > > + return L1; > > +} [1]: https://www.lua.org/manual/5.1/manual.html#luaL_ref [2]: https://github.com/tarantool/tarantool/issues/5201 [3]: https://www.tarantool.io/en/doc/2.3/dev_guide/c_style_guide/#chapter-5-typedefs -- Best regards, IM