[Tarantool-patches] [PATCH 2.X 5/7] module api: luaT_temp_luastate & luaT_release_temp_luastate
Alexander Turenko
alexander.turenko at tarantool.org
Wed Sep 30 00:03:45 MSK 2020
On Tue, Sep 29, 2020 at 06:10:02PM +0300, Igor Munkin wrote:
> Vlad,
>
> I guess there should be a question to be asked in scope of the original
> series: what performance enhancement can we obtain with such hacky
> optimization? Do we have any benchmarks? If the performance impact is
> negligible I believe these interfaces should be left internal.
Why hacky? I'm a module developer and I ask Tarantool: Hey, I know you
have a cache of Lua states. Give me one? Can I yield while the state is
in use? I can, nice. Can I share it between fibers? No, okay.
That's all. Quite simple.
>
> On 29.09.20, Vladislav Shpilevoy wrote:
> > Thanks for the patch!
> >
> > I strongly don't like this export. It seems to be too internal.
> >
> > But I have no better ideas than propose to implement your own
> > lua_State cache in the merger module. It does not seem to be too
> > hard, and I don't think it would occupy much memory.
>
> Well, such exercise looks like fun but it's better to make it once and
> provide user-friendly interfaces by so called "modulekit" to be used in
> third party modules. Anyway, using this routine seems a bit complicated
> than <lua_newstate> and <luaT_newstate> (also fully implemented via Lua
> C standart API + already exported <luaT_call>). Ergo one ought to push
> extern developers using the new one besides exporting it.
>
> >
> > Just a simple list of lua_State objects, created by luaT_newthread
> > on demand, and put back into the list when unused. What is wrong
> > with that?
>
> Again, how much will performance be nerfed if we simply create a new Lua
> coroutine by demand?
acpi-cpufreq, userspace governor, fixed frequency. Hyperthreading. No
taskset.
A: 2.6.0-118-g90108875a RelWithDebInfo.
B: Same, but with the patch:
| diff --git a/src/box/lua/merger.c b/src/box/lua/merger.c
| index 583946c4c..a4332e3d6 100644
| --- a/src/box/lua/merger.c
| +++ b/src/box/lua/merger.c
| @@ -530,13 +530,14 @@ luaL_merge_source_buffer_fetch_impl(struct merge_source_buffer *source,
| static int
| luaL_merge_source_buffer_fetch(struct merge_source_buffer *source)
| {
| - int coro_ref = LUA_NOREF;
| - int top = -1;
| - struct lua_State *L = luaT_temp_luastate(&coro_ref, &top);
| - if (L == NULL)
| + int top = lua_gettop(tarantool_L);
| + struct lua_State *L = luaT_newthread(tarantool_L);
| + if (L == NULL) {
| + lua_settop(tarantool_L, top);
| return -1;
| + }
| int rc = luaL_merge_source_buffer_fetch_impl(source, L);
| - luaT_release_temp_luastate(L, coro_ref, top);
| + lua_settop(tarantool_L, top);
| return rc;
| }
(The patch is stupid, I know, we should not leave an extra element on
tarantool_L and yield, but call luaL_ref / luaL_unref with Lua registry
instead. I forget about this before measurements, but it should not be
much important here.)
Workload (I run it from the console):
| tarantool> clock = require('clock') merger = require('merger') buffer = require('buffer') do local start = clock.monotonic() for i = 1, 10^8 do merger.new_source_frombuffer(buffer.ibuf()):select() end print(clock.monotonic() - start) end
The workload is VERY synthetic: it just don't do anything aside
creating a new buffer source and observing it as the empty one.
A (five runs within one console session):
212.67329062428
213.37809054693
213.94517080393
213.76512717083
215.20638198918
B (measured in the same way):
218.82243523281
219.3976499592
220.46797300316
221.37441124767
220.58953443216
The average performance decrease seems to be around 3% for this case.
`perf record -a -g` results:
| A | B |
| --------------------------------------------------- | -------------------------------------------- |
| - 75.31% 0.00% tarantool | - 76.20% 0.00% tarantool |
| - 0x41fe53b8 | - 0x410a03b8 |
| - 54.56% lj_BC_FUNCC | - 34.88% lj_BC_FUNCC |
| - 36.90% lbox_merge_source_new | - 10.98% lbox_merge_source_new |
| - 20.76% luaL_pushcdata | + 3.31% luaL_setcdatagc |
| + 20.00% lj_gc_step | + 3.17% luaL_merge_source_buffer_new |
| + 5.34% lua_pushcclosure | + 2.04% luaL_iscallable |
| + 4.77% luaL_setcdatagc | 0.77% luaL_pushcdata |
| + 3.69% luaL_merge_source_buffer_new | 0.76% lua_pushcclosure |
| + 1.61% luaL_iscallable | + 8.39% lbox_merge_source_select |
| + 7.65% lbox_merge_source_select | - 7.60% lua_gettable_wrapper |
| + 4.58% lbox_merge_source_gc | - 6.00% lua_pushthread |
| + 2.08% lj_cf_ffi_meta___call | + 4.64% lj_state_new |
| 0.73% lua_gettop | + 1.16% lua_close |
| 0.61% lj_cf_ffi_clib___index | - 1.50% lua_newthread |
| - 11.93% 0x5645f3a0fa76 | + 1.20% lj_gc_step |
| + 11.93% lj_gc_step_jit | + 2.99% lbox_merge_source_gc |
| + 5.84% lj_vm_exit_handler | + 1.83% lj_cf_ffi_meta___call |
| + 1.05% lj_vmeta_call | 0.51% lua_xmove |
| 0.53% lj_vmeta_tgetv | - 32.65% 0x55569f7bfa76 |
| | + 32.61% lj_gc_step_jit |
| | + 5.43% lj_vm_exit_handler |
| | 0.69% lj_vmeta_call |
| | 0.57% lj_vmeta_tgetv |
Duration of samples collecting was a bit different, I don't know how much it matters.
To be honest, I don't feel myself strong enough to interpret this. But
if we'll sum lj_gc_step and lj_gc_step_jit, then it'll be 31.93% for A
and 33.81% for B. And there is lua_gettable_wrapper (7.60%) in B.
WBR, Alexander Turenko.
More information about the Tarantool-patches
mailing list