[Tarantool-patches] [PATCH 2.X 5/7] module api: luaT_temp_luastate & luaT_release_temp_luastate

Wed Sep 30 00:03:45 MSK 2020

On Tue, Sep 29, 2020 at 06:10:02PM +0300, Igor Munkin wrote:
> Vlad,
> 
> I guess there should be a question to be asked in scope of the original
> series: what performance enhancement can we obtain with such hacky
> optimization? Do we have any benchmarks? If the performance impact is
> negligible I believe these interfaces should be left internal.

Why hacky? I'm a module developer and I ask Tarantool: Hey, I know you
have a cache of Lua states. Give me one? Can I yield while the state is
in use? I can, nice. Can I share it between fibers? No, okay.

That's all. Quite simple.

> 
> On 29.09.20, Vladislav Shpilevoy wrote:
> > Thanks for the patch!
> > 
> > I strongly don't like this export. It seems to be too internal.
> > 
> > But I have no better ideas than propose to implement your own
> > lua_State cache in the merger module. It does not seem to be too
> > hard, and I don't think it would occupy much memory.
> 
> Well, such exercise looks like fun but it's better to make it once and
> provide user-friendly interfaces by so called "modulekit" to be used in
> third party modules. Anyway, using this routine seems a bit complicated
> than <lua_newstate> and <luaT_newstate> (also fully implemented via Lua
> C standart API + already exported <luaT_call>). Ergo one ought to push
> extern developers using the new one besides exporting it.
> 
> > 
> > Just a simple list of lua_State objects, created by luaT_newthread
> > on demand, and put back into the list when unused. What is wrong
> > with that?
> 
> Again, how much will performance be nerfed if we simply create a new Lua
> coroutine by demand?

acpi-cpufreq, userspace governor, fixed frequency. Hyperthreading. No
taskset.

A: 2.6.0-118-g90108875a RelWithDebInfo. 
B: Same, but with the patch:

 | diff --git a/src/box/lua/merger.c b/src/box/lua/merger.c
 | index 583946c4c..a4332e3d6 100644
 | --- a/src/box/lua/merger.c
 | +++ b/src/box/lua/merger.c
 | @@ -530,13 +530,14 @@ luaL_merge_source_buffer_fetch_impl(struct merge_source_buffer *source,
 |  static int
 |  luaL_merge_source_buffer_fetch(struct merge_source_buffer *source)
 |  {
 | -       int coro_ref = LUA_NOREF;
 | -       int top = -1;
 | -       struct lua_State *L = luaT_temp_luastate(&coro_ref, &top);
 | -       if (L == NULL)
 | +       int top = lua_gettop(tarantool_L);
 | +       struct lua_State *L = luaT_newthread(tarantool_L);
 | +       if (L == NULL) {
 | +               lua_settop(tarantool_L, top);
 |                 return -1;
 | +       }
 |         int rc = luaL_merge_source_buffer_fetch_impl(source, L);
 | -       luaT_release_temp_luastate(L, coro_ref, top);
 | +       lua_settop(tarantool_L, top);
 |         return rc;
 |  }

(The patch is stupid, I know, we should not leave an extra element on
tarantool_L and yield, but call luaL_ref / luaL_unref with Lua registry
instead. I forget about this before measurements, but it should not be
much important here.)

Workload (I run it from the console):

 | tarantool> clock = require('clock') merger = require('merger') buffer = require('buffer') do local start = clock.monotonic() for i = 1, 10^8 do merger.new_source_frombuffer(buffer.ibuf()):select() end print(clock.monotonic() - start) end

The workload is VERY synthetic: it just don't do anything aside
creating a new buffer source and observing it as the empty one.

A (five runs within one console session):

212.67329062428
213.37809054693
213.94517080393
213.76512717083
215.20638198918

B (measured in the same way):

218.82243523281
219.3976499592
220.46797300316
221.37441124767
220.58953443216

The average performance decrease seems to be around 3% for this case.

`perf record -a -g` results:

 | A                                                   | B                                            |
 | --------------------------------------------------- | -------------------------------------------- |
 | - 75.31%     0.00%  tarantool                       | - 76.20%     0.00%  tarantool                |
 |   - 0x41fe53b8                                      |   - 0x410a03b8                               |
 |     - 54.56% lj_BC_FUNCC                            |     - 34.88% lj_BC_FUNCC                     |
 |       - 36.90% lbox_merge_source_new                |       - 10.98% lbox_merge_source_new         |
 |         - 20.76% luaL_pushcdata                     |         + 3.31% luaL_setcdatagc              |
 |           + 20.00% lj_gc_step                       |         + 3.17% luaL_merge_source_buffer_new |
 |         + 5.34% lua_pushcclosure                    |         + 2.04% luaL_iscallable              |
 |         + 4.77% luaL_setcdatagc                     |           0.77% luaL_pushcdata               |
 |         + 3.69% luaL_merge_source_buffer_new        |           0.76% lua_pushcclosure             |
 |         + 1.61% luaL_iscallable                     |       + 8.39% lbox_merge_source_select       |
 |       + 7.65% lbox_merge_source_select              |       - 7.60% lua_gettable_wrapper           |
 |       + 4.58% lbox_merge_source_gc                  |         - 6.00% lua_pushthread               |
 |       + 2.08% lj_cf_ffi_meta___call                 |           + 4.64% lj_state_new               |
 |         0.73% lua_gettop                            |           + 1.16% lua_close                  |
 |         0.61% lj_cf_ffi_clib___index                |         - 1.50% lua_newthread                |
 |     - 11.93% 0x5645f3a0fa76                         |           + 1.20% lj_gc_step                 |
 |       + 11.93% lj_gc_step_jit                       |       + 2.99% lbox_merge_source_gc           |
 |     + 5.84% lj_vm_exit_handler                      |       + 1.83% lj_cf_ffi_meta___call          |
 |     + 1.05% lj_vmeta_call                           |         0.51% lua_xmove                      |
 |       0.53% lj_vmeta_tgetv                          |     - 32.65% 0x55569f7bfa76                  |
 |                                                     |       + 32.61% lj_gc_step_jit                |
 |                                                     |     + 5.43% lj_vm_exit_handler               |
 |                                                     |       0.69% lj_vmeta_call                    |
 |                                                     |       0.57% lj_vmeta_tgetv                   |

Duration of samples collecting was a bit different, I don't know how much it matters.

To be honest, I don't feel myself strong enough to interpret this. But
if we'll sum lj_gc_step and lj_gc_step_jit, then it'll be 31.93% for A
and 33.81% for B. And there is lua_gettable_wrapper (7.60%) in B.

WBR, Alexander Turenko.