From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp37.i.mail.ru (smtp37.i.mail.ru [94.100.177.97]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 04301469719 for ; Wed, 30 Sep 2020 00:03:33 +0300 (MSK) Date: Wed, 30 Sep 2020 00:03:45 +0300 From: Alexander Turenko Message-ID: <20200929210345.jseljymlmkgdrp3c@tkn_work_nb> References: <6fc62e3a3935978b650fd6d1dcbdef9a5353a1fb.1600955781.git.tsafin@tarantool.org> <40b631b2-c660-32f7-c421-9770d5d06de4@tarantool.org> <20200929151002.GY18920@tarantool.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20200929151002.GY18920@tarantool.org> Subject: Re: [Tarantool-patches] [PATCH 2.X 5/7] module api: luaT_temp_luastate & luaT_release_temp_luastate List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Igor Munkin Cc: tarantool-patches@dev.tarantool.org, Vladislav Shpilevoy On Tue, Sep 29, 2020 at 06:10:02PM +0300, Igor Munkin wrote: > Vlad, > > I guess there should be a question to be asked in scope of the original > series: what performance enhancement can we obtain with such hacky > optimization? Do we have any benchmarks? If the performance impact is > negligible I believe these interfaces should be left internal. Why hacky? I'm a module developer and I ask Tarantool: Hey, I know you have a cache of Lua states. Give me one? Can I yield while the state is in use? I can, nice. Can I share it between fibers? No, okay. That's all. Quite simple. > > On 29.09.20, Vladislav Shpilevoy wrote: > > Thanks for the patch! > > > > I strongly don't like this export. It seems to be too internal. > > > > But I have no better ideas than propose to implement your own > > lua_State cache in the merger module. It does not seem to be too > > hard, and I don't think it would occupy much memory. > > Well, such exercise looks like fun but it's better to make it once and > provide user-friendly interfaces by so called "modulekit" to be used in > third party modules. Anyway, using this routine seems a bit complicated > than and (also fully implemented via Lua > C standart API + already exported ). Ergo one ought to push > extern developers using the new one besides exporting it. > > > > > Just a simple list of lua_State objects, created by luaT_newthread > > on demand, and put back into the list when unused. What is wrong > > with that? > > Again, how much will performance be nerfed if we simply create a new Lua > coroutine by demand? acpi-cpufreq, userspace governor, fixed frequency. Hyperthreading. No taskset. A: 2.6.0-118-g90108875a RelWithDebInfo. B: Same, but with the patch: | diff --git a/src/box/lua/merger.c b/src/box/lua/merger.c | index 583946c4c..a4332e3d6 100644 | --- a/src/box/lua/merger.c | +++ b/src/box/lua/merger.c | @@ -530,13 +530,14 @@ luaL_merge_source_buffer_fetch_impl(struct merge_source_buffer *source, | static int | luaL_merge_source_buffer_fetch(struct merge_source_buffer *source) | { | - int coro_ref = LUA_NOREF; | - int top = -1; | - struct lua_State *L = luaT_temp_luastate(&coro_ref, &top); | - if (L == NULL) | + int top = lua_gettop(tarantool_L); | + struct lua_State *L = luaT_newthread(tarantool_L); | + if (L == NULL) { | + lua_settop(tarantool_L, top); | return -1; | + } | int rc = luaL_merge_source_buffer_fetch_impl(source, L); | - luaT_release_temp_luastate(L, coro_ref, top); | + lua_settop(tarantool_L, top); | return rc; | } (The patch is stupid, I know, we should not leave an extra element on tarantool_L and yield, but call luaL_ref / luaL_unref with Lua registry instead. I forget about this before measurements, but it should not be much important here.) Workload (I run it from the console): | tarantool> clock = require('clock') merger = require('merger') buffer = require('buffer') do local start = clock.monotonic() for i = 1, 10^8 do merger.new_source_frombuffer(buffer.ibuf()):select() end print(clock.monotonic() - start) end The workload is VERY synthetic: it just don't do anything aside creating a new buffer source and observing it as the empty one. A (five runs within one console session): 212.67329062428 213.37809054693 213.94517080393 213.76512717083 215.20638198918 B (measured in the same way): 218.82243523281 219.3976499592 220.46797300316 221.37441124767 220.58953443216 The average performance decrease seems to be around 3% for this case. `perf record -a -g` results: | A | B | | --------------------------------------------------- | -------------------------------------------- | | - 75.31% 0.00% tarantool | - 76.20% 0.00% tarantool | | - 0x41fe53b8 | - 0x410a03b8 | | - 54.56% lj_BC_FUNCC | - 34.88% lj_BC_FUNCC | | - 36.90% lbox_merge_source_new | - 10.98% lbox_merge_source_new | | - 20.76% luaL_pushcdata | + 3.31% luaL_setcdatagc | | + 20.00% lj_gc_step | + 3.17% luaL_merge_source_buffer_new | | + 5.34% lua_pushcclosure | + 2.04% luaL_iscallable | | + 4.77% luaL_setcdatagc | 0.77% luaL_pushcdata | | + 3.69% luaL_merge_source_buffer_new | 0.76% lua_pushcclosure | | + 1.61% luaL_iscallable | + 8.39% lbox_merge_source_select | | + 7.65% lbox_merge_source_select | - 7.60% lua_gettable_wrapper | | + 4.58% lbox_merge_source_gc | - 6.00% lua_pushthread | | + 2.08% lj_cf_ffi_meta___call | + 4.64% lj_state_new | | 0.73% lua_gettop | + 1.16% lua_close | | 0.61% lj_cf_ffi_clib___index | - 1.50% lua_newthread | | - 11.93% 0x5645f3a0fa76 | + 1.20% lj_gc_step | | + 11.93% lj_gc_step_jit | + 2.99% lbox_merge_source_gc | | + 5.84% lj_vm_exit_handler | + 1.83% lj_cf_ffi_meta___call | | + 1.05% lj_vmeta_call | 0.51% lua_xmove | | 0.53% lj_vmeta_tgetv | - 32.65% 0x55569f7bfa76 | | | + 32.61% lj_gc_step_jit | | | + 5.43% lj_vm_exit_handler | | | 0.69% lj_vmeta_call | | | 0.57% lj_vmeta_tgetv | Duration of samples collecting was a bit different, I don't know how much it matters. To be honest, I don't feel myself strong enough to interpret this. But if we'll sum lj_gc_step and lj_gc_step_jit, then it'll be 31.93% for A and 33.81% for B. And there is lua_gettable_wrapper (7.60%) in B. WBR, Alexander Turenko.