[Tarantool-patches] [PATCH] lua: refactor port_lua_do_dump and encode_lua_call
Sergey Kaplun
skaplun at tarantool.org
Tue Jun 29 10:07:59 MSK 2021
Thanks for the questions!
On 24.06.21, Vladislav Shpilevoy wrote:
> Thanks for the investigation!
>
> >>> diff --git a/src/box/lua/call.c b/src/box/lua/call.c
> >>> index 0315e720c..3b2572096 100644
> >>> --- a/src/box/lua/call.c
> >>> +++ b/src/box/lua/call.c
> >>> @@ -450,13 +482,20 @@ port_lua_do_dump(struct port *base, struct mpstream *stream,
> >>> struct encode_lua_ctx ctx;
> >>> ctx.port = port;
> >>> ctx.stream = stream;
> >>> - struct lua_State *L = tarantool_L;
> >>> - int top = lua_gettop(L);
> >>> - if (lua_cpcall(L, handler, &ctx) != 0) {
> >>> - luaT_toerror(port->L);
> >>> + lua_State *L = port->L;
> >>> + /*
> >>> + * At the moment Lua stack holds only values to encode.
> >>> + * Insert corresponding encoder to the bottom and push
> >>> + * encode context as lightuserdata to the top.
> >>> + */
> >>> + const int size = lua_gettop(L);
> >>> + lua_rawgeti(L, LUA_REGISTRYINDEX, execute_lua_refs[handler]);
> >>> + assert(lua_isfunction(L, -1) && lua_iscfunction(L, -1));
> >>> + lua_insert(L, 1);
> >>
> >> If I remember correctly, this leads to moving all the existing
> >> stack elements forward. Which might be expensive. I know from
> >> Vlad Grubov's words that they have code with hundreds of values in
> >> multireturn from stored functions. Did you bench what happens when
> >> the Lua coroutine contains a lot of values? In the bench by the
> >> link above I see only 1-element array and map. Size of the array
> >> and map does not matter though. Only multireturn is interesting
> >> here. Like 'return 1, 2, 3, ...'.
> >
> > I've added this benchmark (200 numbers to return) to the gist.
> > Local results for the bench:
> >
> > Master:
> > | Encode mul: 237900953 mcs, 12.6 K ps
> >
> > My branch:
> > | Encode mul: 235735350 mcs, 12.7 K ps
> >
> > `luamp_encode()` has the biggest impact in `port_do_lua_dump()`
> > (`lua_insert()` costs ~0.1% of the whole runtime).
>
> That looks suspicious. Why does it have such a small impact? Did
I've run the benchmark for return case only with `perf stat` (and 5K
return values) and found three interesting metrics (others are pretty
the same, up to 0.07%):
Master:
| 10,255,748,650 cache-misses # 13.686 % of all cache refs (12.49%)
| 1,364,762,310 LLC-load-misses # 15.92% of all LL-cache accesses (12.50%)
| 401,136,504 iTLB-load-misses # 39.68% of all iTLB cache accesses (12.51%)
My branch:
| 9,964,939,197 cache-misses # 13.413 % of all cache refs (12.52%)
| 1,296,545,699 LLC-load-misses # 15.08% of all LL-cache accesses (12.50%)
| 285,852,865 iTLB-load-misses # 29.13% of all iTLB cache accesses (12.49%)
1) LLC (last-line cache, L2 in my case) misses is slightly decreasing.
I suppose that all insert movement operations use cache (L1 or L2)
instead memory access, that make it rather cheaply.
2) iTLB (instruction translation look-aside buffer) [1] needs to avoid
penalty of page walks to a page table. So looks like within this patch
code locality is improved (may be due to the fact of using
`luaT_call()` instead of 'lua_cpcall()` and 'luaT_toerror()`).
> you check that multireturn really made the stack bigger? Shouldn't
Yes, I did.
> it have linear complexity depending on the stack size after the
> target index?
Yes, with 5K items to return it becomes pretty equal or even a little
bit slower (up to ~1%) than the original code. But encoding is still
taking the biggest part of the runtime.
Side note: if user wants to return 5K values (or a table with 5K values)
from Tarantool we can't stop him, but only to offer him our empathy.
>
> The patch looks fine to me, if the benchmarks show there is even a
> slight perf win. But I think Igor should take another look.
[1]: https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/reference/cpu-metrics-reference/front-end-bound/itlb-overhead.html
--
Best regards,
Sergey Kaplun
More information about the Tarantool-patches
mailing list