[Tarantool-patches] [PATCH] lua: refactor port_lua_do_dump and encode_lua_call

Tue Jun 29 10:07:59 MSK 2021

Thanks for the questions!

On 24.06.21, Vladislav Shpilevoy wrote:
> Thanks for the investigation!
> 
> >>> diff --git a/src/box/lua/call.c b/src/box/lua/call.c
> >>> index 0315e720c..3b2572096 100644
> >>> --- a/src/box/lua/call.c
> >>> +++ b/src/box/lua/call.c
> >>> @@ -450,13 +482,20 @@ port_lua_do_dump(struct port *base, struct mpstream *stream,
> >>>  	struct encode_lua_ctx ctx;
> >>>  	ctx.port = port;
> >>>  	ctx.stream = stream;
> >>> -	struct lua_State *L = tarantool_L;
> >>> -	int top = lua_gettop(L);
> >>> -	if (lua_cpcall(L, handler, &ctx) != 0) {
> >>> -		luaT_toerror(port->L);
> >>> +	lua_State *L = port->L;
> >>> +	/*
> >>> +	 * At the moment Lua stack holds only values to encode.
> >>> +	 * Insert corresponding encoder to the bottom and push
> >>> +	 * encode context as lightuserdata to the top.
> >>> +	 */
> >>> +	const int size = lua_gettop(L);
> >>> +	lua_rawgeti(L, LUA_REGISTRYINDEX, execute_lua_refs[handler]);
> >>> +	assert(lua_isfunction(L, -1) && lua_iscfunction(L, -1));
> >>> +	lua_insert(L, 1);
> >>
> >> If I remember correctly, this leads to moving all the existing
> >> stack elements forward. Which might be expensive. I know from
> >> Vlad Grubov's words that they have code with hundreds of values in
> >> multireturn from stored functions. Did you bench what happens when
> >> the Lua coroutine contains a lot of values? In the bench by the
> >> link above I see only 1-element array and map. Size of the array
> >> and map does not matter though. Only multireturn is interesting
> >> here. Like 'return 1, 2, 3, ...'.
> > 
> > I've added this benchmark (200 numbers to return) to the gist.
> > Local results for the bench:
> > 
> > Master:
> > | Encode mul: 237900953 mcs, 12.6 K ps
> > 
> > My branch:
> > | Encode mul: 235735350 mcs, 12.7 K ps
> > 
> > `luamp_encode()` has the biggest impact in `port_do_lua_dump()`
> > (`lua_insert()` costs ~0.1% of the whole runtime).
> 
> That looks suspicious. Why does it have such a small impact? Did

I've run the benchmark for return case only with `perf stat` (and 5K
return values) and found three interesting metrics (others are pretty
the same, up to 0.07%):

Master:
| 10,255,748,650 cache-misses      # 13.686 % of all cache refs         (12.49%)
| 1,364,762,310  LLC-load-misses   # 15.92% of all LL-cache accesses    (12.50%)
| 401,136,504    iTLB-load-misses  # 39.68% of all iTLB cache accesses  (12.51%)

My branch:
| 9,964,939,197  cache-misses      # 13.413 % of all cache refs         (12.52%)
| 1,296,545,699  LLC-load-misses   # 15.08% of all LL-cache accesses    (12.50%)
| 285,852,865    iTLB-load-misses  # 29.13% of all iTLB cache accesses  (12.49%)

1) LLC (last-line cache, L2 in my case) misses is slightly decreasing.
I suppose that all insert movement operations use cache (L1 or L2)
instead memory access, that make it rather cheaply.

2) iTLB (instruction translation look-aside buffer) [1] needs to avoid
penalty of page walks to a page table. So looks like within this patch
code locality is improved (may be due to the fact of using
`luaT_call()` instead of 'lua_cpcall()` and 'luaT_toerror()`).

> you check that multireturn really made the stack bigger? Shouldn't

Yes, I did.

> it have linear complexity depending on the stack size after the
> target index?

Yes, with 5K items to return it becomes pretty equal or even a little
bit slower (up to ~1%) than the original code. But encoding is still
taking the biggest part of the runtime.

Side note: if user wants to return 5K values (or a table with 5K values)
from Tarantool we can't stop him, but only to offer him our empathy.

> 
> The patch looks fine to me, if the benchmarks show there is even a
> slight perf win. But I think Igor should take another look.

[1]: https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/reference/cpu-metrics-reference/front-end-bound/itlb-overhead.html

-- 
Best regards,
Sergey Kaplun