[Tarantool-patches] [PATCH 2/3] box: introduce port_c

Nikita Pettik korablev at tarantool.org
Tue Apr 28 02:24:03 MSK 2020


On 27 Apr 23:28, Vladislav Shpilevoy wrote:
> Hi! Thanks for the review!
> 
> >>>> +int
> >>>> +port_c_add_mp(struct port *base, const char *mp, const char *mp_end)
> >>>> +{
> >>>> +	struct port_c *port = (struct port_c *)base;
> >>>> +	struct port_c_entry *pe;
> >>>> +	uint32_t size = mp_end - mp;
> >>>> +	char *dst;
> >>>> +	if (size <= PORT_ENTRY_SIZE) {
> >>>> +		/*
> >>>> +		 * Alloc on a mempool is several times faster than
> >>>> +		 * on the heap. And it perfectly fits any
> >>>> +		 * MessagePack number, a short string, a boolean.
> >>>> +		 */
> >>>> +		dst = mempool_alloc(&port_entry_pool);
> >>>
> >>> Dubious optimization...I mean is this code complication really
> >>> worth it?
> >>
> >> mempool_alloc() of this size is x4-5 times faster than malloc()
> >> according to my microbenchmarks on release build. I decided not
> >> to write concrete numbers x4-5, since they probably depend on
> >> machine.
> > 
> > Is there at least one real scenario where entry allocation
> > policy can turn out to be bottleneck in overall performance?
> 
> I guess we have plenty of such scenarios - choose any favorite among
> any of our GC problems, when solution/sales team need to write
> functions in C because of bad memory management in Lua or its high
> overusage because of uncontrolled unpacking of everything. C stored
> functions are supposed to be much much faster than Lua, be called
> hundreds of thousands times per second easily.

I should have said '... entry allocation policy in this particular case..'
implying exactly port_c_add_mp(). Surely, variety of allocators in
Tarantool exist for a reason..
 
> If we use heap for each such request, we loose some real perf, and
> aggravate heap fragmentation for other requests with lots of small
> allocations. Moreover, allocations of about the same size - exactly
> from what mempools are supposed to protect.
> 
> I measure here by numbers I got on my machine. I was running a C
> function, which returned 100 unsigned numbers, each <= 128. The
> function was called just 10k times, not a big number.

100 uints returned by one function sounds like an overestimation.
I mean in terms of synthetic tests perhaps it is ok, but is there
any profit on real workload? Let's leave this question as rhetorical,
since our current benchmarks can't tell us. Patch LGTM for sure.

> With using memory pool the bench finished in 35ms. With using heap
> it finished in 143-144ms. So just by allocating results on the pool
> instead of the heap I won more than 100ms in 1kk mere returns. Not
> selects, or gets, or other non trivial operations. Entirely on
> returns, of pure TX thread time.
> 
> On the summary, I consider this low hanging perf fruit a good
> enough result which is worth the code complication.

I just want to make sure that this optimization is really worth it.
"Premature optimization is the root of all evil."

> Talking of memory problems, my entirely subjective guess is that
> partially we don't see malloc/free in our top results in flamegraphs,
> because we avoid their usage always when possible. All hot data
> structures are pooled - transactions, iterators, port entries (used
> not only for C functions, but for all selects too). All hot temporary
> data is batched on regions (lsregion in case of vinyl). So if we don't
> see malloc in top used functions, it does not mean we can start using it
> in hot places, because it seems to be light. It is actually not.



More information about the Tarantool-patches mailing list