[Tarantool-patches] [PATCH v8 3/4] box/cbox: implement cbox Lua module

Sun Nov 1 00:59:46 MSK 2020

On Sat, Oct 31, 2020 at 01:21:34AM +0100, Vladislav Shpilevoy wrote:
> > 
> > I'll try to repeat. That;s interesting, actually I've been trying
> > to force gc to trigger but without much luck. Letme investigate
> > this.
> 
> It is not about Lua GC. The collectgarbage('collect') works fine
> and the function object is deleted. But the module is not unloaded.
> Because you never call module_sym_unload().

I'll investigate.

> > Here is a paragraph above
> > ---
> >     Once function is loaded it is possible to execute it
> >     in a traditional Lua way, ie to call it as a function.
> > 
> >     ``` Lua
> >     -- call with agruments arg1 and arg2
> >     f(arg1, arg2)
> >     ```
> > ---
> > 
> > I'm not sure what else we could put here?
> 
> This paragraph is false, it seems. Because it does not work like Lua
> functions 1-to-1. It returns nil,err or true,results. Lua functions
> return whatever you return from them so just 'results'. These cbox
> functions change the returned set a bit. And this is good, but you
> didn't document it.

Vlad, frankly I don't understand what else we can put here in documentation,
lets talk f2f about this.

> > 
> > cbox engine uses rbtree to keep functions in memory and we have
> > to initialize it first and clean up it on exit. The lua init provides
> > Lua's interface over the engine. I can move initialization to box_lua_init
> > but where to clean things up?!
> 
> Option 1 - nowhere like other Lua modules. Option 2 - add box_lua_free
> and add your cbox destruction there.

Then box_lua_free will suite us I think. I dont like that we might
exit without explicit cleanup (surely OS will clean the memory by
its own but still).

> >> but AFAIK, according to our benches, when a hash is used as an index,
> >> it is always faster than any tree on lookups. Although I don't
> >> remember benches hash vs tree out of an index. Since your are doing
> >> such dubious "optimizations", is there a bench you can show, that
> >> a tree is faster than a hash?
> > 
> > The main reason for rbtree is the memory.
> 
> Ok. Did you measure how much more efficient it is in terms of memory?
> Even in theory?

There is no need for theory. The cbox_func requires only one pointer
for tree root, and two nodes per each node. In turn our mhash requires
memoy for hash table and entries. My results shows

rbtree
==546461==    still reachable: 39,168 bytes in 256 blocks
==546461==   total heap usage: 257 allocs, 1 frees, 40,192 bytes allocated

mhash
==546471==    still reachable: 51,828 bytes in 260 blocks
==546471==   total heap usage: 273 allocs, 13 frees, 65,124 bytes allocated

almost twice in size. Note that it highly depends on number of nodes/cells.
In the example above we use 256 functions. I did a number of measurements and
hash always use more memory, this is expected because of mhash internals.

Taking into account that I don't expect that many functions to be present
in the system this might not be critical but still.

> > Initially I used our mhash
> > engine but I don't expect there be that many functions which allows
> > me to not allocate such big structure as a hash table, instead I only
> > need two pointers for lookup (ie 16 bytes per entry).
> > 
> > Do you really think it worth it to allocate hash table instead? I'm ready
> > to make lookup a bit slower just to eliminate redundant memory overhead.
> 
> Yes, I think the hash is better here. But since you said the tree
> solution is better in terms of speed because of rehash, and in terms of
> memory because of I don't know what, you gotta prove it somehow. I can
> prove my point because Alexanded had benchmarks showing hash is faster
> inside of indexes. Also there is an issue in vshard about hash index being
> better: https://github.com/tarantool/vshard/issues/128. Alexey did a bench,
> AFAIR, even though we didn't post it in the issue.
> 
> We use the hash in a lot of far more critical places in terms of perf.
> So either you are right and we need to fix them too, or we need to fix the
> hash implementation, or you are not right and we need to keep using the
> hash, when all operations are lookup. In either way we can't just say the
> tree is better for lookups and silently use it here.

Guys, you need to understand one important thing -- benchmarks shows picture
on particular data. They don't explain what is happening. The heart of mhash
engine is MurmurHash3. And it is known to work well on general data while the
distribution of incoming data do not cause too many collisions. I'll push
my benchmark into github so you could take a look (maybe I simply did some
mistakes in benchmarking https://github.com/cyrillos/cbox).

Here is an examples of "good" data, ie random "good" strings where hashing
shows a way better behaviour

(256 strings, create -- nanoseconds spent while inserting all 256 entries,
 lookup -- time spent while been looking every string from 256 entries,
 benchmark built with gcc and -O2).

[cyrill at grain cbox] ./hvsrb --nr 256 --rb --mh
rb create diff :               384353 ns
rb lookup diff :               144715 ns

mh create diff :               258046 ns
mh lookup diff :                92049 ns

Creation time is comparable while lookup almost twice faster. Note though
this is nanoseconds. And we call for lookup only when inserting a new
function. Adding and removing functions won't be a frequent case.

Now lets look into "bad" data, where each string is causing collission.

[cyrill at grain cbox] ./hvsrb --rb --mh
rb create diff :               370554 ns
rb lookup diff :               138824 ns

mh create diff :              1473945 ns
mh lookup diff :               940340 ns

Here I created 256 entries where each of it is known to be "bad" and
mh works 9 times worse. This ponits only to the fact that hashing
procedure is very sensitive to the incoming data while plain strcmp
doesn't care much and rbtree works a way more faster.

Thus when you use some hash table you must be sure that it produce
acceptable results in a worst case. And above been proven that in
worst case things are incredibly bad. I'm showing this to prove
you that hashes are not panacea. And if I have to choose between
rbtree and hashes here I choose the first because 1) it requires
less memory 2) it guarantees us acceptable results even in bad
cases. And we do not expect millions of functions thus on good
cases they work acceptably fast as well.

Still the example above is rather synthetical where incoming
data choosen the way to produce as many collisions as possible
and there is no easy way to generate such symbols in modules
(but still possible, and passing it via lua script should be
possible as well).

And I think we should consider moving to xxhash instead

git at github.com:Cyan4973/xxHash.git

which shows better result so far. Still as any noncrypto
hashes it has collissions.

Hopefully now you agree with me that hashes are not that simple
as they look.

> > I expect at worst case to have up to 1000 functions which may require up
> > to 10 attempts.  I hate (with a passon) idea that in a middle of inserting
> > new key our hash engine start a rehashing procedure.
> 
> If a user loads and unloads functions often, the rehash issue is going to
> be the least severe one, because much more time will be spent on doing
> mmap/munmap to load/unload the shared files.

Yes, and exactly for this reason I don't understand why we fight for
using hashing, while looking up via rbtree gonna be more than enough
here and saves us a bit of memory.

To be honest I'm tired already of this conversation, I think I simply
switch to use our mhash and lets go further.

> > And note that lookup happens only _once_ when you loading a new function,
> > we keep the pointer later and calling function goes directly without any
> > lookups. So I think keeping low memory usage a way more preferred.
> 
> I am going to ask for numbers, what is the memory usage difference we are
> talking about. Otherwise it all sounds sus.

Pointed above.

> > Also
> > we might consider moving memory allocation for functions in our own "small"
> > instace (where mh hashes simply use system's malloc/calloc calls).
> 
> I didn't understand this. Can you elaborate?

I mean our mhash internally use calloc calls while we probably better use
our "small" engine there. But I'm not 100% sure. We have that good own
memory manager why we use so many malloc/calloc calls it is a mistery for
me. I though that we could use "small" everywhere and a user could set
a limit over every bit of tarantool allocations.

> > And no, I didn't measure performance here because it was not a priority
> > when _inserting_ a new function. And I'm very sceptical about "we ran
> > a benchmark and it worked a way more better". We've been using both
> > trees and hashes for decades and open-hashing (like ours) works fine
> > until some moment where collision starts to happen and performace flushed
> > to the toiled because you start walking over the table causing pages
> > to swap in, massive tlb flushed and all this friends. On big data and
> > higly loaded systems it is very fragile area and there is still no
> > answer which structures are better :(
> 
> Please, go to Alexander L. He has concrete numbers how much faster hash is.
> I remember he did it at least in scope of an issue about Lua tables vs spaces
> performance.
> 
> > Sometimes if your data is small
> > enough at sits near in memory old plain linear lookup a way more
> > faster than enything else, because theory is theory but in real life
> > too many components with own limists involved.
> > 
> > Summarizing: I can easily switch back to mhash if you prefer.
> 
> I think at this point we have gone too far to simply choose one solution,
> and we need to do the comparison. Memory comparison is simple to do - insert
> the same functions into a hash, printf its size. Insert into a tree, and
> printf its size. For perf you can either write your own micro-bench or ask
> Alexander L. for existing results.

Already did. And I agree that the conversation is too long already. Lets
jump into mhash use and continue. There are more important problems in
cbox patches you pointed and mhash is the least of them.

> >>
> >> 5. Why is it ssize_t? Is there any single place, where we use ssize_t
> >> for reference counting? Why ssize_t, if it is has nothing to do with
> >> size of anything?
> >>
> >> Also it probably would be better to name it 'load_count'. Because
> >> to the end of the patch I was thinking you decided to count Lua refs
> >> too somewhy, and realized it is load count only afterwards, when
> >> started writing the comments.
> > 
> > Strictly speaking plain int should fit here better but signed size
> > just guaranteed to be large enough to cover address space.
> 
> But it is not an address either ...

I must admit you're right here.

> >> 9. We already discussed it a ton of times, literally exactly the
> >> same 'problem' with struct error objects. A 64 bit integer
> >> will not overflow for your lifetime even if it would increment
> >> every nanosecond. How is it possible? Why the code should get
> >> complicated by making a simple 'ref' function be able to fail?
> > 
> > This has nothing to do for fair use. It prevents from errors where
> > you occasionally screwed the counter.
> 
> In the code you always either increment or decrement it. How many
> lifes you will need to get it 'screwed'?

Counters are not always screwed by correct calls of inc/dec but sometime
it might be an indirrect errors due to addressing bugs and such. IOW
if we can check counters for saturation -- we should. If you really
think it has such big penalty we could put it under #ifdef.

> > I don't know why but for some
> > reason you think that counters are never screwed.
> 
> With that we can get to the point that it is not safe to use TCP,
> and we need to ensure packet order, send checksums and so on. Also
> we can't rely on the main memory stability - some electrons may leak,
> and screw a bit in a number. So we need to check each number before
> we use it.
> 
> What I think is that there is a border of sanity, when the checks
> must be stopped in order to keep the code sane and simple. An overflow
> of a 64 bit number, changed only by increments and decrements, is beyond
> that border.

Sigh, Vlad, I already saw bugs when counters were screwed due to
addressing issue and it required heck lot of time to catch why it
happened. Surely tarantool is not a kernel code and in worse situation
it simply crash but still...

To stop this controversy (because we hardly persuade each other) I simply
drop this saturation check. And surely we won't going to check every
number in memory (thanks we're not writting medical software).

> If the counter is somehow broken, it means its memory was overridden,
> and the whole process is compromised. It makes no sense to continue
> the execution. You not only continue, you also expose this error to a
> user. What is he supposed to do with that error?

He should report us. This is better than continue operating with code
which is already broken. I don't expect it ever happen but if the
test for saturation is so cheap I would save it. Whatever, I'll drop
it.

> > What we really
> > need is a general kind of kref_t and it *must* include saturation
> > check somewhere inside and panic on overflow.
> 
> This is not really applicable in Tarantool. At least not everywhere.
> Struct tuple, for instance, is referenced, but it uses a very special
> reference counter to save every single byte, because tuple count is
> millions and billions.

And I don't propose it for bigrefs but IIRC we've a number of places
where plain reference counters are used?

> >>> +/**
> >>> + * Allocate a new function instance.
> >>> + */
> >>> +static struct cbox_func *
> >>> +cbox_func_new(const char *name, size_t name_len)
> >>> +{
> >>> +	const ssize_t cf_size = sizeof(struct cbox_func);
> >>> +	ssize_t size = cf_size + name_len + 1;
> >>> +	if (size < 0) {
> >>
> >> 10. If this is intended to check for an overflow, it does not work.
> >> If I will pass name_len = UINT64_MAX (on my machine size_t == 8 bytes),
> >> then UINT64_MAX + 1 = 0, you will allocate only sizeof(), and then
> >> will do memcpy for UINT64_MAX below.
> >>
> >> I would suggest not to get crazy with such checks. You can't cover
> >> everything, and it is not needed.
> > 
> > It is a typo, it should be `if (size <= 0)`.
> 
> It won't work either. Because size won't be 0. Name_len + 1 will be,
> and size will be = sizeof(struct cbox_func).
> 
> > I think we must check
> > what users pass us and don't allow to screw a program.
> 
> Name_len is returned by lua_tolstring(). If there is a string, whose
> size overflows size_t, I am afraid the program is already screwed,
> because there is not enough memory in the world to fit such a string.

That's good point.

> 
> > So this test
> > is not crazy at all.
> 
> It really is, see above.