[PATCH v3.1 3/5] box: introduce JSON indexes
Vladimir Davydov
vdavydov.dev at gmail.com
Sat Sep 22 22:15:31 MSK 2018
Hello,
First, the RFC got lost somewhere between reviews. Please resurrect it.
(Personally, I'm against committing RFCs to the repository, but rules
are rules).
Second, there are a few places in the patch where you simultaneously
move and modify huge chunks of code. Please don't do that, because it
makes the patch very difficult to review. Try to avoid moving chunks of
code, but if you absolutely have to, then do it in a separate patch.
Also, don't ever leave stray hunks left from previous versions (there
are a few, too). I won't bother pointing you to exact places in the
patch - you'll have to review your patch and find them by yourself.
Now, about the patch itself. I've a few major comments, which will
probably force you to rework a big part of your patch. I won't give
you a line-by-line review until those are resolved or you convince me
to leave them be:
1. For some reason, you require that the first component of the json
path points to the tuple field, i.e. if a tuple field has index 3,
then the json path specified by the user and stored in the _index
system space must start with '[3]' too. Such a redundancy looks
unnecessary and dubious. I assume that you do that so that you can
maintain a single hash mapping normalized path to tuple_field in
tuple_format. If so, that doesn't justify this decision, as you can
as well keep a hash map in each root tuple_field. Please make the
path point to data within a field, excluding the field itself.
2. AFAIU (I may be wrong, but it looks like it to me) the hash mapping
a normalized path to a tuple field (the one that is in tuple_format)
is a pure optimization. Hence it makes sense to submit it in a
separate patch, applied on top of the main one. Since the sole
purpose of path normalization is this optimization, it should be
factored out and submitted separately, too.
3. Allocation of json paths after key_def and tuple_format structures
looks completely pointless to me. Apart from complicating the code,
this micro optimization achieves nothing. It looks especially
ridiculous in case of tuple_format, which has to maintain a hash
table, which implies making multiple allocations anyway. Please
remove it and allocate the paths in a normal way, with malloc().
BTW, there's another place where you micro manage memory - it's json
path normalization procedure, where you try to reuse a buffer left
from the previous loop iteration. Again, don't do that, it isn't
worth the complexity it brings to the code.
4. All json related methods should be defined in src/lib/json as
independent entities. This concerns json path validation and
normalization, as well as json tree construction and traversal. And
each of them should be submitted in a separate patch accompanied by
a unit test.
Designing the json tree class and the corresponding traversal method
will probably require some mental effort, but it'll be worth it,
because it'll let you remove some duplicate code, as well as split
tuple format validation and field map construction, which are
currently done by the same function depending on the arguments (this
looks bad). Not mentioning that it will also make the code much
easier to follow and review.
Those are the major points, but there are also a few minor ones
concerning the coding technique that I couldn't help noticing:
- Very poor comments - it looks like you write them just to follow the
code style guideline. There are a lot of subtleties in this patch
(to name a few, what the canonical form of a json path is and why it
is needed, what the tuple_field tree looks like and why it exists,
how json paths are allocated and where they are stored), but you
never elaborate in the comments - you only briefly mention what a
function does or what a struct field stores, but never explain why or
what's the catch. I understand that writing comments in English is
difficult (it is for me too! Sometimes I spend more time on writing
a comment than on the code), but it is a lame excuse. Please take
your time, use a dictionary, consult with your colleagues to write
good comments that are easy to read and that shed some light on
implementation details so that one can understand what's going on
without diving deep in the implementation.
- Weird function names - what's tuple_field_bypass_and_init supposed to
mean, for example? If you looked up "bypass" you'd see that it means
"go past/around" or "avoid/evade". This is far from what this
function is about, which is traversing the tuple field tree. Probably
this is related to my previous point... Please be more scrupulous
when it comes to function and variable naming, otherwise your code
will sound hilarious.
- Rather strange way to divide the code into functions. For instance,
take tuple_field_dig_with_parser. Why would you beget such a monster
and let it out in the open? Outside tuple_format.c you only need two
ways of looking up data by path - absolute (i.e. including the field
number) for Lua and relative (i.e. without the field number) for
extracting a key. So introduce two functions with appropriate names,
say tuple_field_by_relative_path and tuple_field_by_absolute_path
(think about the names, please). If we ever need the full power of
tuple_field_dig_with_parser (the name sucks BTW) somewhere else, we
will refactor the code again.
Another example is tuple_field_by_part. I'd expect it to simply call
tuple_field or tuple_field_by_path, depending on whether the part has
a json path attached to it or not. However, this function also does
an optimistic lookup in tuple_format->path_hash. Why can't we do this
lookup in tuple_field_by_path? Don't we want to optimize Lua calls to
tuple_field_by_path too? Oh wait, it does a lookup in the hash...
What's going on there?
Those are the minor points. Again, I won't be pointing to exact places
in the patch, because that would be too tedious. I expect you to go over
your patch and fix them on your own. I'll correct you in the next review
round if you miss anything.
And please don't forget to update the comment to the patch and probably
the RFC when you fix the major points (provided we agree on them, of
course).
Thanks,
Vladimir
More information about the Tarantool-patches
mailing list