[PATCH v3.1 3/5] box: introduce JSON indexes

Sat Sep 22 22:15:31 MSK 2018

Hello,

First, the RFC got lost somewhere between reviews. Please resurrect it.
(Personally, I'm against committing RFCs to the repository, but rules
are rules).

Second, there are a few places in the patch where you simultaneously
move and modify huge chunks of code. Please don't do that, because it
makes the patch very difficult to review. Try to avoid moving chunks of
code, but if you absolutely have to, then do it in a separate patch.
Also, don't ever leave stray hunks left from previous versions (there
are a few, too). I won't bother pointing you to exact places in the
patch - you'll have to review your patch and find them by yourself.

Now, about the patch itself. I've a few major comments, which will
probably force you to rework a big part of your patch. I won't give
you a line-by-line review until those are resolved or you convince me
to leave them be:

 1. For some reason, you require that the first component of the json
    path points to the tuple field, i.e. if a tuple field has index 3,
    then the json path specified by the user and stored in the _index
    system space must start with '[3]' too. Such a redundancy looks
    unnecessary and dubious. I assume that you do that so that you can
    maintain a single hash mapping normalized path to tuple_field in
    tuple_format. If so, that doesn't justify this decision, as you can
    as well keep a hash map in each root tuple_field. Please make the
    path point to data within a field, excluding the field itself.

 2. AFAIU (I may be wrong, but it looks like it to me) the hash mapping
    a normalized path to a tuple field (the one that is in tuple_format)
    is a pure optimization. Hence it makes sense to submit it in a
    separate patch, applied on top of the main one. Since the sole
    purpose of path normalization is this optimization, it should be
    factored out and submitted separately, too.

 3. Allocation of json paths after key_def and tuple_format structures
    looks completely pointless to me. Apart from complicating the code,
    this micro optimization achieves nothing. It looks especially
    ridiculous in case of tuple_format, which has to maintain a hash
    table, which implies making multiple allocations anyway. Please
    remove it and allocate the paths in a normal way, with malloc().

    BTW, there's another place where you micro manage memory - it's json
    path normalization procedure, where you try to reuse a buffer left
    from the previous loop iteration. Again, don't do that, it isn't
    worth the complexity it brings to the code.

 4. All json related methods should be defined in src/lib/json as
    independent entities. This concerns json path validation and
    normalization, as well as json tree construction and traversal. And
    each of them should be submitted in a separate patch accompanied by
    a unit test.

    Designing the json tree class and the corresponding traversal method
    will probably require some mental effort, but it'll be worth it,
    because it'll let you remove some duplicate code, as well as split
    tuple format validation and field map construction, which are
    currently done by the same function depending on the arguments (this
    looks bad). Not mentioning that it will also make the code much
    easier to follow and review.

Those are the major points, but there are also a few minor ones
concerning the coding technique that I couldn't help noticing:

 - Very poor comments - it looks like you write them just to follow the
   code style guideline. There are a lot of subtleties in this patch
   (to name a few, what the canonical form of a json path is and why it
   is needed, what the tuple_field tree looks like and why it exists,
   how json paths are allocated and where they are stored), but you
   never elaborate in the comments - you only briefly mention what a
   function does or what a struct field stores, but never explain why or
   what's the catch. I understand that writing comments in English is
   difficult (it is for me too! Sometimes I spend more time on writing
   a comment than on the code), but it is a lame excuse. Please take
   your time, use a dictionary, consult with your colleagues to write
   good comments that are easy to read and that shed some light on
   implementation details so that one can understand what's going on
   without diving deep in the implementation.

 - Weird function names - what's tuple_field_bypass_and_init supposed to
   mean, for example? If you looked up "bypass" you'd see that it means
   "go past/around" or "avoid/evade". This is far from what this
   function is about, which is traversing the tuple field tree. Probably
   this is related to my previous point... Please be more scrupulous
   when it comes to function and variable naming, otherwise your code
   will sound hilarious.

 - Rather strange way to divide the code into functions. For instance,
   take tuple_field_dig_with_parser. Why would you beget such a monster
   and let it out in the open? Outside tuple_format.c you only need two
   ways of looking up data by path - absolute (i.e. including the field
   number) for Lua and relative (i.e. without the field number) for
   extracting a key. So introduce two functions with appropriate names,
   say tuple_field_by_relative_path and tuple_field_by_absolute_path
   (think about the names, please). If we ever need the full power of
   tuple_field_dig_with_parser (the name sucks BTW) somewhere else, we
   will refactor the code again.

   Another example is tuple_field_by_part. I'd expect it to simply call
   tuple_field or tuple_field_by_path, depending on whether the part has
   a json path attached to it or not. However, this function also does
   an optimistic lookup in tuple_format->path_hash. Why can't we do this
   lookup in tuple_field_by_path? Don't we want to optimize Lua calls to
   tuple_field_by_path too? Oh wait, it does a lookup in the hash...
   What's going on there?

Those are the minor points. Again, I won't be pointing to exact places
in the patch, because that would be too tedious. I expect you to go over
your patch and fix them on your own. I'll correct you in the next review
round if you miss anything.

And please don't forget to update the comment to the patch and probably
the RFC when you fix the major points (provided we agree on them, of
course).

Thanks,
Vladimir