[tarantool-patches] Re: [PATCH v1 1/1] rfc: describe a Tarantool JSON indexes

Mon Jul 30 16:45:10 MSK 2018

Hi! Thanks for the new version! See 12 comments below.

Vova, please, look at my comments and say what do you think.

> 
> # Tarantool JSON Indexes
> 
> * **Status**: In progress
> * **Start date**: 26-07-2018
> * **Authors**: Kirill Shcherbatov @kshcherbatov kshcherbatov at tarantool.org
> * **Issues**: [#1012](https://github.com/tarantool/tarantool/issues/1012)
> 
> ## Summary
> Tarantool *JSON Indexes* is a feature that allows to index space's tuples that have consistent document-type complex structure.
> 
> 
> ## Background and motivation
> The box.index submodule provides read-only access for index definitions and index keys. Indexes are contained in box.space.space-name.index array within each space object. They provide an API for ordered iteration over tuples. This API is a direct binding to corresponding methods of index objects of type box.index in the storage engine.
> ```
> s:create_index('primary', {parts = {1, 'unsigned'}, {2, 'str'}})
> ```
> 
> Sometimes tuple space data have complex but consistent document structure.
> ```
> s:insert{1, {town = 'NY', name = {first = 'Richard', second = 'Feynman'}}}
> s:insert{2, {town = 'NY', name = {first = 'Ayn', second = 'Rand'}}}
> ```
> Building index on document fields would be attractive feature that allows to search and iterate over such data.
> ```
> s:create_index('name'', { parts = {{ "name.first", "str" }, {"name.last", "str"}}})
> ```
> 
> Fields having complex document structure should have 'map' type in format if specified.

1. It is not correct. I can declare JSON index on enclosed arrays like this

     parts = { {'[2][3][1][5]', 'unsigned'}, {'[3][5].field[6].field2[7]', 'string'} }

> 
> 
> ## Detailed design
> All data in Tarantool stored in atomic database objects called *tuples*. They consist of payload(user data represented as *msgpack*) and extra information that describe how does database should operate with it.
> ```
>    [[ Img 0.1 - Tarantool tuple structure ]]
> 
>                             +----------------------+-----------------------+-------------+
>    tuple_begin, ..., raw =  |      extra_size      | offset N ... offset 1 | MessagePack |
>    |                        +----------------------+-----------------------+-------------+
>    |                                                                       ^
>    +-----------------------------------------------------------------------data_offset
> ```
> Indexed information should be accessible really fast. In case of regular indexes on fields it comes possible with *field_map* - data offsets stored before user payload in each tuple. Metadata cells that are used to store those offsets *offset_slot* are not same for different *formats*(aggregated information of all indexes describing specific space). Without loss of generality this mean that tuples constructed at the moment 'A' when only PK index was registered have another *format* than the tuples constructed at the moment 'B' after new Index "name" was added.
> 
> The concept idea with *offset_slot*s could be used for JSON Indexes too.
> 
> JSON-path definitely would be a part of *key_part* structure. As we have already noted, *offset_slots* are different for formats of different generations, so we should find an universal and fast way to obtain it by JSON. As code that access tuple data by index is a really hot, the performance is our priority problem.
> We already have *tuple_field_raw_by_path* that follows JSON path in tuple and returns pointer to requested field. Sometimes fields that nested in JSON-structured document may have coinciding path part:
> e.g.:
> ```
> s:create_index('name'', { parts = {{ "name.first", "str" }, {"name.last", "str"}}})
> s:create_index('name'', { parts = {{ "town", "str" }, {"name.last", "str"}}})
> ```
> We don't like to parse coinciding suffixes multiple time for same tuples.
> 
> ### 1.1. JSON-path tree in tuple format
> Keeping in mind all challenges we suppose to introduce tree-based structure in *tuple_format* that contain JSON-path lexemes and has allocated tuple *offset_slots* as leaves nodes.
> ```
>     [[ Img 1.1 - JSON-path tree in tuple format ]]
> 
>     ----\
>         |------> [name] \
>         |  {off. cache} |----> [first]      <slot1>
>         |               |      {off. cache}
>         |               |
>         |               |----> [last]       <slot2>
>         |                      {off. cache}
>         |
>         \------> [birthday]   <slot3>
>                  {off. cache}
> ```
> 
> During tuple creation we will also fill off_cache on intermediate tree nodes.

2. As I remember from verbal discussion, we've decided to do not store
offsets for intermediate nodes. It is too expensive. You actually purpose
to store an offset for each tuple field, even non-indexed. In such a case
the field_map would become bigger than the tuple payload. Field_map is
very expensive storage and should not store non-needed offsets. So you should
not have an offset on [name], on [birthday]. Only on [first] and [last].

What is more, this example do not match the code above. Firstly you
operate with name/town, then with name/birthday, then again with
name/town. It confuses. Do you have an index on 'birthday'? If not,
then why do you need an offset on it?

> For example, when first index part "name.first" is built, we would already have name offset and would able to start there parsing only "last" lexeme.
> ```
> s:create_index('name'', { parts = {{ "name.first", "str" },
>                                     { "name.last",  "str" }}})
> ```
> The resulting information offset would be stored in slot obtained from tree leaf node.
> 
> ```
>      [[ Img 1.2 - Tuple data stored in database of same format ]]
> 
>                                                                | Description  |
>     +=======================+======+===================+
>     | slot1 | slot2 | slot3 | town |       name        |       | tuple schema |
>     +=======+=======+=======+======+============+======+
>     |   2   |   3   |   0   | NY   | Ayn | Rand |              |    tuple1    |
>     +-----------------------+------+---------+--+------+
>     |   2   |   4   |   0   | NY   | Richard | Feynman |       |    tuple2    |
>     +-----------------------+------+---------+---------+
>                             ^      ^     ^   ^
>                             0      2     3   4                 |    offset    |

                                                             ^
3. The table is malformed. Some problems with white spaces. 3 points
into the middle of 'Richard', additional '+' up to 'Feynman' etc.

4. Offset in tuple_format is number of offset slot in a tuple. Index
in the field_map array actually. So in the format offset slots always
are sequence of natural numbers + 0: 0, 1, 2, 3 ... with no holes. So
I do not understand why do you have 0, 2, 3, 4 in the format instead
of 0, 1, 2, 3.

> ```
> 
> ### 1.2. Access data by JSON-path
> The central API to access information is a routine
> ```
> static inline const char *
> tuple_field_raw(const struct tuple_format *format, const char *tuple,
>                  const uint32_t *field_map, uint32_t field_no)
> ```
> 
> In most scenarios related to indexes it is called with *field_no* extracted from *key_part* - a descriptor of part representing index.
> ```
> field_b = tuple_field_raw(format_b, tuple_b_raw, field_map_b, part->fieldno);
> ```
> With introducing parts containing JSON path we need to rework such calls.
> 
> To avoid parsing JSON path each time we suppose to introduce a cache for *offset_slot* in *key_part* structure:
> ```
> struct key_part {
>    /** Tuple field index for this part */
>    uint32_t fieldno;
>    ...
> 
>    /** Data JSON path. */
>    const char *data_path;
>    /** Epoch of tuple_format cached in offset_slot. */
>    uint32_t slot_epoch;
>    /** Cache of corresponding tuple_format slot_offset. */
>    int32_t slot_cache;

5. Looks like you have a problem with offset_slot/slot_offset naming. Please,
use one of them, not both.

> };
> ```
> same as *tuple_format* where such epoch would be initialized with a new bigger(when source key_parts have changed) value on creation.

6. What is the same as tuple_format? Looks like the first part of the
sentence is lost.

> ```
> struct tuple_format {
>    ...
> 
>    /** Epoch of tuple format. */
>    uint32_t epoch;
>    /** Offset slot tree for JSON indexes. */
>    TREE json_tree;

7. It is not for JSON indexes only and not for offsets only as well. It
is a tree to validate any JSON space format constructed from
space:format and indexes. And not all of the nodes will have offset
slots.

>    /** Formats of the fields. */
>    struct tuple_field fields[0];

8. This is the array of trees. It is not array + tree in a separate
field. You have array of trees where i-th tree describes format of
the i-th field and its internals. Some of tree-nodes have offsets
and some are just to validate the format. Do not forget that these
trees are going to be used for space:format validation. Offset_slot
is a part of tuple_field, even now, and is filled optionally if the
field is a part of an index.

> };
> ```
> 
> In all index-related scenarios with *tuple_field_raw* we would rework with a new function:
> ```
> const char *
> tuple_field_by_part(const struct tuple *tuple, const struct key_def *def,
>                      uint32_t idx);
> 
> ```
> 
> With first argument *tuple* we can obtain tuple_format that contain *epoch* field.
> ```
>    [[ Img 2.1 - Block diagram for tuple_field_by_part routine ]]
> 
>              [[def->epoch == tuple_format(tuple)->epoch]]
>                                   ||
>                    YES            ||             NO
>                   ________________/\________________
>                  |                                  |
> 
>       use offset_slot from             PARSE def->parts[idx]->data_path and
>    def->parts[idx].cached_slot   observe tuple_format(tuple)->json_tree structure,
>                                       UPDATE def->parts[idx].slot_epoch and
>                                             def->parts[idx].slot_cache

9. On 'NO' branch the key_part cache will oscillate. Example: you have a
tuple1 with format1 and key_part synced with format1. Then you add new
index and insert new tuples: tuple[2-N] with format2. Then index compares
the tuples. Each time when you compare tuple1 vs any of tuple[2-N] you
will change key_part cache version to the old one and make slower either
new tuples or old tuples. It is not ok.

Moreover you did not take into account that on comparison of tuples
you have two formats in a common case. Up to which will you update
key_part cache version?

I think that fast fix is to update cache version only to the biggest one.

And the best solution is find a way how to update key_defs on alter and
do not handle this case during comparison. We need Vova's opinion here.

> ```
> 
> PARSE is about parsing JSON path lexeme-by-lexeme and step down the json_tree until leaf with slot offset would reached.
> Then we need to UPDATE key_part cache.
> 
> 
> ## Rationale and alternatives
> 
> ### 2.1 JSON-path hash table instead of tree
> Instead of JSON-path tree in space format described in chapter (1.1) in previous section we could use hash table in format:
> ```
>     HASTABLE<const char *data_path, int32_t offset_slot>

10. HASTABLE? Maybe 'hash', not 'has'?

> 
>                              lookup
>     "name.first" ----> hash1 ------> [hash1, "name.first", offset_slot1]
>     "name.last"  ----> hash2 ------> [hash2, "name.last", offset_slot2]
> ```
> The only drawback of this approach is the inability to speed up the parsing of data on tuple creation based on the information obtained by parsing the previous paths.
> An as tuple insertion is rare frequent operation this makes sense.

11. The real problem of hashes is not linked with the one you described here.
Hash will work for any formats and will be used on comparison only. Not on
insertion. On insertion you need to decode the whole tuple to build offset
map and validate the format going along tuple field trees. Not by JSON path
of an index. On insertion you have no any paths. You have only tuple and the
array of trees of tuple fields. The real problem is that these paths can be
logically identical, but actually have some differences. For example, use
["ident"] in one index and .ident in another one.

> 
> ### 2.2 Re-allocatable slot_cache

12. No such word 're-allocatable'.

> Epoch-based invalidation is looking complicated enough.
> It is possible to allocate slot_cache[] array for each key_part and store slots for all epoch (where epoch would be an index in that array).
> ```
>                           format->epoch
> key_part {                     ||
>     slot_epoch[]:               \/
>          [1: s1, 2: s2, ... epoch: s3, ... max_epoch: sN]
> }
> 
> I.e.:
>      tuple_map[key_part->slot_epoch[format->epoch]]
> ```
> But as formats are created really often, this approach is really resource-consumable.
>