[tarantool-patches] [PATCH v1 1/1] rfc: describe a Tarantool JSON indexes

Fri Jul 27 18:10:13 MSK 2018

On Thu, Jul 26, 2018 at 02:15:08PM +0300, Kirill Shcherbatov wrote:
> Part of #1012
> ---
> Branch: http://github.com/tarantool/tarantool/tree/kshch/gh-1012-json-indexes
> Issue: https://github.com/tarantool/tarantool/issues/1012
> 
>  doc/rfc/1012-json-indexes.md | 119 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 119 insertions(+)
>  create mode 100644 doc/rfc/1012-json-indexes.md
> 
> diff --git a/doc/rfc/1012-json-indexes.md b/doc/rfc/1012-json-indexes.md
> new file mode 100644
> index 0000000..a02daef
> --- /dev/null
> +++ b/doc/rfc/1012-json-indexes.md
> @@ -0,0 +1,119 @@
> +# Tarantool JSON Indexes
> +
> +* **Status**: In progress
> +* **Start date**: 26-07-2018
> +* **Authors**: Kirill Shcherbatov @kshcherbatov kshcherbatov at tarantool.org
> +* **Issues**: [#1012](https://github.com/tarantool/tarantool/issues/1012)
> +
> +## Summary
> +Tarantool JSON Indexes would allow to index a parts of complex objects stored in fields.
> +```
> +box.space.test:create_index('contributer', {parts = {{1, 'number'}, {2, 'str', data_path = "name.last"}}})
> +```

Please improve the summary. Describe what indexes one can create in
tarantool now, and what you're suggesting to add. An example of the new
API usage isn't enough. Please describe the new API like this is a
documentation page.

A couple of questions regarding the new API:
- I don't like 'data_path', why not simply 'path'?
- Are we planning to support the format proposed by a customer in the
  ticket (parts = {{"name.first", "str"}, {"name.last", "str"}})?
- How do json indexes interact with the space format? In other words
  what type should have a field that has a json index over it? I suspect
  it is implied to be a 'map'. Can we restrict the format of a
  particular json path without indexing it (i.e. via space format)?

> +
> +## Background and motivation

The content of this section doesn't match its caption. In "background
and motivation" section you should describe why we need to implement
json path indexes, not how. Description of tuple format should go to
"Detailed design" section, I think.

> +Indexes are consists of key_part(s). We should allow to specify `"data_path"` strings on indexes creation.
> +According to index definition, indexed information should be acceded really fast. Previously, only fields indexes were allowed. 
> +This was achieved with special metadata stored in tuples before main payload.
> +```
> +                           +----------------------+-----------------------+-------------+
> +  tuple_begin, ..., raw =  |      extra_size      | offset N ... offset 1 | MessagePack |
> +  |                        +----------------------+-----------------------+-------------+
> +  |                                                                       ^
> +  +-----------------------------------------------------------------------data_offset
> +```
> +Tuple format field used in index has initialized `offset_slot`: index in metadata data offsets array for this specific tuple.
> +Those offsets are used to access data without unpacking whole tuple on lookup.
> +We can use same approach to gain similar results with JSON indexes.
> +
> +## Detailed design
> +
> +### Concept idea
> +We suppose to extend 'tuple_format' with 'HASH: json_path -> offset_slot'
> +```
> +struct tuple_format {
> +	...
> +	/** Hash table with mapping json_path -> offset_slot. */
> +	struct mh_strni32_t *json_index_hash;
> +}
> +```

Is that it? Is this the whole "Concept idea"?

In the "Detailed design" design you should describe in plain English
what challenges you need to solve and how you're planning to solve them,
using code snippets only to support your words (not instead of!) so that
even someone who's not closely familiar with tarantool internals can
understand what's you're going to do here. And please try not to make
grammatical mistakes - they make the text really difficult to read.

> +### Constructing new tuples
> +1. User specify an index with json-defined structure:
> +```
> +box.space.test:create_index('contributer', {parts = {{1, 'number'}, {2, 'str', data_path = "name.last"}}})
> +```
> +2. `key_part_def` restored from msgpack generates `key_part` for field '2' with 'data_path' = "name.last".  Verification of json validity is also carried.
> +
> +3. tuple_format_create constructs new space format and initialize field included in the indexes with `tuple_field_add_json_index` routine: 
> +```
> +if (part->data_path != NULL) {
> +	if (tuple_field_add_json_index(format,
> +					part->data_path,
> +					--current_slot) != 0)
> +		return -1;
> +}
> +```
> +`tuple_field_add_json_index` constructs a new hash table if it is not exists, and insert `<data_path, hash, slot_offset>`

You could just say that for each indexed json path you allocate a slot
in the offset map instead of writing these 3 bullet points. That would
be much easier to understand...

What if the same json path is indexed twice (by two indexes)? I assume
you allocate just one slot in the offset map. This should be clearly put
in the text.

Where inside the offset map do you allocate index slots for json path
fields? After regular fields or before? Again, this should be described
in the RFC, probably with a diagram, because it's impossible to fathom
from what you wrote.

> +
> +4. Finally, on new tuple creation, `tuple_init_field_map` uses hash to fill tuple map and check that data corresponds data_path restriction:
> +```
> +for (; i < defined_field_count; ++i, ++field) {
> +	struct mh_strni32_t *hash = format->json_index_hash;
> +	if (hash != NULL) {
> +		for (mh_int_t i = mh_first(hash); i != mh_end(hash);
> +		     i = mh_next(hash, i)) {
> +			int32_t offset_slot =
> +				mh_strni32_node(hash, i)->val;
> +			const char *field = NULL;
> +			const char *data_path = mh_strni32_node(hash, i)->str;
> +			if (tuple_field_raw_by_path(format, raw, field_map,
> +						    data_path, strlen(data_path),
> +						    0, &field) != 0 || 
> +			    field == NULL)
> +				return -1;
> +			field_map[offset_slot] =
> +				(uint32_t) (field - raw);
> +		}
> +	}
> +}
> +```

I don't like that if the same json path prefix is used by multiple
indexes, you'll look it up multiple times for the same tuple,
e.g. suppose there are indexes over employee.name.first and
employee.name.last, then you'll lookup employee.name twice for every
tuple you insert into the space. This is sub-optimal.

IMO the whole idea of mapping the whole json path as it was put in by
the user to offset map slot is flawed, because two different strings can
represent the same json path (e.g. because one of them has some spaces
while the other doesn't). Suppose the user creates two indexes that
index the same json path, but address it by different (in terms of
memcpy) strings. What happens then? There will be two offset maps used
instead of one?

That said, I think you should come up with a better implementation for
mapping json path to an offset map rather than simply maintaining json
path => offset slot hash. I guess, you should split json path in nodes
and store those nodes in tuple_format somehow so that no matter how the
user enters a json path, you can find it in the format. This would also
allow you to avoid multiple accesses to the same json path prefix when
constructing a field map for a new tuple: you could decode the msgpack,
looking up indexed paths in the format on the go, i.e. you would decode
the tuple just once, no matter how many indexes are out there.

> +
> +### Speeding up access with offset_slot cache in key_part object
> +Hash lookup could be an expensive operation, so we suppose to cache offset_slots for JSON indexes. 
> +
> +We suppose to extend 'key_part' structure with 'data_path' field initializing on index creation, 'field_epoch' and reallocatable 'offset_slot' int32_t array.
> +```

And how do you expect me to understand what they will be used for?
Describe what you need them for in English (the comments in the code
snippet below don't make it any clearer).

> +struct key_part {
> +	/** Tuple field index for this part */
> +	uint32_t fieldno;
> +	...
> +
> +	/** Data JSON path. */
> +	const char *data_format;

'data_format'? 'data_path' you mean? Why not simply 'path'?

> +	/** Epoch of tuple_format cached in offset_slot. */
> +	uint32_t slot_epoch;
> +	/** Cache of corresponding tuple_format slot_offset. */
> +	int32_t slot_cache;

What are these new fields initialized to for non-json key parts?

> +};
> +```
> +
> +```
> +struct tuple_format {
> +	/** Tuple format epoch. */
> +	uint32_t epoch;
> +};
> +```

So basically, the cached slots of existing indexes will be cleared when
the space format changes (e.g. a new index is created)? Or will you
rebuild them? It's unclear from the text.

What I'm interested in is the following case. Suppose you have a space
with one json index. It has a lot of tuples. One day you decide to
create a new index over a field that wasn't indexed before. Currently
(non-json indexes), the old index will be fast for both old and new
tuples while the new index will be fast only for new tuples and slow for
old tuples, because old tuples don't have the corresponding field in the
field map.

What about json indexes? What will happen with them in such a case? If
you update key parts after the new index is created, then the old index
will be slow for all tuples, which is bad. If you don't, the new index
will be slow for all tuples, which is also bad. Anyway, it's worse than
the behavior we have now.

May be, we could somehow store field map offsets for all formats that
are currently used by tuples stored in the space in each key part, i.e.
turn slot_cache into an array:

  key_part->slot_cache[format->epoch]

> +
> +### Acceding tuple by JSON-path

Acceding?

> +#### tuple_field_raw_by_path
> +```
> +tuple_field_raw_by_path(const struct tuple_format *format, const char *tuple,
> +                        const uint32_t *field_map, const char *path,
> +                        uint32_t path_len, uint32_t path_hash,
> +                        const char **field)
> +```
> +We can trivially test if name is an index looking to the hash and obtain offset to specifying offset to `tuple_field_raw`.

I don't understand this sentence...

> +
> +
> +## Rationale and alternatives
> +...

No rationale, no alternatives?