Tarantool development patches archive
 help / color / mirror / Atom feed
* [tarantool-patches] [PATCH v1 1/1] rfc: describe a Tarantool JSON indexes
@ 2018-07-26 11:15 Kirill Shcherbatov
  2018-07-27 15:10 ` Vladimir Davydov
  0 siblings, 1 reply; 11+ messages in thread
From: Kirill Shcherbatov @ 2018-07-26 11:15 UTC (permalink / raw)
  To: tarantool-patches; +Cc: v.shpilevoy, Kirill Shcherbatov

Part of #1012
---
Branch: http://github.com/tarantool/tarantool/tree/kshch/gh-1012-json-indexes
Issue: https://github.com/tarantool/tarantool/issues/1012

 doc/rfc/1012-json-indexes.md | 119 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 119 insertions(+)
 create mode 100644 doc/rfc/1012-json-indexes.md

diff --git a/doc/rfc/1012-json-indexes.md b/doc/rfc/1012-json-indexes.md
new file mode 100644
index 0000000..a02daef
--- /dev/null
+++ b/doc/rfc/1012-json-indexes.md
@@ -0,0 +1,119 @@
+# Tarantool JSON Indexes
+
+* **Status**: In progress
+* **Start date**: 26-07-2018
+* **Authors**: Kirill Shcherbatov @kshcherbatov kshcherbatov@tarantool.org
+* **Issues**: [#1012](https://github.com/tarantool/tarantool/issues/1012)
+
+## Summary
+Tarantool JSON Indexes would allow to index a parts of complex objects stored in fields.
+```
+box.space.test:create_index('contributer', {parts = {{1, 'number'}, {2, 'str', data_path = "name.last"}}})
+```
+
+## Background and motivation
+Indexes are consists of key_part(s). We should allow to specify `"data_path"` strings on indexes creation.
+According to index definition, indexed information should be acceded really fast. Previously, only fields indexes were allowed. 
+This was achieved with special metadata stored in tuples before main payload.
+```
+                           +----------------------+-----------------------+-------------+
+  tuple_begin, ..., raw =  |      extra_size      | offset N ... offset 1 | MessagePack |
+  |                        +----------------------+-----------------------+-------------+
+  |                                                                       ^
+  +-----------------------------------------------------------------------data_offset
+```
+Tuple format field used in index has initialized `offset_slot`: index in metadata data offsets array for this specific tuple.
+Those offsets are used to access data without unpacking whole tuple on lookup.
+We can use same approach to gain similar results with JSON indexes.
+
+## Detailed design
+
+### Concept idea
+We suppose to extend 'tuple_format' with 'HASH: json_path -> offset_slot'
+```
+struct tuple_format {
+	...
+	/** Hash table with mapping json_path -> offset_slot. */
+	struct mh_strni32_t *json_index_hash;
+}
+```
+### Constructing new tuples
+1. User specify an index with json-defined structure:
+```
+box.space.test:create_index('contributer', {parts = {{1, 'number'}, {2, 'str', data_path = "name.last"}}})
+```
+2. `key_part_def` restored from msgpack generates `key_part` for field '2' with 'data_path' = "name.last".  Verification of json validity is also carried.
+
+3. tuple_format_create constructs new space format and initialize field included in the indexes with `tuple_field_add_json_index` routine: 
+```
+if (part->data_path != NULL) {
+	if (tuple_field_add_json_index(format,
+					part->data_path,
+					--current_slot) != 0)
+		return -1;
+}
+```
+`tuple_field_add_json_index` constructs a new hash table if it is not exists, and insert `<data_path, hash, slot_offset>`
+
+4. Finally, on new tuple creation, `tuple_init_field_map` uses hash to fill tuple map and check that data corresponds data_path restriction:
+```
+for (; i < defined_field_count; ++i, ++field) {
+	struct mh_strni32_t *hash = format->json_index_hash;
+	if (hash != NULL) {
+		for (mh_int_t i = mh_first(hash); i != mh_end(hash);
+		     i = mh_next(hash, i)) {
+			int32_t offset_slot =
+				mh_strni32_node(hash, i)->val;
+			const char *field = NULL;
+			const char *data_path = mh_strni32_node(hash, i)->str;
+			if (tuple_field_raw_by_path(format, raw, field_map,
+						    data_path, strlen(data_path),
+						    0, &field) != 0 || 
+			    field == NULL)
+				return -1;
+			field_map[offset_slot] =
+				(uint32_t) (field - raw);
+		}
+	}
+}
+```
+
+### Speeding up access with offset_slot cache in key_part object
+Hash lookup could be an expensive operation, so we suppose to cache offset_slots for JSON indexes. 
+
+We suppose to extend 'key_part' structure with 'data_path' field initializing on index creation, 'field_epoch' and reallocatable 'offset_slot' int32_t array.
+```
+struct key_part {
+	/** Tuple field index for this part */
+	uint32_t fieldno;
+	...
+
+	/** Data JSON path. */
+	const char *data_format;
+	/** Epoch of tuple_format cached in offset_slot. */
+	uint32_t slot_epoch;
+	/** Cache of corresponding tuple_format slot_offset. */
+	int32_t slot_cache;
+};
+```
+
+```
+struct tuple_format {
+	/** Tuple format epoch. */
+	uint32_t epoch;
+};
+```
+
+### Acceding tuple by JSON-path
+#### tuple_field_raw_by_path
+```
+tuple_field_raw_by_path(const struct tuple_format *format, const char *tuple,
+                        const uint32_t *field_map, const char *path,
+                        uint32_t path_len, uint32_t path_hash,
+                        const char **field)
+```
+We can trivially test if name is an index looking to the hash and obtain offset to specifying offset to `tuple_field_raw`.
+
+
+## Rationale and alternatives
+...
-- 
2.7.4

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [tarantool-patches] [PATCH v1 1/1] rfc: describe a Tarantool JSON indexes
  2018-07-26 11:15 [tarantool-patches] [PATCH v1 1/1] rfc: describe a Tarantool JSON indexes Kirill Shcherbatov
@ 2018-07-27 15:10 ` Vladimir Davydov
  2018-07-30 12:02   ` [tarantool-patches] " Kirill Shcherbatov
  0 siblings, 1 reply; 11+ messages in thread
From: Vladimir Davydov @ 2018-07-27 15:10 UTC (permalink / raw)
  To: Kirill Shcherbatov; +Cc: tarantool-patches, v.shpilevoy

On Thu, Jul 26, 2018 at 02:15:08PM +0300, Kirill Shcherbatov wrote:
> Part of #1012
> ---
> Branch: http://github.com/tarantool/tarantool/tree/kshch/gh-1012-json-indexes
> Issue: https://github.com/tarantool/tarantool/issues/1012
> 
>  doc/rfc/1012-json-indexes.md | 119 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 119 insertions(+)
>  create mode 100644 doc/rfc/1012-json-indexes.md
> 
> diff --git a/doc/rfc/1012-json-indexes.md b/doc/rfc/1012-json-indexes.md
> new file mode 100644
> index 0000000..a02daef
> --- /dev/null
> +++ b/doc/rfc/1012-json-indexes.md
> @@ -0,0 +1,119 @@
> +# Tarantool JSON Indexes
> +
> +* **Status**: In progress
> +* **Start date**: 26-07-2018
> +* **Authors**: Kirill Shcherbatov @kshcherbatov kshcherbatov@tarantool.org
> +* **Issues**: [#1012](https://github.com/tarantool/tarantool/issues/1012)
> +
> +## Summary
> +Tarantool JSON Indexes would allow to index a parts of complex objects stored in fields.
> +```
> +box.space.test:create_index('contributer', {parts = {{1, 'number'}, {2, 'str', data_path = "name.last"}}})
> +```

Please improve the summary. Describe what indexes one can create in
tarantool now, and what you're suggesting to add. An example of the new
API usage isn't enough. Please describe the new API like this is a
documentation page.

A couple of questions regarding the new API:
- I don't like 'data_path', why not simply 'path'?
- Are we planning to support the format proposed by a customer in the
  ticket (parts = {{"name.first", "str"}, {"name.last", "str"}})?
- How do json indexes interact with the space format? In other words
  what type should have a field that has a json index over it? I suspect
  it is implied to be a 'map'. Can we restrict the format of a
  particular json path without indexing it (i.e. via space format)?

> +
> +## Background and motivation

The content of this section doesn't match its caption. In "background
and motivation" section you should describe why we need to implement
json path indexes, not how. Description of tuple format should go to
"Detailed design" section, I think.

> +Indexes are consists of key_part(s). We should allow to specify `"data_path"` strings on indexes creation.
> +According to index definition, indexed information should be acceded really fast. Previously, only fields indexes were allowed. 
> +This was achieved with special metadata stored in tuples before main payload.
> +```
> +                           +----------------------+-----------------------+-------------+
> +  tuple_begin, ..., raw =  |      extra_size      | offset N ... offset 1 | MessagePack |
> +  |                        +----------------------+-----------------------+-------------+
> +  |                                                                       ^
> +  +-----------------------------------------------------------------------data_offset
> +```
> +Tuple format field used in index has initialized `offset_slot`: index in metadata data offsets array for this specific tuple.
> +Those offsets are used to access data without unpacking whole tuple on lookup.
> +We can use same approach to gain similar results with JSON indexes.
> +
> +## Detailed design
> +
> +### Concept idea
> +We suppose to extend 'tuple_format' with 'HASH: json_path -> offset_slot'
> +```
> +struct tuple_format {
> +	...
> +	/** Hash table with mapping json_path -> offset_slot. */
> +	struct mh_strni32_t *json_index_hash;
> +}
> +```

Is that it? Is this the whole "Concept idea"?

In the "Detailed design" design you should describe in plain English
what challenges you need to solve and how you're planning to solve them,
using code snippets only to support your words (not instead of!) so that
even someone who's not closely familiar with tarantool internals can
understand what's you're going to do here. And please try not to make
grammatical mistakes - they make the text really difficult to read.

> +### Constructing new tuples
> +1. User specify an index with json-defined structure:
> +```
> +box.space.test:create_index('contributer', {parts = {{1, 'number'}, {2, 'str', data_path = "name.last"}}})
> +```
> +2. `key_part_def` restored from msgpack generates `key_part` for field '2' with 'data_path' = "name.last".  Verification of json validity is also carried.
> +
> +3. tuple_format_create constructs new space format and initialize field included in the indexes with `tuple_field_add_json_index` routine: 
> +```
> +if (part->data_path != NULL) {
> +	if (tuple_field_add_json_index(format,
> +					part->data_path,
> +					--current_slot) != 0)
> +		return -1;
> +}
> +```
> +`tuple_field_add_json_index` constructs a new hash table if it is not exists, and insert `<data_path, hash, slot_offset>`

You could just say that for each indexed json path you allocate a slot
in the offset map instead of writing these 3 bullet points. That would
be much easier to understand...

What if the same json path is indexed twice (by two indexes)? I assume
you allocate just one slot in the offset map. This should be clearly put
in the text.

Where inside the offset map do you allocate index slots for json path
fields? After regular fields or before? Again, this should be described
in the RFC, probably with a diagram, because it's impossible to fathom
from what you wrote.

> +
> +4. Finally, on new tuple creation, `tuple_init_field_map` uses hash to fill tuple map and check that data corresponds data_path restriction:
> +```
> +for (; i < defined_field_count; ++i, ++field) {
> +	struct mh_strni32_t *hash = format->json_index_hash;
> +	if (hash != NULL) {
> +		for (mh_int_t i = mh_first(hash); i != mh_end(hash);
> +		     i = mh_next(hash, i)) {
> +			int32_t offset_slot =
> +				mh_strni32_node(hash, i)->val;
> +			const char *field = NULL;
> +			const char *data_path = mh_strni32_node(hash, i)->str;
> +			if (tuple_field_raw_by_path(format, raw, field_map,
> +						    data_path, strlen(data_path),
> +						    0, &field) != 0 || 
> +			    field == NULL)
> +				return -1;
> +			field_map[offset_slot] =
> +				(uint32_t) (field - raw);
> +		}
> +	}
> +}
> +```

I don't like that if the same json path prefix is used by multiple
indexes, you'll look it up multiple times for the same tuple,
e.g. suppose there are indexes over employee.name.first and
employee.name.last, then you'll lookup employee.name twice for every
tuple you insert into the space. This is sub-optimal.

IMO the whole idea of mapping the whole json path as it was put in by
the user to offset map slot is flawed, because two different strings can
represent the same json path (e.g. because one of them has some spaces
while the other doesn't). Suppose the user creates two indexes that
index the same json path, but address it by different (in terms of
memcpy) strings. What happens then? There will be two offset maps used
instead of one?

That said, I think you should come up with a better implementation for
mapping json path to an offset map rather than simply maintaining json
path => offset slot hash. I guess, you should split json path in nodes
and store those nodes in tuple_format somehow so that no matter how the
user enters a json path, you can find it in the format. This would also
allow you to avoid multiple accesses to the same json path prefix when
constructing a field map for a new tuple: you could decode the msgpack,
looking up indexed paths in the format on the go, i.e. you would decode
the tuple just once, no matter how many indexes are out there.

> +
> +### Speeding up access with offset_slot cache in key_part object
> +Hash lookup could be an expensive operation, so we suppose to cache offset_slots for JSON indexes. 
> +
> +We suppose to extend 'key_part' structure with 'data_path' field initializing on index creation, 'field_epoch' and reallocatable 'offset_slot' int32_t array.
> +```

And how do you expect me to understand what they will be used for?
Describe what you need them for in English (the comments in the code
snippet below don't make it any clearer).

> +struct key_part {
> +	/** Tuple field index for this part */
> +	uint32_t fieldno;
> +	...
> +
> +	/** Data JSON path. */
> +	const char *data_format;

'data_format'? 'data_path' you mean? Why not simply 'path'?

> +	/** Epoch of tuple_format cached in offset_slot. */
> +	uint32_t slot_epoch;
> +	/** Cache of corresponding tuple_format slot_offset. */
> +	int32_t slot_cache;

What are these new fields initialized to for non-json key parts?

> +};
> +```
> +
> +```
> +struct tuple_format {
> +	/** Tuple format epoch. */
> +	uint32_t epoch;
> +};
> +```

So basically, the cached slots of existing indexes will be cleared when
the space format changes (e.g. a new index is created)? Or will you
rebuild them? It's unclear from the text.

What I'm interested in is the following case. Suppose you have a space
with one json index. It has a lot of tuples. One day you decide to
create a new index over a field that wasn't indexed before. Currently
(non-json indexes), the old index will be fast for both old and new
tuples while the new index will be fast only for new tuples and slow for
old tuples, because old tuples don't have the corresponding field in the
field map.

What about json indexes? What will happen with them in such a case? If
you update key parts after the new index is created, then the old index
will be slow for all tuples, which is bad. If you don't, the new index
will be slow for all tuples, which is also bad. Anyway, it's worse than
the behavior we have now.

May be, we could somehow store field map offsets for all formats that
are currently used by tuples stored in the space in each key part, i.e.
turn slot_cache into an array:

  key_part->slot_cache[format->epoch]

> +
> +### Acceding tuple by JSON-path

Acceding?

> +#### tuple_field_raw_by_path
> +```
> +tuple_field_raw_by_path(const struct tuple_format *format, const char *tuple,
> +                        const uint32_t *field_map, const char *path,
> +                        uint32_t path_len, uint32_t path_hash,
> +                        const char **field)
> +```
> +We can trivially test if name is an index looking to the hash and obtain offset to specifying offset to `tuple_field_raw`.

I don't understand this sentence...

> +
> +
> +## Rationale and alternatives
> +...

No rationale, no alternatives?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v1 1/1] rfc: describe a Tarantool JSON indexes
  2018-07-27 15:10 ` Vladimir Davydov
@ 2018-07-30 12:02   ` Kirill Shcherbatov
  2018-07-30 12:21     ` Vladimir Davydov
  2018-07-30 13:45     ` Vladislav Shpilevoy
  0 siblings, 2 replies; 11+ messages in thread
From: Kirill Shcherbatov @ 2018-07-30 12:02 UTC (permalink / raw)
  To: tarantool-patches, Vladimir Davydov

Hi! Thanks for so detailed and informative criticism!
I've updated RFC. I've tried to be more accurate with details and provide detailed explanations. 
https://github.com/tarantool/tarantool/blob/a78d21282d5cb9033699d3cb50ead91d01dffbb4/doc/rfc/1012-json-indexes.md

=====================================================

# Tarantool JSON Indexes

* **Status**: In progress
* **Start date**: 26-07-2018
* **Authors**: Kirill Shcherbatov @kshcherbatov kshcherbatov@tarantool.org
* **Issues**: [#1012](https://github.com/tarantool/tarantool/issues/1012)

## Summary
Tarantool *JSON Indexes* is a feature that allows to index space's tuples that have consistent document-type complex structure.


## Background and motivation
The box.index submodule provides read-only access for index definitions and index keys. Indexes are contained in box.space.space-name.index array within each space object. They provide an API for ordered iteration over tuples. This API is a direct binding to corresponding methods of index objects of type box.index in the storage engine.
```
s:create_index('primary', {parts = {1, 'unsigned'}, {2, 'str'}})
```

Sometimes tuple space data have complex but consistent document structure.
```
s:insert{1, {town = 'NY', name = {first = 'Richard', second = 'Feynman'}}}
s:insert{2, {town = 'NY', name = {first = 'Ayn', second = 'Rand'}}}
```
Building index on document fields would be attractive feature that allows to search and iterate over such data.
```
s:create_index('name'', { parts = {{ "name.first", "str" }, {"name.last", "str"}}})
```

Fields having complex document structure should have 'map' type in format if specified.


## Detailed design
All data in Tarantool stored in atomic database objects called *tuples*. They consist of payload(user data represented as *msgpack*) and extra information that describe how does database should operate with it.
```
  [[ Img 0.1 - Tarantool tuple structure ]]

                           +----------------------+-----------------------+-------------+
  tuple_begin, ..., raw =  |      extra_size      | offset N ... offset 1 | MessagePack |
  |                        +----------------------+-----------------------+-------------+
  |                                                                       ^
  +-----------------------------------------------------------------------data_offset
```
Indexed information should be accessible really fast. In case of regular indexes on fields it comes possible with *field_map* - data offsets stored before user payload in each tuple. Metadata cells that are used to store those offsets *offset_slot* are not same for different *formats*(aggregated information of all indexes describing specific space). Without loss of generality this mean that tuples constructed at the moment 'A' when only PK index was registered have another *format* than the tuples constructed at the moment 'B' after new Index "name" was added.

The concept idea with *offset_slot*s could be used for JSON Indexes too.

JSON-path definitely would be a part of *key_part* structure. As we have already noted, *offset_slots* are different for formats of different generations, so we should find an universal and fast way to obtain it by JSON. As code that access tuple data by index is a really hot, the performance is our priority problem.
We already have *tuple_field_raw_by_path* that follows JSON path in tuple and returns pointer to requested field. Sometimes fields that nested in JSON-structured document may have coinciding path part:
e.g.:
```
s:create_index('name'', { parts = {{ "name.first", "str" }, {"name.last", "str"}}})
s:create_index('name'', { parts = {{ "town", "str" }, {"name.last", "str"}}})
```
We don't like to parse coinciding suffixes multiple time for same tuples.

### 1.1. JSON-path tree in tuple format
Keeping in mind all challenges we suppose to introduce tree-based structure in *tuple_format* that contain JSON-path lexemes and has allocated tuple *offset_slots* as leaves nodes.
```
   [[ Img 1.1 - JSON-path tree in tuple format ]]

   ----\
       |------> [name] \
       |  {off. cache} |----> [first]      <slot1>
       |               |      {off. cache}
       |               |
       |               |----> [last]       <slot2>
       |                      {off. cache}
       |
       \------> [birthday]   <slot3>
                {off. cache}
```

During tuple creation we will also fill off_cache on intermediate tree nodes.
For example, when first index part "name.first" is built, we would already have name offset and would able to start there parsing only "last" lexeme.
```
s:create_index('name'', { parts = {{ "name.first", "str" },
                                   { "name.last",  "str" }}})
```
The resulting information offset would be stored in slot obtained from tree leaf node.

```
    [[ Img 1.2 - Tuple data stored in database of same format ]]

                                                              | Description  |
   +=======================+======+===================+
   | slot1 | slot2 | slot3 | town |       name        |       | tuple schema |
   +=======+=======+=======+======+============+======+
   |   2   |   3   |   0   | NY   | Ayn | Rand |              |    tuple1    |
   +-----------------------+------+---------+--+------+
   |   2   |   4   |   0   | NY   | Richard | Feynman |       |    tuple2    |
   +-----------------------+------+---------+---------+
                           ^      ^     ^   ^
                           0      2     3   4                 |    offset    |
```

### 1.2. Access data by JSON-path
The central API to access information is a routine
```
static inline const char *
tuple_field_raw(const struct tuple_format *format, const char *tuple,
                const uint32_t *field_map, uint32_t field_no)
```

In most scenarios related to indexes it is called with *field_no* extracted from *key_part* - a descriptor of part representing index.
```
field_b = tuple_field_raw(format_b, tuple_b_raw, field_map_b, part->fieldno);
```
With introducing parts containing JSON path we need to rework such calls.

To avoid parsing JSON path each time we suppose to introduce a cache for *offset_slot* in *key_part* structure:
```
struct key_part {
  /** Tuple field index for this part */
  uint32_t fieldno;
  ...

  /** Data JSON path. */
  const char *data_path;
  /** Epoch of tuple_format cached in offset_slot. */
  uint32_t slot_epoch;
  /** Cache of corresponding tuple_format slot_offset. */
  int32_t slot_cache;
};
```
same as *tuple_format* where such epoch would be initialized with a new bigger(when source key_parts have changed) value on creation.
```
struct tuple_format {
  ...

  /** Epoch of tuple format. */
  uint32_t epoch;
  /** Offset slot tree for JSON indexes. */
  TREE json_tree;
  /** Formats of the fields. */
  struct tuple_field fields[0];
};
```

In all index-related scenarios with *tuple_field_raw* we would rework with a new function:
```
const char *
tuple_field_by_part(const struct tuple *tuple, const struct key_def *def,
                    uint32_t idx);

```

With first argument *tuple* we can obtain tuple_format that contain *epoch* field.
```
  [[ Img 2.1 - Block diagram for tuple_field_by_part routine ]]

            [[def->epoch == tuple_format(tuple)->epoch]]
                                 ||
                  YES            ||             NO
                 ________________/\________________
                |                                  |

     use offset_slot from             PARSE def->parts[idx]->data_path and
  def->parts[idx].cached_slot   observe tuple_format(tuple)->json_tree structure,
                                     UPDATE def->parts[idx].slot_epoch and
                                           def->parts[idx].slot_cache
```

PARSE is about parsing JSON path lexeme-by-lexeme and step down the json_tree until leaf with slot offset would reached.
Then we need to UPDATE key_part cache.


## Rationale and alternatives

### 2.1 JSON-path hash table instead of tree
Instead of JSON-path tree in space format described in chapter (1.1) in previous section we could use hash table in format:
```
   HASTABLE<const char *data_path, int32_t offset_slot>

                            lookup
   "name.first" ----> hash1 ------> [hash1, "name.first", offset_slot1]
   "name.last"  ----> hash2 ------> [hash2, "name.last", offset_slot2]
```
The only drawback of this approach is the inability to speed up the parsing of data on tuple creation based on the information obtained by parsing the previous paths.
An as tuple insertion is rare frequent operation this makes sense.

### 2.2 Re-allocatable slot_cache
Epoch-based invalidation is looking complicated enough.
It is possible to allocate slot_cache[] array for each key_part and store slots for all epoch (where epoch would be an index in that array).
```
                         format->epoch
key_part {                     ||
   slot_epoch[]:               \/
        [1: s1, 2: s2, ... epoch: s3, ... max_epoch: sN]
}

I.e.:
    tuple_map[key_part->slot_epoch[format->epoch]]
```
But as formats are created really often, this approach is really resource-consumable.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v1 1/1] rfc: describe a Tarantool JSON indexes
  2018-07-30 12:02   ` [tarantool-patches] " Kirill Shcherbatov
@ 2018-07-30 12:21     ` Vladimir Davydov
  2018-07-30 13:45     ` Vladislav Shpilevoy
  1 sibling, 0 replies; 11+ messages in thread
From: Vladimir Davydov @ 2018-07-30 12:21 UTC (permalink / raw)
  To: Kirill Shcherbatov, Vladislav Shpilevoy; +Cc: tarantool-patches

On Mon, Jul 30, 2018 at 03:02:32PM +0300, Kirill Shcherbatov wrote:
> Hi! Thanks for so detailed and informative criticism!
> I've updated RFC. I've tried to be more accurate with details and provide detailed explanations. 
> https://github.com/tarantool/tarantool/blob/a78d21282d5cb9033699d3cb50ead91d01dffbb4/doc/rfc/1012-json-indexes.md

Thanks, the new version looks OK to me.

Vladislav, take a look please.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v1 1/1] rfc: describe a Tarantool JSON indexes
  2018-07-30 12:02   ` [tarantool-patches] " Kirill Shcherbatov
  2018-07-30 12:21     ` Vladimir Davydov
@ 2018-07-30 13:45     ` Vladislav Shpilevoy
  2018-07-30 14:09       ` Kirill Shcherbatov
  2018-07-30 16:14       ` Kirill Shcherbatov
  1 sibling, 2 replies; 11+ messages in thread
From: Vladislav Shpilevoy @ 2018-07-30 13:45 UTC (permalink / raw)
  To: tarantool-patches, Kirill Shcherbatov, Vladimir Davydov

Hi! Thanks for the new version! See 12 comments below.

Vova, please, look at my comments and say what do you think.

> 
> # Tarantool JSON Indexes
> 
> * **Status**: In progress
> * **Start date**: 26-07-2018
> * **Authors**: Kirill Shcherbatov @kshcherbatov kshcherbatov@tarantool.org
> * **Issues**: [#1012](https://github.com/tarantool/tarantool/issues/1012)
> 
> ## Summary
> Tarantool *JSON Indexes* is a feature that allows to index space's tuples that have consistent document-type complex structure.
> 
> 
> ## Background and motivation
> The box.index submodule provides read-only access for index definitions and index keys. Indexes are contained in box.space.space-name.index array within each space object. They provide an API for ordered iteration over tuples. This API is a direct binding to corresponding methods of index objects of type box.index in the storage engine.
> ```
> s:create_index('primary', {parts = {1, 'unsigned'}, {2, 'str'}})
> ```
> 
> Sometimes tuple space data have complex but consistent document structure.
> ```
> s:insert{1, {town = 'NY', name = {first = 'Richard', second = 'Feynman'}}}
> s:insert{2, {town = 'NY', name = {first = 'Ayn', second = 'Rand'}}}
> ```
> Building index on document fields would be attractive feature that allows to search and iterate over such data.
> ```
> s:create_index('name'', { parts = {{ "name.first", "str" }, {"name.last", "str"}}})
> ```
> 
> Fields having complex document structure should have 'map' type in format if specified.

1. It is not correct. I can declare JSON index on enclosed arrays like this

     parts = { {'[2][3][1][5]', 'unsigned'}, {'[3][5].field[6].field2[7]', 'string'} }

> 
> 
> ## Detailed design
> All data in Tarantool stored in atomic database objects called *tuples*. They consist of payload(user data represented as *msgpack*) and extra information that describe how does database should operate with it.
> ```
>    [[ Img 0.1 - Tarantool tuple structure ]]
> 
>                             +----------------------+-----------------------+-------------+
>    tuple_begin, ..., raw =  |      extra_size      | offset N ... offset 1 | MessagePack |
>    |                        +----------------------+-----------------------+-------------+
>    |                                                                       ^
>    +-----------------------------------------------------------------------data_offset
> ```
> Indexed information should be accessible really fast. In case of regular indexes on fields it comes possible with *field_map* - data offsets stored before user payload in each tuple. Metadata cells that are used to store those offsets *offset_slot* are not same for different *formats*(aggregated information of all indexes describing specific space). Without loss of generality this mean that tuples constructed at the moment 'A' when only PK index was registered have another *format* than the tuples constructed at the moment 'B' after new Index "name" was added.
> 
> The concept idea with *offset_slot*s could be used for JSON Indexes too.
> 
> JSON-path definitely would be a part of *key_part* structure. As we have already noted, *offset_slots* are different for formats of different generations, so we should find an universal and fast way to obtain it by JSON. As code that access tuple data by index is a really hot, the performance is our priority problem.
> We already have *tuple_field_raw_by_path* that follows JSON path in tuple and returns pointer to requested field. Sometimes fields that nested in JSON-structured document may have coinciding path part:
> e.g.:
> ```
> s:create_index('name'', { parts = {{ "name.first", "str" }, {"name.last", "str"}}})
> s:create_index('name'', { parts = {{ "town", "str" }, {"name.last", "str"}}})
> ```
> We don't like to parse coinciding suffixes multiple time for same tuples.
> 
> ### 1.1. JSON-path tree in tuple format
> Keeping in mind all challenges we suppose to introduce tree-based structure in *tuple_format* that contain JSON-path lexemes and has allocated tuple *offset_slots* as leaves nodes.
> ```
>     [[ Img 1.1 - JSON-path tree in tuple format ]]
> 
>     ----\
>         |------> [name] \
>         |  {off. cache} |----> [first]      <slot1>
>         |               |      {off. cache}
>         |               |
>         |               |----> [last]       <slot2>
>         |                      {off. cache}
>         |
>         \------> [birthday]   <slot3>
>                  {off. cache}
> ```
> 
> During tuple creation we will also fill off_cache on intermediate tree nodes.

2. As I remember from verbal discussion, we've decided to do not store
offsets for intermediate nodes. It is too expensive. You actually purpose
to store an offset for each tuple field, even non-indexed. In such a case
the field_map would become bigger than the tuple payload. Field_map is
very expensive storage and should not store non-needed offsets. So you should
not have an offset on [name], on [birthday]. Only on [first] and [last].

What is more, this example do not match the code above. Firstly you
operate with name/town, then with name/birthday, then again with
name/town. It confuses. Do you have an index on 'birthday'? If not,
then why do you need an offset on it?

> For example, when first index part "name.first" is built, we would already have name offset and would able to start there parsing only "last" lexeme.
> ```
> s:create_index('name'', { parts = {{ "name.first", "str" },
>                                     { "name.last",  "str" }}})
> ```
> The resulting information offset would be stored in slot obtained from tree leaf node.
> 
> ```
>      [[ Img 1.2 - Tuple data stored in database of same format ]]
> 
>                                                                | Description  |
>     +=======================+======+===================+
>     | slot1 | slot2 | slot3 | town |       name        |       | tuple schema |
>     +=======+=======+=======+======+============+======+
>     |   2   |   3   |   0   | NY   | Ayn | Rand |              |    tuple1    |
>     +-----------------------+------+---------+--+------+
>     |   2   |   4   |   0   | NY   | Richard | Feynman |       |    tuple2    |
>     +-----------------------+------+---------+---------+
>                             ^      ^     ^   ^
>                             0      2     3   4                 |    offset    |

                                                             ^
3. The table is malformed. Some problems with white spaces. 3 points
into the middle of 'Richard', additional '+' up to 'Feynman' etc.

4. Offset in tuple_format is number of offset slot in a tuple. Index
in the field_map array actually. So in the format offset slots always
are sequence of natural numbers + 0: 0, 1, 2, 3 ... with no holes. So
I do not understand why do you have 0, 2, 3, 4 in the format instead
of 0, 1, 2, 3.

> ```
> 
> ### 1.2. Access data by JSON-path
> The central API to access information is a routine
> ```
> static inline const char *
> tuple_field_raw(const struct tuple_format *format, const char *tuple,
>                  const uint32_t *field_map, uint32_t field_no)
> ```
> 
> In most scenarios related to indexes it is called with *field_no* extracted from *key_part* - a descriptor of part representing index.
> ```
> field_b = tuple_field_raw(format_b, tuple_b_raw, field_map_b, part->fieldno);
> ```
> With introducing parts containing JSON path we need to rework such calls.
> 
> To avoid parsing JSON path each time we suppose to introduce a cache for *offset_slot* in *key_part* structure:
> ```
> struct key_part {
>    /** Tuple field index for this part */
>    uint32_t fieldno;
>    ...
> 
>    /** Data JSON path. */
>    const char *data_path;
>    /** Epoch of tuple_format cached in offset_slot. */
>    uint32_t slot_epoch;
>    /** Cache of corresponding tuple_format slot_offset. */
>    int32_t slot_cache;

5. Looks like you have a problem with offset_slot/slot_offset naming. Please,
use one of them, not both.

> };
> ```
> same as *tuple_format* where such epoch would be initialized with a new bigger(when source key_parts have changed) value on creation.

6. What is the same as tuple_format? Looks like the first part of the
sentence is lost.

> ```
> struct tuple_format {
>    ...
> 
>    /** Epoch of tuple format. */
>    uint32_t epoch;
>    /** Offset slot tree for JSON indexes. */
>    TREE json_tree;

7. It is not for JSON indexes only and not for offsets only as well. It
is a tree to validate any JSON space format constructed from
space:format and indexes. And not all of the nodes will have offset
slots.

>    /** Formats of the fields. */
>    struct tuple_field fields[0];

8. This is the array of trees. It is not array + tree in a separate
field. You have array of trees where i-th tree describes format of
the i-th field and its internals. Some of tree-nodes have offsets
and some are just to validate the format. Do not forget that these
trees are going to be used for space:format validation. Offset_slot
is a part of tuple_field, even now, and is filled optionally if the
field is a part of an index.

> };
> ```
> 
> In all index-related scenarios with *tuple_field_raw* we would rework with a new function:
> ```
> const char *
> tuple_field_by_part(const struct tuple *tuple, const struct key_def *def,
>                      uint32_t idx);
> 
> ```
> 
> With first argument *tuple* we can obtain tuple_format that contain *epoch* field.
> ```
>    [[ Img 2.1 - Block diagram for tuple_field_by_part routine ]]
> 
>              [[def->epoch == tuple_format(tuple)->epoch]]
>                                   ||
>                    YES            ||             NO
>                   ________________/\________________
>                  |                                  |
> 
>       use offset_slot from             PARSE def->parts[idx]->data_path and
>    def->parts[idx].cached_slot   observe tuple_format(tuple)->json_tree structure,
>                                       UPDATE def->parts[idx].slot_epoch and
>                                             def->parts[idx].slot_cache

9. On 'NO' branch the key_part cache will oscillate. Example: you have a
tuple1 with format1 and key_part synced with format1. Then you add new
index and insert new tuples: tuple[2-N] with format2. Then index compares
the tuples. Each time when you compare tuple1 vs any of tuple[2-N] you
will change key_part cache version to the old one and make slower either
new tuples or old tuples. It is not ok.

Moreover you did not take into account that on comparison of tuples
you have two formats in a common case. Up to which will you update
key_part cache version?

I think that fast fix is to update cache version only to the biggest one.

And the best solution is find a way how to update key_defs on alter and
do not handle this case during comparison. We need Vova's opinion here.

> ```
> 
> PARSE is about parsing JSON path lexeme-by-lexeme and step down the json_tree until leaf with slot offset would reached.
> Then we need to UPDATE key_part cache.
> 
> 
> ## Rationale and alternatives
> 
> ### 2.1 JSON-path hash table instead of tree
> Instead of JSON-path tree in space format described in chapter (1.1) in previous section we could use hash table in format:
> ```
>     HASTABLE<const char *data_path, int32_t offset_slot>

10. HASTABLE? Maybe 'hash', not 'has'?

> 
>                              lookup
>     "name.first" ----> hash1 ------> [hash1, "name.first", offset_slot1]
>     "name.last"  ----> hash2 ------> [hash2, "name.last", offset_slot2]
> ```
> The only drawback of this approach is the inability to speed up the parsing of data on tuple creation based on the information obtained by parsing the previous paths.
> An as tuple insertion is rare frequent operation this makes sense.

11. The real problem of hashes is not linked with the one you described here.
Hash will work for any formats and will be used on comparison only. Not on
insertion. On insertion you need to decode the whole tuple to build offset
map and validate the format going along tuple field trees. Not by JSON path
of an index. On insertion you have no any paths. You have only tuple and the
array of trees of tuple fields. The real problem is that these paths can be
logically identical, but actually have some differences. For example, use
["ident"] in one index and .ident in another one.

> 
> ### 2.2 Re-allocatable slot_cache

12. No such word 're-allocatable'.

> Epoch-based invalidation is looking complicated enough.
> It is possible to allocate slot_cache[] array for each key_part and store slots for all epoch (where epoch would be an index in that array).
> ```
>                           format->epoch
> key_part {                     ||
>     slot_epoch[]:               \/
>          [1: s1, 2: s2, ... epoch: s3, ... max_epoch: sN]
> }
> 
> I.e.:
>      tuple_map[key_part->slot_epoch[format->epoch]]
> ```
> But as formats are created really often, this approach is really resource-consumable.
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v1 1/1] rfc: describe a Tarantool JSON indexes
  2018-07-30 13:45     ` Vladislav Shpilevoy
@ 2018-07-30 14:09       ` Kirill Shcherbatov
  2018-07-30 16:14       ` Kirill Shcherbatov
  1 sibling, 0 replies; 11+ messages in thread
From: Kirill Shcherbatov @ 2018-07-30 14:09 UTC (permalink / raw)
  To: tarantool-patches, Vladislav Shpilevoy, Vladimir Davydov


> 2. As I remember from verbal discussion, we've decided to do not store
> offsets for intermediate nodes. It is too expensive. You actually purpose
> to store an offset for each tuple field, even non-indexed. In such a case
> the field_map would become bigger than the tuple payload. Field_map is
> very expensive storage and should not store non-needed offsets. So you should
> not have an offset on [name], on [birthday]. Only on [first] and [last].
> 
Please, pay attention that we don't allocate offset_slot for intermediate node. 
This is the only cache for tuple_new operation. offset_slot only for leafs.
Without  this feature there is no difference with hash. 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v1 1/1] rfc: describe a Tarantool JSON indexes
  2018-07-30 13:45     ` Vladislav Shpilevoy
  2018-07-30 14:09       ` Kirill Shcherbatov
@ 2018-07-30 16:14       ` Kirill Shcherbatov
  2018-07-30 18:46         ` Vladislav Shpilevoy
  1 sibling, 1 reply; 11+ messages in thread
From: Kirill Shcherbatov @ 2018-07-30 16:14 UTC (permalink / raw)
  To: tarantool-patches, Vladislav Shpilevoy, Vladimir Davydov

Hi! Thank you for review. I've accounted all your suggestions.

On 30.07.2018 16:45, Vladislav Shpilevoy wrote:
> Hi! Thanks for the new version! See 12 comments below.
> 
> Vova, please, look at my comments and say what do you think.
We have verbally discuss all of this with Vova.

> 1. It is not correct. I can declare JSON index on enclosed arrays like this
Fields having complex document structure should have 'map' or 'array' type in format if specified.

> 2. As I remember from verbal discussion, we've decided to do not store
> offsets for intermediate nodes. It is too expensive. You actually purpose
> to store an offset for each tuple field, even non-indexed. In such a case
> the field_map would become bigger than the tuple payload. Field_map is
> very expensive storage and should not store non-needed offsets. So you should
> not have an offset on [name], on [birthday]. Only on [first] and [last].
I've already answered with previous letter that this is not slot_offset that allocated as a part of tuple. 
"off. cache" is only implementation-specific detail that allows start parsing with most relevant offset 
on tree traversal.

> What is more, this example do not match the code above. Firstly you
> operate with name/town, then with name/birthday, then again with
> name/town. It confuses. Do you have an index on 'birthday'? If not,
> then why do you need an offset on it?
I've fixed this example. And have improved all schemas to be more associative.
                                        ^
> 3. The table is malformed. Some problems with white spaces. 3 points
> into the middle of 'Richard', additional '+' up to 'Feynman' etc.
Uguhm, fixed "+"s; But this not a middle of offset; Take a look to a new version

> 4. Offset in tuple_format is number of offset slot in a tuple. Index
> in the field_map array actually. So in the format offset slots always
> are sequence of natural numbers + 0: 0, 1, 2, 3 ... with no holes. So
> I do not understand why do you have 0, 2, 3, 4 in the format instead
> of 0, 1, 2, 3.
This is some sample offsets; Same as (3).
> 5. Looks like you have a problem with offset_slot/slot_offset naming. Please,
> use one of them, not both.
Renamed all to offset_slot.

> 6. What is the same as tuple_format? Looks like the first part of the
> sentence is lost.
Fixed.
And extend *tuple_format* where such epoch would be initialized with a new bigger(when source key_parts have changed) value on creation.

> 7. It is not for JSON indexes only and not for offsets only as well. It
> is a tree to validate any JSON space format constructed from
> space:format and indexes. And not all of the nodes will have offset
> slots.
Fixed.

> 8. This is the array of trees. It is not array + tree in a separate
> field. You have array of trees where i-th tree describes format of
> the i-th field and its internals. Some of tree-nodes have offsets
> and some are just to validate the format. Do not forget that these
> trees are going to be used for space:format validation. Offset_slot
> is a part of tuple_field, even now, and is filled optionally if the
> field is a part of an index.
struct tuple_format {
  ...

  /** Epoch of tuple format. */
  uint32_t epoch;
  /** Array of data_path trees built for indexes. */
  TREE index_tree[0];
};
```

> 9. On 'NO' branch the key_part cache will oscillate. Example: you have a
> tuple1 with format1 and key_part synced with format1. Then you add new
> index and insert new tuples: tuple[2-N] with format2. Then index compares
> the tuples. Each time when you compare tuple1 vs any of tuple[2-N] you
> will change key_part cache version to the old one and make slower either
> new tuples or old tuples. It is not ok.
> 
> Moreover you did not take into account that on comparison of tuples
> you have two formats in a common case. Up to which will you update
> key_part cache version?
> 
> I think that fast fix is to update cache version only to the biggest one.
> 
> And the best solution is find a way how to update key_defs on alter and
> do not handle this case during comparison. We need Vova's opinion here.
Discussed with Vova, fixed.
PARSE def->parts[idx]->data_path and observe tuple_format(tuple)->json_tree structure,  
UPDATE def->parts[idx].slot_epoch and def->parts[idx].slot_cache IF format epoch is bigger

> 10. HASTABLE? Maybe 'hash', not 'has'?
Yep.

> 11. The real problem of hashes is not linked with the one you described here.
> Hash will work for any formats and will be used on comparison only. Not on
> insertion. On insertion you need to decode the whole tuple to build offset
> map and validate the format going along tuple field trees. Not by JSON path
> of an index. On insertion you have no any paths. You have only tuple and the
> array of trees of tuple fields. The real problem is that these paths can be
> logically identical, but actually have some differences. For example, use
> ["ident"] in one index and .ident in another one.
I've used yor explanation in RFC.

> 12. No such word 're-allocatable'.
Reallocatable.


Hm, perhaps it is the time to include you and Vova to RFC authors? Don't know does it matter.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v1 1/1] rfc: describe a Tarantool JSON indexes
  2018-07-30 16:14       ` Kirill Shcherbatov
@ 2018-07-30 18:46         ` Vladislav Shpilevoy
  2018-07-30 19:23           ` Kirill Shcherbatov
  0 siblings, 1 reply; 11+ messages in thread
From: Vladislav Shpilevoy @ 2018-07-30 18:46 UTC (permalink / raw)
  To: Kirill Shcherbatov, tarantool-patches, Vladimir Davydov



On 30/07/2018 19:14, Kirill Shcherbatov wrote:
> Hi! Thank you for review. I've accounted all your suggestions.
> 
> On 30.07.2018 16:45, Vladislav Shpilevoy wrote:
>> Hi! Thanks for the new version! See 12 comments below.
>>
>> Vova, please, look at my comments and say what do you think.
> We have verbally discuss all of this with Vova.

It is cool, but I do not know what you have discussed. I guess
such things should be done either via email, or verbally with
all participants.

> 
>> 2. As I remember from verbal discussion, we've decided to do not store
>> offsets for intermediate nodes. It is too expensive. You actually purpose
>> to store an offset for each tuple field, even non-indexed. In such a case
>> the field_map would become bigger than the tuple payload. Field_map is
>> very expensive storage and should not store non-needed offsets. So you should
>> not have an offset on [name], on [birthday]. Only on [first] and [last].
> I've already answered with previous letter that this is not slot_offset that allocated as a part of tuple.
> "off. cache" is only implementation-specific detail that allows start parsing with most relevant offset
> on tree traversal.

I still do not clearly understand what are you talking about. You can have
different offsets in the same path: 1) [1][2][3].field and 2) [1][2][3]["field"].
Here 'field' has offset 10 in the first case and 11 in the second
one.

If you want to use prefix length in off.cache on comparison to walk the
tuple_field trees along with the path in key_part on mismatch of cache
versions, then you should explain more clear how do you want to use
off.cache.

Lets suppose you have a key_part with a JSON path and a trees array.
To determine into which tree you first go to find offset_slot, you
should parse first path part. Same for each next part - you should
parse it to go down the tree. So you just do not know into which
tuple_field you go until the next part of the path is parsed. And how
does tuple_field.off_cache help here?

And what will you do when you met a format which does not have
an offset for the needed field? For example, I have created an
index, inserted multiple tuples, then created another index. The
format is changed, but the old tuples have the old format that
does not have an offset to the parts of the new index.

>> 8. This is the array of trees. It is not array + tree in a separate
>> field. You have array of trees where i-th tree describes format of
>> the i-th field and its internals. Some of tree-nodes have offsets
>> and some are just to validate the format. Do not forget that these
>> trees are going to be used for space:format validation. Offset_slot
>> is a part of tuple_field, even now, and is filled optionally if the
>> field is a part of an index.
> struct tuple_format {
>    ...
> 
>    /** Epoch of tuple format. */
>    uint32_t epoch;
>    /** Array of data_path trees built for indexes. */
>    TREE index_tree[0];

As I said, it is not special index tree. It can describe a space
format with tens of fields among which only one is indexed.
Index_tree is not correct name. It is tuple_field array as it is
now, where each field is just a struct tuple_field. And struct
tuple_field can contain more tuple_fields inside either as an array
(if the field type is array) or inside a tree/hash if it is map.

> };
> ```
> 
> Hm, perhaps it is the time to include you and Vova to RFC authors? Don't know does it matter.

Up to you %) If you want to be a single author, you are welcome.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v1 1/1] rfc: describe a Tarantool JSON indexes
  2018-07-30 18:46         ` Vladislav Shpilevoy
@ 2018-07-30 19:23           ` Kirill Shcherbatov
  2018-07-30 20:19             ` Vladislav Shpilevoy
  0 siblings, 1 reply; 11+ messages in thread
From: Kirill Shcherbatov @ 2018-07-30 19:23 UTC (permalink / raw)
  To: tarantool-patches, Vladislav Shpilevoy, Vladimir Davydov

Actual link
https://github.com/tarantool/tarantool/blob/db468f172e642d3d830018a20e545eff77c655e3/doc/rfc/1012-json-indexes.md

> I still do not clearly understand what are you talking about. You can have
> different offsets in the same path: 1) [1][2][3].field and 2) [1][2][3]["field"].
> Here 'field' has offset 10 in the first case and 11 in the second
> one.
> 
> If you want to use prefix length in off.cache on comparison to walk the
> tuple_field trees along with the path in key_part on mismatch of cache
> versions, then you should explain more clear how do you want to use
> off.cache.
> 
> Lets suppose you have a key_part with a JSON path and a trees array.
> To determine into which tree you first go to find offset_slot, you
> should parse first path part. Same for each next part - you should
> parse it to go down the tree. So you just do not know into which
> tuple_field you go until the next part of the path is parsed. And how
> does tuple_field.off_cache help here?
> 
> And what will you do when you met a format which does not have
> an offset for the needed field? For example, I have created an
> index, inserted multiple tuples, then created another index. The
> format is changed, but the old tuples have the old format that
> does not have an offset to the parts of the new index.
You are right. This feature won't work.

> As I said, it is not special index tree. It can describe a space
> format with tens of fields among which only one is indexed.
> Index_tree is not correct name. It is tuple_field array as it is
> now, where each field is just a struct tuple_field. And struct
> tuple_field can contain more tuple_fields inside either as an array
> (if the field type is array) or inside a tree/hash if it is map.
Ok, fixed.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v1 1/1] rfc: describe a Tarantool JSON indexes
  2018-07-30 19:23           ` Kirill Shcherbatov
@ 2018-07-30 20:19             ` Vladislav Shpilevoy
  2018-07-30 23:17               ` Vladislav Shpilevoy
  0 siblings, 1 reply; 11+ messages in thread
From: Vladislav Shpilevoy @ 2018-07-30 20:19 UTC (permalink / raw)
  To: Kirill Shcherbatov, tarantool-patches, Vladimir Davydov

Thanks for the fixes! Now I have no any major remark for
the RFC and the implementation method as well.

On 30/07/2018 22:23, Kirill Shcherbatov wrote:
> Actual link
> https://github.com/tarantool/tarantool/blob/db468f172e642d3d830018a20e545eff77c655e3/doc/rfc/1012-json-indexes.md
> 
>> I still do not clearly understand what are you talking about. You can have
>> different offsets in the same path: 1) [1][2][3].field and 2) [1][2][3]["field"].
>> Here 'field' has offset 10 in the first case and 11 in the second
>> one.
>>
>> If you want to use prefix length in off.cache on comparison to walk the
>> tuple_field trees along with the path in key_part on mismatch of cache
>> versions, then you should explain more clear how do you want to use
>> off.cache.
>>
>> Lets suppose you have a key_part with a JSON path and a trees array.
>> To determine into which tree you first go to find offset_slot, you
>> should parse first path part. Same for each next part - you should
>> parse it to go down the tree. So you just do not know into which
>> tuple_field you go until the next part of the path is parsed. And how
>> does tuple_field.off_cache help here?
>>
>> And what will you do when you met a format which does not have
>> an offset for the needed field? For example, I have created an
>> index, inserted multiple tuples, then created another index. The
>> format is changed, but the old tuples have the old format that
>> does not have an offset to the parts of the new index.
> You are right. This feature won't work.
> 
>> As I said, it is not special index tree. It can describe a space
>> format with tens of fields among which only one is indexed.
>> Index_tree is not correct name. It is tuple_field array as it is
>> now, where each field is just a struct tuple_field. And struct
>> tuple_field can contain more tuple_fields inside either as an array
>> (if the field type is array) or inside a tree/hash if it is map.
> Ok, fixed.
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v1 1/1] rfc: describe a Tarantool JSON indexes
  2018-07-30 20:19             ` Vladislav Shpilevoy
@ 2018-07-30 23:17               ` Vladislav Shpilevoy
  0 siblings, 0 replies; 11+ messages in thread
From: Vladislav Shpilevoy @ 2018-07-30 23:17 UTC (permalink / raw)
  To: Kirill Shcherbatov, tarantool-patches, Vladimir Davydov

I think we can resurrect hash of JSON path strings if
will store JSON strings in a canonical form. Hash allows
to improve performance for fields which have offsets in
the both old and new formats. For the old format we will
lookup by hash, for new we will use cache slot. It allows
to avoid going down the tuple_field trees and to decode
JSON on each comparison.

It is easy to convert a path into a canonical form. For
example, you can convert each identifier in the path
to ["ident"] form.

     .ident    -> ["ident"]
     ["ident"] -> unchanged
     [number]  -> unchanged

With these rules we can store all paths in the same
way.

On 30/07/2018 23:19, Vladislav Shpilevoy wrote:
> Thanks for the fixes! Now I have no any major remark for
> the RFC and the implementation method as well.
> 
> On 30/07/2018 22:23, Kirill Shcherbatov wrote:
>> Actual link
>> https://github.com/tarantool/tarantool/blob/db468f172e642d3d830018a20e545eff77c655e3/doc/rfc/1012-json-indexes.md
>>
>>> I still do not clearly understand what are you talking about. You can have
>>> different offsets in the same path: 1) [1][2][3].field and 2) [1][2][3]["field"].
>>> Here 'field' has offset 10 in the first case and 11 in the second
>>> one.
>>>
>>> If you want to use prefix length in off.cache on comparison to walk the
>>> tuple_field trees along with the path in key_part on mismatch of cache
>>> versions, then you should explain more clear how do you want to use
>>> off.cache.
>>>
>>> Lets suppose you have a key_part with a JSON path and a trees array.
>>> To determine into which tree you first go to find offset_slot, you
>>> should parse first path part. Same for each next part - you should
>>> parse it to go down the tree. So you just do not know into which
>>> tuple_field you go until the next part of the path is parsed. And how
>>> does tuple_field.off_cache help here?
>>>
>>> And what will you do when you met a format which does not have
>>> an offset for the needed field? For example, I have created an
>>> index, inserted multiple tuples, then created another index. The
>>> format is changed, but the old tuples have the old format that
>>> does not have an offset to the parts of the new index.
>> You are right. This feature won't work.
>>
>>> As I said, it is not special index tree. It can describe a space
>>> format with tens of fields among which only one is indexed.
>>> Index_tree is not correct name. It is tuple_field array as it is
>>> now, where each field is just a struct tuple_field. And struct
>>> tuple_field can contain more tuple_fields inside either as an array
>>> (if the field type is array) or inside a tree/hash if it is map.
>> Ok, fixed.
>>
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-07-30 23:17 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-26 11:15 [tarantool-patches] [PATCH v1 1/1] rfc: describe a Tarantool JSON indexes Kirill Shcherbatov
2018-07-27 15:10 ` Vladimir Davydov
2018-07-30 12:02   ` [tarantool-patches] " Kirill Shcherbatov
2018-07-30 12:21     ` Vladimir Davydov
2018-07-30 13:45     ` Vladislav Shpilevoy
2018-07-30 14:09       ` Kirill Shcherbatov
2018-07-30 16:14       ` Kirill Shcherbatov
2018-07-30 18:46         ` Vladislav Shpilevoy
2018-07-30 19:23           ` Kirill Shcherbatov
2018-07-30 20:19             ` Vladislav Shpilevoy
2018-07-30 23:17               ` Vladislav Shpilevoy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox