* [tarantool-patches] [PATCH v1 1/5] rfc: describe a Tarantool JSON indexes
2018-08-06 12:26 [tarantool-patches] [PATCH v1 0/5] box: indexes by JSON path Kirill Shcherbatov
@ 2018-08-06 12:26 ` Kirill Shcherbatov
2018-08-06 12:26 ` [tarantool-patches] [PATCH v1 2/5] box: introduce slot_cache in key_part Kirill Shcherbatov
` (4 subsequent siblings)
5 siblings, 0 replies; 14+ messages in thread
From: Kirill Shcherbatov @ 2018-08-06 12:26 UTC (permalink / raw)
To: tarantool-patches; +Cc: v.shpilevoy, Kirill Shcherbatov
Part of #1012.
---
doc/rfc/1012-json-indexes.md | 188 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 188 insertions(+)
create mode 100644 doc/rfc/1012-json-indexes.md
diff --git a/doc/rfc/1012-json-indexes.md b/doc/rfc/1012-json-indexes.md
new file mode 100644
index 0000000..a603f9a
--- /dev/null
+++ b/doc/rfc/1012-json-indexes.md
@@ -0,0 +1,188 @@
+# Tarantool JSON Indexes
+
+* **Status**: In progress
+* **Start date**: 26-07-2018
+* **Authors**: Vladimir Davydov @locker vdavydov.dev@gmail.com, Vladislav Shpilevoy @Gerold103 v.shpilevoy@tarantool.org, Kirill Shcherbatov @kshcherbatov kshcherbatov@tarantool.org
+* **Issues**: [#1012](https://github.com/tarantool/tarantool/issues/1012)
+
+## Summary
+Tarantool *JSON Indexes* is a feature that allows to index space's tuples that have consistent document-type complex structure.
+
+
+## Background and motivation
+The box.index submodule provides read-only access for index definitions and index keys. Indexes are contained in box.space.space-name.index array within each space object. They provide an API for ordered iteration over tuples. This API is a direct binding to corresponding methods of index objects of type box.index in the storage engine.
+```
+s:create_index('primary', {parts = {1, 'unsigned'}, {2, 'str'}})
+```
+
+Sometimes tuple space data have complex but consistent document structure.
+```
+s:insert{1, {town = 'NY', name = {first = 'Richard', second = 'Feynman'}}}
+s:insert{2, {town = 'NY', name = {first = 'Ayn', second = 'Rand'}}}
+```
+Building index on document fields would be attractive feature that allows to search and iterate over such data.
+```
+s:create_index('name'', { parts = {{ "name.first", "str" }, {"name.last", "str"}}})
+```
+
+Fields having complex document structure should have 'map' or 'array' type in format if specified.
+
+
+## Detailed design
+All data in Tarantool stored in atomic database objects called *tuples*. They consist of payload(user data represented as *msgpack*) and extra information that describe how does database should operate with it.
+```
+ [[ Img 0.1 - Tarantool tuple structure ]]
+
+ +----------------------+-----------------------+-------------+
+ tuple_begin, ..., raw = | extra_size | offset N ... offset 1 | MessagePack |
+ | +----------------------+-----------------------+-------------+
+ | ^
+ +-----------------------------------------------------------------------data_offset
+```
+Indexed information should be accessible really fast. In case of regular indexes on fields it comes possible with *field_map* - data offsets stored before user payload in each tuple. Metadata cells that are used to store those offsets *offset_slot* are not same for different *formats*(aggregated information of all indexes describing specific space). Without loss of generality this mean that tuples constructed at the moment 'A' when only PK index was registered have another *format* than the tuples constructed at the moment 'B' after new Index "name" was added.
+
+The concept idea with *offset_slot*s could be used for JSON Indexes too.
+
+JSON-path definitely would be a part of *key_part* structure. As we have already noted, *offset_slots* are different for formats of different generations, so we should find an universal and fast way to obtain it by JSON. As code that access tuple data by index is a really hot, the performance is our priority problem.
+We already have *tuple_field_raw_by_path* that follows JSON path in tuple and returns pointer to requested field. Sometimes fields that nested in JSON-structured document may have coinciding path part:
+e.g.:
+```
+s:create_index('name'', { parts = {{ "name.first", "str" }, {"name.last", "str"}}})
+s:create_index('name'', { parts = {{ "town", "str" }, {"name.last", "str"}}})
+```
+We don't like to parse coinciding suffixes multiple time for same tuples.
+
+### 1.1. JSON-path tree in tuple format
+Keeping in mind all challenges we suppose to introduce tree-based structure in *tuple_format* that contain JSON-path lexemes and has allocated tuple *offset_slots* as leaves nodes.
+```
+ [[ Img 1.1 - JSON-path tree in tuple format ]]
+
+ ----\
+ |-----> [name] \
+ | |----> [first] <slot1 = -3>
+ | |
+ | |----> [last] <slot2 = -1>
+ |
+ \------> [town] <slot3 = -1>
+```
+
+For example, when first index part "name.first" is built, we would already have name offset and would able to start there parsing only "last" lexeme.
+```
+s:create_index('name'', { parts = {{ "name.first", "str" },
+ { "name.last", "str" }}})
+```
+The resulting information offset would be stored in slot obtained from tree leaf node.
+
+```
+ [[ Img 1.2 - Tuple data stored in database of same format ]]
+
+ | Description |
+
+ {-3} {-2} {-1} | slot index |
+ +=======+=======+=======+======+===================+
+ | slot1 | slot2 | slot3 | town | name | | tuple schema |
+ +=======+=======+=======+======+=====+======+======+
+ | 2 | 4 | 0 | NY | Ayn | Rand | | tuple1 data |
+ +-------+-------+-------+------+-----+---+--+------+
+ | 2 | 5 | 0 | NY | Richard | Feynman | | tuple2 data |
+ +-------+-------+-------+------+---------+---------+
+ ^ ^ ^
+ 0 2 4 | tuple1 offset |
+
+ ^ ^ ^
+ 0 2 5 | tuple2 offset |
+```
+
+### 1.2. Access data by JSON-path
+The central API to access information is a routine
+```
+static inline const char *
+tuple_field_raw(const struct tuple_format *format, const char *tuple,
+ const uint32_t *field_map, uint32_t field_no)
+```
+
+In most scenarios related to indexes it is called with *field_no* extracted from *key_part* - a descriptor of part representing index.
+```
+field_b = tuple_field_raw(format_b, tuple_b_raw, field_map_b, part->fieldno);
+```
+With introducing parts containing JSON path we need to rework such calls.
+
+To avoid parsing JSON path each time we suppose to introduce a cache for *offset_slot* in *key_part* structure:
+```
+struct key_part {
+ /** Tuple field index for this part */
+ uint32_t fieldno;
+ ...
+
+ /** Data JSON path. */
+ const char *data_path;
+ /** Epoch of tuple_format cached in offset_slot. */
+ uint32_t slot_epoch;
+ /** Cache of corresponding tuple_format offset_slot. */
+ int32_t slot_cache;
+};
+```
+And extend *tuple_format* where such epoch would be initialized with a new bigger(when source key_parts have changed) value on creation.
+```
+struct tuple_format {
+ ...
+
+ /** Epoch of tuple format. */
+ uint32_t epoch;
+ /** Formats of the fields. */
+ struct tuple_field fields[0];
+};
+```
+We would modify tuple_field to have a tree structure if necessary and represent all intimidate records.
+
+In all index-related scenarios with *tuple_field_raw* we would rework with a new function:
+```
+const char *
+tuple_field_by_part(const struct tuple *tuple, const struct key_def *def,
+ uint32_t idx);
+
+```
+
+With first argument *tuple* we can obtain tuple_format that contain *epoch* field.
+```
+ [[ Img 2.1 - Block diagram for tuple_field_by_part routine ]]
+
+ [[def->epoch == tuple_format(tuple)->epoch]]
+ ||
+ YES || NO
+ ________________/\________________
+ | |
+
+ use offset_slot from PARSE def->parts[idx]->data_path and
+ def->parts[idx].cached_slot observe tuple_format(tuple)->fields structure,
+ UPDATE def->parts[idx].slot_epoch and
+ def->parts[idx].slot_cache IF format epoch is bigger
+```
+
+PARSE is about parsing JSON path lexeme-by-lexeme and step down the fields tree until leaf with slot offset would reached.
+Then we need to UPDATE key_part cache IF epoch has increased. This *may* make epoch oscillations not so significant.
+
+To sped-up data access we suppose to use global format hashtable with all paths:
+```
+ HASHTABLE<const char *data_path, int32_t offset_slot>
+
+ lookup
+ "[2]name.first" ----> hash1 ------> [hash1, "name.first", offset_slot1]
+ "[2]name.last" ----> hash2 ------> [hash2, "name.last", offset_slot2]
+```
+
+## Rationale and alternatives
+### 2.1 Reallocatable slot_cache
+Epoch-based invalidation is looking complicated enough.
+It is possible to allocate slot_cache[] array for each key_part and store slots for all epoch (where epoch would be an index in that array).
+```
+ format->epoch
+key_part { ||
+ slot_epoch[]: \/
+ [1: s1, 2: s2, ... epoch: s3, ... max_epoch: sN]
+}
+
+I.e.:
+ tuple_map[key_part->slot_epoch[format->epoch]]
+```
+But as formats are created really often, this approach is really resource-consumable.
--
2.7.4
^ permalink raw reply [flat|nested] 14+ messages in thread
* [tarantool-patches] [PATCH v1 2/5] box: introduce slot_cache in key_part
2018-08-06 12:26 [tarantool-patches] [PATCH v1 0/5] box: indexes by JSON path Kirill Shcherbatov
2018-08-06 12:26 ` [tarantool-patches] [PATCH v1 1/5] rfc: describe a Tarantool JSON indexes Kirill Shcherbatov
@ 2018-08-06 12:26 ` Kirill Shcherbatov
2018-08-08 22:03 ` [tarantool-patches] " Vladislav Shpilevoy
2018-08-06 12:27 ` [tarantool-patches] [PATCH v1 3/5] box: introduce path field " Kirill Shcherbatov
` (3 subsequent siblings)
5 siblings, 1 reply; 14+ messages in thread
From: Kirill Shcherbatov @ 2018-08-06 12:26 UTC (permalink / raw)
To: tarantool-patches; +Cc: v.shpilevoy, Kirill Shcherbatov
Same key_part could be used in different formats multiple
times, so different field->offset_slot would be allocated.
In most scenarios we work with series of tuples of same
format, and (in general) format lookup for field would be
expensive operation for JSON-paths defined in key_part.
New slot_cache field in key_part structure and epoch-based
mechanism to validate it actuality should be effective
approach to improve performance.
New routine tuple_field_by_part use tuple and key_part
to access field that allows to rework and speedup all
scenarios of access tuple data by index.
This also allows to work with JSON-path key_parts later.
Part of #1012.
---
src/box/key_def.c | 2 ++
src/box/key_def.h | 4 ++++
src/box/memtx_bitset.c | 4 ++--
src/box/memtx_rtree.c | 2 +-
src/box/tuple_bloom.c | 8 ++++----
src/box/tuple_compare.cc | 48 +++++++++++++++++---------------------------
| 13 +++---------
src/box/tuple_format.c | 43 +++++++++++++++++++++++++++++++++++++++
src/box/tuple_format.h | 15 ++++++++++++++
src/box/tuple_hash.cc | 28 ++++++++++++--------------
src/box/tuple_hash.h | 5 +++--
src/box/vy_stmt.h | 2 +-
12 files changed, 109 insertions(+), 65 deletions(-)
diff --git a/src/box/key_def.c b/src/box/key_def.c
index ee09dc9..8a4262b 100644
--- a/src/box/key_def.c
+++ b/src/box/key_def.c
@@ -258,6 +258,8 @@ key_def_set_part(struct key_def *def, uint32_t part_no, uint32_t fieldno,
def->parts[part_no].type = type;
def->parts[part_no].coll = coll;
def->parts[part_no].coll_id = coll_id;
+ def->parts[part_no].slot_cache = TUPLE_OFFSET_SLOT_NIL;
+ def->parts[part_no].format_epoch = 0;
column_mask_set_fieldno(&def->column_mask, fieldno);
/**
* When all parts are set, initialize the tuple
diff --git a/src/box/key_def.h b/src/box/key_def.h
index aecbe03..42c054c 100644
--- a/src/box/key_def.h
+++ b/src/box/key_def.h
@@ -74,6 +74,10 @@ struct key_part {
struct coll *coll;
/** True if a part can store NULLs. */
bool is_nullable;
+ /** Format epoch for slot_cache. */
+ uint64_t format_epoch;
+ /** Cache for corresponding tuple_format slot_offset. */
+ int32_t slot_cache;
};
struct key_def;
diff --git a/src/box/memtx_bitset.c b/src/box/memtx_bitset.c
index a665f1a..e9b5f6a 100644
--- a/src/box/memtx_bitset.c
+++ b/src/box/memtx_bitset.c
@@ -283,8 +283,8 @@ memtx_bitset_index_replace(struct index *base, struct tuple *old_tuple,
}
if (new_tuple != NULL) {
- const char *field;
- field = tuple_field(new_tuple, base->def->key_def->parts[0].fieldno);
+ const char *field =
+ tuple_field_by_part(new_tuple, base->def->key_def, 0);
uint32_t key_len;
const void *key = make_key(field, &key_len);
#ifndef OLD_GOOD_BITSET
diff --git a/src/box/memtx_rtree.c b/src/box/memtx_rtree.c
index 0b12cda..fd6334f 100644
--- a/src/box/memtx_rtree.c
+++ b/src/box/memtx_rtree.c
@@ -112,7 +112,7 @@ extract_rectangle(struct rtree_rect *rect, const struct tuple *tuple,
struct index_def *index_def)
{
assert(index_def->key_def->part_count == 1);
- const char *elems = tuple_field(tuple, index_def->key_def->parts[0].fieldno);
+ const char *elems = tuple_field_by_part(tuple, index_def->key_def, 0);
unsigned dimension = index_def->opts.dimension;
uint32_t count = mp_decode_array(&elems);
return mp_decode_rect(rect, dimension, elems, count, "Field");
diff --git a/src/box/tuple_bloom.c b/src/box/tuple_bloom.c
index ffad151..3de83bc 100644
--- a/src/box/tuple_bloom.c
+++ b/src/box/tuple_bloom.c
@@ -85,8 +85,8 @@ tuple_bloom_builder_add(struct tuple_bloom_builder *builder,
uint32_t total_size = 0;
for (uint32_t i = 0; i < key_def->part_count; i++) {
- total_size += tuple_hash_key_part(&h, &carry, tuple,
- &key_def->parts[i]);
+ total_size +=
+ tuple_hash_key_part(&h, &carry, tuple, key_def, i);
if (i < hashed_parts) {
/*
* This part is already in the bloom, proceed
@@ -183,8 +183,8 @@ tuple_bloom_maybe_has(const struct tuple_bloom *bloom,
uint32_t total_size = 0;
for (uint32_t i = 0; i < key_def->part_count; i++) {
- total_size += tuple_hash_key_part(&h, &carry, tuple,
- &key_def->parts[i]);
+ total_size +=
+ tuple_hash_key_part(&h, &carry, tuple, key_def, i);
uint32_t hash = PMurHash32_Result(h, carry, total_size);
if (!bloom_maybe_has(&bloom->parts[i], hash))
return false;
diff --git a/src/box/tuple_compare.cc b/src/box/tuple_compare.cc
index e53afba..6d34c60 100644
--- a/src/box/tuple_compare.cc
+++ b/src/box/tuple_compare.cc
@@ -432,9 +432,8 @@ tuple_common_key_parts(const struct tuple *tuple_a,
{
uint32_t i;
for (i = 0; i < key_def->part_count; i++) {
- const struct key_part *part = &key_def->parts[i];
- const char *field_a = tuple_field(tuple_a, part->fieldno);
- const char *field_b = tuple_field(tuple_b, part->fieldno);
+ const char *field_a = tuple_field_by_part(tuple_a, key_def, i);
+ const char *field_b = tuple_field_by_part(tuple_b, key_def, i);
enum mp_type a_type = field_a != NULL ?
mp_typeof(*field_a) : MP_NIL;
enum mp_type b_type = field_b != NULL ?
@@ -442,8 +441,9 @@ tuple_common_key_parts(const struct tuple *tuple_a,
if (a_type == MP_NIL && b_type == MP_NIL)
continue;
if (a_type == MP_NIL || b_type == MP_NIL ||
- tuple_compare_field_with_hint(field_a, a_type,
- field_b, b_type, part->type, part->coll) != 0)
+ tuple_compare_field_with_hint(field_a, a_type, field_b,
+ b_type, key_def->parts[i].type,
+ key_def->parts[i].coll) != 0)
break;
}
return i;
@@ -457,7 +457,7 @@ tuple_compare_slowpath(const struct tuple *tuple_a, const struct tuple *tuple_b,
assert(!has_optional_parts || is_nullable);
assert(is_nullable == key_def->is_nullable);
assert(has_optional_parts == key_def->has_optional_parts);
- const struct key_part *part = key_def->parts;
+ struct key_part *part = (struct key_part *)key_def->parts;
const char *tuple_a_raw = tuple_data(tuple_a);
const char *tuple_b_raw = tuple_data(tuple_b);
if (key_def->part_count == 1 && part->fieldno == 0) {
@@ -484,10 +484,6 @@ tuple_compare_slowpath(const struct tuple *tuple_a, const struct tuple *tuple_b,
}
bool was_null_met = false;
- const struct tuple_format *format_a = tuple_format(tuple_a);
- const struct tuple_format *format_b = tuple_format(tuple_b);
- const uint32_t *field_map_a = tuple_field_map(tuple_a);
- const uint32_t *field_map_b = tuple_field_map(tuple_b);
const struct key_part *end;
const char *field_a, *field_b;
enum mp_type a_type, b_type;
@@ -497,11 +493,10 @@ tuple_compare_slowpath(const struct tuple *tuple_a, const struct tuple *tuple_b,
else
end = part + key_def->part_count;
- for (; part < end; part++) {
- field_a = tuple_field_raw(format_a, tuple_a_raw, field_map_a,
- part->fieldno);
- field_b = tuple_field_raw(format_b, tuple_b_raw, field_map_b,
- part->fieldno);
+ uint32_t i = 0;
+ for (; part < end; part++, i++) {
+ field_a = tuple_field_by_part(tuple_a, key_def, i);
+ field_b = tuple_field_by_part(tuple_b, key_def, i);
assert(has_optional_parts ||
(field_a != NULL && field_b != NULL));
if (! is_nullable) {
@@ -547,11 +542,9 @@ tuple_compare_slowpath(const struct tuple *tuple_a, const struct tuple *tuple_b,
* extended parts only.
*/
end = key_def->parts + key_def->part_count;
- for (; part < end; ++part) {
- field_a = tuple_field_raw(format_a, tuple_a_raw, field_map_a,
- part->fieldno);
- field_b = tuple_field_raw(format_b, tuple_b_raw, field_map_b,
- part->fieldno);
+ for (; part < end; ++part, ++i) {
+ field_a = tuple_field_by_part(tuple_a, key_def, i);
+ field_b = tuple_field_by_part(tuple_b, key_def, i);
/*
* Extended parts are primary, and they can not
* be absent or be NULLs.
@@ -576,15 +569,11 @@ tuple_compare_with_key_slowpath(const struct tuple *tuple, const char *key,
assert(has_optional_parts == key_def->has_optional_parts);
assert(key != NULL || part_count == 0);
assert(part_count <= key_def->part_count);
- const struct key_part *part = key_def->parts;
- const struct tuple_format *format = tuple_format(tuple);
- const char *tuple_raw = tuple_data(tuple);
- const uint32_t *field_map = tuple_field_map(tuple);
+ struct key_part *part = (struct key_part *)key_def->parts;
enum mp_type a_type, b_type;
+ uint32_t i = 0;
if (likely(part_count == 1)) {
- const char *field;
- field = tuple_field_raw(format, tuple_raw, field_map,
- part->fieldno);
+ const char *field = tuple_field_by_part(tuple, key_def, i);
if (! is_nullable) {
return tuple_compare_field(field, key, part->type,
part->coll);
@@ -607,10 +596,9 @@ tuple_compare_with_key_slowpath(const struct tuple *tuple, const char *key,
const struct key_part *end = part + part_count;
int rc;
- for (; part < end; ++part, mp_next(&key)) {
+ for (; part < end; ++i, ++part, mp_next(&key)) {
const char *field;
- field = tuple_field_raw(format, tuple_raw, field_map,
- part->fieldno);
+ field = tuple_field_by_part(tuple, key_def, i);
if (! is_nullable) {
rc = tuple_compare_field(field, key, part->type,
part->coll);
--git a/src/box/tuple_extract_key.cc b/src/box/tuple_extract_key.cc
index 880abb6..bbd87f5 100644
--- a/src/box/tuple_extract_key.cc
+++ b/src/box/tuple_extract_key.cc
@@ -101,18 +101,13 @@ tuple_extract_key_slowpath(const struct tuple *tuple,
assert(contains_sequential_parts ==
key_def_contains_sequential_parts(key_def));
assert(mp_sizeof_nil() == 1);
- const char *data = tuple_data(tuple);
uint32_t part_count = key_def->part_count;
uint32_t bsize = mp_sizeof_array(part_count);
- const struct tuple_format *format = tuple_format(tuple);
- const uint32_t *field_map = tuple_field_map(tuple);
- const char *tuple_end = data + tuple->bsize;
+ const char *tuple_end = tuple_data(tuple) + tuple->bsize;
/* Calculate the key size. */
for (uint32_t i = 0; i < part_count; ++i) {
- const char *field =
- tuple_field_raw(format, data, field_map,
- key_def->parts[i].fieldno);
+ const char *field = tuple_field_by_part(tuple, key_def, i);
if (has_optional_parts && field == NULL) {
bsize += mp_sizeof_nil();
continue;
@@ -152,9 +147,7 @@ tuple_extract_key_slowpath(const struct tuple *tuple,
}
char *key_buf = mp_encode_array(key, part_count);
for (uint32_t i = 0; i < part_count; ++i) {
- const char *field =
- tuple_field_raw(format, data, field_map,
- key_def->parts[i].fieldno);
+ const char *field = tuple_field_by_part(tuple, key_def, i);
if (has_optional_parts && field == NULL) {
key_buf = mp_encode_nil(key_buf);
continue;
diff --git a/src/box/tuple_format.c b/src/box/tuple_format.c
index 2e19d2e..40cd48a 100644
--- a/src/box/tuple_format.c
+++ b/src/box/tuple_format.c
@@ -29,12 +29,15 @@
* SUCH DAMAGE.
*/
#include "json/path.h"
+#include "tuple.h"
#include "tuple_format.h"
/** Global table of tuple formats */
struct tuple_format **tuple_formats;
static intptr_t recycled_format_ids = FORMAT_ID_NIL;
+static uint64_t format_epoch = 1;
+
static uint32_t formats_size = 0, formats_capacity = 0;
static const struct tuple_field tuple_field_default = {
@@ -69,6 +72,7 @@ tuple_format_create(struct tuple_format *format, struct key_def * const *keys,
format->fields[i] = tuple_field_default;
int current_slot = 0;
+ bool format_epoch_changed = false;
/* extract field type info */
for (uint16_t key_no = 0; key_no < key_count; ++key_no) {
@@ -135,6 +139,8 @@ tuple_format_create(struct tuple_format *format, struct key_def * const *keys,
is_sequential == false && part->fieldno > 0) {
field->offset_slot = --current_slot;
+ if (part->slot_cache != field->offset_slot)
+ format_epoch_changed = true;
}
}
}
@@ -148,6 +154,8 @@ tuple_format_create(struct tuple_format *format, struct key_def * const *keys,
return -1;
}
format->field_map_size = field_map_size;
+ if (format_epoch_changed)
+ format->epoch = ++format_epoch;
return 0;
}
@@ -233,6 +241,7 @@ tuple_format_alloc(struct key_def * const *keys, uint16_t key_count,
format->dict = dict;
tuple_dictionary_ref(dict);
}
+ format->epoch = format_epoch;
format->refs = 0;
format->id = FORMAT_ID_NIL;
format->field_count = field_count;
@@ -542,6 +551,40 @@ tuple_field_go_to_key(const char **field, const char *key, int len)
return -1;
}
+const char *
+tuple_field_by_part(const struct tuple *tuple, const struct key_def *def,
+ uint32_t idx)
+{
+ struct key_part *part = (struct key_part *)&def->parts[idx];
+ const struct tuple_format *format = tuple_format(tuple);
+ const char *data = tuple_data(tuple);
+ const uint32_t *field_map = tuple_field_map(tuple);
+
+ uint32_t field_no = part->fieldno;
+ if (likely(field_no < format->index_field_count)) {
+ assert(format->epoch > 0);
+ int32_t offset_slot = TUPLE_OFFSET_SLOT_NIL;
+ if (part->format_epoch != format->epoch) {
+ offset_slot = format->fields[field_no].offset_slot;
+ if (format->epoch > part->format_epoch) {
+ part->slot_cache = offset_slot;
+ part->format_epoch = format->epoch;
+ }
+ }
+ if (offset_slot != TUPLE_OFFSET_SLOT_NIL) {
+ return field_map[offset_slot] != 0 ?
+ data + field_map[offset_slot] : NULL;
+ }
+ }
+ ERROR_INJECT(ERRINJ_TUPLE_FIELD, return NULL);
+ uint32_t field_count = mp_decode_array(&data);
+ if (unlikely(field_no >= field_count))
+ return NULL;
+ for (uint32_t k = 0; k < field_no; k++)
+ mp_next(&data);
+ return data;
+}
+
int
tuple_field_raw_by_path(struct tuple_format *format, const char *tuple,
const uint32_t *field_map, const char *path,
diff --git a/src/box/tuple_format.h b/src/box/tuple_format.h
index c7dc48f..3cb1284 100644
--- a/src/box/tuple_format.h
+++ b/src/box/tuple_format.h
@@ -115,6 +115,8 @@ struct tuple_field {
* Tuple format describes how tuple is stored and information about its fields
*/
struct tuple_format {
+ /** Tuple format epoch. */
+ uint64_t epoch;
/** Virtual function table */
struct tuple_format_vtab vtab;
/** Pointer to engine-specific data. */
@@ -324,6 +326,19 @@ tuple_init_field_map(const struct tuple_format *format, uint32_t *field_map,
const char *tuple);
/**
+ * Get a field refereed by multipart index @def part @idx in
+ * @tuple.
+ *
+ * @param tuple object containing data to lookup.
+ * @param key_def multipart index to use.
+ * @param idx index of part to use.
+ * @retval field data if field exists or NULL
+ */
+const char *
+tuple_field_by_part(const struct tuple *tuple, const struct key_def *def,
+ uint32_t idx);
+
+/**
* Get a field at the specific position in this MessagePack array.
* Returns a pointer to MessagePack data.
* @param format tuple format
diff --git a/src/box/tuple_hash.cc b/src/box/tuple_hash.cc
index dee9be3..7e4f3e6 100644
--- a/src/box/tuple_hash.cc
+++ b/src/box/tuple_hash.cc
@@ -157,7 +157,7 @@ struct TupleHash
uint32_t h = HASH_SEED;
uint32_t carry = 0;
uint32_t total_size = 0;
- const char *field = tuple_field(tuple, key_def->parts->fieldno);
+ const char *field = tuple_field_by_part(tuple, key_def, 0);
TupleFieldHash<TYPE, MORE_TYPES...>::
hash(&field, &h, &carry, &total_size);
return PMurHash32_Result(h, carry, total_size);
@@ -169,7 +169,7 @@ struct TupleHash<FIELD_TYPE_UNSIGNED> {
static uint32_t hash(const struct tuple *tuple,
const struct key_def *key_def)
{
- const char *field = tuple_field(tuple, key_def->parts->fieldno);
+ const char *field = tuple_field_by_part(tuple, key_def, 0);
uint64_t val = mp_decode_uint(&field);
if (likely(val <= UINT32_MAX))
return val;
@@ -310,12 +310,12 @@ tuple_hash_null(uint32_t *ph1, uint32_t *pcarry)
uint32_t
tuple_hash_key_part(uint32_t *ph1, uint32_t *pcarry,
const struct tuple *tuple,
- const struct key_part *part)
+ const struct key_def *key_def, uint32_t i)
{
- const char *field = tuple_field(tuple, part->fieldno);
+ const char *field = tuple_field_by_part(tuple, key_def, i);
if (field == NULL)
return tuple_hash_null(ph1, pcarry);
- return tuple_hash_field(ph1, pcarry, &field, part->coll);
+ return tuple_hash_field(ph1, pcarry, &field, key_def->parts[i].coll);
}
template <bool has_optional_parts>
@@ -326,31 +326,29 @@ tuple_hash_slowpath(const struct tuple *tuple, const struct key_def *key_def)
uint32_t h = HASH_SEED;
uint32_t carry = 0;
uint32_t total_size = 0;
- uint32_t prev_fieldno = key_def->parts[0].fieldno;
- const char *field = tuple_field(tuple, key_def->parts[0].fieldno);
+ const struct key_part *part = &key_def->parts[0];
+ uint32_t prev_fieldno = part->fieldno;
+ const char *field = tuple_field_by_part(tuple, key_def, 0);
const char *end = (char *)tuple + tuple_size(tuple);
if (has_optional_parts && field == NULL) {
total_size += tuple_hash_null(&h, &carry);
} else {
- total_size += tuple_hash_field(&h, &carry, &field,
- key_def->parts[0].coll);
+ total_size += tuple_hash_field(&h, &carry, &field, part->coll);
}
for (uint32_t part_id = 1; part_id < key_def->part_count; part_id++) {
/* If parts of key_def are not sequential we need to call
* tuple_field. Otherwise, tuple is hashed sequentially without
* need of tuple_field
*/
- if (prev_fieldno + 1 != key_def->parts[part_id].fieldno) {
- field = tuple_field(tuple, key_def->parts[part_id].fieldno);
- }
+ if (prev_fieldno + 1 != part->fieldno)
+ field = tuple_field_by_part(tuple, key_def, part_id);
if (has_optional_parts && (field == NULL || field >= end)) {
total_size += tuple_hash_null(&h, &carry);
} else {
total_size +=
- tuple_hash_field(&h, &carry, &field,
- key_def->parts[part_id].coll);
+ tuple_hash_field(&h, &carry, &field, part->coll);
}
- prev_fieldno = key_def->parts[part_id].fieldno;
+ prev_fieldno = part->fieldno;
}
return PMurHash32_Result(h, carry, total_size);
diff --git a/src/box/tuple_hash.h b/src/box/tuple_hash.h
index aab8f54..90f80a8 100644
--- a/src/box/tuple_hash.h
+++ b/src/box/tuple_hash.h
@@ -64,7 +64,8 @@ tuple_hash_field(uint32_t *ph1, uint32_t *pcarry, const char **field,
* @param ph1 - pointer to running hash
* @param pcarry - pointer to carry
* @param tuple - tuple to hash
- * @param part - key part
+ * @param key_def - key definition
+ * @param i - index of part to hash
* @return size of processed data
*
* This function updates @ph1 and @pcarry.
@@ -72,7 +73,7 @@ tuple_hash_field(uint32_t *ph1, uint32_t *pcarry, const char **field,
uint32_t
tuple_hash_key_part(uint32_t *ph1, uint32_t *pcarry,
const struct tuple *tuple,
- const struct key_part *part);
+ const struct key_def *key_def, uint32_t i);
/**
* Calculates a common hash value for a tuple
diff --git a/src/box/vy_stmt.h b/src/box/vy_stmt.h
index e53f98c..0b485ad 100644
--- a/src/box/vy_stmt.h
+++ b/src/box/vy_stmt.h
@@ -665,7 +665,7 @@ static inline bool
vy_tuple_key_contains_null(const struct tuple *tuple, const struct key_def *def)
{
for (uint32_t i = 0; i < def->part_count; ++i) {
- const char *field = tuple_field(tuple, def->parts[i].fieldno);
+ const char *field = tuple_field_by_part(tuple, def, i);
if (field == NULL || mp_typeof(*field) == MP_NIL)
return true;
}
--
2.7.4
^ permalink raw reply [flat|nested] 14+ messages in thread
* [tarantool-patches] Re: [PATCH v1 2/5] box: introduce slot_cache in key_part
2018-08-06 12:26 ` [tarantool-patches] [PATCH v1 2/5] box: introduce slot_cache in key_part Kirill Shcherbatov
@ 2018-08-08 22:03 ` Vladislav Shpilevoy
2018-08-15 12:14 ` Kirill Shcherbatov
0 siblings, 1 reply; 14+ messages in thread
From: Vladislav Shpilevoy @ 2018-08-08 22:03 UTC (permalink / raw)
To: tarantool-patches, Kirill Shcherbatov
Thanks for the patch! See 8 comments below.
> diff --git a/src/box/tuple_compare.cc b/src/box/tuple_compare.cc
> index e53afba..6d34c60 100644
> --- a/src/box/tuple_compare.cc
> +++ b/src/box/tuple_compare.cc
> @@ -457,7 +457,7 @@ tuple_compare_slowpath(const struct tuple *tuple_a, const struct tuple *tuple_b,
> assert(!has_optional_parts || is_nullable);
> assert(is_nullable == key_def->is_nullable);
> assert(has_optional_parts == key_def->has_optional_parts);
> - const struct key_part *part = key_def->parts;
> + struct key_part *part = (struct key_part *)key_def->parts;
1. Why?
> const char *tuple_a_raw = tuple_data(tuple_a);
> const char *tuple_b_raw = tuple_data(tuple_b);
> if (key_def->part_count == 1 && part->fieldno == 0) {
> @@ -484,10 +484,6 @@ tuple_compare_slowpath(const struct tuple *tuple_a, const struct tuple *tuple_b,
> }
>
> bool was_null_met = false;
> - const struct tuple_format *format_a = tuple_format(tuple_a);
> - const struct tuple_format *format_b = tuple_format(tuple_b);
> - const uint32_t *field_map_a = tuple_field_map(tuple_a);
> - const uint32_t *field_map_b = tuple_field_map(tuple_b);
2. These things were needed to get formats and tuples data once. Now you
lookup format and raw data on each part in the cycle below. So it is
slower in part_count times. See below the main comment.
> const struct key_part *end;
> const char *field_a, *field_b;
> enum mp_type a_type, b_type;
> @@ -497,11 +493,10 @@ tuple_compare_slowpath(const struct tuple *tuple_a, const struct tuple *tuple_b,
> else
> end = part + key_def->part_count;
>
> - for (; part < end; part++) {
> - field_a = tuple_field_raw(format_a, tuple_a_raw, field_map_a,
> - part->fieldno);
> - field_b = tuple_field_raw(format_b, tuple_b_raw, field_map_b,
> - part->fieldno);
> + uint32_t i = 0;
> + for (; part < end; part++, i++) {
> + field_a = tuple_field_by_part(tuple_a, key_def, i);
> + field_b = tuple_field_by_part(tuple_b, key_def, i);
3. As I said you verbally, when we were discussing how to implement
comparators, you do not need to use slot cache, epochs and other
complex JSON things to compare flat index keys, with no JSON paths. You
should add a new template parameter to tuple_compare_slowpath like
'is_flat' or something, and if it is false, then use regular tuple_field
or tuple_field_raw. If true, then try to use slot cache, epoch etc.
Comparators are chosen once when key_def is created and template 'if'
is compile time, so these 'if's will cost nothing.
Now tuple_field_by_part is much slower than tuple_field_raw - it makes
additional lookup of a format (from a huge array of tuple formats),
of tuple data (additional call), field map (one more call). Further
both tuple_field_by_part and tuple_field_raw do 3 'if's and lookup
in field_map. But tuple_field_raw also updates slot_cache and
format_epoch - it is +2 assignments. So tuple_field_raw is faster.
In this cycle in the summary it is in part_count times faster.
So please, do not use tuple_field_by_part for flat indexes. It has no
profit for them.
All the same about key extraction and hash.
> assert(has_optional_parts ||
> (field_a != NULL && field_b != NULL));
> if (! is_nullable) {
> diff --git a/src/box/tuple_format.c b/src/box/tuple_format.c
> index 2e19d2e..40cd48a 100644
> --- a/src/box/tuple_format.c
> +++ b/src/box/tuple_format.c
> @@ -29,12 +29,15 @@
> * SUCH DAMAGE.
> */
> #include "json/path.h"
> +#include "tuple.h"
4. Tuple format should not depend on tuple. Vice versa. Please,
consult other tuple_field methods how to solve this dependency
cycle.
> #include "tuple_format.h"
>
> /** Global table of tuple formats */
> struct tuple_format **tuple_formats;
> static intptr_t recycled_format_ids = FORMAT_ID_NIL;
>
> +static uint64_t format_epoch = 1;
5. Looks like you've ignored my point that a global format
epoch will expire epoch in all formats when just a one
space is altered. Please, do not. Epoch should be one per
space.
> +
> static uint32_t formats_size = 0, formats_capacity = 0;
>
> static const struct tuple_field tuple_field_default = {
> @@ -69,6 +72,7 @@ tuple_format_create(struct tuple_format *format, struct key_def * const *keys,
> format->fields[i] = tuple_field_default;
>
> int current_slot = 0;
> + bool format_epoch_changed = false;
6. Prefix flags with 'is_'.
>
> /* extract field type info */
> for (uint16_t key_no = 0; key_no < key_count; ++key_no) {> @@ -542,6 +551,40 @@ tuple_field_go_to_key(const char **field, const char *key, int len)
> return -1;
> }
>
> +const char *
> +tuple_field_by_part(const struct tuple *tuple, const struct key_def *def,
> + uint32_t idx)
> +{
> + struct key_part *part = (struct key_part *)&def->parts[idx];
> + const struct tuple_format *format = tuple_format(tuple);
> + const char *data = tuple_data(tuple);
> + const uint32_t *field_map = tuple_field_map(tuple);
> +
> + uint32_t field_no = part->fieldno;
> + if (likely(field_no < format->index_field_count)) {
> + assert(format->epoch > 0);
> + int32_t offset_slot = TUPLE_OFFSET_SLOT_NIL;
> + if (part->format_epoch != format->epoch) {
> + offset_slot = format->fields[field_no].offset_slot;
> + if (format->epoch > part->format_epoch) {
> + part->slot_cache = offset_slot;
> + part->format_epoch = format->epoch;
> + }
> + }
> + if (offset_slot != TUPLE_OFFSET_SLOT_NIL) {
> + return field_map[offset_slot] != 0 ?
> + data + field_map[offset_slot] : NULL;
> + }
> + }
> + ERROR_INJECT(ERRINJ_TUPLE_FIELD, return NULL);
7. Why do you need it?
> + uint32_t field_count = mp_decode_array(&data);
> + if (unlikely(field_no >= field_count))
> + return NULL;
> + for (uint32_t k = 0; k < field_no; k++)
> + mp_next(&data);
> + return data;
> +}
> +
> int
> tuple_field_raw_by_path(struct tuple_format *format, const char *tuple,
> const uint32_t *field_map, const char *path,
> diff --git a/src/box/tuple_format.h b/src/box/tuple_format.h
> index c7dc48f..3cb1284 100644
> --- a/src/box/tuple_format.h
> +++ b/src/box/tuple_format.h
> @@ -115,6 +115,8 @@ struct tuple_field {
> * Tuple format describes how tuple is stored and information about its fields
> */
> struct tuple_format {
> + /** Tuple format epoch. */
8. Too small comment for such complex feature.
> + uint64_t epoch;
> /** Virtual function table */
> struct tuple_format_vtab vtab;
> /** Pointer to engine-specific data. */
^ permalink raw reply [flat|nested] 14+ messages in thread
* [tarantool-patches] Re: [PATCH v1 2/5] box: introduce slot_cache in key_part
2018-08-08 22:03 ` [tarantool-patches] " Vladislav Shpilevoy
@ 2018-08-15 12:14 ` Kirill Shcherbatov
0 siblings, 0 replies; 14+ messages in thread
From: Kirill Shcherbatov @ 2018-08-15 12:14 UTC (permalink / raw)
To: tarantool-patches, Vladislav Shpilevoy
>> - const struct key_part *part = key_def->parts;
>> + struct key_part *part = (struct key_part *)key_def->parts;
> 1. Why?
Fixed.
> 2. These things were needed to get formats and tuples data once. Now you
> lookup format and raw data on each part in the cycle below. So it is
> slower in part_count times. See below the main comment.
> 3. As I said you verbally, when we were discussing how to implement
> comparators, you do not need to use slot cache, epochs and other
> complex JSON things to compare flat index keys, with no JSON paths. You
> should add a new template parameter to tuple_compare_slowpath like
> 'is_flat' or something, and if it is false, then use regular tuple_field
> or tuple_field_raw. If true, then try to use slot cache, epoch etc.
> Comparators are chosen once when key_def is created and template 'if'
> is compile time, so these 'if's will cost nothing.
>
> Now tuple_field_by_part is much slower than tuple_field_raw - it makes
> additional lookup of a format (from a huge array of tuple formats),
> of tuple data (additional call), field map (one more call). Further
> both tuple_field_by_part and tuple_field_raw do 3 'if's and lookup
> in field_map. But tuple_field_raw also updates slot_cache and
> format_epoch - it is +2 assignments. So tuple_field_raw is faster.
> In this cycle in the summary it is in part_count times faster.
>
> So please, do not use tuple_field_by_part for flat indexes. It has no
> profit for them.
>
> All the same about key extraction and hash.
Done as you've suggested.
> 4. Tuple format should not depend on tuple. Vice versa. Please,
> consult other tuple_field methods how to solve this dependency
> cycle.
There is no such problem already.
> 5. Looks like you've ignored my point that a global format
> epoch will expire epoch in all formats when just a one
> space is altered. Please, do not. Epoch should be one per
> space.
I've introduced new function space_format_update_epoch that is called on alter.
> 6. Prefix flags with 'is_'.
Ok, fixed.
> 7. Why do you need it?
I've reeworked this function a lot as a part of following patch.
Here is wrapper that calls tuple_field_raw now in this patch.
> 8. Too small comment for such complex feature.
I've wrote a bigger comment.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [tarantool-patches] [PATCH v1 3/5] box: introduce path field in key_part
2018-08-06 12:26 [tarantool-patches] [PATCH v1 0/5] box: indexes by JSON path Kirill Shcherbatov
2018-08-06 12:26 ` [tarantool-patches] [PATCH v1 1/5] rfc: describe a Tarantool JSON indexes Kirill Shcherbatov
2018-08-06 12:26 ` [tarantool-patches] [PATCH v1 2/5] box: introduce slot_cache in key_part Kirill Shcherbatov
@ 2018-08-06 12:27 ` Kirill Shcherbatov
2018-08-08 22:03 ` [tarantool-patches] " Vladislav Shpilevoy
2018-08-06 12:27 ` [tarantool-patches] [PATCH v1 4/5] box: introduce path_hash and tuple_field tree Kirill Shcherbatov
` (2 subsequent siblings)
5 siblings, 1 reply; 14+ messages in thread
From: Kirill Shcherbatov @ 2018-08-06 12:27 UTC (permalink / raw)
To: tarantool-patches; +Cc: v.shpilevoy, Kirill Shcherbatov
As we need to store user-defined JSON path in key_part
and key_part_def, we have introduced path and path_len
fields. JSON path is verified and transformed to canonical
form on index msgpack unpack.
Because of field names specified as format could be changed
key_part path persisted in Tarantool should be always started
with first-level field access via array index(not by name).
Part of #1012.
---
src/box/key_def.c | 197 +++++++++++++++++++++++++++++++++++++++++++++++----
src/box/key_def.h | 13 +++-
src/box/lua/space.cc | 5 ++
src/box/schema.cc | 8 +--
src/box/vy_log.c | 3 +-
5 files changed, 205 insertions(+), 21 deletions(-)
diff --git a/src/box/key_def.c b/src/box/key_def.c
index 8a4262b..79e07f8 100644
--- a/src/box/key_def.c
+++ b/src/box/key_def.c
@@ -35,12 +35,15 @@
#include "column_mask.h"
#include "schema_def.h"
#include "coll_id_cache.h"
+#include "fiber.h"
+#include "json/path.h"
static const struct key_part_def key_part_def_default = {
0,
field_type_MAX,
COLL_NONE,
false,
+ NULL
};
static int64_t
@@ -53,6 +56,7 @@ part_type_by_name_wrapper(const char *str, uint32_t len)
#define PART_OPT_FIELD "field"
#define PART_OPT_COLLATION "collation"
#define PART_OPT_NULLABILITY "is_nullable"
+#define PART_OPT_PATH "path"
const struct opt_def part_def_reg[] = {
OPT_DEF_ENUM(PART_OPT_TYPE, field_type, struct key_part_def, type,
@@ -61,6 +65,7 @@ const struct opt_def part_def_reg[] = {
OPT_DEF(PART_OPT_COLLATION, OPT_UINT32, struct key_part_def, coll_id),
OPT_DEF(PART_OPT_NULLABILITY, OPT_BOOL, struct key_part_def,
is_nullable),
+ OPT_DEF(PART_OPT_PATH, OPT_STRPTR, struct key_part_def, path),
OPT_END,
};
@@ -103,7 +108,27 @@ key_def_dup(const struct key_def *src)
return NULL;
}
memcpy(res, src, sz);
+ uint32_t i = 0;
+ for (; i < src->part_count; i++) {
+ if (src->parts[i].path == NULL) {
+ res->parts[i].path = NULL;
+ continue;
+ }
+ char *path = strdup(src->parts[i].path);
+ if (path == NULL) {
+ diag_set(OutOfMemory, src->parts[i].path_len + 1,
+ "strdup", "path");
+ goto error;
+ }
+ res->parts[i].path = path;
+ }
return res;
+
+error:
+ for (uint32_t j = 0; j < i; j++)
+ free((void *)res->parts[j].path);
+ free(res);
+ return NULL;
}
void
@@ -118,6 +143,8 @@ key_def_swap(struct key_def *old_def, struct key_def *new_def)
void
key_def_delete(struct key_def *def)
{
+ for (uint32_t i = 0; i < def->part_count; i++)
+ free((void *)def->parts[i].path);
free(def);
}
@@ -160,19 +187,34 @@ key_def_new_with_parts(struct key_part_def *parts, uint32_t part_count)
if (coll_id == NULL) {
diag_set(ClientError, ER_WRONG_INDEX_OPTIONS,
i + 1, "collation was not found by ID");
- key_def_delete(def);
- return NULL;
+ goto error;
}
coll = coll_id->coll;
}
+ char *path = NULL;
+ if (part->path != NULL &&
+ (path = strdup(part->path)) == NULL) {
+ diag_set(OutOfMemory, strlen(part->path) + 1, "strdup",
+ "path");
+ goto error;
+ }
key_def_set_part(def, i, part->fieldno, part->type,
- part->is_nullable, coll, part->coll_id);
+ part->is_nullable, coll, part->coll_id,
+ path);
}
return def;
+error:
+ /*
+ * Don't care about non-initialized fields as them filled
+ * with 0 via calloc.
+ */
+ key_def_delete(def);
+ return NULL;
}
-void
-key_def_dump_parts(const struct key_def *def, struct key_part_def *parts)
+int
+key_def_dump_parts(struct region *pool, const struct key_def *def,
+ struct key_part_def *parts)
{
for (uint32_t i = 0; i < def->part_count; i++) {
const struct key_part *part = &def->parts[i];
@@ -181,7 +223,20 @@ key_def_dump_parts(const struct key_def *def, struct key_part_def *parts)
part_def->type = part->type;
part_def->is_nullable = part->is_nullable;
part_def->coll_id = part->coll_id;
+ if (part->path != NULL) {
+ part_def->path = region_alloc(pool, part->path_len + 1);
+ if (part_def->path == NULL) {
+ diag_set(OutOfMemory, part->path_len + 1,
+ "region_alloc", "part_def->path");
+ return -1;
+ }
+ memcpy(part_def->path, part->path, part->path_len);
+ part_def->path[part->path_len] = '\0';
+ } else {
+ part_def->path = NULL;
+ }
}
+ return 0;
}
box_key_def_t *
@@ -195,7 +250,7 @@ box_key_def_new(uint32_t *fields, uint32_t *types, uint32_t part_count)
key_def_set_part(key_def, item, fields[item],
(enum field_type)types[item],
key_part_def_default.is_nullable, NULL,
- COLL_NONE);
+ COLL_NONE, NULL);
}
return key_def;
}
@@ -241,6 +296,11 @@ key_part_cmp(const struct key_part *parts1, uint32_t part_count1,
if (part1->is_nullable != part2->is_nullable)
return part1->is_nullable <
part2->is_nullable ? -1 : 1;
+ /* Lexicographic strings order. */
+ uint32_t len = MIN(part1->path_len, part2->path_len);
+ int rc = 0;
+ if ((rc = strncmp(part1->path, part2->path, len)) != 0)
+ return rc;
}
return part_count1 < part_count2 ? -1 : part_count1 > part_count2;
}
@@ -248,7 +308,7 @@ key_part_cmp(const struct key_part *parts1, uint32_t part_count1,
void
key_def_set_part(struct key_def *def, uint32_t part_no, uint32_t fieldno,
enum field_type type, bool is_nullable, struct coll *coll,
- uint32_t coll_id)
+ uint32_t coll_id, char *path)
{
assert(part_no < def->part_count);
assert(type < field_type_MAX);
@@ -260,6 +320,8 @@ key_def_set_part(struct key_def *def, uint32_t part_no, uint32_t fieldno,
def->parts[part_no].coll_id = coll_id;
def->parts[part_no].slot_cache = TUPLE_OFFSET_SLOT_NIL;
def->parts[part_no].format_epoch = 0;
+ def->parts[part_no].path = path;
+ def->parts[part_no].path_len = path != NULL ? strlen(path) : 0;
column_mask_set_fieldno(&def->column_mask, fieldno);
/**
* When all parts are set, initialize the tuple
@@ -304,8 +366,15 @@ key_def_snprint_parts(char *buf, int size, const struct key_part_def *parts,
for (uint32_t i = 0; i < part_count; i++) {
const struct key_part_def *part = &parts[i];
assert(part->type < field_type_MAX);
- SNPRINT(total, snprintf, buf, size, "%d, '%s'",
- (int)part->fieldno, field_type_strs[part->type]);
+ if (part->path != NULL) {
+ SNPRINT(total, snprintf, buf, size, "%d, '%s', '%s'",
+ (int) part->fieldno, part->path,
+ field_type_strs[part->type]);
+ } else {
+ SNPRINT(total, snprintf, buf, size, "%d, '%s'",
+ (int) part->fieldno,
+ field_type_strs[part->type]);
+ }
if (i < part_count - 1)
SNPRINT(total, snprintf, buf, size, ", ");
}
@@ -324,6 +393,8 @@ key_def_sizeof_parts(const struct key_part_def *parts, uint32_t part_count)
count++;
if (part->is_nullable)
count++;
+ if (part->path != NULL)
+ count++;
size += mp_sizeof_map(count);
size += mp_sizeof_str(strlen(PART_OPT_FIELD));
size += mp_sizeof_uint(part->fieldno);
@@ -338,6 +409,10 @@ key_def_sizeof_parts(const struct key_part_def *parts, uint32_t part_count)
size += mp_sizeof_str(strlen(PART_OPT_NULLABILITY));
size += mp_sizeof_bool(part->is_nullable);
}
+ if (part->path != NULL) {
+ size += mp_sizeof_str(strlen(PART_OPT_PATH));
+ size += mp_sizeof_str(strlen(part->path));
+ }
}
return size;
}
@@ -351,6 +426,8 @@ key_def_encode_parts(char *data, const struct key_part_def *parts,
int count = 2;
if (part->coll_id != COLL_NONE)
count++;
+ if (part->path != NULL)
+ count++;
if (part->is_nullable)
count++;
data = mp_encode_map(data, count);
@@ -372,6 +449,12 @@ key_def_encode_parts(char *data, const struct key_part_def *parts,
strlen(PART_OPT_NULLABILITY));
data = mp_encode_bool(data, part->is_nullable);
}
+ if (part->path != NULL) {
+ data = mp_encode_str(data, PART_OPT_PATH,
+ strlen(PART_OPT_PATH));
+ data = mp_encode_str(data, part->path,
+ strlen(part->path));
+ }
}
return data;
}
@@ -432,6 +515,7 @@ key_def_decode_parts_166(struct key_part_def *parts, uint32_t part_count,
fields[part->fieldno].is_nullable :
key_part_def_default.is_nullable);
part->coll_id = COLL_NONE;
+ part->path = NULL;
}
return 0;
}
@@ -445,8 +529,9 @@ key_def_decode_parts(struct key_part_def *parts, uint32_t part_count,
return key_def_decode_parts_166(parts, part_count, data,
fields, field_count);
}
+ struct key_part_def *part;
for (uint32_t i = 0; i < part_count; i++) {
- struct key_part_def *part = &parts[i];
+ part = &parts[i];
if (mp_typeof(**data) != MP_MAP) {
diag_set(ClientError, ER_WRONG_INDEX_OPTIONS,
i + TUPLE_INDEX_BASE,
@@ -456,7 +541,7 @@ key_def_decode_parts(struct key_part_def *parts, uint32_t part_count,
*part = key_part_def_default;
if (opts_decode(part, part_def_reg, data,
ER_WRONG_INDEX_OPTIONS, i + TUPLE_INDEX_BASE,
- NULL) != 0)
+ &fiber()->gc) != 0)
return -1;
if (part->type == field_type_MAX) {
diag_set(ClientError, ER_WRONG_INDEX_OPTIONS,
@@ -473,8 +558,75 @@ key_def_decode_parts(struct key_part_def *parts, uint32_t part_count,
"string and scalar parts");
return -1;
}
+ if (part->path != NULL) {
+ struct region *region = &fiber()->gc;
+ size_t path_len = strlen(part->path);
+ struct json_path_parser parser;
+ struct json_path_node node;
+ json_path_parser_create(&parser, part->path, path_len);
+ /*
+ * A worst-case scenario is .a -> ["a"]
+ * i.e. 3*path_len + 1 is enough.
+ */
+ uint32_t size = region_used(region);
+ char *path =
+ region_alloc(region, 3 * path_len + 1);
+ if (path == NULL) {
+ diag_set(OutOfMemory, 3 * path_len + 1,
+ "region_alloc", "path");
+ return -1;
+ }
+ part->path = path;
+ int rc = json_path_next(&parser, &node);
+ if (rc != 0)
+ goto error_invalid_json;
+ if (node.type != JSON_PATH_NUM) {
+ diag_set(ClientError, ER_WRONG_INDEX_OPTIONS,
+ part->fieldno,
+ "invalid JSON path: first part should "
+ "be defined as array");
+ return -1;
+ }
+ if (node.num - TUPLE_INDEX_BASE != part->fieldno) {
+ diag_set(ClientError, ER_WRONG_INDEX_OPTIONS,
+ part->fieldno,
+ "invalid JSON path: first part refers "
+ "to invalid field");
+ return -1;
+ }
+ uint32_t lexemes = 0;
+ do {
+ if (node.type == JSON_PATH_NUM) {
+ path += sprintf(path, "[%u]",
+ (unsigned)node.num);
+ } else if (node.type == JSON_PATH_STR) {
+ path += sprintf(path, "[\"%.*s\"]",
+ node.len, node.str);
+ } else {
+ unreachable();
+ }
+ lexemes++;
+ } while ((rc = json_path_next(&parser, &node)) == 0 &&
+ node.type != JSON_PATH_END);
+ if (rc != 0 || node.type != JSON_PATH_END)
+ goto error_invalid_json;
+ /* JSON index is useless. */
+ if (lexemes == 1) {
+ region_truncate(region, size);
+ part->path = NULL;
+ } else {
+ region_truncate(region,
+ size + (path - part->path + 1));
+ }
+ }
}
return 0;
+
+error_invalid_json:
+ diag_set(ClientError, ER_WRONG_INDEX_OPTIONS,
+ part->fieldno + TUPLE_INDEX_BASE,
+ "invalid JSON path: path has invalid structure");
+ return -1;
}
int
@@ -497,6 +649,7 @@ key_def_decode_parts_160(struct key_part_def *parts, uint32_t part_count,
fields[part->fieldno].is_nullable :
key_part_def_default.is_nullable);
part->coll_id = COLL_NONE;
+ part->path = NULL;
}
return 0;
}
@@ -558,8 +711,15 @@ key_def_merge(const struct key_def *first, const struct key_def *second)
part = first->parts;
end = part + first->part_count;
for (; part != end; part++) {
+ char *path = NULL;
+ if (part->path != NULL && (path = strdup(part->path)) == NULL) {
+ diag_set(OutOfMemory, part->path_len + 1, "strdup",
+ "path");
+ goto error;
+ }
key_def_set_part(new_def, pos++, part->fieldno, part->type,
- part->is_nullable, part->coll, part->coll_id);
+ part->is_nullable, part->coll, part->coll_id,
+ path);
}
/* Set-append second key def's part to the new key def. */
@@ -568,10 +728,21 @@ key_def_merge(const struct key_def *first, const struct key_def *second)
for (; part != end; part++) {
if (key_def_find(first, part->fieldno))
continue;
+ char *path = NULL;
+ if (part->path != NULL && (path = strdup(part->path)) == NULL) {
+ diag_set(OutOfMemory, part->path_len + 1, "strdup",
+ "path");
+ goto error;
+ }
key_def_set_part(new_def, pos++, part->fieldno, part->type,
- part->is_nullable, part->coll, part->coll_id);
+ part->is_nullable, part->coll, part->coll_id,
+ path);
}
return new_def;
+
+error:
+ key_def_delete(new_def);
+ return NULL;
}
int
diff --git a/src/box/key_def.h b/src/box/key_def.h
index 42c054c..f14a928 100644
--- a/src/box/key_def.h
+++ b/src/box/key_def.h
@@ -54,6 +54,8 @@ struct key_part_def {
uint32_t coll_id;
/** True if a key part can store NULLs. */
bool is_nullable;
+ /** JSON path to data. */
+ char *path;
};
/**
@@ -78,6 +80,10 @@ struct key_part {
uint64_t format_epoch;
/** Cache for corresponding tuple_format slot_offset. */
int32_t slot_cache;
+ /** JSON path to data. */
+ const char *path;
+ /** JSON path length. */
+ uint32_t path_len;
};
struct key_def;
@@ -246,8 +252,9 @@ key_def_new_with_parts(struct key_part_def *parts, uint32_t part_count);
/**
* Dump part definitions of the given key def.
*/
-void
-key_def_dump_parts(const struct key_def *def, struct key_part_def *parts);
+int
+key_def_dump_parts(struct region *pool, const struct key_def *def,
+ struct key_part_def *parts);
/**
* Set a single key part in a key def.
@@ -256,7 +263,7 @@ key_def_dump_parts(const struct key_def *def, struct key_part_def *parts);
void
key_def_set_part(struct key_def *def, uint32_t part_no, uint32_t fieldno,
enum field_type type, bool is_nullable, struct coll *coll,
- uint32_t coll_id);
+ uint32_t coll_id, char *path);
/**
* Update 'has_optional_parts' of @a key_def with correspondence
diff --git a/src/box/lua/space.cc b/src/box/lua/space.cc
index 580e0ea..98bb969 100644
--- a/src/box/lua/space.cc
+++ b/src/box/lua/space.cc
@@ -295,6 +295,11 @@ lbox_fillspace(struct lua_State *L, struct space *space, int i)
lua_pushnumber(L, part->fieldno + TUPLE_INDEX_BASE);
lua_setfield(L, -2, "fieldno");
+ if (part->path != NULL) {
+ lua_pushstring(L, part->path);
+ lua_setfield(L, -2, "path");
+ }
+
lua_pushboolean(L, part->is_nullable);
lua_setfield(L, -2, "is_nullable");
diff --git a/src/box/schema.cc b/src/box/schema.cc
index 433f52c..a01126a 100644
--- a/src/box/schema.cc
+++ b/src/box/schema.cc
@@ -291,13 +291,13 @@ schema_init()
auto key_def_guard = make_scoped_guard([&] { key_def_delete(key_def); });
key_def_set_part(key_def, 0 /* part no */, 0 /* field no */,
- FIELD_TYPE_STRING, false, NULL, COLL_NONE);
+ FIELD_TYPE_STRING, false, NULL, COLL_NONE, NULL);
sc_space_new(BOX_SCHEMA_ID, "_schema", key_def, &on_replace_schema,
NULL);
/* _space - home for all spaces. */
key_def_set_part(key_def, 0 /* part no */, 0 /* field no */,
- FIELD_TYPE_UNSIGNED, false, NULL, COLL_NONE);
+ FIELD_TYPE_UNSIGNED, false, NULL, COLL_NONE, NULL);
/* _collation - collation description. */
sc_space_new(BOX_COLLATION_ID, "_collation", key_def,
@@ -345,10 +345,10 @@ schema_init()
diag_raise();
/* space no */
key_def_set_part(key_def, 0 /* part no */, 0 /* field no */,
- FIELD_TYPE_UNSIGNED, false, NULL, COLL_NONE);
+ FIELD_TYPE_UNSIGNED, false, NULL, COLL_NONE, NULL);
/* index no */
key_def_set_part(key_def, 1 /* part no */, 1 /* field no */,
- FIELD_TYPE_UNSIGNED, false, NULL, COLL_NONE);
+ FIELD_TYPE_UNSIGNED, false, NULL, COLL_NONE, NULL);
sc_space_new(BOX_INDEX_ID, "_index", key_def,
&alter_space_on_replace_index, &on_stmt_begin_index);
}
diff --git a/src/box/vy_log.c b/src/box/vy_log.c
index 3843cad..b1c6659 100644
--- a/src/box/vy_log.c
+++ b/src/box/vy_log.c
@@ -711,7 +711,8 @@ vy_log_record_dup(struct region *pool, const struct vy_log_record *src)
"struct key_part_def");
goto err;
}
- key_def_dump_parts(src->key_def, dst->key_parts);
+ if (key_def_dump_parts(pool, src->key_def, dst->key_parts) != 0)
+ goto err;
dst->key_part_count = src->key_def->part_count;
dst->key_def = NULL;
}
--
2.7.4
^ permalink raw reply [flat|nested] 14+ messages in thread
* [tarantool-patches] Re: [PATCH v1 3/5] box: introduce path field in key_part
2018-08-06 12:27 ` [tarantool-patches] [PATCH v1 3/5] box: introduce path field " Kirill Shcherbatov
@ 2018-08-08 22:03 ` Vladislav Shpilevoy
2018-08-15 12:14 ` Kirill Shcherbatov
0 siblings, 1 reply; 14+ messages in thread
From: Vladislav Shpilevoy @ 2018-08-08 22:03 UTC (permalink / raw)
To: tarantool-patches, Kirill Shcherbatov
Thanks for the patch! See 8 comments below.
On 06/08/2018 15:27, Kirill Shcherbatov wrote:
> As we need to store user-defined JSON path in key_part
> and key_part_def, we have introduced path and path_len
> fields. JSON path is verified and transformed to canonical
> form on index msgpack unpack.
> Because of field names specified as format could be changed
> key_part path persisted in Tarantool should be always started
> with first-level field access via array index(not by name).
>
> Part of #1012.
> ---
> src/box/key_def.c | 197 +++++++++++++++++++++++++++++++++++++++++++++++----
> src/box/key_def.h | 13 +++-
> src/box/lua/space.cc | 5 ++
> src/box/schema.cc | 8 +--
> src/box/vy_log.c | 3 +-
> 5 files changed, 205 insertions(+), 21 deletions(-)
>
> diff --git a/src/box/key_def.c b/src/box/key_def.c
> index 8a4262b..79e07f8 100644
> --- a/src/box/key_def.c
> +++ b/src/box/key_def.c
> @@ -103,7 +108,27 @@ key_def_dup(const struct key_def *src)
> return NULL;
> }
> memcpy(res, src, sz);
> + uint32_t i = 0;
> + for (; i < src->part_count; i++) {
> + if (src->parts[i].path == NULL) {
> + res->parts[i].path = NULL;
> + continue;
> + }
> + char *path = strdup(src->parts[i].path);
1. Please, keep monolithic key_def memory layout.
> + if (path == NULL) {
> + diag_set(OutOfMemory, src->parts[i].path_len + 1,
> + "strdup", "path");
> + goto error;
> + }
> + res->parts[i].path = path;
> + }
> return res;
> +
> +error:
> + for (uint32_t j = 0; j < i; j++)
> + free((void *)res->parts[j].path);
> + free(res);
> + return NULL;
> }
>
> void
> @@ -241,6 +296,11 @@ key_part_cmp(const struct key_part *parts1, uint32_t part_count1,
> if (part1->is_nullable != part2->is_nullable)
> return part1->is_nullable <
> part2->is_nullable ? -1 : 1;
> + /* Lexicographic strings order. */
> + uint32_t len = MIN(part1->path_len, part2->path_len);
> + int rc = 0;
> + if ((rc = strncmp(part1->path, part2->path, len)) != 0)
2. As I said you at least two times, strncmp treats non-equal strings
as the same one when the one is prefix of another. So this comparison
is not correct.
> + return rc;
> }
> return part_count1 < part_count2 ? -1 : part_count1 > part_count2;
> }
> @@ -445,8 +529,9 @@ key_def_decode_parts(struct key_part_def *parts, uint32_t part_count,
> return key_def_decode_parts_166(parts, part_count, data,
> fields, field_count);
> }
> + struct key_part_def *part;
> for (uint32_t i = 0; i < part_count; i++) {
> - struct key_part_def *part = &parts[i];
3. Just do ++part on the previous line.
> + part = &parts[i];
> if (mp_typeof(**data) != MP_MAP) {
> diag_set(ClientError, ER_WRONG_INDEX_OPTIONS,
> i + TUPLE_INDEX_BASE,
> @@ -473,8 +558,75 @@ key_def_decode_parts(struct key_part_def *parts, uint32_t part_count,
> "string and scalar parts");
> return -1;
> }
> + if (part->path != NULL) {
> + struct region *region = &fiber()->gc;
> + size_t path_len = strlen(part->path);
> + struct json_path_parser parser;
> + struct json_path_node node;
> + json_path_parser_create(&parser, part->path, path_len);
> + /*
> + * A worst-case scenario is .a -> ["a"]
> + * i.e. 3*path_len + 1 is enough.
4. As I understand, len of ".a" is 2, len of "['a']" is 5. So
it is actually 2*len + 1. Not 3*len + 1. It is not?
> + */
> + uint32_t size = region_used(region);
5. You do not need to truncate the region yourself. This is done
later by commit/rollback. But you can reuse previous the region
allocations if store the region memory pointer out of this cycle
together size. And when a new path len * 2 + 1 > than the memory
size you have, you allocate more. Else you can reuse it. In the
best case it reduces number of allocations in part_count times,
in the worst one it is not worse than your way.
> + char *path =
> + region_alloc(region, 3 * path_len + 1);
> + if (path == NULL) {
> + diag_set(OutOfMemory, 3 * path_len + 1,
> + "region_alloc", "path");
> + return -1;
> + }
> + part->path = path;
> + int rc = json_path_next(&parser, &node);
> + if (rc != 0)
> + goto error_invalid_json;
> + if (node.type != JSON_PATH_NUM) {
> + diag_set(ClientError, ER_WRONG_INDEX_OPTIONS,
> + part->fieldno,
> + "invalid JSON path: first part should "
> + "be defined as array");
6. As array index. Not array.
> + return -1;
> + }
> + if (node.num - TUPLE_INDEX_BASE != part->fieldno) {
> + diag_set(ClientError, ER_WRONG_INDEX_OPTIONS,
> + part->fieldno,
> + "invalid JSON path: first part refers "
> + "to invalid field");
> + return -1;
> + }
> + uint32_t lexemes = 0;
> + do {
> + if (node.type == JSON_PATH_NUM) {
> + path += sprintf(path, "[%u]",
> + (unsigned)node.num);
> + } else if (node.type == JSON_PATH_STR) {
> + path += sprintf(path, "[\"%.*s\"]",
> + node.len, node.str);
> + } else {
> + unreachable();
> + }
> + lexemes++;
> + } while ((rc = json_path_next(&parser, &node)) == 0 &&
> + node.type != JSON_PATH_END);
> + if (rc != 0 || node.type != JSON_PATH_END)
> + goto error_invalid_json;
> + /* JSON index is useless. */
> + if (lexemes == 1) {
> + region_truncate(region, size);
> + part->path = NULL;
> + } else {
> + region_truncate(region,
> + size + (path - part->path + 1));
> + }
> + }
7. Please, reduce indentation level. You can check
'if part->path == NULL then continue' and move this
code block 1 tab left. Even now you can join some of
lines into one.
> }
> return 0;
> +
> +error_invalid_json:
> + diag_set(ClientError, ER_WRONG_INDEX_OPTIONS,
> + part->fieldno + TUPLE_INDEX_BASE,
> + "invalid JSON path: path has invalid structure");
8. Please, write on which position the error occurred.
> + return -1;
> }
>
> in
^ permalink raw reply [flat|nested] 14+ messages in thread
* [tarantool-patches] Re: [PATCH v1 3/5] box: introduce path field in key_part
2018-08-08 22:03 ` [tarantool-patches] " Vladislav Shpilevoy
@ 2018-08-15 12:14 ` Kirill Shcherbatov
0 siblings, 0 replies; 14+ messages in thread
From: Kirill Shcherbatov @ 2018-08-15 12:14 UTC (permalink / raw)
To: tarantool-patches, Vladislav Shpilevoy
> 1. Please, keep monolithic key_def memory layout.
Yep, good idea. New patch version attached.
> 2. As I said you at least two times, strncmp treats non-equal strings
> as the same one when the one is prefix of another. So this comparison
> is not correct.
- if ((rc = strncmp(part1->path, part2->path, len)) != 0)
+ if ((rc = memcmp(part1->path, part2->path, len)) != 0)
> 3. Just do ++part on the previous line.
- struct key_part_def *part;
- for (uint32_t i = 0; i < part_count; i++) {
- part = &parts[i];
+ struct key_part_def *part = parts;
+ for (uint32_t i = 0; i < part_count; i++, part++) {
> 4. As I understand, len of ".a" is 2, len of "['a']" is 5. So
> it is actually 2*len + 1. Not 3*len + 1. It is not?
No, if I not mistaken, 2len+1 is not enough at worst scenario:
let |str| - length(str); new string is str' -> |str'| - length of the new string
let n_lex - count of lexemes in string (like .a that is the minimal construction)
let w_{i}^{ex} - whole length of original lexeme i, w_{i} - length of the string part without special characters.
|str| = \sum_{n=1}^{n_lex} w_{i}^{ex} = \sum_{n=1}^{n_lex}(1 + w_{i}) = n_lex + \sum_{1}^{n_lex}w_{i} = n_lex + S
|str'| = \sum_{n=1}^{n_lex}(4 + w_{i}) =
4n_lex + \sum_{1}^{n_lex}w_{i} = 4n_lex + S
|str'| = 4n_lex + |str| - n_lex = 3n_lex + |str|
At the worst scenario, n_lex = \frac{|str|}{2}
|str'| = 3*\frac{|str|}{2} + |str| = \frac{5}{2}*|str| = 2|str| + \frac{1}{2}*|str|
That means that for |str| > 2 : 2|str| + 1 < 2|str| + \frac{1}{2}*|str|
We can use 3|str| instead of 3|str| + 1, but believe that 1 byte is nothing, but here
it looks like terminating-zero slot.
e.g.
|.a.a.a.a| = 4*2 = n*2 = 8
|["a"]["a"]["a"]["a"]| = 4*5 = n*5 = 20
2*|str| + 1 = 2*8 + 1 = 17 i.e. segfault
> 5. You do not need to truncate the region yourself. This is done
> later by commit/rollback. But you can reuse previous the region
> allocations if store the region memory pointer out of this cycle
> together size. And when a new path len * 2 + 1 > than the memory
> size you have, you allocate more. Else you can reuse it. In the
> best case it reduces number of allocations in part_count times,
> in the worst one it is not worse than your way.
Good idea,
I started to reuse the biggest of two items: the rest of 3|str| + 1 new string or old |str| string
> 6. As array index. Not array.
fixed.
- "be defined as array");
+ "be defined as array index");
> 7. Please, reduce indentation level. You can check
> 'if part->path == NULL then continue' and move this
> code block 1 tab left. Even now you can join some of
> lines into one.
This looks ungly.
I've moved all this code to a new function:
static int
key_def_normalize_json_path(struct region *region, struct key_part_def *part,
char **path_extra, uint32_t *path_extra_size);
> 8. Please, write on which position the error occurred.
ok, done.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [tarantool-patches] [PATCH v1 4/5] box: introduce path_hash and tuple_field tree
2018-08-06 12:26 [tarantool-patches] [PATCH v1 0/5] box: indexes by JSON path Kirill Shcherbatov
` (2 preceding siblings ...)
2018-08-06 12:27 ` [tarantool-patches] [PATCH v1 3/5] box: introduce path field " Kirill Shcherbatov
@ 2018-08-06 12:27 ` Kirill Shcherbatov
2018-08-08 22:03 ` [tarantool-patches] " Vladislav Shpilevoy
2018-08-06 12:27 ` [tarantool-patches] [PATCH v1 5/5] box: specify indexes in user-friendly form Kirill Shcherbatov
2018-08-08 22:03 ` [tarantool-patches] Re: [PATCH v1 0/5] box: indexes by JSON path Vladislav Shpilevoy
5 siblings, 1 reply; 14+ messages in thread
From: Kirill Shcherbatov @ 2018-08-06 12:27 UTC (permalink / raw)
To: tarantool-patches; +Cc: v.shpilevoy, Kirill Shcherbatov
To work with JSON-defined indexes we introduce format JSON
path hashtable data_path and a tree of intermediate path
fields attached to format root fields.
<Hashtable>: format->data_path
[2].FIO.fname -> field "fname" {type=str, off_slot=-1}
[2].FIO.sname -> field "sname" {type=str, off_slot=-2}
<Tree>: format->field[2] {type = map}
|
FIO {type = map}
|
"fname" | "sname"
{type=str,off_slot=-1} ____|____ {type = str,off_slot=-2}
Leaf fields used in Index have initialized offset_slot.
On new tuple creation we observe fields tree and use leaf
records to init tuple field_map.
At the same time we use data_path hashtable on tuple data
access by index(when cached offset_slot is invalid).
New routine tuple_format_add_json_path is used to construct
all internal structures for JSON path on new format creation
and duplicating.
Part of #1012.
---
src/box/errcode.h | 2 +-
src/box/index_def.c | 8 +-
src/box/key_def.c | 4 +-
| 23 +-
src/box/tuple_format.c | 531 +++++++++++++++++++++++++++++++++++++++++--
src/box/tuple_format.h | 20 +-
test/box/misc.result | 51 +++--
test/engine/tuple.result | 101 ++++++++
test/engine/tuple.test.lua | 27 +++
9 files changed, 704 insertions(+), 63 deletions(-)
diff --git a/src/box/errcode.h b/src/box/errcode.h
index 3d5f66a..8cbd59d 100644
--- a/src/box/errcode.h
+++ b/src/box/errcode.h
@@ -107,7 +107,7 @@ struct errcode_record {
/* 52 */_(ER_FUNCTION_EXISTS, "Function '%s' already exists") \
/* 53 */_(ER_BEFORE_REPLACE_RET, "Invalid return value of space:before_replace trigger: expected tuple or nil, got %s") \
/* 54 */_(ER_FUNCTION_MAX, "A limit on the total number of functions has been reached: %u") \
- /* 55 */_(ER_UNUSED4, "") \
+ /* 55 */_(ER_DATA_MISMATCH_INDEX_PART, "Tuple doesn't math document structure defined as index") \
/* 56 */_(ER_USER_MAX, "A limit on the total number of users has been reached: %u") \
/* 57 */_(ER_NO_SUCH_ENGINE, "Space engine '%s' does not exist") \
/* 58 */_(ER_RELOAD_CFG, "Can't set option '%s' dynamically") \
diff --git a/src/box/index_def.c b/src/box/index_def.c
index 9cda63c..a2c5bd4 100644
--- a/src/box/index_def.c
+++ b/src/box/index_def.c
@@ -209,8 +209,12 @@ index_def_is_valid(struct index_def *index_def, const char *space_name)
* Courtesy to a user who could have made
* a typo.
*/
- if (index_def->key_def->parts[i].fieldno ==
- index_def->key_def->parts[j].fieldno) {
+ struct key_part *part_a = &index_def->key_def->parts[i];
+ struct key_part *part_b = &index_def->key_def->parts[j];
+ if ((part_a->fieldno == part_b->fieldno &&
+ part_a->path == NULL && part_b->path == NULL) ||
+ (part_a->path != NULL && part_b->path != NULL &&
+ strcmp(part_a->path, part_b->path) == 0)) {
diag_set(ClientError, ER_MODIFY_INDEX,
index_def->name, space_name,
"same key part is indexed twice");
diff --git a/src/box/key_def.c b/src/box/key_def.c
index 79e07f8..c454377 100644
--- a/src/box/key_def.c
+++ b/src/box/key_def.c
@@ -583,8 +583,8 @@ key_def_decode_parts(struct key_part_def *parts, uint32_t part_count,
if (node.type != JSON_PATH_NUM) {
diag_set(ClientError, ER_WRONG_INDEX_OPTIONS,
part->fieldno,
- "invalid JSON path: first part should "
- "be defined as array");
+ "invalid JSON path: first path part "\
+ "should be defined as array");
return -1;
}
if (node.num - TUPLE_INDEX_BASE != part->fieldno) {
--git a/src/box/tuple_extract_key.cc b/src/box/tuple_extract_key.cc
index bbd87f5..6a2690b 100644
--- a/src/box/tuple_extract_key.cc
+++ b/src/box/tuple_extract_key.cc
@@ -4,14 +4,22 @@
enum { MSGPACK_NULL = 0xc0 };
+static bool
+key_def_parts_sequential(const struct key_def *def, int i)
+{
+ uint32_t fieldno1 = def->parts[i + 1].fieldno;
+ uint32_t fieldno2 = def->parts[i].fieldno + 1;
+ return fieldno1 == fieldno2 && def->parts[i].path != NULL &&
+ def->parts[i + 1].path != NULL;
+}
+
/** True, if a key con contain two or more parts in sequence. */
static bool
key_def_contains_sequential_parts(const struct key_def *def)
{
- for (uint32_t i = 0; i < def->part_count - 1; ++i) {
- if (def->parts[i].fieldno + 1 == def->parts[i + 1].fieldno)
+ for (uint32_t i = 0; i < def->part_count - 1; ++i)
+ if (key_def_parts_sequential(def, i))
return true;
- }
return false;
}
@@ -120,8 +128,7 @@ tuple_extract_key_slowpath(const struct tuple *tuple,
* minimize tuple_field_raw() calls.
*/
for (; i < part_count - 1; i++) {
- if (key_def->parts[i].fieldno + 1 !=
- key_def->parts[i + 1].fieldno) {
+ if (key_def_parts_sequential(key_def, i)) {
/*
* End of sequential part.
*/
@@ -160,8 +167,7 @@ tuple_extract_key_slowpath(const struct tuple *tuple,
* minimize tuple_field_raw() calls.
*/
for (; i < part_count - 1; i++) {
- if (key_def->parts[i].fieldno + 1 !=
- key_def->parts[i + 1].fieldno) {
+ if (key_def_parts_sequential(key_def, i)) {
/*
* End of sequential part.
*/
@@ -227,8 +233,7 @@ tuple_extract_key_slowpath_raw(const char *data, const char *data_end,
uint32_t fieldno = key_def->parts[i].fieldno;
uint32_t null_count = 0;
for (; i < key_def->part_count - 1; i++) {
- if (key_def->parts[i].fieldno + 1 !=
- key_def->parts[i + 1].fieldno)
+ if (key_def_parts_sequential(key_def, i))
break;
}
uint32_t end_fieldno = key_def->parts[i].fieldno;
diff --git a/src/box/tuple_format.c b/src/box/tuple_format.c
index 40cd48a..f03978d 100644
--- a/src/box/tuple_format.c
+++ b/src/box/tuple_format.c
@@ -32,6 +32,34 @@
#include "tuple.h"
#include "tuple_format.h"
+path_hash_f path_hash;
+
+#define mh_name _field
+struct mh_field_key_t {
+ const char *path;
+ uint32_t path_len;
+ uint32_t path_hash;
+};
+#define mh_key_t struct mh_field_key_t *
+
+struct mh_field_node_t {
+ const char *path;
+ uint32_t path_len;
+ uint32_t path_hash;
+ struct tuple_field *field;
+};
+#define mh_node_t struct mh_field_node_t
+
+#define mh_arg_t void *
+#define mh_hash(a, arg) ((a)->path_hash)
+#define mh_hash_key(a, arg) mh_hash(a, arg)
+#define mh_cmp(a, b, arg) ((a)->path_len != (b)->path_len || \
+ memcmp((a)->path, (b)->path, (a)->path_len))
+#define mh_cmp_key(a, b, arg) mh_cmp(a, b, arg)
+#define MH_SOURCE 1
+#include "salad/mhash.h" /* Create mh_field_t hash. */
+
+
/** Global table of tuple formats */
struct tuple_format **tuple_formats;
static intptr_t recycled_format_ids = FORMAT_ID_NIL;
@@ -41,10 +69,347 @@ static uint64_t format_epoch = 1;
static uint32_t formats_size = 0, formats_capacity = 0;
static const struct tuple_field tuple_field_default = {
- FIELD_TYPE_ANY, TUPLE_OFFSET_SLOT_NIL, false, false,
+ FIELD_TYPE_ANY, TUPLE_OFFSET_SLOT_NIL, false, false, {{NULL, 0}, NULL}
};
/**
+ * Propagate @a field to MessagePack(field)[index].
+ * @param[in][out] field Field to propagate.
+ * @param index 1-based index to propagate to.
+ *
+ * @retval 0 Success, the index was found.
+ * @retval -1 Not found.
+ */
+static inline int
+tuple_field_go_to_index(const char **field, uint64_t index);
+
+/**
+ * Propagate @a field to MessagePack(field)[key].
+ * @param[in][out] field Field to propagate.
+ * @param key Key to propagate to.
+ * @param len Length of @a key.
+ *
+ * @retval 0 Success, the index was found.
+ * @retval -1 Not found.
+ */
+static inline int
+tuple_field_go_to_key(const char **field, const char *key, int len);
+
+/**
+ * Get @hashtable record by key @path, @path_len.
+ * @param hashtable to lookup,
+ * @param path string.
+ * @param path_len length of @path.
+ * @retval NULL on nothing found.
+ * @retval hashtable record pointer.
+ */
+static struct mh_field_node_t *
+json_path_hash_get(struct mh_field_t *hashtable, const char *path,
+ uint32_t path_len)
+{
+ assert(hashtable != NULL);
+ uint32_t hash = field_name_hash(path, path_len);
+ struct mh_field_key_t key = {path, path_len, hash};
+ mh_int_t rc = mh_field_find(hashtable, &key, NULL);
+ if (rc == mh_end(hashtable))
+ return NULL;
+ return mh_field_node(hashtable, rc);
+}
+
+/**
+ * Create a new hashtable object.
+ * @param[out] hashtable pointer to object to create.
+ * @retval -1 on error.
+ * @retval 0 on success.
+ */
+static int
+json_path_hash_create(struct mh_field_t **hashtable)
+{
+ assert(*hashtable == NULL);
+ *hashtable = mh_field_new();
+ if (*hashtable == NULL) {
+ diag_set(OutOfMemory, sizeof(*hashtable), "mh_field_new",
+ "hashtable");
+ return -1;
+ }
+ return 0;
+}
+
+/**
+ * Delete @hashtable object.
+ * @param hashtable pointer to object to delete.
+ * @param free_path release key path part.
+ */
+static void
+json_path_hash_delete(struct mh_field_t *hashtable, bool free_path)
+{
+ if (hashtable == NULL)
+ return;
+ while (mh_size(hashtable)) {
+ mh_int_t n = mh_first(hashtable);
+ if (free_path)
+ free((void *)mh_field_node(hashtable, n)->path);
+ mh_field_del(hashtable, n, NULL);
+ }
+ mh_field_delete(hashtable);
+}
+
+/**
+ * Insert a new record to hashtable.
+ * @param hashtable storage to insert new record.
+ * @param path string.
+ * @param path_len length of @path.
+ * @param tuple_field value to store in @hashtable.
+ * @retval -1 on error.
+ * @retval 0 on success.
+ */
+static int
+json_path_hash_insert(struct mh_field_t *hashtable, const char *path,
+ uint32_t path_len, struct tuple_field *field)
+{
+ assert(hashtable != NULL);
+ /* Test if record already present in hash. */
+ uint32_t path_hash = field_name_hash(path, path_len);
+ struct mh_field_node_t name_node =
+ {path, path_len, path_hash, field};
+ mh_int_t rc = mh_field_put(hashtable, &name_node, NULL, NULL);
+ if (rc == mh_end(hashtable)) {
+ diag_set(OutOfMemory, sizeof(*hashtable), "mh_field_put",
+ "hashtable");
+ return -1;
+ }
+ return 0;
+}
+
+/**
+ * Construct field tree level for JSON path part.
+ *
+ * @param[in, out] tuple_field pointer to record to start with
+ * would be changed to record that math
+ * @part lexeme.
+ * @param fieldno number of root space field.
+ * @param part JSON path lexeme to represent in field tree.
+ * @retval -1 on error.
+ * @retval 0 on success.
+ */
+static int
+json_field_tree_append(struct tuple_field **field_subtree, uint32_t fieldno,
+ struct json_path_node *part)
+{
+ enum field_type type;
+ struct tuple_field *field = *field_subtree;
+ switch (part->type) {
+ case JSON_PATH_NUM: {
+ type = FIELD_TYPE_ARRAY;
+ if (field->type != FIELD_TYPE_ANY && field->type != type)
+ goto error_type_mistmatch;
+ field->type = type;
+ /* Create or resize field array if required. */
+ if (field->array == NULL) {
+ field->array =
+ calloc(part->num, sizeof(struct tuple_field *));
+ if (field->array == NULL) {
+ diag_set(OutOfMemory,
+ part->num * sizeof(struct tuple_field *),
+ "malloc", "field->array");
+ return -1;
+ }
+ field->type = type;
+ field->array_size = part->num;
+ } else if (part->num >= field->array_size) {
+ void *array =
+ realloc(field->array,
+ part->num * sizeof(struct tuple_field *));
+ if (array == NULL) {
+ diag_set(OutOfMemory,
+ sizeof(struct tuple_field *), "realloc",
+ "array");
+ return -1;
+ }
+ memset(field->array[field->array_size], 0,
+ part->num - field->array_size);
+ field->array = array;
+ field->array_size = part->num;
+ }
+ /* Record already exists. No actions required */
+ if (field->array[part->num - TUPLE_INDEX_BASE] != NULL) {
+ *field_subtree =
+ field->array[part->num - TUPLE_INDEX_BASE];
+ return 0;
+ }
+ break;
+ }
+ case JSON_PATH_STR: {
+ type = FIELD_TYPE_MAP;
+ if (field->type != FIELD_TYPE_ANY && field->type != type)
+ goto error_type_mistmatch;
+ field->type = type;
+ if (field->map == NULL &&
+ json_path_hash_create(&field->map) != 0)
+ return -1;
+ struct mh_field_node_t *node =
+ json_path_hash_get(field->map, part->str, part->len);
+ if (node != NULL) {
+ *field_subtree = node->field;
+ return 0;
+ }
+ break;
+ }
+ default:
+ unreachable();
+ }
+
+ /* Construct and insert a new record. */
+ struct tuple_field *new_field =
+ malloc(sizeof(struct tuple_field));
+ if (new_field == NULL) {
+ diag_set(OutOfMemory, sizeof(struct tuple_field), "malloc",
+ "new_field");
+ return -1;
+ }
+ *new_field = tuple_field_default;
+ if (field->type == FIELD_TYPE_MAP) {
+ if (json_path_hash_insert(field->map, part->str, part->len,
+ new_field) != 0) {
+ free(new_field);
+ return -1;
+ }
+ } else if (field->type == FIELD_TYPE_ARRAY) {
+ field->array[part->num - TUPLE_INDEX_BASE] = new_field;
+ }
+ *field_subtree = new_field;
+
+ return 0;
+
+error_type_mistmatch:
+ diag_set(ClientError, ER_INDEX_PART_TYPE_MISMATCH,
+ tt_sprintf("%d", fieldno + TUPLE_INDEX_BASE),
+ field_type_strs[type], field_type_strs[field->type]);
+ return -1;
+}
+
+/**
+ * Delete @field_subtree object.
+ * @param field_subtree to delete.
+ */
+static void
+json_field_tree_delete(struct tuple_field *field_subtree)
+{
+ if (field_subtree->type == FIELD_TYPE_MAP &&
+ field_subtree->map != NULL) {
+ mh_int_t i;
+ mh_foreach(field_subtree->map, i) {
+ struct tuple_field *field =
+ mh_field_node(field_subtree->map, i)->field;
+ json_field_tree_delete(field);
+ free(field);
+ }
+ /*
+ * Field tree map records refer to string in
+ * format->path_hash.
+ */
+ json_path_hash_delete(field_subtree->map, false);
+ } else if (field_subtree->type == FIELD_TYPE_ARRAY &&
+ field_subtree->array != NULL) {
+ for (uint32_t i = 0; i < field_subtree->array_size; i++) {
+ json_field_tree_delete(field_subtree->array[i]);
+ free(field_subtree->array[i]);
+ }
+ free(field_subtree->array);
+ }
+}
+
+/**
+ * Add new JSON @path to @format.
+ * @param format to modify.
+ * @param path string to add.
+ * @param path_len length of @path.
+ * @param field_type type of field by @path.
+ * @param[out] leaf_field pointer to leaf field.
+ */
+static int
+tuple_format_add_json_path(struct tuple_format *format, const char *path,
+ uint32_t path_len, enum field_type type,
+ struct tuple_field **leaf_field)
+{
+
+ struct mh_field_node_t *leaf_node = NULL;
+ if (format->path_hash == NULL) {
+ /* Initialize path hashtable. */
+ if (json_path_hash_create(&format->path_hash) != 0)
+ return -1;
+ }
+
+ /*
+ * Get root field by index.
+ * Path is specified in canonical form: [i]...
+ */
+ int rc = 0;
+ struct json_path_parser parser;
+ struct json_path_node node;
+ json_path_parser_create(&parser, path, path_len);
+ rc = json_path_next(&parser, &node);
+ assert(rc == 0 && node.type == JSON_PATH_NUM);
+ assert(node.num < format->field_count + 1);
+ uint32_t root_fieldno = node.num - TUPLE_INDEX_BASE;
+ struct tuple_field *field = &format->fields[root_fieldno];
+
+ /* Test if path is already registered. */
+ if ((leaf_node = json_path_hash_get(format->path_hash, path,
+ path_len)) != NULL) {
+ if (leaf_node->field->type != type) {
+ const char *err =
+ tt_sprintf("JSON path '%.*s' has been already "
+ "constructed for '%s' leaf record",
+ path_len, path,
+ field_type_strs[
+ leaf_node->field->type]);
+ diag_set(ClientError, ER_WRONG_INDEX_OPTIONS,
+ node.num, err);
+ return -1;
+ }
+ *leaf_field = leaf_node->field;
+ /* Path already registered. */
+ return 0;
+ }
+
+ /*
+ * Hash table would hold string that path tree
+ * would refer to.
+ */
+ if ((path = strdup(path)) == NULL) {
+ diag_set(OutOfMemory, path_len + 1, "strdup", "path");
+ return -1;
+ }
+
+ /* Initialize dummy path_hash record. */
+ if (json_path_hash_insert(format->path_hash, path, path_len,
+ NULL) != 0) {
+ free((void *)path);
+ return -1;
+ }
+
+ /* Build data path tree. */
+ while ((rc = json_path_next(&parser, &node)) == 0 &&
+ node.type != JSON_PATH_END) {
+ if (json_field_tree_append(&field, root_fieldno, &node) != 0)
+ return -1;
+ }
+ /* It should be canonical valid JSON. */
+ assert(rc == 0 && node.type == JSON_PATH_END);
+ /* Leaf record is a new object as JSON path unique. */
+ field->type = type;
+
+ /* Finish hashtable record init. */
+ leaf_node = json_path_hash_get(format->path_hash, path, path_len);
+ assert(leaf_node != NULL && leaf_node->field == NULL);
+ leaf_node->field = field;
+
+ *leaf_field = field;
+ return 0;
+}
+
+/**
* Extract all available type info from keys and field
* definitions.
*/
@@ -66,6 +431,9 @@ tuple_format_create(struct tuple_format *format, struct key_def * const *keys,
format->fields[i].type = fields[i].type;
format->fields[i].offset_slot = TUPLE_OFFSET_SLOT_NIL;
format->fields[i].is_nullable = fields[i].is_nullable;
+ /* Don't need to init format->fields[i].map. */
+ format->fields[i].array = NULL;
+ format->fields[i].array_size = 0;
}
/* Initialize remaining fields */
for (uint32_t i = field_count; i < format->field_count; i++)
@@ -105,10 +473,12 @@ tuple_format_create(struct tuple_format *format, struct key_def * const *keys,
* used in tuple_format.
*/
if (field_type1_contains_type2(field->type,
- part->type)) {
+ part->type) &&
+ part->path == NULL) {
field->type = part->type;
} else if (! field_type1_contains_type2(part->type,
- field->type)) {
+ field->type) &&
+ part->path == NULL) {
const char *name;
int fieldno = part->fieldno + TUPLE_INDEX_BASE;
if (part->fieldno >= field_count) {
@@ -135,11 +505,26 @@ tuple_format_create(struct tuple_format *format, struct key_def * const *keys,
* First field is always simply accessible,
* so we don't store an offset for it.
*/
- if (field->offset_slot == TUPLE_OFFSET_SLOT_NIL &&
+ if (part->path != NULL) {
+ assert(is_sequential == false);
+ struct tuple_field *leaf_field = NULL;
+ if (tuple_format_add_json_path(format,
+ part->path,
+ part->path_len,
+ part->type,
+ &leaf_field) != 0)
+ return -1;
+ assert(leaf_field != NULL);
+ if (leaf_field->offset_slot ==
+ TUPLE_OFFSET_SLOT_NIL)
+ leaf_field->offset_slot = --current_slot;
+ if (part->slot_cache != current_slot)
+ format_epoch_changed = true;
+ } else if (field->offset_slot == TUPLE_OFFSET_SLOT_NIL &&
is_sequential == false && part->fieldno > 0) {
field->offset_slot = --current_slot;
- if (part->slot_cache != field->offset_slot)
+ if (part->slot_cache != current_slot)
format_epoch_changed = true;
}
}
@@ -248,6 +633,8 @@ tuple_format_alloc(struct key_def * const *keys, uint16_t key_count,
format->index_field_count = index_field_count;
format->exact_field_count = 0;
format->min_field_count = 0;
+ format->path_hash = NULL;
+ format->epoch = format_epoch;
return format;
}
@@ -255,6 +642,9 @@ tuple_format_alloc(struct key_def * const *keys, uint16_t key_count,
static inline void
tuple_format_destroy(struct tuple_format *format)
{
+ for (uint32_t i = 0; i < format->field_count; i++)
+ json_field_tree_delete(&format->fields[i]);
+ json_path_hash_delete(format->path_hash, true);
tuple_dictionary_unref(format->dict);
}
@@ -345,6 +735,32 @@ tuple_format_dup(struct tuple_format *src)
return NULL;
}
memcpy(format, src, total);
+
+ /* Fill with NULLs for normal destruction on error. */
+ format->path_hash = NULL;
+ for (uint32_t i = 0; i < format->field_count; i++) {
+ format->fields[i].array = NULL;
+ format->fields[i].array_size = 0;
+ }
+
+ if (src->path_hash != NULL) {
+ mh_int_t i;
+ mh_foreach(src->path_hash, i) {
+ struct mh_field_node_t *node =
+ mh_field_node(src->path_hash, i);
+ struct tuple_field *field = node->field;
+ int32_t offset_slot = field->offset_slot;
+ if (tuple_format_add_json_path(format, node->path,
+ node->path_len,
+ field->type,
+ &field) != 0)
+ goto error;
+ assert(field != NULL);
+ /* Store offset_slot in a new leaf record. */
+ field->offset_slot = offset_slot;
+ }
+ }
+
tuple_dictionary_ref(format->dict);
format->id = FORMAT_ID_NIL;
format->refs = 0;
@@ -354,6 +770,59 @@ tuple_format_dup(struct tuple_format *src)
return NULL;
}
return format;
+
+error:
+ tuple_format_destroy(format);
+ return NULL;
+}
+
+static int
+tuple_init_json_field_map(const struct tuple_field *field, uint32_t idx,
+ uint32_t *field_map, const char *tuple,
+ const char *offset)
+{
+ if (field->type == FIELD_TYPE_MAP) {
+ mh_int_t i;
+ mh_foreach(field->map, i) {
+ struct mh_field_node_t *node =
+ mh_field_node(field->map, i);
+ const char *raw = offset;
+ if (tuple_field_go_to_key(&raw, node->path,
+ node->path_len) != 0) {
+ diag_set(ClientError,
+ ER_DATA_MISMATCH_INDEX_PART);
+ return -1;
+ }
+ if (tuple_init_json_field_map(node->field, idx,
+ field_map, tuple,
+ raw) != 0)
+ return -1;
+ }
+ } else if (field->type == FIELD_TYPE_ARRAY) {
+ for (uint32_t i = 0; i < field->array_size; i++) {
+ struct tuple_field *field = field->array[i];
+ if (field == NULL)
+ continue;
+ const char *raw = offset;
+ if (tuple_field_go_to_index(&raw, i) != 0) {
+ diag_set(ClientError,
+ ER_DATA_MISMATCH_INDEX_PART);
+ return -1;
+ }
+ if (tuple_init_json_field_map(field, idx, field_map,
+ tuple, raw) != 0)
+ return -1;
+ }
+ } else {
+ /* Leaf field node. */
+ assert(field->offset_slot != TUPLE_OFFSET_SLOT_NIL);
+ if (key_mp_type_validate(field->type, mp_typeof(*offset),
+ ER_KEY_PART_TYPE, idx,
+ field->is_nullable) != 0)
+ return -1;
+ field_map[field->offset_slot] = (uint32_t)(offset - tuple);
+ }
+ return 0;
}
/** @sa declaration for details. */
@@ -411,6 +880,12 @@ tuple_init_field_map(const struct tuple_format *format, uint32_t *field_map,
field_map[field->offset_slot] =
(uint32_t) (pos - tuple);
}
+ if (field->map != NULL) {
+ assert(field->array != NULL);
+ if (tuple_init_json_field_map(field, i, field_map,
+ tuple, pos) != 0)
+ return -1;
+ }
mp_next(&pos);
}
return 0;
@@ -471,14 +946,6 @@ box_tuple_format_unref(box_tuple_format_t *format)
tuple_format_unref(format);
}
-/**
- * Propagate @a field to MessagePack(field)[index].
- * @param[in][out] field Field to propagate.
- * @param index 1-based index to propagate to.
- *
- * @retval 0 Success, the index was found.
- * @retval -1 Not found.
- */
static inline int
tuple_field_go_to_index(const char **field, uint64_t index)
{
@@ -517,15 +984,6 @@ tuple_field_go_to_index(const char **field, uint64_t index)
return -1;
}
-/**
- * Propagate @a field to MessagePack(field)[key].
- * @param[in][out] field Field to propagate.
- * @param key Key to propagate to.
- * @param len Length of @a key.
- *
- * @retval 0 Success, the index was found.
- * @retval -1 Not found.
- */
static inline int
tuple_field_go_to_key(const char **field, const char *key, int len)
{
@@ -565,17 +1023,33 @@ tuple_field_by_part(const struct tuple *tuple, const struct key_def *def,
assert(format->epoch > 0);
int32_t offset_slot = TUPLE_OFFSET_SLOT_NIL;
if (part->format_epoch != format->epoch) {
- offset_slot = format->fields[field_no].offset_slot;
+ /* Cache miss. */
+ if (part->path != NULL) {
+ struct mh_field_t *ht = format->path_hash;
+ struct mh_field_node_t *node =
+ json_path_hash_get(ht, part->path,
+ part->path_len);
+ assert(node != NULL);
+ offset_slot = node->field->offset_slot;
+ } else {
+ offset_slot =
+ format->fields[field_no].offset_slot;
+ }
+ /* Update cache only for greater epoch. */
if (format->epoch > part->format_epoch) {
part->slot_cache = offset_slot;
part->format_epoch = format->epoch;
}
+ } else {
+ /* Cache hit. */
+ offset_slot = part->slot_cache;
}
if (offset_slot != TUPLE_OFFSET_SLOT_NIL) {
return field_map[offset_slot] != 0 ?
data + field_map[offset_slot] : NULL;
}
}
+ assert(part->path == NULL);
ERROR_INJECT(ERRINJ_TUPLE_FIELD, return NULL);
uint32_t field_count = mp_decode_array(&data);
if (unlikely(field_no >= field_count))
@@ -593,6 +1067,17 @@ tuple_field_raw_by_path(struct tuple_format *format, const char *tuple,
{
assert(path_len > 0);
uint32_t fieldno;
+ if (format->path_hash != NULL) {
+ struct mh_field_node_t *ht_record =
+ json_path_hash_get(format->path_hash, path, path_len);
+ if (ht_record != NULL) {
+ int32_t offset_slot = ht_record->field->offset_slot;
+ assert(offset_slot != TUPLE_OFFSET_SLOT_NIL);
+ assert(field_map[offset_slot] != 0);
+ *field = tuple + field_map[offset_slot];
+ return 0;
+ }
+ }
/*
* It is possible, that a field has a name as
* well-formatted JSON. For example 'a.b.c.d' or '[1]' can
diff --git a/src/box/tuple_format.h b/src/box/tuple_format.h
index 3cb1284..1154f74 100644
--- a/src/box/tuple_format.h
+++ b/src/box/tuple_format.h
@@ -63,6 +63,7 @@ enum { TUPLE_OFFSET_SLOT_NIL = INT32_MAX };
struct tuple;
struct tuple_format;
+struct tuple_field_map;
/** Engine-specific tuple format methods. */
struct tuple_format_vtab {
@@ -81,6 +82,11 @@ struct tuple_format_vtab {
const char *end);
};
+struct mh_field_t;
+typedef size_t (*path_hash_f)(const char *str, uint32_t len);
+extern path_hash_f path_hash;
+
+
/** Tuple field meta information for tuple_format. */
struct tuple_field {
/**
@@ -92,7 +98,7 @@ struct tuple_field {
*/
enum field_type type;
/**
- * Offset slot in field map in tuple. Normally tuple
+ * Offset slot ix`n field map in tuple. Normally tuple
* stores field map - offsets of all fields participating
* in indexes. This allows quick access to most used
* fields without parsing entire mspack. This member
@@ -108,6 +114,16 @@ struct tuple_field {
bool is_key_part;
/** True, if a field can store NULL. */
bool is_nullable;
+ /** Tree child records. */
+ union {
+ /** Array of fields. */
+ struct {
+ struct tuple_field **array;
+ uint32_t array_size;
+ };
+ /** Hashtable: path -> tuple_field. */
+ struct mh_field_t *map;
+ };
};
/**
@@ -163,6 +179,8 @@ struct tuple_format {
* Shared names storage used by all formats of a space.
*/
struct tuple_dictionary *dict;
+ /** JSON path hash table */
+ struct mh_field_t *path_hash;
/* Formats of the fields */
struct tuple_field fields[0];
};
diff --git a/test/box/misc.result b/test/box/misc.result
index 4895a78..556f004 100644
--- a/test/box/misc.result
+++ b/test/box/misc.result
@@ -348,7 +348,7 @@ t;
- 'box.error.CANT_CREATE_COLLATION : 150'
- 'box.error.USER_EXISTS : 46'
- 'box.error.WAL_IO : 40'
- - 'box.error.PROC_RET : 21'
+ - 'box.error.RTREE_RECT : 101'
- 'box.error.PRIV_GRANTED : 89'
- 'box.error.CREATE_SPACE : 9'
- 'box.error.GRANT : 88'
@@ -359,7 +359,7 @@ t;
- 'box.error.VINYL_MAX_TUPLE_SIZE : 139'
- 'box.error.LOAD_FUNCTION : 99'
- 'box.error.INVALID_XLOG : 74'
- - 'box.error.READ_VIEW_ABORTED : 130'
+ - 'box.error.PRIV_NOT_GRANTED : 91'
- 'box.error.TRANSACTION_CONFLICT : 97'
- 'box.error.GUEST_USER_PASSWORD : 96'
- 'box.error.PROC_C : 102'
@@ -370,7 +370,7 @@ t;
- 'box.error.CFG : 59'
- 'box.error.NO_SUCH_FIELD : 37'
- 'box.error.CONNECTION_TO_SELF : 117'
- - 'box.error.FUNCTION_MAX : 54'
+ - 'box.error.PROC_LUA : 32'
- 'box.error.ILLEGAL_PARAMS : 1'
- 'box.error.PARTIAL_KEY : 136'
- 'box.error.SAVEPOINT_NO_TRANSACTION : 114'
@@ -397,36 +397,37 @@ t;
- 'box.error.FUNCTION_EXISTS : 52'
- 'box.error.UPDATE_ARG_TYPE : 26'
- 'box.error.CROSS_ENGINE_TRANSACTION : 81'
- - 'box.error.FORMAT_MISMATCH_INDEX_PART : 27'
- 'box.error.injection : table: <address>
+ - 'box.error.FORMAT_MISMATCH_INDEX_PART : 27'
+ - 'box.error.IDENTIFIER : 70'
- 'box.error.FUNCTION_TX_ACTIVE : 30'
- - 'box.error.ITERATOR_TYPE : 72'
- 'box.error.TRANSACTION_YIELD : 154'
+ - 'box.error.NULLABLE_MISMATCH : 153'
- 'box.error.NO_SUCH_ENGINE : 57'
- 'box.error.COMMIT_IN_SUB_STMT : 122'
- - 'box.error.NULLABLE_MISMATCH : 153'
- - 'box.error.UNSUPPORTED : 5'
- - 'box.error.LAST_DROP : 15'
+ - 'box.error.RELOAD_CFG : 58'
- 'box.error.SPACE_FIELD_IS_DUPLICATE : 149'
+ - 'box.error.LAST_DROP : 15'
+ - 'box.error.SEQUENCE_OVERFLOW : 147'
- 'box.error.DECOMPRESSION : 124'
- 'box.error.CREATE_SEQUENCE : 142'
- 'box.error.CREATE_USER : 43'
- - 'box.error.SEQUENCE_OVERFLOW : 147'
+ - 'box.error.DATA_MISMATCH_INDEX_PART : 55'
- 'box.error.INSTANCE_UUID_MISMATCH : 66'
- - 'box.error.RELOAD_CFG : 58'
+ - 'box.error.TUPLE_FORMAT_LIMIT : 16'
- 'box.error.SYSTEM : 115'
- 'box.error.KEY_PART_IS_TOO_LONG : 118'
- - 'box.error.MORE_THAN_ONE_TUPLE : 41'
+ - 'box.error.INJECTION : 8'
- 'box.error.TRUNCATE_SYSTEM_SPACE : 137'
- 'box.error.NO_SUCH_SAVEPOINT : 61'
- 'box.error.VY_QUOTA_TIMEOUT : 135'
- - 'box.error.PRIV_NOT_GRANTED : 91'
+ - 'box.error.READ_VIEW_ABORTED : 130'
- 'box.error.WRONG_INDEX_OPTIONS : 108'
- 'box.error.INVALID_VYLOG_FILE : 133'
- 'box.error.INDEX_FIELD_COUNT_LIMIT : 127'
- - 'box.error.BEFORE_REPLACE_RET : 53'
+ - 'box.error.PROTOCOL : 104'
- 'box.error.USER_MAX : 56'
- - 'box.error.INVALID_MSGPACK : 20'
+ - 'box.error.BEFORE_REPLACE_RET : 53'
- 'box.error.TUPLE_NOT_ARRAY : 22'
- 'box.error.KEY_PART_COUNT : 31'
- 'box.error.ALTER_SPACE : 12'
@@ -435,7 +436,7 @@ t;
- 'box.error.DROP_SEQUENCE : 144'
- 'box.error.INVALID_XLOG_ORDER : 76'
- 'box.error.UNKNOWN_REQUEST_TYPE : 48'
- - 'box.error.PROC_LUA : 32'
+ - 'box.error.PROC_RET : 21'
- 'box.error.SUB_STMT_MAX : 121'
- 'box.error.ROLE_NOT_GRANTED : 92'
- 'box.error.SPACE_EXISTS : 10'
@@ -446,36 +447,36 @@ t;
- 'box.error.REPLICASET_UUID_MISMATCH : 63'
- 'box.error.UPDATE_FIELD : 29'
- 'box.error.INDEX_EXISTS : 85'
- - 'box.error.SPLICE : 25'
+ - 'box.error.DROP_SPACE : 11'
- 'box.error.COMPRESSION : 119'
- 'box.error.INVALID_ORDER : 68'
- - 'box.error.UNKNOWN : 0'
+ - 'box.error.SPLICE : 25'
- 'box.error.NO_SUCH_GROUP : 155'
- - 'box.error.TUPLE_FORMAT_LIMIT : 16'
+ - 'box.error.INVALID_MSGPACK : 20'
- 'box.error.DROP_PRIMARY_KEY : 17'
- 'box.error.NULLABLE_PRIMARY : 152'
- 'box.error.NO_SUCH_SEQUENCE : 145'
- - 'box.error.INJECTION : 8'
+ - 'box.error.FUNCTION_MAX : 54'
- 'box.error.INVALID_UUID : 64'
- - 'box.error.IDENTIFIER : 70'
+ - 'box.error.UNSUPPORTED : 5'
- 'box.error.TIMEOUT : 78'
+ - 'box.error.ITERATOR_TYPE : 72'
- 'box.error.REPLICA_MAX : 73'
- 'box.error.NO_SUCH_ROLE : 82'
- - 'box.error.DROP_SPACE : 11'
- 'box.error.MISSING_REQUEST_FIELD : 69'
- 'box.error.MISSING_SNAPSHOT : 93'
- 'box.error.WRONG_SPACE_OPTIONS : 111'
- 'box.error.READONLY : 7'
- - 'box.error.RTREE_RECT : 101'
+ - 'box.error.UNKNOWN : 0'
- 'box.error.UPSERT_UNIQUE_SECONDARY_KEY : 105'
- 'box.error.NO_CONNECTION : 77'
- 'box.error.UNSUPPORTED_PRIV : 98'
- 'box.error.WRONG_SCHEMA_VERSION : 109'
- 'box.error.ROLLBACK_IN_SUB_STMT : 123'
- - 'box.error.PROTOCOL : 104'
- - 'box.error.INVALID_XLOG_TYPE : 125'
- - 'box.error.INDEX_PART_TYPE_MISMATCH : 24'
+ - 'box.error.MORE_THAN_ONE_TUPLE : 41'
- 'box.error.UNSUPPORTED_INDEX_FEATURE : 112'
+ - 'box.error.INDEX_PART_TYPE_MISMATCH : 24'
+ - 'box.error.INVALID_XLOG_TYPE : 125'
...
test_run:cmd("setopt delimiter ''");
---
diff --git a/test/engine/tuple.result b/test/engine/tuple.result
index 7fb0916..02b92f3 100644
--- a/test/engine/tuple.result
+++ b/test/engine/tuple.result
@@ -891,6 +891,107 @@ t["{"]
s:drop()
---
...
+--
+-- gh-1012: Indexes for JSON-defined paths.
+--
+s_withdata = box.schema.space.create('withdata', {engine = engine})
+---
+...
+s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'str', path = '[3].FIO["fname"]'}, {3, 'str', path = '[3]["FIO"].fname'}}})
+---
+- error: 'Can''t create or modify index ''test1'' in space ''withdata'': same key
+ part is indexed twice'
+...
+s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'str', path = 666}, {3, 'str', path = '[3]["FIO"]["fname"]'}}})
+---
+- error: 'Wrong index options (field 2): ''path'' must be string'
+...
+s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'str', path = 'field.FIO.fname'}}})
+---
+- error: 'Wrong index options (field 2): invalid JSON path: first path part should
+ be defined as array'
+...
+s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'map', path = '[3].FIO'}}})
+---
+- error: 'Can''t create or modify index ''test1'' in space ''withdata'': field type
+ ''map'' is not supported'
+...
+s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'array', path = '[3][1]'}}})
+---
+- error: 'Can''t create or modify index ''test1'' in space ''withdata'': field type
+ ''array'' is not supported'
+...
+s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'str', path = '[3].FIO'}, {3, 'str', path = '[3]["FIO"].fname'}}})
+---
+- error: Field 3 has type 'map' in one index, but type 'string' in another
+...
+s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'str', path = '[3][1].sname'}, {3, 'str', path = '[3]["FIO"].fname'}}})
+---
+- error: Field 3 has type 'map' in one index, but type 'array' in another
+...
+s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'str', path = '[2].FIO.fname'}}})
+---
+- error: 'Wrong index options (field 2): invalid JSON path: first part refers to invalid
+ field'
+...
+s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'str', path = '[3].FIO....fname'}}})
+---
+- error: 'Wrong index options (field 3): invalid JSON path: path has invalid structure'
+...
+idx = s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'str', path = '[3]["FIO"]["fname"]'}, {3, 'str', path = '[3]["FIO"]["sname"]'}}})
+---
+...
+assert(idx ~= nil)
+---
+- true
+...
+s_withdata:insert{7, 7, {town = 'NY', FIO = 666}, 4, 5}
+---
+- error: Tuple doesn't math document structure defined as index
+...
+s_withdata:insert{7, 7, {town = 'NY', FIO = {fname = 666, sname = 'Bond'}}, 4, 5}
+---
+- error: 'Supplied key type of part 2 does not match index part type: expected string'
+...
+s_withdata:insert{7, 7, {town = 'NY', FIO = {fname = "James"}}, 4, 5}
+---
+- error: Tuple doesn't math document structure defined as index
+...
+s_withdata:insert{7, 7, {town = 'NY', FIO = {fname = 'James', sname = 'Bond'}}, 4, 5}
+---
+- [7, 7, {'town': 'NY', 'FIO': {'fname': 'James', 'sname': 'Bond'}}, 4, 5]
+...
+s_withdata:insert{7, 7, {town = 'NY', FIO = {fname = 'James', sname = 'Bond'}}, 4, 5}
+---
+- error: Duplicate key exists in unique index 'test1' in space 'withdata'
+...
+s_withdata:insert{7, 7, {town = 'NY', FIO = {fname = 'James', sname = 'Bond', data = "extra"}}, 4, 5}
+---
+- error: Duplicate key exists in unique index 'test1' in space 'withdata'
+...
+s_withdata:insert{7, 7, {town = 'NY', FIO = {fname = 'Max', sname = 'Isaev', data = "extra"}}, 4, 5}
+---
+- [7, 7, {'town': 'NY', 'FIO': {'fname': 'Max', 'data': 'extra', 'sname': 'Isaev'}},
+ 4, 5]
+...
+idx:select()
+---
+- - [7, 7, {'town': 'NY', 'FIO': {'fname': 'James', 'sname': 'Bond'}}, 4, 5]
+ - [7, 7, {'town': 'NY', 'FIO': {'fname': 'Max', 'data': 'extra', 'sname': 'Isaev'}},
+ 4, 5]
+...
+idx:min()
+---
+- [7, 7, {'town': 'NY', 'FIO': {'fname': 'James', 'sname': 'Bond'}}, 4, 5]
+...
+idx:max()
+---
+- [7, 7, {'town': 'NY', 'FIO': {'fname': 'Max', 'data': 'extra', 'sname': 'Isaev'}},
+ 4, 5]
+...
+s_withdata:drop()
+---
+...
engine = nil
---
...
diff --git a/test/engine/tuple.test.lua b/test/engine/tuple.test.lua
index 30d6f1a..e106dd0 100644
--- a/test/engine/tuple.test.lua
+++ b/test/engine/tuple.test.lua
@@ -289,5 +289,32 @@ t["{"]
s:drop()
+--
+-- gh-1012: Indexes for JSON-defined paths.
+--
+s_withdata = box.schema.space.create('withdata', {engine = engine})
+s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'str', path = '[3].FIO["fname"]'}, {3, 'str', path = '[3]["FIO"].fname'}}})
+s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'str', path = 666}, {3, 'str', path = '[3]["FIO"]["fname"]'}}})
+s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'str', path = 'field.FIO.fname'}}})
+s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'map', path = '[3].FIO'}}})
+s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'array', path = '[3][1]'}}})
+s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'str', path = '[3].FIO'}, {3, 'str', path = '[3]["FIO"].fname'}}})
+s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'str', path = '[3][1].sname'}, {3, 'str', path = '[3]["FIO"].fname'}}})
+s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'str', path = '[2].FIO.fname'}}})
+s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'str', path = '[3].FIO....fname'}}})
+idx = s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'str', path = '[3]["FIO"]["fname"]'}, {3, 'str', path = '[3]["FIO"]["sname"]'}}})
+assert(idx ~= nil)
+s_withdata:insert{7, 7, {town = 'NY', FIO = 666}, 4, 5}
+s_withdata:insert{7, 7, {town = 'NY', FIO = {fname = 666, sname = 'Bond'}}, 4, 5}
+s_withdata:insert{7, 7, {town = 'NY', FIO = {fname = "James"}}, 4, 5}
+s_withdata:insert{7, 7, {town = 'NY', FIO = {fname = 'James', sname = 'Bond'}}, 4, 5}
+s_withdata:insert{7, 7, {town = 'NY', FIO = {fname = 'James', sname = 'Bond'}}, 4, 5}
+s_withdata:insert{7, 7, {town = 'NY', FIO = {fname = 'James', sname = 'Bond', data = "extra"}}, 4, 5}
+s_withdata:insert{7, 7, {town = 'NY', FIO = {fname = 'Max', sname = 'Isaev', data = "extra"}}, 4, 5}
+idx:select()
+idx:min()
+idx:max()
+s_withdata:drop()
+
engine = nil
test_run = nil
--
2.7.4
^ permalink raw reply [flat|nested] 14+ messages in thread
* [tarantool-patches] Re: [PATCH v1 4/5] box: introduce path_hash and tuple_field tree
2018-08-06 12:27 ` [tarantool-patches] [PATCH v1 4/5] box: introduce path_hash and tuple_field tree Kirill Shcherbatov
@ 2018-08-08 22:03 ` Vladislav Shpilevoy
2018-08-15 12:14 ` Kirill Shcherbatov
0 siblings, 1 reply; 14+ messages in thread
From: Vladislav Shpilevoy @ 2018-08-08 22:03 UTC (permalink / raw)
To: tarantool-patches, Kirill Shcherbatov
Thanks for the patch! See 18 comments below.
On 06/08/2018 15:27, Kirill Shcherbatov wrote:
> To work with JSON-defined indexes we introduce format JSON
> path hashtable data_path and a tree of intermediate path
> fields attached to format root fields.
>
> <Hashtable>: format->data_path
> [2].FIO.fname -> field "fname" {type=str, off_slot=-1}
> [2].FIO.sname -> field "sname" {type=str, off_slot=-2}
>
> <Tree>: format->field[2] {type = map}
> |
> FIO {type = map}
> |
> "fname" | "sname"
> {type=str,off_slot=-1} ____|____ {type = str,off_slot=-2}
>
> Leaf fields used in Index have initialized offset_slot.
> On new tuple creation we observe fields tree and use leaf
> records to init tuple field_map.
> At the same time we use data_path hashtable on tuple data
> access by index(when cached offset_slot is invalid).
>
> New routine tuple_format_add_json_path is used to construct
> all internal structures for JSON path on new format creation
> and duplicating.
>
> Part of #1012.
> ---
> src/box/errcode.h | 2 +-
> src/box/index_def.c | 8 +-
> src/box/key_def.c | 4 +-
> src/box/tuple_extract_key.cc | 23 +-
> src/box/tuple_format.c | 531 +++++++++++++++++++++++++++++++++++++++++--
> src/box/tuple_format.h | 20 +-
> test/box/misc.result | 51 +++--
> test/engine/tuple.result | 101 ++++++++
> test/engine/tuple.test.lua | 27 +++
> 9 files changed, 704 insertions(+), 63 deletions(-)
>
> diff --git a/src/box/key_def.c b/src/box/key_def.c
> index 79e07f8..c454377 100644
> --- a/src/box/key_def.c
> +++ b/src/box/key_def.c
> @@ -583,8 +583,8 @@ key_def_decode_parts(struct key_part_def *parts, uint32_t part_count,
> if (node.type != JSON_PATH_NUM) {
> diag_set(ClientError, ER_WRONG_INDEX_OPTIONS,
> part->fieldno,
> - "invalid JSON path: first part should "
> - "be defined as array");
> + "invalid JSON path: first path part "\
> + "should be defined as array");
1. A commit of a patchset should not fix errors introduced in
the same patchset. Please, merge this hunk into the previous commit.
> return -1;
> }
> if (node.num - TUPLE_INDEX_BASE != part->fieldno) {
> diff --git a/src/box/tuple_extract_key.cc b/src/box/tuple_extract_key.cc
> index bbd87f5..6a2690b 100644
> --- a/src/box/tuple_extract_key.cc
> +++ b/src/box/tuple_extract_key.cc
> @@ -4,14 +4,22 @@
>
> enum { MSGPACK_NULL = 0xc0 };
>
> +static bool
> +key_def_parts_sequential(const struct key_def *def, int i)
2. For flags and flag calculators please use 'is/are'.
> +{
> + uint32_t fieldno1 = def->parts[i + 1].fieldno;
> + uint32_t fieldno2 = def->parts[i].fieldno + 1;
> + return fieldno1 == fieldno2 && def->parts[i].path != NULL &&
> + def->parts[i + 1].path != NULL;
> +}
> +
> /** True, if a key con contain two or more parts in sequence. */
> static bool
> key_def_contains_sequential_parts(const struct key_def *def)
> {
> - for (uint32_t i = 0; i < def->part_count - 1; ++i) {
> - if (def->parts[i].fieldno + 1 == def->parts[i + 1].fieldno)
> + for (uint32_t i = 0; i < def->part_count - 1; ++i)
> + if (key_def_parts_sequential(def, i))
> return true;
> - }
3. The 'for' consists of two lines, so should have {}.
> return false;
> }
>
4. I found a problem with your keys extractor and tried to
reproduce it with a simple test but got another crash even
earlier than I expected:
Assertion failed: (is_sequential == false), function tuple_format_create, file /Users/v.shpilevoy/Work/Repositories/tarantool/src/box/tuple_format.c, line 509.
Process 93586 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
frame #0: 0x00007fff56cf5b66 libsystem_kernel.dylib`__pthread_kill + 10
libsystem_kernel.dylib`__pthread_kill:
-> 0x7fff56cf5b66 <+10>: jae 0x7fff56cf5b70 ; <+20>
0x7fff56cf5b68 <+12>: movq %rax, %rdi
0x7fff56cf5b6b <+15>: jmp 0x7fff56cecae9 ; cerror_nocancel
0x7fff56cf5b70 <+20>: retq
Target 0: (tarantool) stopped.
The test is:
box.cfg{}
s = box.schema.create_space('test')
parts = {}
parts[1] = {'[1][1]', 'unsigned'}
pk = s:create_index('pk', {parts = parts})
But this is not what I had wanted to test. The initial
problem I found is about how you extract keys: on extract key
slowpath functions (both raw and not) you iterate over
the very first level of a tuple and treat its entire
fields as key parts, but it is wrong now. In tuple key
extractor you should dig into the field when it is a
JSON path part. And this is where you again need the
template parameter 'is_flat' to omit such check for flat
indexes.
The original test I wanted to run was this:
box.cfg{}
s = box.schema.create_space('test')
parts = {}
parts[1] = {'[1][1]', 'unsigned'}
pk = s:create_index('pk', {parts = parts})
s:insert{{1, 1}}
s:upsert({{1}}, {{'+', 2, 1}})
Here, I think, upsert will try to extract key 1, but
will actually extract entire {1} and crash in some place.
> diff --git a/src/box/tuple_format.c b/src/box/tuple_format.c
> index 40cd48a..f03978d 100644
> --- a/src/box/tuple_format.c
> +++ b/src/box/tuple_format.c
> @@ -32,6 +32,34 @@
> #include "tuple.h"
> #include "tuple_format.h"
>
> +path_hash_f path_hash;
> +
> +#define mh_name _field
> +struct mh_field_key_t {
> + const char *path;
> + uint32_t path_len;
> + uint32_t path_hash;
> +};
> +#define mh_key_t struct mh_field_key_t *
> +
> +struct mh_field_node_t {
> + const char *path;
> + uint32_t path_len;
> + uint32_t path_hash;
> + struct tuple_field *field;
> +};> +#define mh_node_t struct mh_field_node_t
> +
> +#define mh_arg_t void *
> +#define mh_hash(a, arg) ((a)->path_hash)
> +#define mh_hash_key(a, arg) mh_hash(a, arg)
> +#define mh_cmp(a, b, arg) ((a)->path_len != (b)->path_len || \
> + memcmp((a)->path, (b)->path, (a)->path_len))
> +#define mh_cmp_key(a, b, arg) mh_cmp(a, b, arg)
> +#define MH_SOURCE 1
> +#include "salad/mhash.h" /* Create mh_field_t hash. */
> +
5. Exactly the same hash already exists in assoc.h as mh_strnptr, as
I said you verbally.
> +
> /** Global table of tuple formats */
> struct tuple_format **tuple_formats;
> static intptr_t recycled_format_ids = FORMAT_ID_NIL;
> @@ -41,10 +69,347 @@ static uint64_t format_epoch = 1;
> static uint32_t formats_size = 0, formats_capacity = 0;
>
> static const struct tuple_field tuple_field_default = {
> - FIELD_TYPE_ANY, TUPLE_OFFSET_SLOT_NIL, false, false,
> + FIELD_TYPE_ANY, TUPLE_OFFSET_SLOT_NIL, false, false, {{NULL, 0}, NULL}
> };
>
> /**
> + * Propagate @a field to MessagePack(field)[index].
> + * @param[in][out] field Field to propagate.
> + * @param index 1-based index to propagate to.
> + *
> + * @retval 0 Success, the index was found.
> + * @retval -1 Not found.
> + */
> +static inline int
> +tuple_field_go_to_index(const char **field, uint64_t index);
> +
> +/**
> + * Propagate @a field to MessagePack(field)[key].
> + * @param[in][out] field Field to propagate.
> + * @param key Key to propagate to.
> + * @param len Length of @a key.
> + *
> + * @retval 0 Success, the index was found.
> + * @retval -1 Not found.
> + */
> +static inline int
> +tuple_field_go_to_key(const char **field, const char *key, int len);
6. You do not need to announce these functions. Just move the
implementation here.
> +
> +/**
> + * Get @hashtable record by key @path, @path_len.
> + * @param hashtable to lookup,
> + * @param path string.
> + * @param path_len length of @path.
> + * @retval NULL on nothing found.
> + * @retval hashtable record pointer.
> + */
> +static struct mh_field_node_t *
> +json_path_hash_get(struct mh_field_t *hashtable, const char *path,
> + uint32_t path_len)
> +{
> + assert(hashtable != NULL);
> + uint32_t hash = field_name_hash(path, path_len);
> + struct mh_field_key_t key = {path, path_len, hash};
> + mh_int_t rc = mh_field_find(hashtable, &key, NULL);
> + if (rc == mh_end(hashtable))
> + return NULL;
> + return mh_field_node(hashtable, rc);
> +}
7. Use mh_strnptr_find_inp.
> +
> +/**
> + * Construct field tree level for JSON path part.
> + *
> + * @param[in, out] tuple_field pointer to record to start with
> + * would be changed to record that math
> + * @part lexeme.
> + * @param fieldno number of root space field.
> + * @param part JSON path lexeme to represent in field tree.
> + * @retval -1 on error.
> + * @retval 0 on success.
> + */
> +static int
> +json_field_tree_append(struct tuple_field **field_subtree, uint32_t fieldno,
> + struct json_path_node *part)
> +{
> + enum field_type type;
> + struct tuple_field *field = *field_subtree;
> + switch (part->type) {
> + case JSON_PATH_NUM: {
> + type = FIELD_TYPE_ARRAY;
> + if (field->type != FIELD_TYPE_ANY && field->type != type)
> + goto error_type_mistmatch;
> + field->type = type;
> + /* Create or resize field array if required. */
> + if (field->array == NULL) {
> + field->array =
> + calloc(part->num, sizeof(struct tuple_field *));
> + if (field->array == NULL) {
> + diag_set(OutOfMemory,
> + part->num * sizeof(struct tuple_field *),
> + "malloc", "field->array");
> + return -1;
> + }
> + field->type = type;
> + field->array_size = part->num;
> + } else if (part->num >= field->array_size) {
> + void *array =
> + realloc(field->array,
> + part->num * sizeof(struct tuple_field *));
8. Realloc works ok with NULL in the first argument, so you
can merge this if-else branching. If part->num > size then realloc
+ memset of the gap. No special calloc.
> + if (array == NULL) {
> + diag_set(OutOfMemory,
> + sizeof(struct tuple_field *), "realloc",
> + "array");
> + return -1;
> + }
> + memset(field->array[field->array_size], 0,
> + part->num - field->array_size);
> + field->array = array;
> + field->array_size = part->num;
> + }
> + /* Record already exists. No actions required */
> + if (field->array[part->num - TUPLE_INDEX_BASE] != NULL) {
> + *field_subtree =
> + field->array[part->num - TUPLE_INDEX_BASE];
> + return 0;
> + }
> + break;
> + }
> + case JSON_PATH_STR: {
> + type = FIELD_TYPE_MAP;
> + if (field->type != FIELD_TYPE_ANY && field->type != type)
> + goto error_type_mistmatch;
> + field->type = type;
> + if (field->map == NULL &&
> + json_path_hash_create(&field->map) != 0)
> + return -1;
> + struct mh_field_node_t *node =
> + json_path_hash_get(field->map, part->str, part->len);
> + if (node != NULL) {
> + *field_subtree = node->field;
> + return 0;
> + }
> + break;
> + }
> + default:
> + unreachable();
> + }
> +
> + /* Construct and insert a new record. */
> + struct tuple_field *new_field =
> + malloc(sizeof(struct tuple_field));
> + if (new_field == NULL) {
> + diag_set(OutOfMemory, sizeof(struct tuple_field), "malloc",
> + "new_field");
> + return -1;
> + }
> + *new_field = tuple_field_default;
> + if (field->type == FIELD_TYPE_MAP) {
> + if (json_path_hash_insert(field->map, part->str, part->len,
> + new_field) != 0) {
> + free(new_field);
> + return -1;
> + }
> + } else if (field->type == FIELD_TYPE_ARRAY) {
> + field->array[part->num - TUPLE_INDEX_BASE] = new_field;
> + }
> + *field_subtree = new_field;
> +
> + return 0;
> +
> +error_type_mistmatch:
> + diag_set(ClientError, ER_INDEX_PART_TYPE_MISMATCH,
> + tt_sprintf("%d", fieldno + TUPLE_INDEX_BASE),
> + field_type_strs[type], field_type_strs[field->type]);
> + return -1;
> +}
> +
> +/**
> + * Add new JSON @path to @format.
> + * @param format to modify.
> + * @param path string to add.
> + * @param path_len length of @path.
> + * @param field_type type of field by @path.
> + * @param[out] leaf_field pointer to leaf field.
> + */
> +static int
> +tuple_format_add_json_path(struct tuple_format *format, const char *path,
> + uint32_t path_len, enum field_type type,
> + struct tuple_field **leaf_field)
> +{
> +
> + struct mh_field_node_t *leaf_node = NULL;
> + if (format->path_hash == NULL) {
> + /* Initialize path hashtable. */
> + if (json_path_hash_create(&format->path_hash) != 0)
9. Please, preallocate the needed hash elements count once, using
mh reserve() function.
> + return -1;
> + }
> +
> + /*
> + * Get root field by index.
> + * Path is specified in canonical form: [i]...
> + */
> + int rc = 0;
> + struct json_path_parser parser;
> + struct json_path_node node;
> + json_path_parser_create(&parser, path, path_len);
> + rc = json_path_next(&parser, &node);
> + assert(rc == 0 && node.type == JSON_PATH_NUM);
> + assert(node.num < format->field_count + 1);
> + uint32_t root_fieldno = node.num - TUPLE_INDEX_BASE;
> + struct tuple_field *field = &format->fields[root_fieldno];
> +
> + /* Test if path is already registered. */
> + if ((leaf_node = json_path_hash_get(format->path_hash, path,
> + path_len)) != NULL) {
> + if (leaf_node->field->type != type) {
> + const char *err =
> + tt_sprintf("JSON path '%.*s' has been already "
> + "constructed for '%s' leaf record",
> + path_len, path,
> + field_type_strs[
> + leaf_node->field->type]);
> + diag_set(ClientError, ER_WRONG_INDEX_OPTIONS,
> + node.num, err);
> + return -1;
> + }
> + *leaf_field = leaf_node->field;
> + /* Path already registered. */
> + return 0;
> + }
> +
> + /*
> + * Hash table would hold string that path tree
> + * would refer to.
> + */
> + if ((path = strdup(path)) == NULL) {
10. Please, allocate json paths memory in the same block as the format.
See tuple_format_alloc. This allows to do not do strdups here, do not
use a 'free_path' flag in json_path_hash_delete, and reduces memory
fragmentation.
> + diag_set(OutOfMemory, path_len + 1, "strdup", "path");
> + return -1;
> + }
> +
> + /* Initialize dummy path_hash record. */
> + if (json_path_hash_insert(format->path_hash, path, path_len,
> + NULL) != 0) {
> + free((void *)path);
> + return -1;
> + }
> +
> + /* Build data path tree. */
> + while ((rc = json_path_next(&parser, &node)) == 0 &&
> + node.type != JSON_PATH_END) {
> + if (json_field_tree_append(&field, root_fieldno, &node) != 0)
> + return -1;
> + }
> + /* It should be canonical valid JSON. */
> + assert(rc == 0 && node.type == JSON_PATH_END);
> + /* Leaf record is a new object as JSON path unique. */
> + field->type = type;
> +
> + /* Finish hashtable record init. */
> + leaf_node = json_path_hash_get(format->path_hash, path, path_len);
> + assert(leaf_node != NULL && leaf_node->field == NULL);
> + leaf_node->field = field;
> +
> + *leaf_field = field;
> + return 0;
> +}
> +
> +/**
> * Extract all available type info from keys and field
> * definitions.
> */
> @@ -354,6 +770,59 @@ tuple_format_dup(struct tuple_format *src)
> return NULL;
> }
> return format;
> +
> +error:
> + tuple_format_destroy(format);
> + return NULL;
> +}
> +
> +static int
> +tuple_init_json_field_map(const struct tuple_field *field, uint32_t idx,
> + uint32_t *field_map, const char *tuple,
> + const char *offset)
> +{
> + if (field->type == FIELD_TYPE_MAP) {
> + mh_int_t i;
> + mh_foreach(field->map, i) {
> + struct mh_field_node_t *node =
> + mh_field_node(field->map, i);
> + const char *raw = offset;
> + if (tuple_field_go_to_key(&raw, node->path,
> + node->path_len) != 0) {
> + diag_set(ClientError,
> + ER_DATA_MISMATCH_INDEX_PART);
> + return -1;
> + }
> + if (tuple_init_json_field_map(node->field, idx,
> + field_map, tuple,
> + raw) != 0)
> + return -1;
> + }
> + } else if (field->type == FIELD_TYPE_ARRAY) {
> + for (uint32_t i = 0; i < field->array_size; i++) {
> + struct tuple_field *field = field->array[i];
> + if (field == NULL)
> + continue;
> + const char *raw = offset;
> + if (tuple_field_go_to_index(&raw, i) != 0) {
> + diag_set(ClientError,
> + ER_DATA_MISMATCH_INDEX_PART);
> + return -1;
> + }
11. When a field is array, you can iterate over its fields and check
if corresponding field->array[i] is not null. This allows to iterate
over the array only once, while now you rescan it O(N) times where
N is size of the field->array.
12. For the map above you can do another thing. Each time you do
go_to_key() you can remember the key index (number of iterations
inside go_to_key()) and the result pointer. During the cycle you
calculate the maximal key index and its pointer. And then you
finish the field decoding starting from this point. This is a
strong optimization. For example, now for such field
{a = {b = {c = d, e = f}, g = h}}
and key in a.b.c you decode it as
{a = {b = {c = d, e = f}, g = h}}
{b = {c = d, e = f}
{c = d, e = f}
This is because after you found the key you still need to
skip the entire map field to find the next one (see the next
comment).
With my method you decode it as
{a = {b = {c = d, e = f}, g = h}}
> + if (tuple_init_json_field_map(field, idx, field_map,
> + tuple, raw) != 0)
> + return -1;
> + }
> + } else {
> + /* Leaf field node. */
> + assert(field->offset_slot != TUPLE_OFFSET_SLOT_NIL);
> + if (key_mp_type_validate(field->type, mp_typeof(*offset),
> + ER_KEY_PART_TYPE, idx,
> + field->is_nullable) != 0)
> + return -1;
> + field_map[field->offset_slot] = (uint32_t)(offset - tuple);
> + }
> + return 0;
> }
>
> /** @sa declaration for details. */
> @@ -411,6 +880,12 @@ tuple_init_field_map(const struct tuple_format *format, uint32_t *field_map,
> field_map[field->offset_slot] =
> (uint32_t) (pos - tuple);
> }
> + if (field->map != NULL) {
> + assert(field->array != NULL);
> + if (tuple_init_json_field_map(field, i, field_map,
> + tuple, pos) != 0)
13. Now you rescan the field pointed by pos twice - first
time inside tuple_init_json_field_map, second in the mp_next below.
You should not.
> + return -1;
> + }
> mp_next(&pos);
> }
> return 0;
> @@ -565,17 +1023,33 @@ tuple_field_by_part(const struct tuple *tuple, const struct key_def *def,
> assert(format->epoch > 0);
> int32_t offset_slot = TUPLE_OFFSET_SLOT_NIL;
> if (part->format_epoch != format->epoch) {
> - offset_slot = format->fields[field_no].offset_slot;
> + /* Cache miss. */
> + if (part->path != NULL) {
> + struct mh_field_t *ht = format->path_hash;
> + struct mh_field_node_t *node =
> + json_path_hash_get(ht, part->path,
> + part->path_len);
14. As I said you verbally, instead of recalculating hash from
the path you can store it in the key_part together with slot cache
and epoch. And use light version of json_path_hash_get which takes
a path with already calculated hash.
> + assert(node != NULL);
> + offset_slot = node->field->offset_slot;
> + } else {
> + offset_slot =
> + format->fields[field_no].offset_slot;
> + }
> + /* Update cache only for greater epoch. */
> if (format->epoch > part->format_epoch) {
> part->slot_cache = offset_slot;
> part->format_epoch = format->epoch;
> }
> + } else {
> + /* Cache hit. */
> + offset_slot = part->slot_cache;
> }
> if (offset_slot != TUPLE_OFFSET_SLOT_NIL) {
> return field_map[offset_slot] != 0 ?
> data + field_map[offset_slot] : NULL;
> }
> }
> diff --git a/src/box/tuple_format.h b/src/box/tuple_format.h
> index 3cb1284..1154f74 100644
> --- a/src/box/tuple_format.h
> +++ b/src/box/tuple_format.h
> @@ -63,6 +63,7 @@ enum { TUPLE_OFFSET_SLOT_NIL = INT32_MAX };
>
> struct tuple;
> struct tuple_format;
> +struct tuple_field_map;
15. What is it?
>
> /** Engine-specific tuple format methods. */
> struct tuple_format_vtab {
> @@ -81,6 +82,11 @@ struct tuple_format_vtab {
> const char *end);
> };
>
> +struct mh_field_t;
> +typedef size_t (*path_hash_f)(const char *str, uint32_t len);
> +extern path_hash_f path_hash;
16. What is it? I do not see where is it
assigned or used.
> +
> +
> /** Tuple field meta information for tuple_format. */
> struct tuple_field {
> /**
> @@ -92,7 +98,7 @@ struct tuple_field {
> */
> enum field_type type;
> /**
> - * Offset slot in field map in tuple. Normally tuple
> + * Offset slot ix`n field map in tuple. Normally tuple
17. !?
> * stores field map - offsets of all fields participating
> * in indexes. This allows quick access to most used
> * fields without parsing entire mspack. This member
> diff --git a/test/engine/tuple.test.lua b/test/engine/tuple.test.lua
> index 30d6f1a..e106dd0 100644
> --- a/test/engine/tuple.test.lua
> +++ b/test/engine/tuple.test.lua
> @@ -289,5 +289,32 @@ t["{"]
>
> s:drop()
>
> +--
> +-- gh-1012: Indexes for JSON-defined paths.
> +--
> +s_withdata = box.schema.space.create('withdata', {engine = engine})
> +s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'str', path = '[3].FIO["fname"]'}, {3, 'str', path = '[3]["FIO"].fname'}}})
> +s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'str', path = 666}, {3, 'str', path = '[3]["FIO"]["fname"]'}}})
> +s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'str', path = 'field.FIO.fname'}}})
> +s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'map', path = '[3].FIO'}}})
> +s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'array', path = '[3][1]'}}})
> +s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'str', path = '[3].FIO'}, {3, 'str', path = '[3]["FIO"].fname'}}})
> +s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'str', path = '[3][1].sname'}, {3, 'str', path = '[3]["FIO"].fname'}}})
> +s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'str', path = '[2].FIO.fname'}}})
> +s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'str', path = '[3].FIO....fname'}}})
> +idx = s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'str', path = '[3]["FIO"]["fname"]'}, {3, 'str', path = '[3]["FIO"]["sname"]'}}})
> +assert(idx ~= nil)
> +s_withdata:insert{7, 7, {town = 'NY', FIO = 666}, 4, 5}
> +s_withdata:insert{7, 7, {town = 'NY', FIO = {fname = 666, sname = 'Bond'}}, 4, 5}
> +s_withdata:insert{7, 7, {town = 'NY', FIO = {fname = "James"}}, 4, 5}
> +s_withdata:insert{7, 7, {town = 'NY', FIO = {fname = 'James', sname = 'Bond'}}, 4, 5}
> +s_withdata:insert{7, 7, {town = 'NY', FIO = {fname = 'James', sname = 'Bond'}}, 4, 5}
> +s_withdata:insert{7, 7, {town = 'NY', FIO = {fname = 'James', sname = 'Bond', data = "extra"}}, 4, 5}
> +s_withdata:insert{7, 7, {town = 'NY', FIO = {fname = 'Max', sname = 'Isaev', data = "extra"}}, 4, 5}
> +idx:select()
> +idx:min()
> +idx:max()
> +s_withdata:drop()
18. Sorry, but it is too little number of tests. You even did not
test primary JSON index. As well as key extractions (upsert, vinyl replace/update etc),
tuple hashes (via dumps in vinyl for example) etc.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [tarantool-patches] Re: [PATCH v1 4/5] box: introduce path_hash and tuple_field tree
2018-08-08 22:03 ` [tarantool-patches] " Vladislav Shpilevoy
@ 2018-08-15 12:14 ` Kirill Shcherbatov
0 siblings, 0 replies; 14+ messages in thread
From: Kirill Shcherbatov @ 2018-08-15 12:14 UTC (permalink / raw)
To: tarantool-patches, Vladislav Shpilevoy
> 1. A commit of a patchset should not fix errors introduced in
> the same patchset. Please, merge this hunk into the previous commit.
No such problem already.
> 2. For flags and flag calculators please use 'is/are'.
Ok, fixed.
> 3. The 'for' consists of two lines, so should have {}.
Ok, fixed.
> 4. I found a problem with your keys extractor and tried to
> reproduce it with a simple test but got another crash even
> earlier than I expected:
Thanks, fixed as a part of global reworking.
> But this is not what I had wanted to test. The initial
> problem I found is about how you extract keys: on extract key
> slowpath functions (both raw and not) you iterate over
> the very first level of a tuple and treat its entire
> fields as key parts, but it is wrong now. In tuple key
> extractor you should dig into the field when it is a
> JSON path part. And this is where you again need the
> template parameter 'is_flat' to omit such check for flat
> indexes.
>
> The original test I wanted to run was this:
>
> box.cfg{}
> s = box.schema.create_space('test')
> parts = {}
> parts[1] = {'[1][1]', 'unsigned'}
> pk = s:create_index('pk', {parts = parts})
> s:insert{{1, 1}}
> s:upsert({{1}}, {{'+', 2, 1}})
>
> Here, I think, upsert will try to extract key 1, but
> will actually extract entire {1} and crash in some place.
No such problem already. I've included such tests in my patchset.
> 5. Exactly the same hash already exists in assoc.h as mh_strnptr, as
> I said you verbally.
Ok, I use it now.
> 6. You do not need to announce these functions. Just move the
> implementation here.
I need them defined in header for now.
> 7. Use mh_strnptr_find_inp.
Ok, everywhere when possible.
> 8. Realloc works ok with NULL in the first argument, so you
> can merge this if-else branching. If part->num > size then realloc
> + memset of the gap. No special calloc.
Ok, done.
> 9. Please, preallocate the needed hash elements count once, using
> mh reserve() function.
Ok, I do this for format hash now.
> 10. Please, allocate json paths memory in the same block as the format.
> See tuple_format_alloc. This allows to do not do strdups here, do not
> use a 'free_path' flag in json_path_hash_delete, and reduces memory
> fragmentation.
Ok, done.
> 11. When a field is array, you can iterate over its fields and check
> if corresponding field->array[i] is not null. This allows to iterate
> over the array only once, while now you rescan it O(N) times where
> N is size of the field->array.
Good idea, I've done it.
>
> 12. For the map above you can do another thing. Each time you do
> go_to_key() you can remember the key index (number of iterations
> inside go_to_key()) and the result pointer. During the cycle you
> calculate the maximal key index and its pointer. And then you
> finish the field decoding starting from this point. This is a
> strong optimization. For example, now for such field
>
> {a = {b = {c = d, e = f}, g = h}}
>
> and key in a.b.c you decode it as
>
> {a = {b = {c = d, e = f}, g = h}}
> {b = {c = d, e = f}
> {c = d, e = f}
>
> This is because after you found the key you still need to
> skip the entire map field to find the next one (see the next
> comment).
>
> With my method you decode it as
>
> {a = {b = {c = d, e = f}, g = h}}
If I've correctly understood this feature is already implemented in my patch.
> 13. Now you rescan the field pointed by pos twice - first
> time inside tuple_init_json_field_map, second in the mp_next below.
> You should not.
External mp_next moves across fields; at the same time tuple_init_json_field_map (called
json_field_tree_exec_routine now) iterates iside of it.
I don't know, how could I use internal offset to go to next field.
[{}, {}]
Believe that this would not the only thing that I'll fix next time.
> 14. As I said you verbally, instead of recalculating hash from
> the path you can store it in the key_part together with slot cache
> and epoch. And use light version of json_path_hash_get which takes
> a path with already calculated hash.
Ok, done.
>> +struct tuple_field_map;
> 15. What is it?
crap. dropped.
> 16. What is it? I do not see where is it
> assigned or used.
dropped.
>> - * Offset slot in field map in tuple. Normally tuple
>> + * Offset slot ix`n field map in tuple. Normally tuple
>
> 17. !?
Fixed.
> 18. Sorry, but it is too little number of tests. You even did not
> test primary JSON index. As well as key extractions (upsert, vinyl replace/update etc),
> tuple hashes (via dumps in vinyl for example) etc.
I've wrote many pretty tests.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [tarantool-patches] [PATCH v1 5/5] box: specify indexes in user-friendly form
2018-08-06 12:26 [tarantool-patches] [PATCH v1 0/5] box: indexes by JSON path Kirill Shcherbatov
` (3 preceding siblings ...)
2018-08-06 12:27 ` [tarantool-patches] [PATCH v1 4/5] box: introduce path_hash and tuple_field tree Kirill Shcherbatov
@ 2018-08-06 12:27 ` Kirill Shcherbatov
2018-08-08 22:03 ` [tarantool-patches] Re: [PATCH v1 0/5] box: indexes by JSON path Vladislav Shpilevoy
5 siblings, 0 replies; 14+ messages in thread
From: Kirill Shcherbatov @ 2018-08-06 12:27 UTC (permalink / raw)
To: tarantool-patches; +Cc: v.shpilevoy, Kirill Shcherbatov
Since now it is possible to create indexes by JSON-path
using field names specified in format.
@TarantoolBot document
Title: Indexes by JSON path
Sometimes field data could have complex document structure.
When this structure is consistent across whole document,
you are able to create an index by JSON path.
Example:
s:create_index('json_index',
{parts = {{'data.FIO["fname"]', 'str'}}})
Part of #1012.
---
src/box/lua/schema.lua | 58 +++++++++++++++++++++++++++++++++++++++------
test/engine/iterator.result | 2 +-
test/engine/tuple.result | 36 ++++++++++++++++++++++++++++
test/engine/tuple.test.lua | 10 ++++++++
4 files changed, 98 insertions(+), 8 deletions(-)
diff --git a/src/box/lua/schema.lua b/src/box/lua/schema.lua
index b9b8c90..62b83aa 100644
--- a/src/box/lua/schema.lua
+++ b/src/box/lua/schema.lua
@@ -556,6 +556,48 @@ local function update_index_parts_1_6_0(parts)
return result
end
+local function format_field_index_by_name(format, name)
+ for k,v in pairs(format) do
+ if v.name == name then
+ return k
+ end
+ end
+ return nil
+end
+
+local function format_field_resolve(format, name)
+ local idx = nil
+ local field_name = nil
+ -- try resolve whole name
+ idx = format_field_index_by_name(format, name)
+ if idx ~= nil then
+ return idx, nil
+ end
+ -- try resolve field by name
+ field_name = string.match(name, "^[a-z][a-z,0-9]*")
+ if field_name ~= nil then
+ idx = format_field_index_by_name(format, field_name)
+ local suffix = string.sub(name, string.len(field_name) + 2)
+ if idx ~= nil and suffix ~= nil then
+ return idx, string.format("[%d]%s", idx, suffix)
+ end
+ -- simplified json_path
+ return idx, nil
+ end
+ -- try resolve field by index
+ field_name = string.match(name, "^%[[0-9]*%]")
+ if field_name ~= nil then
+ idx = tonumber(string.sub(field_name, 2, -2))
+ local suffix = string.sub(name, string.len(field_name) + 2)
+ if suffix == nil then
+ -- simplified json_path
+ return idx, nil
+ end
+ return idx, name
+ end
+ return nil, nil
+end
+
local function update_index_parts(format, parts)
if type(parts) ~= "table" then
box.error(box.error.ILLEGAL_PARAMS,
@@ -607,16 +649,18 @@ local function update_index_parts(format, parts)
box.error(box.error.ILLEGAL_PARAMS,
"options.parts[" .. i .. "]: field (name or number) is expected")
elseif type(part.field) == 'string' then
- for k,v in pairs(format) do
- if v.name == part.field then
- part.field = k
- break
- end
- end
- if type(part.field) == 'string' then
+ local idx, path = format_field_resolve(format, part.field)
+ if idx == nil then
box.error(box.error.ILLEGAL_PARAMS,
"options.parts[" .. i .. "]: field was not found by name '" .. part.field .. "'")
end
+ if part.path ~= nil and part.path ~= path then
+ box.error(box.error.ILLEGAL_PARAMS,
+ "options.parts[" .. i .. "]: field path '"..part.path.." doesn't math path resolved by name '" .. part.field .. "'")
+ end
+ parts_can_be_simplified = parts_can_be_simplified and path == nil
+ part.field = idx
+ part.path = path or part.path
elseif part.field == 0 then
box.error(box.error.ILLEGAL_PARAMS,
"options.parts[" .. i .. "]: field (number) must be one-based")
diff --git a/test/engine/iterator.result b/test/engine/iterator.result
index 10097ed..0bca87e 100644
--- a/test/engine/iterator.result
+++ b/test/engine/iterator.result
@@ -4213,7 +4213,7 @@ s:replace{35}
...
state, value = gen(param,state)
---
-- error: 'builtin/box/schema.lua:1034: usage: next(param, state)'
+- error: 'builtin/box/schema.lua:1078: usage: next(param, state)'
...
value
---
diff --git a/test/engine/tuple.result b/test/engine/tuple.result
index 02b92f3..9d8d06f 100644
--- a/test/engine/tuple.result
+++ b/test/engine/tuple.result
@@ -945,6 +945,42 @@ assert(idx ~= nil)
---
- true
...
+format = {{'int1', 'unsigned'}, {'int2', 'unsigned'}, {'data', 'array'}, {'int3', 'unsigned'}, {'int4', 'unsigned'}}
+---
+...
+s_withdata:format(format)
+---
+- error: Field 3 has type 'map' in one index, but type 'array' in another
+...
+format = {{'int1', 'unsigned'}, {'int2', 'unsigned'}, {'data', 'map'}, {'int3', 'unsigned'}, {'int4', 'unsigned'}}
+---
+...
+s_withdata:format(format)
+---
+...
+idx2 = s_withdata:create_index('test2', {parts = {{2, 'number'}, {'[3]["FIO"]["fname"]', 'str', path = '[3].FIO.fname'}}})
+---
+- error: 'Illegal parameters, options.parts[2]: field path ''[3].FIO.fname doesn''t
+ math path resolved by name ''[3]["FIO"]["fname"]'''
+...
+idx2 = s_withdata:create_index('test2', {parts = {{2, 'number'}, {'[3]["FIO"]["fname"]', 'str', path = '[3]["FIO"]["fname"]'}}})
+---
+...
+assert(idx2 ~= nil)
+---
+- true
+...
+idx3 = s_withdata:create_index('test3', {parts = {{2, 'number'}, {'data.FIO["fname"]', 'str'}}})
+---
+...
+assert(idx3 ~= nil)
+---
+- true
+...
+assert(idx2.parts[2].path == "[3][\"FIO\"][\"fname\"]")
+---
+- true
+...
s_withdata:insert{7, 7, {town = 'NY', FIO = 666}, 4, 5}
---
- error: Tuple doesn't math document structure defined as index
diff --git a/test/engine/tuple.test.lua b/test/engine/tuple.test.lua
index e106dd0..282973a 100644
--- a/test/engine/tuple.test.lua
+++ b/test/engine/tuple.test.lua
@@ -304,6 +304,16 @@ s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'str', path = '[2]
s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'str', path = '[3].FIO....fname'}}})
idx = s_withdata:create_index('test1', {parts = {{2, 'number'}, {3, 'str', path = '[3]["FIO"]["fname"]'}, {3, 'str', path = '[3]["FIO"]["sname"]'}}})
assert(idx ~= nil)
+format = {{'int1', 'unsigned'}, {'int2', 'unsigned'}, {'data', 'array'}, {'int3', 'unsigned'}, {'int4', 'unsigned'}}
+s_withdata:format(format)
+format = {{'int1', 'unsigned'}, {'int2', 'unsigned'}, {'data', 'map'}, {'int3', 'unsigned'}, {'int4', 'unsigned'}}
+s_withdata:format(format)
+idx2 = s_withdata:create_index('test2', {parts = {{2, 'number'}, {'[3]["FIO"]["fname"]', 'str', path = '[3].FIO.fname'}}})
+idx2 = s_withdata:create_index('test2', {parts = {{2, 'number'}, {'[3]["FIO"]["fname"]', 'str', path = '[3]["FIO"]["fname"]'}}})
+assert(idx2 ~= nil)
+idx3 = s_withdata:create_index('test3', {parts = {{2, 'number'}, {'data.FIO["fname"]', 'str'}}})
+assert(idx3 ~= nil)
+assert(idx2.parts[2].path == "[3][\"FIO\"][\"fname\"]")
s_withdata:insert{7, 7, {town = 'NY', FIO = 666}, 4, 5}
s_withdata:insert{7, 7, {town = 'NY', FIO = {fname = 666, sname = 'Bond'}}, 4, 5}
s_withdata:insert{7, 7, {town = 'NY', FIO = {fname = "James"}}, 4, 5}
--
2.7.4
^ permalink raw reply [flat|nested] 14+ messages in thread
* [tarantool-patches] Re: [PATCH v1 0/5] box: indexes by JSON path
2018-08-06 12:26 [tarantool-patches] [PATCH v1 0/5] box: indexes by JSON path Kirill Shcherbatov
` (4 preceding siblings ...)
2018-08-06 12:27 ` [tarantool-patches] [PATCH v1 5/5] box: specify indexes in user-friendly form Kirill Shcherbatov
@ 2018-08-08 22:03 ` Vladislav Shpilevoy
2018-08-15 12:14 ` Kirill Shcherbatov
5 siblings, 1 reply; 14+ messages in thread
From: Vladislav Shpilevoy @ 2018-08-08 22:03 UTC (permalink / raw)
To: tarantool-patches
Hi! Thanks for the patchset! See 3 comments below.
On 06/08/2018 15:26, Kirill Shcherbatov wrote:
> Sometimes field data could have complex document structure.
> When this structure is consistent across whole document,
> you are able to create an index by JSON path.
> This came possible with auxiliary structures per tuple_format:
> tree of intermediate path fields and hashtable referes to leaf field
1. Typo: no such word 'referes'.
2. On the Travis engine/upsert test fails:
https://travis-ci.org/tarantool/tarantool/jobs/412620927
3. I've got the build errors:
/Users/v.shpilevoy/Work/Repositories/tarantool/src/box/tuple_format.c:72:67: error: excess elements in union initializer [-Werror]
FIELD_TYPE_ANY, TUPLE_OFFSET_SLOT_NIL, false, false, {{NULL, 0}, NULL}
^~~~
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/9.1.0/include/stddef.h:105:16: note: expanded from macro 'NULL'
# define NULL ((void*)0)
^~~~~~~~~~
/Users/v.shpilevoy/Work/Repositories/tarantool/src/box/tuple_format.c:803:32: error: variable 'field' is uninitialized when used within its own initialization [-Werror,-Wuninitialized]
struct tuple_field *field = field->array[i];
~~~~~ ^~~~~
2 errors generated.
> that use path as key.
> To speed-up data access by JSON index key_part structure extended
> with offset_slot cache that points to field_map item containing
> data offset for current tuple.
> RFC contains detailed description of those concepts.
> Finally, supported ability to define JSON paths in user-friendly
> form containing format field name(that could be changed).
>
> Branch: http://github.com/tarantool/tarantool/tree/kshch/gh-1012-json-indexes
> Issue: https://github.com/tarantool/tarantool/issues/1012
>
> Kirill Shcherbatov (5):
> rfc: describe a Tarantool JSON indexes
> box: introduce slot_cache in key_part
> box: introduce path field in key_part
> box: introduce path_hash and tuple_field tree
> box: specify indexes in user-friendly form
>
> doc/rfc/1012-json-indexes.md | 188 ++++++++++++++
> src/box/errcode.h | 2 +-
> src/box/index_def.c | 8 +-
> src/box/key_def.c | 199 ++++++++++++++-
> src/box/key_def.h | 17 +-
> src/box/lua/schema.lua | 58 ++++-
> src/box/lua/space.cc | 5 +
> src/box/memtx_bitset.c | 4 +-
> src/box/memtx_rtree.c | 2 +-
> src/box/schema.cc | 8 +-
> src/box/tuple_bloom.c | 8 +-
> src/box/tuple_compare.cc | 48 ++--
> src/box/tuple_extract_key.cc | 36 ++-
> src/box/tuple_format.c | 570 +++++++++++++++++++++++++++++++++++++++++--
> src/box/tuple_format.h | 35 ++-
> src/box/tuple_hash.cc | 28 +--
> src/box/tuple_hash.h | 5 +-
> src/box/vy_log.c | 3 +-
> src/box/vy_stmt.h | 2 +-
> test/box/misc.result | 51 ++--
> test/engine/iterator.result | 2 +-
> test/engine/tuple.result | 137 +++++++++++
> test/engine/tuple.test.lua | 37 +++
> 23 files changed, 1300 insertions(+), 153 deletions(-)
> create mode 100644 doc/rfc/1012-json-indexes.md
>
^ permalink raw reply [flat|nested] 14+ messages in thread