From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id C1C0827435 for ; Thu, 26 Jul 2018 07:15:13 -0400 (EDT) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id xcBLLiAfkngI for ; Thu, 26 Jul 2018 07:15:13 -0400 (EDT) Received: from smtpng3.m.smailru.net (smtpng3.m.smailru.net [94.100.177.149]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 06D6E2187D for ; Thu, 26 Jul 2018 07:15:12 -0400 (EDT) From: Kirill Shcherbatov Subject: [tarantool-patches] [PATCH v1 1/1] rfc: describe a Tarantool JSON indexes Date: Thu, 26 Jul 2018 14:15:08 +0300 Message-Id: <7192ba6c28bf9cd637f7e1e5263bbf9771cc6f44.1532603654.git.kshcherbatov@tarantool.org> Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-subscribe: List-owner: List-post: List-archive: To: tarantool-patches@freelists.org Cc: v.shpilevoy@tarantool.org, Kirill Shcherbatov Part of #1012 --- Branch: http://github.com/tarantool/tarantool/tree/kshch/gh-1012-json-indexes Issue: https://github.com/tarantool/tarantool/issues/1012 doc/rfc/1012-json-indexes.md | 119 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 119 insertions(+) create mode 100644 doc/rfc/1012-json-indexes.md diff --git a/doc/rfc/1012-json-indexes.md b/doc/rfc/1012-json-indexes.md new file mode 100644 index 0000000..a02daef --- /dev/null +++ b/doc/rfc/1012-json-indexes.md @@ -0,0 +1,119 @@ +# Tarantool JSON Indexes + +* **Status**: In progress +* **Start date**: 26-07-2018 +* **Authors**: Kirill Shcherbatov @kshcherbatov kshcherbatov@tarantool.org +* **Issues**: [#1012](https://github.com/tarantool/tarantool/issues/1012) + +## Summary +Tarantool JSON Indexes would allow to index a parts of complex objects stored in fields. +``` +box.space.test:create_index('contributer', {parts = {{1, 'number'}, {2, 'str', data_path = "name.last"}}}) +``` + +## Background and motivation +Indexes are consists of key_part(s). We should allow to specify `"data_path"` strings on indexes creation. +According to index definition, indexed information should be acceded really fast. Previously, only fields indexes were allowed. +This was achieved with special metadata stored in tuples before main payload. +``` + +----------------------+-----------------------+-------------+ + tuple_begin, ..., raw = | extra_size | offset N ... offset 1 | MessagePack | + | +----------------------+-----------------------+-------------+ + | ^ + +-----------------------------------------------------------------------data_offset +``` +Tuple format field used in index has initialized `offset_slot`: index in metadata data offsets array for this specific tuple. +Those offsets are used to access data without unpacking whole tuple on lookup. +We can use same approach to gain similar results with JSON indexes. + +## Detailed design + +### Concept idea +We suppose to extend 'tuple_format' with 'HASH: json_path -> offset_slot' +``` +struct tuple_format { + ... + /** Hash table with mapping json_path -> offset_slot. */ + struct mh_strni32_t *json_index_hash; +} +``` +### Constructing new tuples +1. User specify an index with json-defined structure: +``` +box.space.test:create_index('contributer', {parts = {{1, 'number'}, {2, 'str', data_path = "name.last"}}}) +``` +2. `key_part_def` restored from msgpack generates `key_part` for field '2' with 'data_path' = "name.last". Verification of json validity is also carried. + +3. tuple_format_create constructs new space format and initialize field included in the indexes with `tuple_field_add_json_index` routine: +``` +if (part->data_path != NULL) { + if (tuple_field_add_json_index(format, + part->data_path, + --current_slot) != 0) + return -1; +} +``` +`tuple_field_add_json_index` constructs a new hash table if it is not exists, and insert `` + +4. Finally, on new tuple creation, `tuple_init_field_map` uses hash to fill tuple map and check that data corresponds data_path restriction: +``` +for (; i < defined_field_count; ++i, ++field) { + struct mh_strni32_t *hash = format->json_index_hash; + if (hash != NULL) { + for (mh_int_t i = mh_first(hash); i != mh_end(hash); + i = mh_next(hash, i)) { + int32_t offset_slot = + mh_strni32_node(hash, i)->val; + const char *field = NULL; + const char *data_path = mh_strni32_node(hash, i)->str; + if (tuple_field_raw_by_path(format, raw, field_map, + data_path, strlen(data_path), + 0, &field) != 0 || + field == NULL) + return -1; + field_map[offset_slot] = + (uint32_t) (field - raw); + } + } +} +``` + +### Speeding up access with offset_slot cache in key_part object +Hash lookup could be an expensive operation, so we suppose to cache offset_slots for JSON indexes. + +We suppose to extend 'key_part' structure with 'data_path' field initializing on index creation, 'field_epoch' and reallocatable 'offset_slot' int32_t array. +``` +struct key_part { + /** Tuple field index for this part */ + uint32_t fieldno; + ... + + /** Data JSON path. */ + const char *data_format; + /** Epoch of tuple_format cached in offset_slot. */ + uint32_t slot_epoch; + /** Cache of corresponding tuple_format slot_offset. */ + int32_t slot_cache; +}; +``` + +``` +struct tuple_format { + /** Tuple format epoch. */ + uint32_t epoch; +}; +``` + +### Acceding tuple by JSON-path +#### tuple_field_raw_by_path +``` +tuple_field_raw_by_path(const struct tuple_format *format, const char *tuple, + const uint32_t *field_map, const char *path, + uint32_t path_len, uint32_t path_hash, + const char **field) +``` +We can trivially test if name is an index looking to the hash and obtain offset to specifying offset to `tuple_field_raw`. + + +## Rationale and alternatives +... -- 2.7.4