Tarantool development patches archive
 help / color / mirror / Atom feed
From: Kirill Shcherbatov <kshcherbatov@tarantool.org>
To: tarantool-patches@freelists.org
Cc: v.shpilevoy@tarantool.org,
	Kirill Shcherbatov <kshcherbatov@tarantool.org>
Subject: [tarantool-patches] [PATCH v1 1/1] rfc: describe a Tarantool JSON indexes
Date: Thu, 26 Jul 2018 14:15:08 +0300	[thread overview]
Message-ID: <7192ba6c28bf9cd637f7e1e5263bbf9771cc6f44.1532603654.git.kshcherbatov@tarantool.org> (raw)

Part of #1012
---
Branch: http://github.com/tarantool/tarantool/tree/kshch/gh-1012-json-indexes
Issue: https://github.com/tarantool/tarantool/issues/1012

 doc/rfc/1012-json-indexes.md | 119 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 119 insertions(+)
 create mode 100644 doc/rfc/1012-json-indexes.md

diff --git a/doc/rfc/1012-json-indexes.md b/doc/rfc/1012-json-indexes.md
new file mode 100644
index 0000000..a02daef
--- /dev/null
+++ b/doc/rfc/1012-json-indexes.md
@@ -0,0 +1,119 @@
+# Tarantool JSON Indexes
+
+* **Status**: In progress
+* **Start date**: 26-07-2018
+* **Authors**: Kirill Shcherbatov @kshcherbatov kshcherbatov@tarantool.org
+* **Issues**: [#1012](https://github.com/tarantool/tarantool/issues/1012)
+
+## Summary
+Tarantool JSON Indexes would allow to index a parts of complex objects stored in fields.
+```
+box.space.test:create_index('contributer', {parts = {{1, 'number'}, {2, 'str', data_path = "name.last"}}})
+```
+
+## Background and motivation
+Indexes are consists of key_part(s). We should allow to specify `"data_path"` strings on indexes creation.
+According to index definition, indexed information should be acceded really fast. Previously, only fields indexes were allowed. 
+This was achieved with special metadata stored in tuples before main payload.
+```
+                           +----------------------+-----------------------+-------------+
+  tuple_begin, ..., raw =  |      extra_size      | offset N ... offset 1 | MessagePack |
+  |                        +----------------------+-----------------------+-------------+
+  |                                                                       ^
+  +-----------------------------------------------------------------------data_offset
+```
+Tuple format field used in index has initialized `offset_slot`: index in metadata data offsets array for this specific tuple.
+Those offsets are used to access data without unpacking whole tuple on lookup.
+We can use same approach to gain similar results with JSON indexes.
+
+## Detailed design
+
+### Concept idea
+We suppose to extend 'tuple_format' with 'HASH: json_path -> offset_slot'
+```
+struct tuple_format {
+	...
+	/** Hash table with mapping json_path -> offset_slot. */
+	struct mh_strni32_t *json_index_hash;
+}
+```
+### Constructing new tuples
+1. User specify an index with json-defined structure:
+```
+box.space.test:create_index('contributer', {parts = {{1, 'number'}, {2, 'str', data_path = "name.last"}}})
+```
+2. `key_part_def` restored from msgpack generates `key_part` for field '2' with 'data_path' = "name.last".  Verification of json validity is also carried.
+
+3. tuple_format_create constructs new space format and initialize field included in the indexes with `tuple_field_add_json_index` routine: 
+```
+if (part->data_path != NULL) {
+	if (tuple_field_add_json_index(format,
+					part->data_path,
+					--current_slot) != 0)
+		return -1;
+}
+```
+`tuple_field_add_json_index` constructs a new hash table if it is not exists, and insert `<data_path, hash, slot_offset>`
+
+4. Finally, on new tuple creation, `tuple_init_field_map` uses hash to fill tuple map and check that data corresponds data_path restriction:
+```
+for (; i < defined_field_count; ++i, ++field) {
+	struct mh_strni32_t *hash = format->json_index_hash;
+	if (hash != NULL) {
+		for (mh_int_t i = mh_first(hash); i != mh_end(hash);
+		     i = mh_next(hash, i)) {
+			int32_t offset_slot =
+				mh_strni32_node(hash, i)->val;
+			const char *field = NULL;
+			const char *data_path = mh_strni32_node(hash, i)->str;
+			if (tuple_field_raw_by_path(format, raw, field_map,
+						    data_path, strlen(data_path),
+						    0, &field) != 0 || 
+			    field == NULL)
+				return -1;
+			field_map[offset_slot] =
+				(uint32_t) (field - raw);
+		}
+	}
+}
+```
+
+### Speeding up access with offset_slot cache in key_part object
+Hash lookup could be an expensive operation, so we suppose to cache offset_slots for JSON indexes. 
+
+We suppose to extend 'key_part' structure with 'data_path' field initializing on index creation, 'field_epoch' and reallocatable 'offset_slot' int32_t array.
+```
+struct key_part {
+	/** Tuple field index for this part */
+	uint32_t fieldno;
+	...
+
+	/** Data JSON path. */
+	const char *data_format;
+	/** Epoch of tuple_format cached in offset_slot. */
+	uint32_t slot_epoch;
+	/** Cache of corresponding tuple_format slot_offset. */
+	int32_t slot_cache;
+};
+```
+
+```
+struct tuple_format {
+	/** Tuple format epoch. */
+	uint32_t epoch;
+};
+```
+
+### Acceding tuple by JSON-path
+#### tuple_field_raw_by_path
+```
+tuple_field_raw_by_path(const struct tuple_format *format, const char *tuple,
+                        const uint32_t *field_map, const char *path,
+                        uint32_t path_len, uint32_t path_hash,
+                        const char **field)
+```
+We can trivially test if name is an index looking to the hash and obtain offset to specifying offset to `tuple_field_raw`.
+
+
+## Rationale and alternatives
+...
-- 
2.7.4

             reply	other threads:[~2018-07-26 11:15 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-26 11:15 Kirill Shcherbatov [this message]
2018-07-27 15:10 ` Vladimir Davydov
2018-07-30 12:02   ` [tarantool-patches] " Kirill Shcherbatov
2018-07-30 12:21     ` Vladimir Davydov
2018-07-30 13:45     ` Vladislav Shpilevoy
2018-07-30 14:09       ` Kirill Shcherbatov
2018-07-30 16:14       ` Kirill Shcherbatov
2018-07-30 18:46         ` Vladislav Shpilevoy
2018-07-30 19:23           ` Kirill Shcherbatov
2018-07-30 20:19             ` Vladislav Shpilevoy
2018-07-30 23:17               ` Vladislav Shpilevoy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7192ba6c28bf9cd637f7e1e5263bbf9771cc6f44.1532603654.git.kshcherbatov@tarantool.org \
    --to=kshcherbatov@tarantool.org \
    --cc=tarantool-patches@freelists.org \
    --cc=v.shpilevoy@tarantool.org \
    --subject='Re: [tarantool-patches] [PATCH v1 1/1] rfc: describe a Tarantool JSON indexes' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox