From: Vladimir Davydov <vdavydov.dev@gmail.com> To: Kirill Shcherbatov <kshcherbatov@tarantool.org> Cc: tarantool-patches@freelists.org, v.shpilevoy@tarantool.org Subject: Re: [tarantool-patches] [PATCH v1 1/1] rfc: describe a Tarantool JSON indexes Date: Fri, 27 Jul 2018 18:10:13 +0300 [thread overview] Message-ID: <20180727151013.goyfa4uuf7nl7nou@esperanza> (raw) In-Reply-To: <7192ba6c28bf9cd637f7e1e5263bbf9771cc6f44.1532603654.git.kshcherbatov@tarantool.org> On Thu, Jul 26, 2018 at 02:15:08PM +0300, Kirill Shcherbatov wrote: > Part of #1012 > --- > Branch: http://github.com/tarantool/tarantool/tree/kshch/gh-1012-json-indexes > Issue: https://github.com/tarantool/tarantool/issues/1012 > > doc/rfc/1012-json-indexes.md | 119 +++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 119 insertions(+) > create mode 100644 doc/rfc/1012-json-indexes.md > > diff --git a/doc/rfc/1012-json-indexes.md b/doc/rfc/1012-json-indexes.md > new file mode 100644 > index 0000000..a02daef > --- /dev/null > +++ b/doc/rfc/1012-json-indexes.md > @@ -0,0 +1,119 @@ > +# Tarantool JSON Indexes > + > +* **Status**: In progress > +* **Start date**: 26-07-2018 > +* **Authors**: Kirill Shcherbatov @kshcherbatov kshcherbatov@tarantool.org > +* **Issues**: [#1012](https://github.com/tarantool/tarantool/issues/1012) > + > +## Summary > +Tarantool JSON Indexes would allow to index a parts of complex objects stored in fields. > +``` > +box.space.test:create_index('contributer', {parts = {{1, 'number'}, {2, 'str', data_path = "name.last"}}}) > +``` Please improve the summary. Describe what indexes one can create in tarantool now, and what you're suggesting to add. An example of the new API usage isn't enough. Please describe the new API like this is a documentation page. A couple of questions regarding the new API: - I don't like 'data_path', why not simply 'path'? - Are we planning to support the format proposed by a customer in the ticket (parts = {{"name.first", "str"}, {"name.last", "str"}})? - How do json indexes interact with the space format? In other words what type should have a field that has a json index over it? I suspect it is implied to be a 'map'. Can we restrict the format of a particular json path without indexing it (i.e. via space format)? > + > +## Background and motivation The content of this section doesn't match its caption. In "background and motivation" section you should describe why we need to implement json path indexes, not how. Description of tuple format should go to "Detailed design" section, I think. > +Indexes are consists of key_part(s). We should allow to specify `"data_path"` strings on indexes creation. > +According to index definition, indexed information should be acceded really fast. Previously, only fields indexes were allowed. > +This was achieved with special metadata stored in tuples before main payload. > +``` > + +----------------------+-----------------------+-------------+ > + tuple_begin, ..., raw = | extra_size | offset N ... offset 1 | MessagePack | > + | +----------------------+-----------------------+-------------+ > + | ^ > + +-----------------------------------------------------------------------data_offset > +``` > +Tuple format field used in index has initialized `offset_slot`: index in metadata data offsets array for this specific tuple. > +Those offsets are used to access data without unpacking whole tuple on lookup. > +We can use same approach to gain similar results with JSON indexes. > + > +## Detailed design > + > +### Concept idea > +We suppose to extend 'tuple_format' with 'HASH: json_path -> offset_slot' > +``` > +struct tuple_format { > + ... > + /** Hash table with mapping json_path -> offset_slot. */ > + struct mh_strni32_t *json_index_hash; > +} > +``` Is that it? Is this the whole "Concept idea"? In the "Detailed design" design you should describe in plain English what challenges you need to solve and how you're planning to solve them, using code snippets only to support your words (not instead of!) so that even someone who's not closely familiar with tarantool internals can understand what's you're going to do here. And please try not to make grammatical mistakes - they make the text really difficult to read. > +### Constructing new tuples > +1. User specify an index with json-defined structure: > +``` > +box.space.test:create_index('contributer', {parts = {{1, 'number'}, {2, 'str', data_path = "name.last"}}}) > +``` > +2. `key_part_def` restored from msgpack generates `key_part` for field '2' with 'data_path' = "name.last". Verification of json validity is also carried. > + > +3. tuple_format_create constructs new space format and initialize field included in the indexes with `tuple_field_add_json_index` routine: > +``` > +if (part->data_path != NULL) { > + if (tuple_field_add_json_index(format, > + part->data_path, > + --current_slot) != 0) > + return -1; > +} > +``` > +`tuple_field_add_json_index` constructs a new hash table if it is not exists, and insert `<data_path, hash, slot_offset>` You could just say that for each indexed json path you allocate a slot in the offset map instead of writing these 3 bullet points. That would be much easier to understand... What if the same json path is indexed twice (by two indexes)? I assume you allocate just one slot in the offset map. This should be clearly put in the text. Where inside the offset map do you allocate index slots for json path fields? After regular fields or before? Again, this should be described in the RFC, probably with a diagram, because it's impossible to fathom from what you wrote. > + > +4. Finally, on new tuple creation, `tuple_init_field_map` uses hash to fill tuple map and check that data corresponds data_path restriction: > +``` > +for (; i < defined_field_count; ++i, ++field) { > + struct mh_strni32_t *hash = format->json_index_hash; > + if (hash != NULL) { > + for (mh_int_t i = mh_first(hash); i != mh_end(hash); > + i = mh_next(hash, i)) { > + int32_t offset_slot = > + mh_strni32_node(hash, i)->val; > + const char *field = NULL; > + const char *data_path = mh_strni32_node(hash, i)->str; > + if (tuple_field_raw_by_path(format, raw, field_map, > + data_path, strlen(data_path), > + 0, &field) != 0 || > + field == NULL) > + return -1; > + field_map[offset_slot] = > + (uint32_t) (field - raw); > + } > + } > +} > +``` I don't like that if the same json path prefix is used by multiple indexes, you'll look it up multiple times for the same tuple, e.g. suppose there are indexes over employee.name.first and employee.name.last, then you'll lookup employee.name twice for every tuple you insert into the space. This is sub-optimal. IMO the whole idea of mapping the whole json path as it was put in by the user to offset map slot is flawed, because two different strings can represent the same json path (e.g. because one of them has some spaces while the other doesn't). Suppose the user creates two indexes that index the same json path, but address it by different (in terms of memcpy) strings. What happens then? There will be two offset maps used instead of one? That said, I think you should come up with a better implementation for mapping json path to an offset map rather than simply maintaining json path => offset slot hash. I guess, you should split json path in nodes and store those nodes in tuple_format somehow so that no matter how the user enters a json path, you can find it in the format. This would also allow you to avoid multiple accesses to the same json path prefix when constructing a field map for a new tuple: you could decode the msgpack, looking up indexed paths in the format on the go, i.e. you would decode the tuple just once, no matter how many indexes are out there. > + > +### Speeding up access with offset_slot cache in key_part object > +Hash lookup could be an expensive operation, so we suppose to cache offset_slots for JSON indexes. > + > +We suppose to extend 'key_part' structure with 'data_path' field initializing on index creation, 'field_epoch' and reallocatable 'offset_slot' int32_t array. > +``` And how do you expect me to understand what they will be used for? Describe what you need them for in English (the comments in the code snippet below don't make it any clearer). > +struct key_part { > + /** Tuple field index for this part */ > + uint32_t fieldno; > + ... > + > + /** Data JSON path. */ > + const char *data_format; 'data_format'? 'data_path' you mean? Why not simply 'path'? > + /** Epoch of tuple_format cached in offset_slot. */ > + uint32_t slot_epoch; > + /** Cache of corresponding tuple_format slot_offset. */ > + int32_t slot_cache; What are these new fields initialized to for non-json key parts? > +}; > +``` > + > +``` > +struct tuple_format { > + /** Tuple format epoch. */ > + uint32_t epoch; > +}; > +``` So basically, the cached slots of existing indexes will be cleared when the space format changes (e.g. a new index is created)? Or will you rebuild them? It's unclear from the text. What I'm interested in is the following case. Suppose you have a space with one json index. It has a lot of tuples. One day you decide to create a new index over a field that wasn't indexed before. Currently (non-json indexes), the old index will be fast for both old and new tuples while the new index will be fast only for new tuples and slow for old tuples, because old tuples don't have the corresponding field in the field map. What about json indexes? What will happen with them in such a case? If you update key parts after the new index is created, then the old index will be slow for all tuples, which is bad. If you don't, the new index will be slow for all tuples, which is also bad. Anyway, it's worse than the behavior we have now. May be, we could somehow store field map offsets for all formats that are currently used by tuples stored in the space in each key part, i.e. turn slot_cache into an array: key_part->slot_cache[format->epoch] > + > +### Acceding tuple by JSON-path Acceding? > +#### tuple_field_raw_by_path > +``` > +tuple_field_raw_by_path(const struct tuple_format *format, const char *tuple, > + const uint32_t *field_map, const char *path, > + uint32_t path_len, uint32_t path_hash, > + const char **field) > +``` > +We can trivially test if name is an index looking to the hash and obtain offset to specifying offset to `tuple_field_raw`. I don't understand this sentence... > + > + > +## Rationale and alternatives > +... No rationale, no alternatives?
next prev parent reply other threads:[~2018-07-27 15:10 UTC|newest] Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top 2018-07-26 11:15 Kirill Shcherbatov 2018-07-27 15:10 ` Vladimir Davydov [this message] 2018-07-30 12:02 ` [tarantool-patches] " Kirill Shcherbatov 2018-07-30 12:21 ` Vladimir Davydov 2018-07-30 13:45 ` Vladislav Shpilevoy 2018-07-30 14:09 ` Kirill Shcherbatov 2018-07-30 16:14 ` Kirill Shcherbatov 2018-07-30 18:46 ` Vladislav Shpilevoy 2018-07-30 19:23 ` Kirill Shcherbatov 2018-07-30 20:19 ` Vladislav Shpilevoy 2018-07-30 23:17 ` Vladislav Shpilevoy
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20180727151013.goyfa4uuf7nl7nou@esperanza \ --to=vdavydov.dev@gmail.com \ --cc=kshcherbatov@tarantool.org \ --cc=tarantool-patches@freelists.org \ --cc=v.shpilevoy@tarantool.org \ --subject='Re: [tarantool-patches] [PATCH v1 1/1] rfc: describe a Tarantool JSON indexes' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox