From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Thu, 28 Mar 2019 13:21:21 +0300 From: Vladimir Davydov Subject: Re: [PATCH v2 0/5] box: introduce multikey indexes in memtx Message-ID: <20190328102121.3arq2o4iflksxsio@esperanza> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: To: Kirill Shcherbatov Cc: tarantool-patches@freelists.org List-ID: First, please address comments by Kostja - you seem to be silently ignoring them, which is impolite. Also, please stop replying to comments with a new patch. Mail threads exist to discuss, not to mindlessly fix everything that a review spotted. A reviewer may be wrong now and then, in which case you have to think yourself and argue if necessary. Once the change set is big enough, send a new series with a proper change log. Take a look at how Cyrill handles it, for example. Also, your hunger to quickly deliver a patch while sacrificing code quality and making dubious design decision is really annoying. And I dare say, I'm not alone thinking that - ask Nikita or Vlad. In a real-world open-source project (e.g. Linux kernel), your emails would soon land in junk - people out there are far not as patient as we are. You should finally start putting much more time and efforts into thinking through the design, self-reviewing and polishing your code and comments rather than mindlessly typing and thus wasting reviewer's time. I haven't reviewed the patches carefully yet (I hope you'll do it yourself before submitting a new version). Here's a few things that caught my eye after looking through the code. On Tue, Mar 19, 2019 at 03:32:05PM +0300, Kirill Shcherbatov wrote: > lib: introduce json_path_is_multikey helper The helper returns true iff multikey_len equals path hence the bool return value is useless. Return multikey offset instead. And rename the function appropriately. And as Kostja mentioned, the new function isn't tested properly. > lib: introduce is_weak_cmp flag for json_path_cmp This flag is ugly, both semantically and esthetically. As a matter fact, I don't see why you would need such a function at all. You can check if paths to the first multikey part is the same right in key_def_new. BTW I don't like that this check lives in index_def_is_valid, because it means that there may be a key_def with invalid configuration. Besides, this makes checks scattered all over the code, making it difficult to follow. > box: move offset_slot init to tuple_format_add_field > box: introduce field_map_builder for field_map init Cutting a function in two halves at a seemingly random place, as you did in case of tuple_field_map_create, is a wrong way to refactor it. What you need to do is move out semantically independent parts. E.g. required fields initialization/checking, tuple data/format traversal. There's a lot of code duplication between vy_stmt_new_surrogate_delete and tuple_create_field_map - you should try to eliminate it before extending field map. Also, field map querying should live in this new module, too, because keeping code building and using something in one place is easier for understanding. BTW, better call field_map.[hc] not tuple_field_map.[hc], because the object is called simply field_map. > +/** Preliminary field_map atom. */ > +typedef uint32_t field_map_builder_item; We don't use typedefs for things like that. The name's bad, too: 'item' is too vague. Other names are far not perfect, either. E.g. consider field_map_builder_size. Is it the size of a field_map_builder? No, it's the size of a field map it will build. Please try to come up with better names. Also, comments as they are aren't enough. You should explain in a nutshell how a builder is used, what field map looks like, why offsets are negative, everything that may raise questions. > @@ -877,17 +819,17 @@ tuple_field_map_create(struct tuple_format *format, const char *tuple, > token.num = idx; > break; > case MP_MAP: > - if (mp_typeof(*pos) != MP_STR) { > + if (mp_typeof(**pos) != MP_STR) { > /* > * JSON path support only string > * keys for map. Skip this entry. > */ > - mp_next(&pos); > - mp_next(&pos); > + mp_next(pos); > + mp_next(pos); I don't see why you'd need to change *pos to **pos. > +static inline void > +field_map_builder_build(struct field_map_builder *builder, char *wptr) The next patch makes this function quite a big one. I think we should move it to C file. > box: introduce multikey indexes in memtx > +/** The field_map extent. */ > +struct field_map_ext { The comment is useless: it simply reiterates the struct name without shedding any light on the nature of the object. It seems to me that your comments are bad not because of your poor English (it isn't much worse than of other members of our team), but because of your attitude - you simply don't believe that they are worth spending your precious time on so you hand this work over to a reviewer. Well, I'm tired of it. You ought to write good descriptive comments to every non-trivial thing you add. Comments are useful, especially when you try to understand the code in some time. Use translator/dictionary if you need to. And it's absolutely okay to spend substantially more time on writing comments than on the code itself. BTW you must update the comment to hint_t, pointing out what it's used for in case of multikey indexes. > +/** Return field_map extention allocation size. */ > +static inline uint32_t > +field_map_ext_size(uint32_t items) > +{ > + return sizeof(struct field_map_ext) + items * sizeof(uint32_t); > +} > + > +/** Return field_map extention for field_map and offset_slot. */ > +static inline struct field_map_ext * > +field_map_ext_get(const uint32_t *field_map, int32_t offset_slot) > +{ > + return (struct field_map_ext *)((char *)field_map - > + field_map[offset_slot]); > +} Neither of these functions is used outside field_map.h. Better inline them or move them to C file. A general rule is to keep the API as concise as possible provided it doesn't hurt readability. > @@ -234,6 +234,18 @@ struct key_def { > bool has_optional_parts; > /** Key fields mask. @sa column_mask.h for details. */ > uint64_t column_mask; > + /** > + * In case of multikey index, the index of the key_part > + * containing JSON path with array index placeholder "[*]". > + * Otherwise multikey_part_idx == part_count. > + */ > + uint32_t multikey_part_idx; > + /** > + * In case of multikey index, the length of the > + * parts[multikey_part_idx].path substring "...[*]" > + * @see json_path_is_multikey(). > + */ > + uint32_t multikey_path_len; I don't like mixing part_idx and path_len in one struct. Better store field id and path in here instead IMO. > @@ -876,6 +940,10 @@ tuple_field_map_initialize(struct field_map_builder *builder, > mp_decode_array(pos) : > mp_decode_map(pos); > mp_stack_push(&stack, type, size); > + if (json_token_is_multikey(&field->token)) { > + assert(type == MP_ARRAY); > + mk_parent_frame = &frames[stack.used - 1]; > + } This proves mp_stack API inadequate: clearly, mp_stack_top() is missing. > @@ -699,6 +699,12 @@ vinyl_space_check_index_def(struct space *space, struct index_def *index_def) > return -1; > } > } > + if (key_def_is_multikey(index_def->key_def)) { > + diag_set(ClientError, ER_MODIFY_INDEX, > + index_def->name, space_name(space), > + "vinyl space index cannot be multikey"); > + return -1; > + } Should be ER_UNSUPPORTED. > + if (child_field->token.type == JSON_TOKEN_ANY && > + !json_token_is_multikey(&parent_field->token) && > + !json_token_is_leaf(&parent_field->token)) { > + assert(expected_type == FIELD_TYPE_ARRAY); Bad indentation. > +int > +memtx_tree_index_replace_multikey(struct index *base, struct tuple *old_tuple, > + struct tuple *new_tuple, > + enum dup_replace_mode mode, > + struct tuple **result) The function should be static, apparently. > -template > +template > static void > key_def_set_compare_func_json(struct key_def *def) I split these functions by meaning: 'json' and 'plain' so that set_compare_func only checks is_nullable flag. 'multikey' case should be handled as a part of 'json', similarly to how 'sequential' is handled in 'plain'. > @@ -1763,12 +1794,12 @@ key_def_set_compare_func_json(struct key_def *def) > def->tuple_compare = tuple_compare_slowpath > ; > def->tuple_compare_hinted = tuple_compare_slowpath_hinted > - ; > + ; > def->tuple_compare_with_key = tuple_compare_with_key_slowpath > ; > def->tuple_compare_with_key_hinted = > tuple_compare_with_key_slowpath_hinted > - ; > + ; tuple_compare and tuple_compare_with_key don't make any sense for multikey indexes. You should add stubs for them. I guess the stubs should simply assert(0). The same's fair for extract_key and hints. > @@ -621,7 +695,8 @@ memtx_tree_index_create_iterator(struct index *base, enum iterator_type type, > it->type = type; > it->key_data.key = key; > it->key_data.part_count = part_count; > - it->key_data.hint = key_hint(key, part_count, cmp_def); > + if (!key_def_is_multikey(cmp_def)) > + it->key_data.hint = key_hint(key, part_count, cmp_def); Please get rid of this 'if'. Overload key_hint instead. > diff --git a/test/engine/multikey_idx.test.lua b/test/engine/multikey_idx.test.lua I'd call the test simply 'multikey.test.lua'. > +-- Duplicates in multikey parts. > +s:insert({5, {1, 1, 1}, {{fname='A', sname='B'}, {fname='C', sname='D'}, {fname='A', sname='B'}}}) > +--- > +- error: Duplicate key exists in unique index 'idx' in space 'withdata' > +... AFAIR the RFC, this shouldn't result in an error. Am I wrong? > @TarantoolBot document > Title: introduce multikey indexes in memtx > Multikey indexes allows you to automatically index set of documents > by JSON paths having array index placeholder "[*]". Multikey index > cannot be primary as it cannot be unique(by definition). But it can! I can set is_unique, can't I? > Multikey index parts must be compatible: only one "[*]" placeholder > is allowed in same position(for all JSON paths in index parts). > > Example: > s = box.schema.space.create('withdata') > pk = s:create_index('pk') > parts = { > {2, 'str', path = 'data[*].name'}, > {2, 'str', path = 'data[*].extra.phone'} > } > idx = s:create_index('idx', {parts = parts}) > s:insert({1, {data = {{name="A", extra={phone="111"}}, > {name="B", extra={phone="111"}}}, > garbage = 1}} > idx:get({'A', '111'}) The docbot request is vastly insufficient: - how unique indexes behave - how duplicates are handled - what's incompatible parts Please explain every aspect that may raise questions.