From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <vdavydov.dev@gmail.com>
Date: Thu, 28 Mar 2019 13:21:21 +0300
From: Vladimir Davydov <vdavydov.dev@gmail.com>
Subject: Re: [PATCH v2 0/5] box: introduce multikey indexes in memtx
Message-ID: <20190328102121.3arq2o4iflksxsio@esperanza>
References: <cover.1552998554.git.kshcherbatov@tarantool.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <cover.1552998554.git.kshcherbatov@tarantool.org>
To: Kirill Shcherbatov <kshcherbatov@tarantool.org>
Cc: tarantool-patches@freelists.org
List-ID: <tarantool-patches.dev.tarantool.org>

First, please address comments by Kostja - you seem to be silently
ignoring them, which is impolite.

Also, please stop replying to comments with a new patch. Mail threads
exist to discuss, not to mindlessly fix everything that a review
spotted. A reviewer may be wrong now and then, in which case you have
to think yourself and argue if necessary. Once the change set is big
enough, send a new series with a proper change log. Take a look at how
Cyrill handles it, for example. Also, your hunger to quickly deliver a
patch while sacrificing code quality and making dubious design decision
is really annoying. And I dare say, I'm not alone thinking that - ask
Nikita or Vlad. In a real-world open-source project (e.g. Linux kernel),
your emails would soon land in junk - people out there are far not as
patient as we are. You should finally start putting much more time and
efforts into thinking through the design, self-reviewing and polishing
your code and comments rather than mindlessly typing and thus wasting
reviewer's time.

I haven't reviewed the patches carefully yet (I hope you'll do it
yourself before submitting a new version). Here's a few things that
caught my eye after looking through the code.

On Tue, Mar 19, 2019 at 03:32:05PM +0300, Kirill Shcherbatov wrote:
>   lib: introduce json_path_is_multikey helper

The helper returns true iff multikey_len equals path hence the bool
return value is useless. Return multikey offset instead. And rename
the function appropriately.

And as Kostja mentioned, the new function isn't tested properly.

>   lib: introduce is_weak_cmp flag for json_path_cmp

This flag is ugly, both semantically and esthetically. As a matter fact,
I don't see why you would need such a function at all. You can check if
paths to the first multikey part is the same right in key_def_new.

BTW I don't like that this check lives in index_def_is_valid, because it
means that there may be a key_def with invalid configuration. Besides,
this makes checks scattered all over the code, making it difficult to
follow.

>   box: move offset_slot init to tuple_format_add_field
>   box: introduce field_map_builder for field_map init

Cutting a function in two halves at a seemingly random place, as you did
in case of tuple_field_map_create, is a wrong way to refactor it. What
you need to do is move out semantically independent parts. E.g. required
fields initialization/checking, tuple data/format traversal. There's a
lot of code duplication between vy_stmt_new_surrogate_delete and
tuple_create_field_map - you should try to eliminate it before extending
field map.

Also, field map querying should live in this new module, too, because
keeping code building and using something in one place is easier for
understanding. BTW, better call field_map.[hc] not tuple_field_map.[hc],
because the object is called simply field_map.

> +/** Preliminary field_map atom. */
> +typedef uint32_t field_map_builder_item;

We don't use typedefs for things like that. The name's bad, too: 'item'
is too vague. Other names are far not perfect, either. E.g. consider
field_map_builder_size. Is it the size of a field_map_builder? No, it's
the size of a field map it will build. Please try to come up with better
names.

Also, comments as they are aren't enough. You should explain in
a nutshell how a builder is used, what field map looks like, why
offsets are negative, everything that may raise questions.

> @@ -877,17 +819,17 @@ tuple_field_map_create(struct tuple_format *format, const char *tuple,
>                         token.num = idx;
>                         break;
>                 case MP_MAP:
> -                       if (mp_typeof(*pos) != MP_STR) {
> +                       if (mp_typeof(**pos) != MP_STR) {
>                                 /*
>                                  * JSON path support only string
>                                  * keys for map. Skip this entry.
>                                  */
> -                               mp_next(&pos);
> -                               mp_next(&pos);
> +                               mp_next(pos);
> +                               mp_next(pos);

I don't see why you'd need to change *pos to **pos.

> +static inline void
> +field_map_builder_build(struct field_map_builder *builder, char *wptr)

The next patch makes this function quite a big one. I think we should
move it to C file.

>   box: introduce multikey indexes in memtx

> +/** The field_map extent. */
> +struct field_map_ext {

The comment is useless: it simply reiterates the struct name without
shedding any light on the nature of the object.

It seems to me that your comments are bad not because of your poor
English (it isn't much worse than of other members of our team), but
because of your attitude - you simply don't believe that they are worth
spending your precious time on so you hand this work over to a reviewer.
Well, I'm tired of it. You ought to write good descriptive comments to
every non-trivial thing you add. Comments are useful, especially when
you try to understand the code in some time. Use translator/dictionary
if you need to. And it's absolutely okay to spend substantially more
time on writing comments than on the code itself.

BTW you must update the comment to hint_t, pointing out what it's used
for in case of multikey indexes.

> +/** Return field_map extention allocation size. */
> +static inline uint32_t
> +field_map_ext_size(uint32_t items)
> +{
> +	return sizeof(struct field_map_ext) + items * sizeof(uint32_t);
> +}
> +
> +/** Return field_map extention for field_map and offset_slot. */
> +static inline struct field_map_ext *
> +field_map_ext_get(const uint32_t *field_map, int32_t offset_slot)
> +{
> +	return (struct field_map_ext *)((char *)field_map -
> +					field_map[offset_slot]);
> +}

Neither of these functions is used outside field_map.h. Better inline
them or move them to C file. A general rule is to keep the API as
concise as possible provided it doesn't hurt readability.

> @@ -234,6 +234,18 @@ struct key_def {
>  	bool has_optional_parts;
>  	/** Key fields mask. @sa column_mask.h for details. */
>  	uint64_t column_mask;
> +	/**
> +	 * In case of multikey index, the index of the key_part
> +	 * containing JSON path with array index placeholder "[*]".
> +	 * Otherwise multikey_part_idx == part_count.
> +	 */
> +	uint32_t multikey_part_idx;
> +	/**
> +	 * In case of multikey index, the length of the
> +	 * parts[multikey_part_idx].path substring "...[*]"
> +	 * @see json_path_is_multikey().
> +	 */
> +	uint32_t multikey_path_len;

I don't like mixing part_idx and path_len in one struct. Better store
field id and path in here instead IMO.

> @@ -876,6 +940,10 @@ tuple_field_map_initialize(struct field_map_builder *builder,
>  					mp_decode_array(pos) :
>  					mp_decode_map(pos);
>  			mp_stack_push(&stack, type, size);
> +			if (json_token_is_multikey(&field->token)) {
> +				assert(type == MP_ARRAY);
> +				mk_parent_frame = &frames[stack.used - 1];
> +			}

This proves mp_stack API inadequate: clearly, mp_stack_top() is missing.

> @@ -699,6 +699,12 @@ vinyl_space_check_index_def(struct space *space, struct index_def *index_def)
>  			return -1;
>  		}
>  	}
> +	if (key_def_is_multikey(index_def->key_def)) {
> +		diag_set(ClientError, ER_MODIFY_INDEX,
> +			 index_def->name, space_name(space),
> +			 "vinyl space index cannot be multikey");
> +		return -1;
> +	}

Should be ER_UNSUPPORTED.


> +	if (child_field->token.type == JSON_TOKEN_ANY &&
> +		!json_token_is_multikey(&parent_field->token) &&
> +		!json_token_is_leaf(&parent_field->token)) {
> +		assert(expected_type == FIELD_TYPE_ARRAY);

Bad indentation.


> +int
> +memtx_tree_index_replace_multikey(struct index *base, struct tuple *old_tuple,
> +				  struct tuple *new_tuple,
> +				  enum dup_replace_mode mode,
> +				  struct tuple **result)

The function should be static, apparently.

> -template<bool is_nullable, bool has_optional_parts>
> +template<bool is_nullable, bool has_optional_parts, bool is_multikey>
>  static void
>  key_def_set_compare_func_json(struct key_def *def)

I split these functions by meaning: 'json' and 'plain' so that
set_compare_func only checks is_nullable flag. 'multikey' case
should be handled as a part of 'json', similarly to how
'sequential' is handled in 'plain'.

> @@ -1763,12 +1794,12 @@ key_def_set_compare_func_json(struct key_def *def)
>  	def->tuple_compare = tuple_compare_slowpath
>  			<is_nullable, has_optional_parts, true>;
>  	def->tuple_compare_hinted = tuple_compare_slowpath_hinted
> -			<is_nullable, has_optional_parts, true>;
> +			<is_nullable, has_optional_parts, true, is_multikey>;
>  	def->tuple_compare_with_key = tuple_compare_with_key_slowpath
>  			<is_nullable, has_optional_parts, true>;
>  	def->tuple_compare_with_key_hinted =
>  			tuple_compare_with_key_slowpath_hinted
> -			<is_nullable, has_optional_parts, true>;
> +			<is_nullable, has_optional_parts, true, is_multikey>;

tuple_compare and tuple_compare_with_key don't make any sense for
multikey indexes. You should add stubs for them. I guess the stubs
should simply assert(0). The same's fair for extract_key and hints.

> @@ -621,7 +695,8 @@ memtx_tree_index_create_iterator(struct index *base, enum iterator_type type,
>  	it->type = type;
>  	it->key_data.key = key;
>  	it->key_data.part_count = part_count;
> -	it->key_data.hint = key_hint(key, part_count, cmp_def);
> +	if (!key_def_is_multikey(cmp_def))
> +		it->key_data.hint = key_hint(key, part_count, cmp_def);

Please get rid of this 'if'. Overload key_hint instead.

> diff --git a/test/engine/multikey_idx.test.lua b/test/engine/multikey_idx.test.lua

I'd call the test simply 'multikey.test.lua'.

> +-- Duplicates in multikey parts.
> +s:insert({5, {1, 1, 1}, {{fname='A', sname='B'}, {fname='C', sname='D'}, {fname='A', sname='B'}}})
> +---
> +- error: Duplicate key exists in unique index 'idx' in space 'withdata'
> +...

AFAIR the RFC, this shouldn't result in an error. Am I wrong?

>     @TarantoolBot document
>     Title: introduce multikey indexes in memtx
>     Multikey indexes allows you to automatically index set of documents
>     by JSON paths having array index placeholder "[*]". Multikey index
>     cannot be primary as it cannot be unique(by definition).

But it can! I can set is_unique, can't I?

>     Multikey index parts must be compatible: only one "[*]" placeholder
>     is allowed in same position(for all JSON paths in index parts).
>     
>     Example:
>     s = box.schema.space.create('withdata')
>     pk = s:create_index('pk')
>     parts = {
>             {2, 'str', path = 'data[*].name'},
>             {2, 'str', path = 'data[*].extra.phone'}
>     }
>     idx = s:create_index('idx', {parts = parts})
>     s:insert({1, {data = {{name="A", extra={phone="111"}},
>                           {name="B", extra={phone="111"}}},
>                  garbage = 1}}
>     idx:get({'A', '111'})

The docbot request is vastly insufficient:
 - how unique indexes behave
 - how duplicates are handled
 - what's incompatible parts

Please explain every aspect that may raise questions.