From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <kshcherbatov@tarantool.org>
Subject: Re: [tarantool-patches] Re: [PATCH v4 3/3] box: introduce multikey
 indexes in memtx
References: <cover.1555682707.git.kshcherbatov@tarantool.org>
 <c800be65c6587346bc5866079b830d8456a3d037.1555682707.git.kshcherbatov@tarantool.org>
 <20190429160615.5a2uedzpzj6rw76k@esperanza>
From: Kirill Shcherbatov <kshcherbatov@tarantool.org>
Message-ID: <0ea77506-42b9-1e08-cda2-062d2948a0ff@tarantool.org>
Date: Tue, 30 Apr 2019 11:22:18 +0300
MIME-Version: 1.0
In-Reply-To: <20190429160615.5a2uedzpzj6rw76k@esperanza>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
To: tarantool-patches@freelists.org, Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: kostja@tarantool.org
List-ID: <tarantool-patches.dev.tarantool.org>

Hi! Thank for review. 
This is not new code but some questions and answers letter =)

> Unnecessary inclusion?
I've included it because my static analyzer suggested it because
of NULL. But everything is compiled fine without it.

> Please don't assume that JSON_TOKEN_ANY is represented by "[*]" outside
> json library: suppose we'll allow the user to add some spaces ("[ * ]")
> - this code will break then.
> 
> Either use json_lexer to skip the multikey token or add a helper
> function to the json lib.

I also don't like this thing. Look: this is ok?
	/* Skip JSON_TOKEN_ANY token. */
	struct json_lexer lexer;
	struct json_token token;
	json_lexer_create(&lexer, path + multikey_path_len,
			  path_len - multikey_path_len, TUPLE_INDEX_BASE);
	json_lexer_next_token(&lexer, &token);
	assert(token.type == JSON_TOKEN_ANY);

	/* The rest of JSON path couldn't be multikey. */
	int multikey_path_suffix_len =
		path_len - multikey_path_len - lexer.offset;
	if (json_path_multikey_offset(path + multikey_path_len + lexer.offset,
				      multikey_path_suffix_len,
				      TUPLE_INDEX_BASE) !=
	    multikey_path_suffix_len) {
		diag_set(ClientError, ER_WRONG_INDEX_OPTIONS,
			 part_no + TUPLE_INDEX_BASE,
			 "no more than one array index placeholder [*] is "
			 "allowed in JSON path");
		return -1;
	}
	return 0;

>> +/**
>> + * Perform tuple insertion by given multikey index.
>> + * In case of replacement, all old tuple entries are deleted
>> + * by all it's multikey indexes.
>> + */
>> +static int
>> +memtx_tree_index_insert_multikey(struct memtx_tree_index *index,
> 
> Bad name: we have replace_multikey and insert_multikey - one would think
> the only difference between them is the operation type (REPLACE/INSERT),
> but it would be totally wrong, as one is used as a helper for another.
> 
> Please come up with a better name. What about
> 
>   memtx_tree_index_replace_multikey_one
> 
> ?
> 
> That would emphasize that it inserts a single multikey entry.

I like "memtx_tree_index_replace_multikey_step". What do you think?

>> +	/*
>> +	 * Propagate dup_data.tuple deletion for all multikey
>> +	 * index keys extracted by dup_data.tuple.
>> +	 */
>> +	int conflict_idx = (int)dup_data.hint;
>> +	struct key_def *cmp_def = memtx_tree_cmp_def(&index->tree);
>> +	uint32_t multikey_count = tuple_multikey_count(dup_data.tuple, cmp_def);
>> +	for (int i = 0; (uint32_t) i < multikey_count; i++) {
>> +		if (conflict_idx == i)
>> +			continue;
>> +		dup_data.hint = i;
>> +		memtx_tree_delete(&index->tree, dup_data);
>> +	}
>> +	/*
>> +	 * The new_tuple multikey index key (by conflict_idx) may
>> +	 * be deleted from index when old_tuple has another key by
>> +	 * some other multikey index != conflict_idx. Restore it
>> +	 * as a precaution.
>> +	 */
>> +	memtx_tree_insert(&index->tree, new_data, NULL);
> 
> Would be better if we didn't delete a tuple in this case. Or at least,
> memtx_tree_delete returned the deleted tuple so that we wouldn't call
> memtx_tree_insert when we didn't need to. Not a show stopper though.
As I have verbally told you, we don't know If there is an other duplicate
in old tuple and it's index.
Imagine old keys [1, 2, 3, 1, 1]  replaced with new keys [4, 1, 3, 1, 1]
The first conflict would be on multikey_idx_new = 1, key = 1 -->
that corresponds old tuple multikey_idx_old = 0.

Then all keys [1, 2, 3, 1, 1] excluding multikey_idx_old would be deleted
[2, 3, 1, 1]; Yep. There are some other keys = 1 and new key = 1 by  
multikey_idx_new = 1 would be deleted.
Insert new element again is a easy, fast and straightforward solution.
Moreover, the only deletion would be on first conflict. I mean, there
would no conflict with keys 2 and 3 because they are already deleted.

>> +/* Initialize the next element of the index build_array. */
>>  static int
>> -memtx_tree_index_build_next(struct index *base, struct tuple *tuple)
>> +memtx_tree_index_build_array_append(struct memtx_tree_index *index,
>> +				    struct tuple *tuple, hint_t hint)
>>  {
>> -	struct memtx_tree_index *index = (struct memtx_tree_index *)base;
>> -	struct key_def *cmp_def = memtx_tree_cmp_def(&index->tree);
>> +	bool need_realloc = false;
>>  	if (index->build_array == NULL) {
>> -		index->build_array = malloc(MEMTX_EXTENT_SIZE);
>> -		if (index->build_array == NULL) {
>> -			diag_set(OutOfMemory, MEMTX_EXTENT_SIZE,
>> -				 "memtx_tree_index", "build_next");
>> -			return -1;
>> -		}
>>  		index->build_array_alloc_size =
>>  			MEMTX_EXTENT_SIZE / sizeof(index->build_array[0]);
>> +		need_realloc = true;
>> +	}
>> +	if (index->build_array_size >= index->build_array_alloc_size) {
>> +		index->build_array_alloc_size +=
>> +			MAX(index->build_array_alloc_size / 2,
>> +			    index->build_array_size + 1 -
>> +			    index->build_array_alloc_size);
>> +		need_realloc = true;
> 
> Please avoid unnecessary refactoring - it complicates review.
> The function looks fine to me as it is - no need to merge malloc
> and realloc paths IMO.
Yep. I've replaced old deletion
>> -		index->build_array_alloc_size = index->build_array_alloc_size +
>> -					index->build_array_alloc_size / 2;
with DIV_ROUND_UP and everything works fine.

> This smells O(N^2). Apparently, one can remove duplicates from a sorted
> array in O(N). Please fix.
Done.
At the moment I've wrote this code, I've thought that memmove is faster than
manual copying(because of stdlib SSE instructions or other magic). 
  
>> +/**
>> + * Check if child_field may be attached to parent_field,
>> + * update the parent_field container type if required.
>> + */
>> +static int
>> +tuple_field_ensure_child_compatibility(struct tuple_field *parent,
>> +				       struct tuple_field *child)
> 
> Wouldn't tuple_format_check_field be a better name?
I am not shure.
This code not only check some cases, but also may change field type
when it is possible.
"enshure compatibility" -- make everything okey or die