From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Subject: Re: [tarantool-patches] Re: [PATCH v4 3/3] box: introduce multikey indexes in memtx References: <20190429160615.5a2uedzpzj6rw76k@esperanza> From: Kirill Shcherbatov Message-ID: <0ea77506-42b9-1e08-cda2-062d2948a0ff@tarantool.org> Date: Tue, 30 Apr 2019 11:22:18 +0300 MIME-Version: 1.0 In-Reply-To: <20190429160615.5a2uedzpzj6rw76k@esperanza> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit To: tarantool-patches@freelists.org, Vladimir Davydov Cc: kostja@tarantool.org List-ID: Hi! Thank for review. This is not new code but some questions and answers letter =) > Unnecessary inclusion? I've included it because my static analyzer suggested it because of NULL. But everything is compiled fine without it. > Please don't assume that JSON_TOKEN_ANY is represented by "[*]" outside > json library: suppose we'll allow the user to add some spaces ("[ * ]") > - this code will break then. > > Either use json_lexer to skip the multikey token or add a helper > function to the json lib. I also don't like this thing. Look: this is ok? /* Skip JSON_TOKEN_ANY token. */ struct json_lexer lexer; struct json_token token; json_lexer_create(&lexer, path + multikey_path_len, path_len - multikey_path_len, TUPLE_INDEX_BASE); json_lexer_next_token(&lexer, &token); assert(token.type == JSON_TOKEN_ANY); /* The rest of JSON path couldn't be multikey. */ int multikey_path_suffix_len = path_len - multikey_path_len - lexer.offset; if (json_path_multikey_offset(path + multikey_path_len + lexer.offset, multikey_path_suffix_len, TUPLE_INDEX_BASE) != multikey_path_suffix_len) { diag_set(ClientError, ER_WRONG_INDEX_OPTIONS, part_no + TUPLE_INDEX_BASE, "no more than one array index placeholder [*] is " "allowed in JSON path"); return -1; } return 0; >> +/** >> + * Perform tuple insertion by given multikey index. >> + * In case of replacement, all old tuple entries are deleted >> + * by all it's multikey indexes. >> + */ >> +static int >> +memtx_tree_index_insert_multikey(struct memtx_tree_index *index, > > Bad name: we have replace_multikey and insert_multikey - one would think > the only difference between them is the operation type (REPLACE/INSERT), > but it would be totally wrong, as one is used as a helper for another. > > Please come up with a better name. What about > > memtx_tree_index_replace_multikey_one > > ? > > That would emphasize that it inserts a single multikey entry. I like "memtx_tree_index_replace_multikey_step". What do you think? >> + /* >> + * Propagate dup_data.tuple deletion for all multikey >> + * index keys extracted by dup_data.tuple. >> + */ >> + int conflict_idx = (int)dup_data.hint; >> + struct key_def *cmp_def = memtx_tree_cmp_def(&index->tree); >> + uint32_t multikey_count = tuple_multikey_count(dup_data.tuple, cmp_def); >> + for (int i = 0; (uint32_t) i < multikey_count; i++) { >> + if (conflict_idx == i) >> + continue; >> + dup_data.hint = i; >> + memtx_tree_delete(&index->tree, dup_data); >> + } >> + /* >> + * The new_tuple multikey index key (by conflict_idx) may >> + * be deleted from index when old_tuple has another key by >> + * some other multikey index != conflict_idx. Restore it >> + * as a precaution. >> + */ >> + memtx_tree_insert(&index->tree, new_data, NULL); > > Would be better if we didn't delete a tuple in this case. Or at least, > memtx_tree_delete returned the deleted tuple so that we wouldn't call > memtx_tree_insert when we didn't need to. Not a show stopper though. As I have verbally told you, we don't know If there is an other duplicate in old tuple and it's index. Imagine old keys [1, 2, 3, 1, 1] replaced with new keys [4, 1, 3, 1, 1] The first conflict would be on multikey_idx_new = 1, key = 1 --> that corresponds old tuple multikey_idx_old = 0. Then all keys [1, 2, 3, 1, 1] excluding multikey_idx_old would be deleted [2, 3, 1, 1]; Yep. There are some other keys = 1 and new key = 1 by multikey_idx_new = 1 would be deleted. Insert new element again is a easy, fast and straightforward solution. Moreover, the only deletion would be on first conflict. I mean, there would no conflict with keys 2 and 3 because they are already deleted. >> +/* Initialize the next element of the index build_array. */ >> static int >> -memtx_tree_index_build_next(struct index *base, struct tuple *tuple) >> +memtx_tree_index_build_array_append(struct memtx_tree_index *index, >> + struct tuple *tuple, hint_t hint) >> { >> - struct memtx_tree_index *index = (struct memtx_tree_index *)base; >> - struct key_def *cmp_def = memtx_tree_cmp_def(&index->tree); >> + bool need_realloc = false; >> if (index->build_array == NULL) { >> - index->build_array = malloc(MEMTX_EXTENT_SIZE); >> - if (index->build_array == NULL) { >> - diag_set(OutOfMemory, MEMTX_EXTENT_SIZE, >> - "memtx_tree_index", "build_next"); >> - return -1; >> - } >> index->build_array_alloc_size = >> MEMTX_EXTENT_SIZE / sizeof(index->build_array[0]); >> + need_realloc = true; >> + } >> + if (index->build_array_size >= index->build_array_alloc_size) { >> + index->build_array_alloc_size += >> + MAX(index->build_array_alloc_size / 2, >> + index->build_array_size + 1 - >> + index->build_array_alloc_size); >> + need_realloc = true; > > Please avoid unnecessary refactoring - it complicates review. > The function looks fine to me as it is - no need to merge malloc > and realloc paths IMO. Yep. I've replaced old deletion >> - index->build_array_alloc_size = index->build_array_alloc_size + >> - index->build_array_alloc_size / 2; with DIV_ROUND_UP and everything works fine. > This smells O(N^2). Apparently, one can remove duplicates from a sorted > array in O(N). Please fix. Done. At the moment I've wrote this code, I've thought that memmove is faster than manual copying(because of stdlib SSE instructions or other magic). >> +/** >> + * Check if child_field may be attached to parent_field, >> + * update the parent_field container type if required. >> + */ >> +static int >> +tuple_field_ensure_child_compatibility(struct tuple_field *parent, >> + struct tuple_field *child) > > Wouldn't tuple_format_check_field be a better name? I am not shure. This code not only check some cases, but also may change field type when it is possible. "enshure compatibility" -- make everything okey or die