From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id 48D43294E5 for ; Mon, 27 Aug 2018 03:37:14 -0400 (EDT) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 3U1RUoiuqAJE for ; Mon, 27 Aug 2018 03:37:14 -0400 (EDT) Received: from smtpng2.m.smailru.net (smtpng2.m.smailru.net [94.100.179.3]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id CC6A3294AA for ; Mon, 27 Aug 2018 03:37:13 -0400 (EDT) From: Kirill Shcherbatov Subject: [tarantool-patches] Re: [PATCH v2 3/5] box: introduce path field in key_part References: <776c2c4fe710320a4fe117fc96683efbd30b5c74.1534332920.git.kshcherbatov@tarantool.org> <1209797c-ea66-184c-af9f-f7e0df0531d6@tarantool.org> Message-ID: Date: Mon, 27 Aug 2018 10:37:11 +0300 MIME-Version: 1.0 In-Reply-To: <1209797c-ea66-184c-af9f-f7e0df0531d6@tarantool.org> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-subscribe: List-owner: List-post: List-archive: To: tarantool-patches@freelists.org, Vladislav Shpilevoy > 1. Please, make extra_size be also an argument of key_def_sizeof > and name it literally paths_size. struct key_def * -key_def_new(uint32_t part_count, size_t extra_size) +key_def_new(uint32_t part_count, size_t paths_size) static inline size_t -key_def_sizeof(uint32_t part_count) +key_def_sizeof(uint32_t part_count, size_t paths_size) { - return sizeof(struct key_def) + sizeof(struct key_part) * part_count; + return sizeof(struct key_def) + sizeof(struct key_part) * part_count + + paths_size; } > 2. Same as strncmp and has the same bug: if a path is prefix of > another path, then you declare them as equal, but it is wrong. /* Lexicographic strings order. */ - uint32_t len = MIN(part1->path_len, part2->path_len); + if (part1->path_len != part2->path_len) + return part1->path_len - part2->path_len; int rc = 0; - if ((rc = memcmp(part1->path, part2->path, len)) != 0) + if ((rc = memcmp(part1->path, part2->path, + part1->path_len)) != 0) return rc; > Please, write a test on it. It should alter an index, extending > path of one of its parts. In such a case it should be rebuilt, > but your patch will not do it. I've checked this code reachable with assert. But I don't now how handle it manually.. Please, help. > 3. Why not const char *path? Okey. > 4. Please, be consistent in names: you already use 'is_flat' > in the previous patch, not 'has_json_paths'. Ok. is_flat now > 5. Why do you need additional parentheses? Ok. > 6. As usual. Start sentences with a capital letter, finish with the > dot. Fixed. > 7. Typo: 'alloated'. Ok. > 8. Your calculations are diligent, but they are unreadable, sorry. > I tried to calculate again, and found that 2.5*len is enough. > A path in the worst case looks like '.a'*. So the biggest > possible number of path parts is len(path)/len('.a') = len / 2. > A part is either in the correct format already, or requires > 3 additional symbols. So the worst new len is: > (len / 2) * 3 + len = 2.5 * len > \____________/ > number of > additional > characters > > It is not? I failed to find a counter-example. Yep, this match my calculations exacly. 5/2 ~= 3 for me) I didn't like multiply on frac value. > 9. As I remember, we have decided to do not store > first path part in the string, because we have it in > part->fieldno already decoded. Please, cut the first > path part off. Also, it is faster when we need to > follow the path along tuple field tree in tuple_format, > because we do not need useless decoding of the first > part. Not exactly. We decided to use path hash per format so we need this prefix to distinguish same path behind different fields. > 10. Here parser.src_len is always equal to path_len - it was > initialized by this value at the beginning of the function. You are right here: - if ((uint32_t)parser.src_len > path_len) { + if (allocated_size > parser.src_len) { > 11. What is more, I do not understand, how can it work. When > you have two key_part_defs, first takes the region memory, > and the second can take exactly the same memory and rewrite it, > if its path len < than the former's. It was not clear on the > previous implementation, but now it is evident. So all > key_part_defs has the same path, if they are in descending > order of their path lens. No, I'd like to reuse the biggest available chunk: old path buffer or rest of new path buffer; I mean: old = "[0].FIO.fname\0"; ^old new = "[0]["FIO"]["fname"]\0\0\0\0\0\0\0\" ^rest if (allocated_size > (uint32_t)parser.src_len) { /* Use rest of new buffer next time. */ *path_extra = path; *path_extra_size = allocated_size; } else { /* Reuse old path buffer. */ *path_extra = (char *)parser.src; *path_extra_size = parser.src_len; } allocated_size and path were previously updated to match new buffer tail > 12. Symbol_count is not the same as error position. Please, use > json_path_next result as the position. err_msg = tt_sprintf("invalid JSON path '%.*s': path has invalid " "structure (error at position %d)", parser.src_len, - parser.src, parser.symbol_count); + parser.src, rc); > 13. Because key_def_find does not care about paths yourself, > now key_def_contains() is broken, and Vinyl is too as a consequence. > Please, write a test on it. Now it is possible to insert duplicates > into unique Vinyl index, if a secondary index contains field numbers > of the primary one, but with different paths. It is wrong. You are right, let's better refractore key_def_find function: const struct key_part * -key_def_find(const struct key_def *key_def, uint32_t fieldno) +key_def_find(const struct key_def *key_def, uint32_t fieldno, const char *path, + uint32_t path_len) { const struct key_part *part = key_def->parts; const struct key_part *end = part + key_def->part_count; + assert(path == NULL || strlen(path) == path_len); for (; part != end; part++) { - if (part->fieldno == fieldno) + if (part->fieldno == fieldno && part->path_len == path_len && + (part->path == NULL || + memcmp(part->path, path, path_len) == 0)) return part; } I.e. - const struct key_part *duplicate = - key_def_find(first, part->fieldno); - if (duplicate != NULL && - part->path_len == duplicate->path_len && - memcmp(part->path, duplicate->path, part->path_len) == 0) + if (key_def_find(first, part->fieldno, part->path, + part->path_len) != NULL) everywhere. >> + /** JSON path to data. */ > 14. Please, note that it does not include the first > path part, that is decoded already into fieldno. It is not correct for values jet extracted from msgpack and I don't trim first field as I've already noted below. >> + /** JSON path to data. */ >> + char *path; > 15. The same. - /** JSON path to data. */ + /** JSON path to data in canonical form. */ > 16. As I see, you did update neither key_def_contains_sequential_parts > nor points of its result usage (if (! contains_sequential_parts) ... ). > So in your implementation such parts as '[1][2][3]' and '[2][4][5]' are > considered to be sequential, but it is wrong obviously. Their MessagePack's > do not follow each other. After I reviewed the whole patchset I see that > you updated these functions in the next commits, but please, do it in > this commit. +static bool +key_def_parts_are_sequential(const struct key_def *def, int i) +{ + uint32_t fieldno1 = def->parts[i].fieldno + 1; + uint32_t fieldno2 = def->parts[i + 1].fieldno; + return fieldno1 == fieldno2 && def->parts[i].path == NULL && + def->parts[i + 1].path == NULL; +} + - if (def->parts[i].fieldno + 1 == def->parts[i + 1].fieldno) + if (key_def_parts_are_sequential(def, i)) - if (key_def->parts[i].fieldno + 1 != - key_def->parts[i + 1].fieldno) { + if (!key_def_parts_are_sequential(key_def, i)) { etc. > 17. Why this patch has no tests? You already now can test paths > normalization and the first path part cut into fieldno. This path didn't introduce working feature. After small discussion, we decided to merge 2nd and 3rd paths.