[tarantool-patches] Re: [PATCH v2 3/5] box: introduce path field in key_part

Kirill Shcherbatov kshcherbatov at tarantool.org
Mon Aug 27 10:37:11 MSK 2018


> 1. Please, make extra_size be also an argument of key_def_sizeof
> and name it literally paths_size.
 struct key_def *
-key_def_new(uint32_t part_count, size_t extra_size)
+key_def_new(uint32_t part_count, size_t paths_size)

 static inline size_t
-key_def_sizeof(uint32_t part_count)
+key_def_sizeof(uint32_t part_count, size_t paths_size)
 {
-       return sizeof(struct key_def) + sizeof(struct key_part) * part_count;
+       return sizeof(struct key_def) + sizeof(struct key_part) * part_count +
+              paths_size;
 }
> 2. Same as strncmp and has the same bug: if a path is prefix of
> another path, then you declare them as equal, but it is wrong.
                /* Lexicographic strings order. */
-               uint32_t len = MIN(part1->path_len, part2->path_len);
+               if (part1->path_len != part2->path_len)
+                       return part1->path_len - part2->path_len;
                int rc = 0;
-               if ((rc = memcmp(part1->path, part2->path, len)) != 0)
+               if ((rc = memcmp(part1->path, part2->path,
+                                part1->path_len)) != 0)
                        return rc;

> Please, write a test on it. It should alter an index, extending
> path of one of its parts. In such a case it should be rebuilt,
> but your patch will not do it.
I've checked this code reachable with assert. But I don't now how handle it manually..
Please, help.

> 3. Why not const char *path?
Okey.

> 4. Please, be consistent in names: you already use 'is_flat'
> in the previous patch, not 'has_json_paths'.
Ok. is_flat now

> 5. Why do you need additional parentheses?
Ok.

> 6. As usual. Start sentences with a capital letter, finish with the
> dot.
Fixed.

> 7. Typo: 'alloated'.
Ok.

> 8. Your calculations are diligent, but they are unreadable, sorry.
> I tried to calculate again, and found that 2.5*len is enough.
> A path in the worst case looks like '.a'*. So the biggest
> possible number of path parts is len(path)/len('.a') = len / 2.
> A part is either in the correct format already, or requires
> 3 additional symbols. So the worst new len is:
> (len / 2) * 3 + len = 2.5 * len
> \____________/
>    number of
>    additional
>    characters
> 
> It is not? I failed to find a counter-example.
Yep, this match my calculations exacly. 5/2 ~= 3 for me)
I didn't like multiply on frac value.
> 9. As I remember, we have decided to do not store
> first path part in the string, because we have it in
> part->fieldno already decoded. Please, cut the first
> path part off. Also, it is faster when we need to
> follow the path along tuple field tree in tuple_format,
> because we do not need useless decoding of the first
> part.
Not exactly. We decided to use path hash per format so we need this prefix
to distinguish same path behind different fields.

> 10. Here parser.src_len is always equal to path_len - it was
> initialized by this value at the beginning of the function.
You are right here:
-       if ((uint32_t)parser.src_len > path_len) {
+       if (allocated_size > parser.src_len) {

> 11. What is more, I do not understand, how can it work. When
> you have two key_part_defs, first takes the region memory,
> and the second can take exactly the same memory and rewrite it,
> if its path len < than the former's. It was not clear on the
> previous implementation, but now it is evident. So all
> key_part_defs has the same path, if they are in descending
> order of their path lens.
No, I'd like to reuse the biggest available chunk:
old path buffer or rest of new path buffer; I mean:
old = "[0].FIO.fname\0";
          ^old
new = "[0]["FIO"]["fname"]\0\0\0\0\0\0\0\"
                                                   ^rest
        if (allocated_size > (uint32_t)parser.src_len) {
		/* Use rest of new buffer next time. */
		*path_extra = path;
		*path_extra_size = allocated_size;
	} else {
		/* Reuse old path buffer. */
		*path_extra = (char *)parser.src;
		*path_extra_size = parser.src_len;
	}
allocated_size and path were previously updated to match new buffer tail

> 12. Symbol_count is not the same as error position. Please, use
> json_path_next result as the position.
        err_msg = tt_sprintf("invalid JSON path '%.*s': path has invalid "
                             "structure (error at position %d)", parser.src_len,
-                            parser.src, parser.symbol_count);
+                            parser.src, rc);

> 13. Because key_def_find does not care about paths yourself,
> now key_def_contains() is broken, and Vinyl is too as a consequence.
> Please, write a test on it. Now it is possible to insert duplicates
> into unique Vinyl index, if a secondary index contains field numbers
> of the primary one, but with different paths. It is wrong.
You are right, let's better refractore key_def_find function:

 const struct key_part *
-key_def_find(const struct key_def *key_def, uint32_t fieldno)
+key_def_find(const struct key_def *key_def, uint32_t fieldno, const char *path,
+            uint32_t path_len)
 {
        const struct key_part *part = key_def->parts;
        const struct key_part *end = part + key_def->part_count;
+       assert(path == NULL || strlen(path) == path_len);
        for (; part != end; part++) {
-               if (part->fieldno == fieldno)
+               if (part->fieldno == fieldno && part->path_len == path_len &&
+                   (part->path == NULL ||
+                    memcmp(part->path, path, path_len) == 0))
                        return part;
        }

I.e.
-               const struct key_part *duplicate =
-                       key_def_find(first, part->fieldno);
-               if (duplicate != NULL &&
-                   part->path_len == duplicate->path_len &&
-                   memcmp(part->path, duplicate->path, part->path_len) == 0)
+               if (key_def_find(first, part->fieldno, part->path,
+                                part->path_len) != NULL)
everywhere.

>> +	/** JSON path to data. */
> 14. Please, note that it does not include the first
> path part, that is decoded already into fieldno.
It is not correct for values jet extracted from msgpack and I don't
trim first field as I've already noted below. 
>> +	/** JSON path to data. */
>> +	char *path;
> 15. The same.
-       /** JSON path to data. */
+       /** JSON path to data in canonical form. */

> 16. As I see, you did update neither key_def_contains_sequential_parts
> nor points of its result usage (if (! contains_sequential_parts) ... ).
> So in your implementation such parts as '[1][2][3]' and '[2][4][5]' are
> considered to be sequential, but it is wrong obviously. Their MessagePack's
> do not follow each other. After I reviewed the whole patchset I see that
> you updated these functions in the next commits, but please, do it in
> this commit.
+static bool
+key_def_parts_are_sequential(const struct key_def *def, int i)
+{
+       uint32_t fieldno1 = def->parts[i].fieldno + 1;
+       uint32_t fieldno2 = def->parts[i + 1].fieldno;
+       return fieldno1 == fieldno2 && def->parts[i].path == NULL &&
+               def->parts[i + 1].path == NULL;
+}
+

-               if (def->parts[i].fieldno + 1 == def->parts[i + 1].fieldno)
+               if (key_def_parts_are_sequential(def, i))

-                               if (key_def->parts[i].fieldno + 1 !=
-                                   key_def->parts[i + 1].fieldno) {
+                               if (!key_def_parts_are_sequential(key_def, i)) {

etc.

> 17. Why this patch has no tests? You already now can test paths
> normalization and the first path part cut into fieldno.
This path didn't introduce working feature. After small discussion, we decided to merge 2nd and 3rd paths.




More information about the Tarantool-patches mailing list