[tarantool-patches] Re: [PATCH v2 3/5] box: introduce path field in key_part

Mon Aug 27 10:37:11 MSK 2018

> 1. Please, make extra_size be also an argument of key_def_sizeof
> and name it literally paths_size.
 struct key_def *
-key_def_new(uint32_t part_count, size_t extra_size)
+key_def_new(uint32_t part_count, size_t paths_size)

 static inline size_t
-key_def_sizeof(uint32_t part_count)
+key_def_sizeof(uint32_t part_count, size_t paths_size)
 {
-       return sizeof(struct key_def) + sizeof(struct key_part) * part_count;
+       return sizeof(struct key_def) + sizeof(struct key_part) * part_count +
+              paths_size;
 }
> 2. Same as strncmp and has the same bug: if a path is prefix of
> another path, then you declare them as equal, but it is wrong.
                /* Lexicographic strings order. */
-               uint32_t len = MIN(part1->path_len, part2->path_len);
+               if (part1->path_len != part2->path_len)
+                       return part1->path_len - part2->path_len;
                int rc = 0;
-               if ((rc = memcmp(part1->path, part2->path, len)) != 0)
+               if ((rc = memcmp(part1->path, part2->path,
+                                part1->path_len)) != 0)
                        return rc;

> Please, write a test on it. It should alter an index, extending
> path of one of its parts. In such a case it should be rebuilt,
> but your patch will not do it.
I've checked this code reachable with assert. But I don't now how handle it manually..
Please, help.

> 3. Why not const char *path?
Okey.

> 4. Please, be consistent in names: you already use 'is_flat'
> in the previous patch, not 'has_json_paths'.
Ok. is_flat now

> 5. Why do you need additional parentheses?
Ok.

> 6. As usual. Start sentences with a capital letter, finish with the
> dot.
Fixed.

> 7. Typo: 'alloated'.
Ok.

> 8. Your calculations are diligent, but they are unreadable, sorry.
> I tried to calculate again, and found that 2.5*len is enough.
> A path in the worst case looks like '.a'*. So the biggest
> possible number of path parts is len(path)/len('.a') = len / 2.
> A part is either in the correct format already, or requires
> 3 additional symbols. So the worst new len is:
> (len / 2) * 3 + len = 2.5 * len
> \____________/
>    number of
>    additional
>    characters
> 
> It is not? I failed to find a counter-example.
Yep, this match my calculations exacly. 5/2 ~= 3 for me)
I didn't like multiply on frac value.
> 9. As I remember, we have decided to do not store
> first path part in the string, because we have it in
> part->fieldno already decoded. Please, cut the first
> path part off. Also, it is faster when we need to
> follow the path along tuple field tree in tuple_format,
> because we do not need useless decoding of the first
> part.
Not exactly. We decided to use path hash per format so we need this prefix
to distinguish same path behind different fields.

> 10. Here parser.src_len is always equal to path_len - it was
> initialized by this value at the beginning of the function.
You are right here:
-       if ((uint32_t)parser.src_len > path_len) {
+       if (allocated_size > parser.src_len) {

> 11. What is more, I do not understand, how can it work. When
> you have two key_part_defs, first takes the region memory,
> and the second can take exactly the same memory and rewrite it,
> if its path len < than the former's. It was not clear on the
> previous implementation, but now it is evident. So all
> key_part_defs has the same path, if they are in descending
> order of their path lens.
No, I'd like to reuse the biggest available chunk:
old path buffer or rest of new path buffer; I mean:
old = "[0].FIO.fname\0";
          ^old
new = "[0]["FIO"]["fname"]\0\0\0\0\0\0\0\"
                                                   ^rest
        if (allocated_size > (uint32_t)parser.src_len) {
		/* Use rest of new buffer next time. */
		*path_extra = path;
		*path_extra_size = allocated_size;
	} else {
		/* Reuse old path buffer. */
		*path_extra = (char *)parser.src;
		*path_extra_size = parser.src_len;
	}
allocated_size and path were previously updated to match new buffer tail

> 12. Symbol_count is not the same as error position. Please, use
> json_path_next result as the position.
        err_msg = tt_sprintf("invalid JSON path '%.*s': path has invalid "
                             "structure (error at position %d)", parser.src_len,
-                            parser.src, parser.symbol_count);
+                            parser.src, rc);

> 13. Because key_def_find does not care about paths yourself,
> now key_def_contains() is broken, and Vinyl is too as a consequence.
> Please, write a test on it. Now it is possible to insert duplicates
> into unique Vinyl index, if a secondary index contains field numbers
> of the primary one, but with different paths. It is wrong.
You are right, let's better refractore key_def_find function:

 const struct key_part *
-key_def_find(const struct key_def *key_def, uint32_t fieldno)
+key_def_find(const struct key_def *key_def, uint32_t fieldno, const char *path,
+            uint32_t path_len)
 {
        const struct key_part *part = key_def->parts;
        const struct key_part *end = part + key_def->part_count;
+       assert(path == NULL || strlen(path) == path_len);
        for (; part != end; part++) {
-               if (part->fieldno == fieldno)
+               if (part->fieldno == fieldno && part->path_len == path_len &&
+                   (part->path == NULL ||
+                    memcmp(part->path, path, path_len) == 0))
                        return part;
        }

I.e.
-               const struct key_part *duplicate =
-                       key_def_find(first, part->fieldno);
-               if (duplicate != NULL &&
-                   part->path_len == duplicate->path_len &&
-                   memcmp(part->path, duplicate->path, part->path_len) == 0)
+               if (key_def_find(first, part->fieldno, part->path,
+                                part->path_len) != NULL)
everywhere.

>> +	/** JSON path to data. */
> 14. Please, note that it does not include the first
> path part, that is decoded already into fieldno.
It is not correct for values jet extracted from msgpack and I don't
trim first field as I've already noted below. 
>> +	/** JSON path to data. */
>> +	char *path;
> 15. The same.
-       /** JSON path to data. */
+       /** JSON path to data in canonical form. */

> 16. As I see, you did update neither key_def_contains_sequential_parts
> nor points of its result usage (if (! contains_sequential_parts) ... ).
> So in your implementation such parts as '[1][2][3]' and '[2][4][5]' are
> considered to be sequential, but it is wrong obviously. Their MessagePack's
> do not follow each other. After I reviewed the whole patchset I see that
> you updated these functions in the next commits, but please, do it in
> this commit.
+static bool
+key_def_parts_are_sequential(const struct key_def *def, int i)
+{
+       uint32_t fieldno1 = def->parts[i].fieldno + 1;
+       uint32_t fieldno2 = def->parts[i + 1].fieldno;
+       return fieldno1 == fieldno2 && def->parts[i].path == NULL &&
+               def->parts[i + 1].path == NULL;
+}
+

-               if (def->parts[i].fieldno + 1 == def->parts[i + 1].fieldno)
+               if (key_def_parts_are_sequential(def, i))

-                               if (key_def->parts[i].fieldno + 1 !=
-                                   key_def->parts[i + 1].fieldno) {
+                               if (!key_def_parts_are_sequential(key_def, i)) {

etc.

> 17. Why this patch has no tests? You already now can test paths
> normalization and the first path part cut into fieldno.
This path didn't introduce working feature. After small discussion, we decided to merge 2nd and 3rd paths.