[tarantool-patches] Re: [PATCH v2 4/5] box: introduce path_hash and tuple_field tree

Mon Aug 27 10:37:08 MSK 2018

> 1. Document structure in future could be defined not only
> via indexes.
-       /* 55 */_(ER_DATA_MISMATCH_INDEX_PART,  "Tuple doesn't math document structure defined as index") \
+       /* 55 */_(ER_DATA_STRUCTURE_MISMATCH,   "Tuple doesn't math document structure") \
> 2. A comment, please.
> 3. Please, add suffix '_cb' which we use for callbacks.
> 4. Why not just 'return key_mp_type_validate(...);'?

/**
 * Watch json_field_tree_routine description.
 * Validate field tree leaf has correct type.
 */
static int
tuple_validate_json_path_data_cb(const struct tuple_field *field, uint32_t idx,
				 const char *tuple, const char *offset,
				 void *ctx)
{
	(void)tuple;
	(void)ctx;
	return key_mp_type_validate(field->type, mp_typeof(*offset),
				    ER_KEY_PART_TYPE, idx, field->is_nullable);
}

/**
 * Watch json_field_tree_routine description.
 * Initialize tuple field map using field tree leaf offset_slot.
 * @param ctx is field_map
 */
 static int
-tuple_init_json_field_map_routine(const struct tuple_field *field, uint32_t idx,
-                                 const char *tuple, const char *offset,
-                                 void *ctx)
+tuple_init_json_field_map_cb(const struct tuple_field *field, uint32_t idx,
+                            const char *tuple, const char *offset, void *ctx)

> 5. The same problem as in the previous version - you double decode
> the field. First inside the tree walker and second here.
> 
> Please, find a way how to avoid it. For help see my comments
> for the walker.
Believe, this is already fixed.

> 6. It should be part of the previous patch together with any
> changes of this file.
Ok

> 7. Static inline and a comment. Also, this function is called
> from places having 'is_flat' template parameter which
> allows to do not check path for NULL on compilation time, but
> you do not use it. Please, make it template too.
Ok, done.

> 8. All the changes between this point and point 7 should be
> part of the previous patch.
Ok.

> 9. First level fieldno should not be encoded in the path. See
> the previous patch for comments.
It is not so, already answered.
> 10. You have exactly the same code in tuple_field_raw_by_path. Please,
> extract either the whole while cycle or this switch into a function.
I've introduce new tuple_field_dig_with_parser routine.
It calls tuple_field_go_to_index and tuple_field_go_to_key using parser,
so I like this name.

> 11. At first, it is a part of the previous patch. At second, it
> run too far. It is time to declare a matrix of functions, maybe
> three dimensional or two, indexes of which are calculated from
> contains_sequential_parts, has_optional_parts and is_flat. See
> field_type1_contains_type2 for an example.
> Same for other functions, having >= 3 template parameters.
Done as a part of previous path.

> 12. Why do you return node_t if you always use only node->val? Why not
> to return val?
It was not nessesary in path that you've revieved but it is important now:
I update leaf field manually and change str pointer to point to format
chunk path memory.

> 13. Why so complicated? Just rename it to new and return the hash
> instead of int.
Simplified.

> 14. Please, put an empty line between the functions.
Done.

> 15. Do not use SQLite style of freeing objects, please.
Done.

> 16. How is it possible that the record exists already? And
> where do you check it here?
Legacy. Dropped.

> 17. Fits in one line.
Done.

> 18. Out of 80. Use sizeof(array[0]). The same below.
-                                       part->num * sizeof(struct tuple_field *));
+                                       part->num * sizeof(array[0]));
Done.

> 19. Size here is not correct. You forgot about * part->num.
-                                       sizeof(struct tuple_field *), "realloc",
-                                       "array");
+                                        part->num * sizeof(array[0]),
+                                        "realloc","array");
Fixed.

> 20. Why are these cases different? If I substitute field->array_size
> (which is true when field->array == NULL) with 0, I will match the
> first branch. It is not?
-                       if (field->array == NULL) {
-                               memset(array, 0, part->num *
-                                       sizeof(struct tuple_field *));
-                       } else {
-                               memset(&array[field->array_size], 0,
-                                      (part->num - field->array_size) *
-                                      sizeof(struct tuple_field *));
-                       }
+                       memset(&array[field->array_size], 0,
+                               (part->num - field->array_size) *
+                               sizeof(array[0]));

> 21. I think, it is rather 'else' branch. Evidently, the condition
> is unreachable, if the field was just created in the code above.
Yep, upgraded.

> 22. If field->map == NULL then sure you do not
> need to check node existence below. Please,
> split those two cases into 2 branches.
Done.

> 23. How can it be NULL?
Fixed.
>> +int
>> +json_field_tree_exec_routine(const struct tuple_field *field, uint32_t idx,
>> +			     const char *tuple, const char *offset,
>> +			     json_field_tree_routine routine, void *routine_ctx)
> 
> 24. I got the idea and I like it - have a walker applicable
> for any puprose. But it calls a function by pointer on a hot
> path - initialization of a tuple field map + validation.
> 
> Lets make it macros. I know, that a macros can not be recursive,
> but we can write a generator of such walker. A macros taking a
> function and generating new json_field_tree_exec_routine.
I've reworked this function a lot. It is complicated enough. Believe, it is not necessary now.

> 25. Talking about one of my first comments (how to iterate over
> the field only once when possible): here, for array, you can
> merely continue iteration and mp_next calling behind field->array_size
> up to count.
> 
> For map the things are more complex. And you can apply the method
> I have described on the previous review. You learn map size and
> on each tuple_field_go_to_key remember maximal number and position
> of the decoded key-value pair. When the iteration is finished,
> you decode the tail starting from the maximal decoded key-value
> pair until the end. In such a case the tail is decoded only once
> always.

> 26. No such parameter 'field_type'. Only 'type'.
Ok, fixed.

> 27. Do not use SQLite style of variables declaration. It
> can be declared and assigned at the same time below.
Ok.

> 28. I do not see a test on this. Please, use gcov to be sure that
> test coverage of all the new code is 100% (except OOM errors).
Added test.

> 29. What about nullability mismatch?Ugum, there was a lot changes in this scenario. I've fixed it now.

> 30. Please. consider this:
> 
>   	field->type = type;
> -	if (json_path_hash_insert(format->path_hash, path, path_len,
> -				  field) != 0)
> -		return -1;
> -
>   	*leaf_field = field;
> -	return 0;
> +	return json_path_hash_insert(format->path_hash, path, path_len, field);
>   }
Ok, applied.
> 31. The aforementioned two hunks are complete mess. Please,
> move the 'is_key_part = true' before them. Then after it at
> first check for 'if (part->path != NULL) { ...; continue; }'.
> Then remove checks for 'path == NULL'.
Done.

> 32. When some indexes has common fields, you count
> and allocate their paths twice.
I've fixed this using hashtable init on format alloc and on-fact leaf initialization.
It is also important to patch hashtable string source and recreate parser with new string.

>> +	for (uint32_t i = 0; i < format->field_count; i++) {
>> +		format->fields[i].array = NULL;
>> +		format->fields[i].array_size = 0;
> 
> 33. Just memset them.
Those fields has other data that I wouldn't copy again.

>> +	if (tuple_format_register(format) != 0)
>> +		goto error;
>>   	return format;
> 
> 34. Consider thisOk, tnx.

> 35. Why here do you check for map != NULL but in all other places
> for array != NULL?
I've introduce childs field that I use in such situations.

> 36. If this is true, then the previous 'if' is unreachable
> (field->offset_slot != TUPLE_OFFSET_SLOT_NIL). Complex field
> can not have an offset slot. So this is 'else' branch, I guess.
Yep.
> 37. After you fix my other comments, this mp_next should
> not be called, if json tree walker was used.
Ugum, done.

> 38. Just inline it.
Ok, done.

> 39. Inversely, it is likely.
Fixed.

> 40. Do here 'return' and reduce indentation level below.
Done.
> 41. As I understand, when a part is JSON and its epoch
> matches the format, it always has a valid slot_cache and you
> do not need to check for '< field_map_size'.
This also related to 42th comment.
Vinyl now has two formats with compatible field map of same size.
But format field count could differ a little. If I not mistaken, disk format
may have less field_count. So smaller format-tuple field map of is initialized
with 0 where appropriate fields don't present.
On the other hand, part cache is looking valid as format epoch are similar.
And it is so, except the fact that such data is not exists for this format type.
Code is appended below.

> Moreover, I do not understand why do you check this slot
> to be not SLOT_NIL and has not zero value in field_map. It
> is 100% valid and you already now can do
> 'return data + field_map[part->slot_cache]', it is not?> 42. The same here. When an offset slot is stored in a field
> from format->path_hash, it always has not zero field_map
> value and not SLOT_NIL offset. So the branch below is actually
> 'else', I guess.

const char *
tuple_field_by_part_raw(const struct tuple_format *format, const char *data,
			const uint32_t *field_map, struct key_part *part)
{
	if (likely(part->path == NULL))
		return tuple_field_raw(format, data, field_map, part->fieldno);

	struct tuple_field *field = NULL;
	int32_t offset_slot = TUPLE_OFFSET_SLOT_NIL;
	if (likely(part->offset_slot_epoch == format->epoch)) {
		offset_slot = part->offset_slot;
	} else if (format->path_hash != NULL &&
		   (field = json_path_hash_get(format->path_hash, part->path,
					       part->path_len,
					       part->path_hash)) != NULL) {
		offset_slot = field->offset_slot;
	} else {
		/*
		 * Legacy tuple having no field map for
		 * JSON index.
		 */
		uint32_t path_hash =
			field_name_hash(part->path, part->path_len);
		const char *raw = NULL;
		if (tuple_field_raw_by_path(format, data, field_map,
					    part->path, part->path_len,
					    path_hash, &raw) != 0)
			raw = NULL;
		return raw;
	}
	assert(offset_slot < 0);
	assert(-offset_slot * sizeof(uint32_t) <= format->field_map_size);
	if (unlikely(field_map[offset_slot] == 0))
		return NULL;
	/* Cache offset_slot if required. */
	if (part->offset_slot_epoch < format->epoch) {
		part->offset_slot = offset_slot;
		part->offset_slot_epoch = format->epoch;
	}
	return data + field_map[offset_slot];
}

> 43. Please, move this function above within its original commit.
Done.

> 44. Actually, it is not used in tree-walker.
                /** Leaf argument for manual tuple construction. */
		void *arg;

> 45. I see, how do you struggle with checking that the
> field has no childs, so I propose to add a member in the
> union 'void *child'. And check it for NULL. It is better
> than check explicitly array or map in places where it
> does not matter.
Done.

> 46. All these changes should be done in the previous commit.
Done.

> 47. Why do you check for child != NULL instead of part->path != NULL?
Fixed.

>> +static void
>> +vy_stmt_msgpack_build(struct tuple_field *field, char *tuple,
>> +		      uint32_t *field_map, char **offset, bool write_data)
> 
> 48. Why you do not use json_field_tree_exec_routine? Because you need
> to do something before and afer decoding the field? Then just implement
> it as a macro as I said and add 2 parameters:
> start_field_cb(field_type, field_size) and end_field_cb(field_type).
> And if a _cb is not defined, its call is not generated.
I have no callbacks now.

> 49. Double ';'.Fixed.
Done.