[PATCH v6 0/8] box: Indexes by JSON path

Tue Dec 18 23:46:57 MSK 2018

On Mon, Dec 17, 2018 at 09:52:44AM +0300, Kirill Shcherbatov wrote:
> Sometimes field data could have complex document structure.
> When this structure is consistent across whole document,
> you are able to create an index by JSON path.
> 
> Changes in version 6:
>  - key_part(s) path string is not 0-terminated anymore
>  - global formats_epoch pool used to initialize tuple_format:epoch
>    numeric field
>  - more informative error messages
>  - splitted changes to few new additional preparatory commits
>  - got rid off hashtable in vy_stmt_build
>  - prevent traversing field tree twice in vy_stmt_build
>  - inline basic field_by_part with slowpath alternative
>  - simplified tuple_format1_can_store_format2_tuples
>  - splitted tuple_init_field_map to helper routines
>  - moved tests to separate file 
>  - large and descriptive comments for many things
> 
> http://github.com/tarantool/tarantool/tree/kshch/gh-1012-json-indexes
> https://github.com/tarantool/tarantool/issues/1012
> 
> Kirill Shcherbatov (8):
>   box: refactor tuple_validate_raw
>   box: refactor ER_{FIELD_TYPE, ACTION_MISMATCH}
>   box: build path to field string uniformly on error

I'd love to apply these preparatory patches before looking at the rest,
but they are badly split and leave the code in a rather incomplete
state. Let's sort them out first.

What I want to see:

1. Implement tuple_validate with tuple_init_field_map. Do not mix
   field_map_template in this patch, as it has nothing to do with that.
   And it's not just refactoring. Actually, it can degrade the function
   performance. Please fix the commit message accordingly.

2. Refactor field type/nullability checks. Basically, this is what patch
   #2 of your series does, but it's not good enough.
   
   First, mp_type_is_compatible and key_mp_type_validate share the same
   piece of code. You should factor it out instead of copying and
   pasting, e.g.

   - field_mp_type_is_compatible returns true if given field and msgpack
     types are compatible. It should probably be defined in field_def.c.
     key_mp_type should probably be renamed to field_mp_type and moved
     to field_def.c, as it deals with field_type. I think it's also
     worth moving mp_type_strs to field_def.c, too.

   - key_part_validate takes a pointer to msgpack and a field type.
     Calls field_mp_type_is_compatible(mp_typeof(*key), field_type) to
     check if the given key part is valid. Sets diag on error.

   - key_validate_parts calls key_part_validate for each msgpack part.

   I'm not sure about function names - please think/try by yourself.

   Second, key_mp_type_validate doesn't need to take an error code
   anymore - now we always pass ER_KEY_PART_COUNT to it.

   Third, changing %u to %s in error messages should be done later, see
   step 4. And you should use int2str instead of tt_snprintf for it.

   This patch is going to do quite a bit of work - that's OK I think,
   just enumerate all the changes done by it in a bullet list in the
   commit message.

3. Add a helper function for getting a path to a json_token in a
   json_tree. Should be defined in json.c. We will use this function for
   reporting tuple format violations. I understand that if we go down
   this path, we won't be able to report top-level field names as they
   are defined in a tuple_dictionary, not in a json_tree. I think it's
   OK as the messages will be descriptive enough even if we use field
   numbers instead. At the same time, this will allow us to move a
   relatively big chunk of code that doesn't really have anything to do
   with tuples out of tuple_format.c.

   This function should be implemented without an alloc() callback. It
   should be passed a buffer of a fixed length. It should return the
   number of bytes that would be printed if there were enough space in
   the buffer, in snprintf-like manner, so that one could first call it
   with a 0-length buffer to figure out the buffer length, then allocate
   a buffer and call it again, this time for real.

   Don't forget to add a unit test.

4. Use the helper introduced at step #3 to report field names in error
   messages. You should change %u to %s in error messages in this patch,
   not in patch 2, because after patch 2 we still use field numbers, not
   paths.

Let's leave field_map validation introduction to the main patch of the
series for now. It would be great to add it in a separate patch, but it
wouldn't work for plain indexes as missing fields are already checked
with ER_MIN_FIELD_COUNT, i.e. we would just add some dead code, and
unfortunately we can't get rid of ER_MIN_FIELD_COUNT, because
space.format fields aren't assigned slots in the field map so we can't
use the field map to check if they are present. This makes me wonder
what we are going to do when we get to support setting JSON schema in
space.format. How will we check that all formatted fields are present
then? We need to figure it out before proceeding with the main patch.
Let's discuss it.

If you agree with my points above, please send the four patches in a
separate series. Do not resubmit the rest of the series before I've
committed the preparatory patches.

Side notes:
 - Never add dead code in a patch. All code added by a patch should be
   tested. Introducing tuple_field_map_validate before implementing JSON
   index support violates this rule.
 - Try not to rewrite the same piece of code multiple times in the same
   series, because this practically doubles my work as a reviewer.
   Patches of a patch set should fit in like pieces of a jigsaw puzzle.
   Here's an example of you doing this:

@@ -547,8 +559,10 @@ tuple_init_field_map(const struct tuple_format *format, uint32_t *field_map,
 			if (validate &&
 			    !mp_type_is_compatible(type, field->type,
 						   is_nullable) != 0) {
-				diag_set(ClientError, ER_FIELD_TYPE,
-					 tt_sprintf("%u", token.num + 1),
+				const char *path =
+					tuple_field_json_path(format, field);
+				assert(path != NULL);
+				diag_set(ClientError, ER_FIELD_TYPE, path,
 					 field_type_strs[field->type]);
 				return -1;
 			}

and in the next patch

@@ -560,10 +922,13 @@ tuple_init_field_map(const struct tuple_format *format, uint32_t *field_map,
 			    !mp_type_is_compatible(type, field->type,
 						   is_nullable) != 0) {
 				const char *path =
-					tuple_field_json_path(format, field);
-				assert(path != NULL);
-				diag_set(ClientError, ER_FIELD_TYPE, path,
-					 field_type_strs[field->type]);
+					tuple_field_json_path(format, field,
+							      region);
+				if (path != NULL) {
+					diag_set(ClientError, ER_FIELD_TYPE,
+						 path,
+						 field_type_strs[field->type]);
+				}
 				return -1;
 			}
 			if (field->offset_slot != TUPLE_OFFSET_SLOT_NIL) {