[Tarantool-patches] [PATCH 4/4] box: introduce indices by UUID

Mon Apr 6 00:29:12 MSK 2020

Thanks for the patch!

See 6 comments below.

On 04/04/2020 01:02, Serge Petrenko wrote:
> It is now possible to create an index over UUID values, either returned
> by something like `uuid.new()` or 36- and 32- byte strings in format
> "88ebbb9a-8c18-480f-bd04-dd5345f1573c" or "88ebbb9a8c18480fbd04dd5345f1573c".
> 16-byte binary strings are also supported.
> 
> Closes #4268
> Closes #2916
> 
> @TarantoolBot document
> Title: Document uuid field type.
> 
> There's a new field type -- UUID, it accepts values returned by
> `uuid.new()`, as well as strings representing uuids, like
> `88ebbb9a-8c18-480f-bd04-dd5345f1573c` and
> `88ebbb9a8c18480fbd04dd5345f1573c` and 16 byte binary values, encoded in
> msgpack.
> 
> The index may be either unique or non-unique, nullable or non-nullable,
> and may be a primary key.
> 
> To create an index over a uuid field for space `test`, say:
> ```
> box.space.test:create_index("pk", {parts={1, 'uuid'}})
> ```
> Now you may insert uuids into the space:
> ```
> tarantool> box.space.test:insert{uuid.new()}
> ---
> - [e631fdcc-0e8a-4d2f-83fd-b0ce6762b13f]
> ...
> 
> tarantool> box.space.test:insert{"64d22e4d-ac92-4a23-899a-e59f34af5479"}
> ---
> - ['64d22e4d-ac92-4a23-899a-e59f34af5479']
> ...
> 
> tarantool> box.space.test:insert{"856dc1177dcc49969f1407f8c6c8a371"}
> ---
> - ['856dc1177dcc49969f1407f8c6c8a371']
> ...
> 
> tarantool> box.space.test:select{}
> ---
> - - ['64d22e4d-ac92-4a23-899a-e59f34af5479']
>   - ['856dc1177dcc49969f1407f8c6c8a371']
>   - [e631fdcc-0e8a-4d2f-83fd-b0ce6762b13f]

1. Wait. Why are they printed differently? They should be all the
same 36 strings when you print them, no?

Seems like you understood Mons to the letter. I believe he meant
we should provide a way to insert strings to simplify moving data
to a new space, but it does not mean we should store them as strings.
It makes comparators slower. sscanf() to extract uuid from a string
on *each* comparison is a huge performance hole. From what I measured
recently (not in Tarantool), sscanf() is a *really* slow thing. Its
usage in comparators is dangerous.

On the other hand ability to insert strings means we need to convert
them to MP_EXT after insertion, which means full scan of each tuple
by field types and possibly its recreation. We had a similar problem
with 'double' type. After all it was decided to keep as is and let
users enforce the needed type when they want using ffi.cast('double').

IMO, we should just disallow both ability to store strings, and to
insert strings. I mean, you can't insert a string into a decimal field,
for example. So a user anyway will never be able to insert only Lua
simple types. Cdata is enforced for decimal and double right now.

Better keep it simple. If someone wants to convert their existing spaces
to mp uuid, they easily can write a trivial function which would replace
string uuid with cdata uuid before insertion into the new space. We
should not sacrifice perf for that.

This is only my opinion, you may want to ask Mons and Kirill.

2. It is worth mentioning comparison order. Are they compared as
strings, or individual uuid fields are compared? In the latter case
in what order the fields are compared?

Also probably string comparison vs by-field comparison vs memcmp
comparison will be the same in case fields are stored in msgpack
in the same order as in the string, and in big-endian. I didn't check
but it is worth investigation. It would be good to be able to use
memcmp.

> diff --git a/src/box/tuple_compare.cc b/src/box/tuple_compare.cc
> index 3f8a0ce24..aad69340d 100644
> --- a/src/box/tuple_compare.cc
> +++ b/src/box/tuple_compare.cc
> @@ -378,6 +382,61 @@ mp_compare_bin(const char *field_a, const char *field_b)
>  	return COMPARE_RESULT(size_a, size_b);
>  }
>  
> +static int
> +mp_compare_uuid_with_type(const char *field_a, enum mp_type type_a,
> +			  const char *field_b, enum mp_type type_b)
> +{
> +	struct tt_uuid uuid_a, uuid_b;
> +	struct tt_uuid *ret;
> +	const char *str;
> +	uint32_t len;
> +	int rc;
> +	switch (type_a) {
> +	case MP_STR:
> +		str = mp_decode_str(&field_a, &len);
> +		rc = tt_uuid_from_lstring(str, len, &uuid_a);

3. This is perf killer. Basically it would be faster to store
uuids as a string, or just as 16 bytes binary. Because memcmp
could be used directly then.

> +		assert(rc == 0);
> +		break;
> +	case MP_BIN:
> +		str = mp_decode_bin(&field_a, &len);
> +		ret = uuid_unpack(&str, len, &uuid_a);
> +		assert(ret != NULL);
> +		break;
> +	case MP_EXT:
> +		ret = mp_decode_uuid(&field_a, &uuid_a);
> +		assert(ret != NULL);
> +		break;
> +	default:
> +		unreachable();
> +	}
> +	switch (type_b) {
> +	case MP_STR:
> +		str = mp_decode_str(&field_b, &len);
> +		rc = tt_uuid_from_lstring(str, len, &uuid_b);
> +		assert(rc == 0);
> +		break;
> +	case MP_BIN:
> +		str = mp_decode_bin(&field_b, &len);
> +		ret = uuid_unpack(&str, len, &uuid_b);
> +		assert(ret != NULL);
> +		break;
> +	case MP_EXT:
> +		ret = mp_decode_uuid(&field_b, &uuid_b);
> +		assert(ret != NULL);
> +		break;
> +	default:
> +		unreachable();
> +	}
> +	return tt_uuid_compare(&uuid_a, &uuid_b);

4. If we will decide to store uuids as MP_EXT, it may be
useful to introduce a function mp_compare_uuid, by analogue
with other mp_compare_* functions. To compare them directly
using memcmp, without unpacking. But see above my comment
about order. memcmp validness depends on how uuid is stored
in MessagePack.

> +}
> +
> +static inline int
> +mp_compare_uuid(const char *field_a, const char *field_b)
> +{
> +	return mp_compare_uuid_with_type(field_a, mp_typeof(*field_a),
> +					 field_b, mp_typeof(*field_b));
> +}
> @@ -1578,6 +1642,21 @@ hint_decimal(decimal_t *dec)
>  	return hint_create(MP_CLASS_NUMBER, val);
>  }
>  
> +static inline hint_t
> +hint_uuid(struct tt_uuid *uuid)
> +{
> +	/* Simply take the first part of the UUID as hint. */

5. Why only first?

> +	uint64_t val = 0;
> +	val |= uuid->time_low;
> +	val <<= sizeof(uuid->time_mid) * CHAR_BIT;
> +	val |= uuid->time_mid;
> +	val <<= sizeof(uuid->time_hi_and_version) * CHAR_BIT;
> +	val |= uuid->time_hi_and_version;
> +	/* Make space for class representation. */
> +	val >>= HINT_CLASS_BITS;
> +	return hint_create(MP_CLASS_UUID, val);
> +}
> diff --git a/test/engine/gh-4268-uuid.result b/test/engine/gh-4268-uuid.result
> new file mode 100644
> index 000000000..928204507
> --- /dev/null
> +++ b/test/engine/gh-4268-uuid.result

6. Since this is a feature, we probably should not call the
test file using name pattern used for bugs. Just uuid.test.lua.