[Tarantool-patches] [PATCH 4/4] box: introduce indices by UUID

Fri Apr 10 02:46:51 MSK 2020

> 6 апр. 2020 г., в 00:29, Vladislav Shpilevoy <v.shpilevoy at tarantool.org> написал(а):
> 
> Thanks for the patch!

Hi! Thanks for the review!

> 
> See 6 comments below.
> 
> On 04/04/2020 01:02, Serge Petrenko wrote:
>> It is now possible to create an index over UUID values, either returned
>> by something like `uuid.new()` or 36- and 32- byte strings in format
>> "88ebbb9a-8c18-480f-bd04-dd5345f1573c" or "88ebbb9a8c18480fbd04dd5345f1573c".
>> 16-byte binary strings are also supported.
>> 
>> Closes #4268
>> Closes #2916
>> 
>> @TarantoolBot document
>> Title: Document uuid field type.
>> 
>> There's a new field type -- UUID, it accepts values returned by
>> `uuid.new()`, as well as strings representing uuids, like
>> `88ebbb9a-8c18-480f-bd04-dd5345f1573c` and
>> `88ebbb9a8c18480fbd04dd5345f1573c` and 16 byte binary values, encoded in
>> msgpack.
>> 
>> The index may be either unique or non-unique, nullable or non-nullable,
>> and may be a primary key.
>> 
>> To create an index over a uuid field for space `test`, say:
>> ```
>> box.space.test:create_index("pk", {parts={1, 'uuid'}})
>> ```
>> Now you may insert uuids into the space:
>> ```
>> tarantool> box.space.test:insert{uuid.new()}
>> ---
>> - [e631fdcc-0e8a-4d2f-83fd-b0ce6762b13f]
>> ...
>> 
>> tarantool> box.space.test:insert{"64d22e4d-ac92-4a23-899a-e59f34af5479"}
>> ---
>> - ['64d22e4d-ac92-4a23-899a-e59f34af5479']
>> ...
>> 
>> tarantool> box.space.test:insert{"856dc1177dcc49969f1407f8c6c8a371"}
>> ---
>> - ['856dc1177dcc49969f1407f8c6c8a371']
>> ...
>> 
>> tarantool> box.space.test:select{}
>> ---
>> - - ['64d22e4d-ac92-4a23-899a-e59f34af5479']
>>  - ['856dc1177dcc49969f1407f8c6c8a371']
>>  - [e631fdcc-0e8a-4d2f-83fd-b0ce6762b13f]
> 
> 1. Wait. Why are they printed differently? They should be all the
> same 36 strings when you print them, no?
> 
> Seems like you understood Mons to the letter. I believe he meant
> we should provide a way to insert strings to simplify moving data
> to a new space, but it does not mean we should store them as strings.
> It makes comparators slower. sscanf() to extract uuid from a string
> on *each* comparison is a huge performance hole. From what I measured
> recently (not in Tarantool), sscanf() is a *really* slow thing. Its
> usage in comparators is dangerous.
> 
> On the other hand ability to insert strings means we need to convert
> them to MP_EXT after insertion, which means full scan of each tuple
> by field types and possibly its recreation. We had a similar problem
> with 'double' type. After all it was decided to keep as is and let
> users enforce the needed type when they want using ffi.cast('double').
> 
> IMO, we should just disallow both ability to store strings, and to
> insert strings. I mean, you can't insert a string into a decimal field,
> for example. So a user anyway will never be able to insert only Lua
> simple types. Cdata is enforced for decimal and double right now.
> 
> Better keep it simple. If someone wants to convert their existing spaces
> to mp uuid, they easily can write a trivial function which would replace
> string uuid with cdata uuid before insertion into the new space. We
> should not sacrifice perf for that.
> 
> This is only my opinion, you may want to ask Mons and Kirill.

So, I consulted. Mons, he’s ok with allowing only cdata.

> 
> 
> 2. It is worth mentioning comparison order. Are they compared as
> strings, or individual uuid fields are compared? In the latter case
> in what order the fields are compared?

Ok, mentioned in commit message. The order’s lexicographical.
> 
> Also probably string comparison vs by-field comparison vs memcmp
> comparison will be the same in case fields are stored in msgpack
> in the same order as in the string, and in big-endian. I didn't check
> but it is worth investigation. It would be good to be able to use
> memcmp.
> 
>> diff --git a/src/box/tuple_compare.cc b/src/box/tuple_compare.cc
>> index 3f8a0ce24..aad69340d 100644
>> --- a/src/box/tuple_compare.cc
>> +++ b/src/box/tuple_compare.cc
>> @@ -378,6 +382,61 @@ mp_compare_bin(const char *field_a, const char *field_b)
>> 	return COMPARE_RESULT(size_a, size_b);
>> }
>> 
>> +static int
>> +mp_compare_uuid_with_type(const char *field_a, enum mp_type type_a,
>> +			  const char *field_b, enum mp_type type_b)
>> +{
>> +	struct tt_uuid uuid_a, uuid_b;
>> +	struct tt_uuid *ret;
>> +	const char *str;
>> +	uint32_t len;
>> +	int rc;
>> +	switch (type_a) {
>> +	case MP_STR:
>> +		str = mp_decode_str(&field_a, &len);
>> +		rc = tt_uuid_from_lstring(str, len, &uuid_a);
> 
> 3. This is perf killer. Basically it would be faster to store
> uuids as a string, or just as 16 bytes binary. Because memcmp
> could be used directly then.

Switched to memcmp.

> 
>> +		assert(rc == 0);
>> +		break;
>> +	case MP_BIN:
>> +		str = mp_decode_bin(&field_a, &len);
>> +		ret = uuid_unpack(&str, len, &uuid_a);
>> +		assert(ret != NULL);
>> +		break;
>> +	case MP_EXT:
>> +		ret = mp_decode_uuid(&field_a, &uuid_a);
>> +		assert(ret != NULL);
>> +		break;
>> +	default:
>> +		unreachable();
>> +	}
>> +	switch (type_b) {
>> +	case MP_STR:
>> +		str = mp_decode_str(&field_b, &len);
>> +		rc = tt_uuid_from_lstring(str, len, &uuid_b);
>> +		assert(rc == 0);
>> +		break;
>> +	case MP_BIN:
>> +		str = mp_decode_bin(&field_b, &len);
>> +		ret = uuid_unpack(&str, len, &uuid_b);
>> +		assert(ret != NULL);
>> +		break;
>> +	case MP_EXT:
>> +		ret = mp_decode_uuid(&field_b, &uuid_b);
>> +		assert(ret != NULL);
>> +		break;
>> +	default:
>> +		unreachable();
>> +	}
>> +	return tt_uuid_compare(&uuid_a, &uuid_b);
> 
> 4. If we will decide to store uuids as MP_EXT, it may be
> useful to introduce a function mp_compare_uuid, by analogue
> with other mp_compare_* functions. To compare them directly
> using memcmp, without unpacking. But see above my comment
> about order. memcmp validness depends on how uuid is stored
> in MessagePack.
> 

Done. uuid fields are stored in big-endian manner and in the same
order they’re compared, so memcmp is valid.

>> +}
>> +
>> +static inline int
>> +mp_compare_uuid(const char *field_a, const char *field_b)
>> +{
>> +	return mp_compare_uuid_with_type(field_a, mp_typeof(*field_a),
>> +					 field_b, mp_typeof(*field_b));
>> +}
>> @@ -1578,6 +1642,21 @@ hint_decimal(decimal_t *dec)
>> 	return hint_create(MP_CLASS_NUMBER, val);
>> }
>> 
>> +static inline hint_t
>> +hint_uuid(struct tt_uuid *uuid)
>> +{
>> +	/* Simply take the first part of the UUID as hint. */
> 
> 5. Why only first?

I meant, take the ‘high’ part of the UUID. The one which’s compared first.

> 
>> +	uint64_t val = 0;
>> +	val |= uuid->time_low;
>> +	val <<= sizeof(uuid->time_mid) * CHAR_BIT;
>> +	val |= uuid->time_mid;
>> +	val <<= sizeof(uuid->time_hi_and_version) * CHAR_BIT;
>> +	val |= uuid->time_hi_and_version;
>> +	/* Make space for class representation. */
>> +	val >>= HINT_CLASS_BITS;
>> +	return hint_create(MP_CLASS_UUID, val);
>> +}
>> diff --git a/test/engine/gh-4268-uuid.result b/test/engine/gh-4268-uuid.result
>> new file mode 100644
>> index 000000000..928204507
>> --- /dev/null
>> +++ b/test/engine/gh-4268-uuid.result
> 
> 6. Since this is a feature, we probably should not call the
> test file using name pattern used for bugs. Just uuid.test.lua.

Done.

--
Serge Petrenko
sergepetrenko at tarantool.org