[Tarantool-patches] [PATCH 4/4] box: introduce indices by UUID
Serge Petrenko
sergepetrenko at tarantool.org
Fri Apr 10 02:46:51 MSK 2020
> 6 апр. 2020 г., в 00:29, Vladislav Shpilevoy <v.shpilevoy at tarantool.org> написал(а):
>
> Thanks for the patch!
Hi! Thanks for the review!
>
> See 6 comments below.
>
> On 04/04/2020 01:02, Serge Petrenko wrote:
>> It is now possible to create an index over UUID values, either returned
>> by something like `uuid.new()` or 36- and 32- byte strings in format
>> "88ebbb9a-8c18-480f-bd04-dd5345f1573c" or "88ebbb9a8c18480fbd04dd5345f1573c".
>> 16-byte binary strings are also supported.
>>
>> Closes #4268
>> Closes #2916
>>
>> @TarantoolBot document
>> Title: Document uuid field type.
>>
>> There's a new field type -- UUID, it accepts values returned by
>> `uuid.new()`, as well as strings representing uuids, like
>> `88ebbb9a-8c18-480f-bd04-dd5345f1573c` and
>> `88ebbb9a8c18480fbd04dd5345f1573c` and 16 byte binary values, encoded in
>> msgpack.
>>
>> The index may be either unique or non-unique, nullable or non-nullable,
>> and may be a primary key.
>>
>> To create an index over a uuid field for space `test`, say:
>> ```
>> box.space.test:create_index("pk", {parts={1, 'uuid'}})
>> ```
>> Now you may insert uuids into the space:
>> ```
>> tarantool> box.space.test:insert{uuid.new()}
>> ---
>> - [e631fdcc-0e8a-4d2f-83fd-b0ce6762b13f]
>> ...
>>
>> tarantool> box.space.test:insert{"64d22e4d-ac92-4a23-899a-e59f34af5479"}
>> ---
>> - ['64d22e4d-ac92-4a23-899a-e59f34af5479']
>> ...
>>
>> tarantool> box.space.test:insert{"856dc1177dcc49969f1407f8c6c8a371"}
>> ---
>> - ['856dc1177dcc49969f1407f8c6c8a371']
>> ...
>>
>> tarantool> box.space.test:select{}
>> ---
>> - - ['64d22e4d-ac92-4a23-899a-e59f34af5479']
>> - ['856dc1177dcc49969f1407f8c6c8a371']
>> - [e631fdcc-0e8a-4d2f-83fd-b0ce6762b13f]
>
> 1. Wait. Why are they printed differently? They should be all the
> same 36 strings when you print them, no?
>
> Seems like you understood Mons to the letter. I believe he meant
> we should provide a way to insert strings to simplify moving data
> to a new space, but it does not mean we should store them as strings.
> It makes comparators slower. sscanf() to extract uuid from a string
> on *each* comparison is a huge performance hole. From what I measured
> recently (not in Tarantool), sscanf() is a *really* slow thing. Its
> usage in comparators is dangerous.
>
> On the other hand ability to insert strings means we need to convert
> them to MP_EXT after insertion, which means full scan of each tuple
> by field types and possibly its recreation. We had a similar problem
> with 'double' type. After all it was decided to keep as is and let
> users enforce the needed type when they want using ffi.cast('double').
>
> IMO, we should just disallow both ability to store strings, and to
> insert strings. I mean, you can't insert a string into a decimal field,
> for example. So a user anyway will never be able to insert only Lua
> simple types. Cdata is enforced for decimal and double right now.
>
> Better keep it simple. If someone wants to convert their existing spaces
> to mp uuid, they easily can write a trivial function which would replace
> string uuid with cdata uuid before insertion into the new space. We
> should not sacrifice perf for that.
>
> This is only my opinion, you may want to ask Mons and Kirill.
So, I consulted. Mons, he’s ok with allowing only cdata.
>
>
> 2. It is worth mentioning comparison order. Are they compared as
> strings, or individual uuid fields are compared? In the latter case
> in what order the fields are compared?
Ok, mentioned in commit message. The order’s lexicographical.
>
> Also probably string comparison vs by-field comparison vs memcmp
> comparison will be the same in case fields are stored in msgpack
> in the same order as in the string, and in big-endian. I didn't check
> but it is worth investigation. It would be good to be able to use
> memcmp.
>
>> diff --git a/src/box/tuple_compare.cc b/src/box/tuple_compare.cc
>> index 3f8a0ce24..aad69340d 100644
>> --- a/src/box/tuple_compare.cc
>> +++ b/src/box/tuple_compare.cc
>> @@ -378,6 +382,61 @@ mp_compare_bin(const char *field_a, const char *field_b)
>> return COMPARE_RESULT(size_a, size_b);
>> }
>>
>> +static int
>> +mp_compare_uuid_with_type(const char *field_a, enum mp_type type_a,
>> + const char *field_b, enum mp_type type_b)
>> +{
>> + struct tt_uuid uuid_a, uuid_b;
>> + struct tt_uuid *ret;
>> + const char *str;
>> + uint32_t len;
>> + int rc;
>> + switch (type_a) {
>> + case MP_STR:
>> + str = mp_decode_str(&field_a, &len);
>> + rc = tt_uuid_from_lstring(str, len, &uuid_a);
>
> 3. This is perf killer. Basically it would be faster to store
> uuids as a string, or just as 16 bytes binary. Because memcmp
> could be used directly then.
Switched to memcmp.
>
>> + assert(rc == 0);
>> + break;
>> + case MP_BIN:
>> + str = mp_decode_bin(&field_a, &len);
>> + ret = uuid_unpack(&str, len, &uuid_a);
>> + assert(ret != NULL);
>> + break;
>> + case MP_EXT:
>> + ret = mp_decode_uuid(&field_a, &uuid_a);
>> + assert(ret != NULL);
>> + break;
>> + default:
>> + unreachable();
>> + }
>> + switch (type_b) {
>> + case MP_STR:
>> + str = mp_decode_str(&field_b, &len);
>> + rc = tt_uuid_from_lstring(str, len, &uuid_b);
>> + assert(rc == 0);
>> + break;
>> + case MP_BIN:
>> + str = mp_decode_bin(&field_b, &len);
>> + ret = uuid_unpack(&str, len, &uuid_b);
>> + assert(ret != NULL);
>> + break;
>> + case MP_EXT:
>> + ret = mp_decode_uuid(&field_b, &uuid_b);
>> + assert(ret != NULL);
>> + break;
>> + default:
>> + unreachable();
>> + }
>> + return tt_uuid_compare(&uuid_a, &uuid_b);
>
> 4. If we will decide to store uuids as MP_EXT, it may be
> useful to introduce a function mp_compare_uuid, by analogue
> with other mp_compare_* functions. To compare them directly
> using memcmp, without unpacking. But see above my comment
> about order. memcmp validness depends on how uuid is stored
> in MessagePack.
>
Done. uuid fields are stored in big-endian manner and in the same
order they’re compared, so memcmp is valid.
>> +}
>> +
>> +static inline int
>> +mp_compare_uuid(const char *field_a, const char *field_b)
>> +{
>> + return mp_compare_uuid_with_type(field_a, mp_typeof(*field_a),
>> + field_b, mp_typeof(*field_b));
>> +}
>> @@ -1578,6 +1642,21 @@ hint_decimal(decimal_t *dec)
>> return hint_create(MP_CLASS_NUMBER, val);
>> }
>>
>> +static inline hint_t
>> +hint_uuid(struct tt_uuid *uuid)
>> +{
>> + /* Simply take the first part of the UUID as hint. */
>
> 5. Why only first?
I meant, take the ‘high’ part of the UUID. The one which’s compared first.
>
>> + uint64_t val = 0;
>> + val |= uuid->time_low;
>> + val <<= sizeof(uuid->time_mid) * CHAR_BIT;
>> + val |= uuid->time_mid;
>> + val <<= sizeof(uuid->time_hi_and_version) * CHAR_BIT;
>> + val |= uuid->time_hi_and_version;
>> + /* Make space for class representation. */
>> + val >>= HINT_CLASS_BITS;
>> + return hint_create(MP_CLASS_UUID, val);
>> +}
>> diff --git a/test/engine/gh-4268-uuid.result b/test/engine/gh-4268-uuid.result
>> new file mode 100644
>> index 000000000..928204507
>> --- /dev/null
>> +++ b/test/engine/gh-4268-uuid.result
>
> 6. Since this is a feature, we probably should not call the
> test file using name pattern used for bugs. Just uuid.test.lua.
Done.
--
Serge Petrenko
sergepetrenko at tarantool.org
More information about the Tarantool-patches
mailing list