[tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
n.pettik
korablev at tarantool.org
Mon Feb 25 18:10:27 MSK 2019
> On 25 Feb 2019, at 14:09, i.koptelov <ivan.koptelov at tarantool.org> wrote:
>> On 22 Feb 2019, at 15:59, n.pettik <korablev at tarantool.org> wrote:
>>> On 20 Feb 2019, at 22:24, i.koptelov <ivan.koptelov at tarantool.org> wrote:
>>>>>
>>>>> On 20 Feb 2019, at 18:47, i.koptelov <ivan.koptelov at tarantool.org> wrote:
>>>>>
>>>>> Thanks to Alexander, I fixed my patch to use a function
>>>>> from icu to count the length of the string.
>>>>>
>>>>> Changes:
>>
>> Travis has failed. Please, make sure it is OK before sending the patch.
>> It doesn’t fail on my local (Mac) machine, so I guess this fail appears
>> only on Linux system.
> The problem is with badutf test (LENGTH tests).
> I’ve tried to reproduce the problem on my machine (using Docker with Ubuntu),
> but with no success. It seems like that different versions of icu4c lib
> provide different behavior of U8_FWD_1_UNSAFE.
> I propose to just inline these two lines (which we need) into
> some util function. Logic of these lines seems to be quite simple
> and obvious (after you read about utf8 on wikipedia), so I see no
> problem.
>
> #define U8_COUNT_TRAIL_BYTES_UNSAFE(leadByte) \
> (((uint8_t)(leadByte)>=0xc2)+((uint8_t)(leadByte)>=0xe0)+((uint8_t)(leadByte)>=0xf0))
>
> #define U8_FWD_1_UNSAFE(s, i) { \
> (i)+=1+U8_COUNT_TRAIL_BYTES_UNSAFE((s)[i]); \
> }
That’s I was talking about. But using the macros with the same
name as in utf library doesn’t look like a good pattern. Yep, you
can use define guards like:
#ifdef U8_COUNT_TRAIL_BYTES_UNSAFE
#undef U8_COUNT_TRAIL_BYTES_UNSAFE
#endif
#define U8_COUNT_TRAIL_BYTES_UNSAFE
But I’d rather just give it another name.
Hence, taking into account comment below,
we are going to substitute SQL_SKIP_UTF8() with
implementation borrowed from icu library.
>>>> Furthermore, description says that it “assumes well-formed UTF-8”,
>>>> which in our case is not true. So who knows what may happen if we pass
>>>> malformed byte sequence. I am not even saying that behaviour of
>>>> this function on invalid inputs may change later.
>>>
>>> In it's current implementation U8_FWD_1_UNSAFE satisfy our needs safely. Returned
>>> symbol length would never exceed byte_len.
>>>
>>> static int
>>> utf8_char_count(const unsigned char *str, int byte_len)
>>> {
>>> int symbol_count = 0;
>>> for (int i = 0; i < byte_len;) {
>>> U8_FWD_1_UNSAFE(str, i);
>>> symbol_count++;
>>> }
>>> return symbol_count;
>>> }
>>>
>>> I agree that it is a bad idea to relay on lib behaviour which may
>>> change lately. So maybe I would just inline these one line macros?
>>> Or use my own implementation, since it’s more efficient (but less beautiful)
>>
>> Nevermind, let's keep it as is.
>> I really worry only about the fact that in other places SQL_SKIP_UTF8
>> is used instead. It handles only two-bytes utf8 symbols, meanwhile
>> U8_FWD_1_UNSAFE() accounts three and four bytes length symbols.
>> Can we use everywhere the same pattern?
> Yes, I think, we can.
Ok, then will be waiting for updates.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.tarantool.org/pipermail/tarantool-patches/attachments/20190225/f5556275/attachment.html>
More information about the Tarantool-patches
mailing list