[tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'

Mon Feb 25 14:09:05 MSK 2019


> On 22 Feb 2019, at 15:59, n.pettik <korablev at tarantool.org> wrote:
> 
> 
> 
>> On 20 Feb 2019, at 22:24, i.koptelov <ivan.koptelov at tarantool.org> wrote:
>> 
>>>> 
>>>> On 20 Feb 2019, at 18:47, i.koptelov <ivan.koptelov at tarantool.org> wrote:
>>>> 
>>>> Thanks to Alexander, I fixed my patch to use a function
>>>> from icu to count the length of the string.
>>>> 
>>>> Changes:
> 
> Travis has failed. Please, make sure it is OK before sending the patch.
> It doesn’t fail on my local (Mac) machine, so I guess this fail appears
> only on Linux system.
The problem is with badutf test (LENGTH tests).
I’ve tried to reproduce the problem on my machine (using Docker with Ubuntu),
but with no success. It seems like that different versions of icu4c lib
provide different behavior of U8_FWD_1_UNSAFE.
I propose to just inline these two lines (which we need) into
some util function. Logic of these lines seems to be quite simple
and obvious (after you read about utf8 on wikipedia), so I see no
problem.

#define U8_COUNT_TRAIL_BYTES_UNSAFE(leadByte) \
   (((uint8_t)(leadByte)>=0xc2)+((uint8_t)(leadByte)>=0xe0)+((uint8_t)(leadByte)>=0xf0))

#define U8_FWD_1_UNSAFE(s, i) { \
   (i)+=1+U8_COUNT_TRAIL_BYTES_UNSAFE((s)[i]); \
}


> 
>>> Furthermore, description says that it “assumes well-formed UTF-8”,
>>> which in our case is not true. So who knows what may happen if we pass
>>> malformed byte sequence. I am not even saying that behaviour of
>>> this function on invalid inputs may change later.
>> 
>> In it's current implementation U8_FWD_1_UNSAFE satisfy our needs safely. Returned
>> symbol length would never exceed byte_len.
>> 
>> static int
>> utf8_char_count(const unsigned char *str, int byte_len)
>> {
>> 	int symbol_count = 0;
>> 	for (int i = 0; i < byte_len;) {
>> 		U8_FWD_1_UNSAFE(str, i);
>> 		symbol_count++;
>> 	}
>> 	return symbol_count;
>> }
>> 
>> I agree that it is a bad idea to relay on lib behaviour which may
>> change lately. So maybe I would just inline these one line macros?
>> Or use my own implementation, since it’s more efficient (but less beautiful)
> 
> Nevermind, let's keep it as is.
> I really worry only about the fact that in other places SQL_SKIP_UTF8
> is used instead. It handles only two-bytes utf8 symbols, meanwhile
> U8_FWD_1_UNSAFE() accounts three and four bytes length symbols.
> Can we use everywhere the same pattern?
Yes, I think, we can.