[tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
i.koptelov
ivan.koptelov at tarantool.org
Wed Feb 20 22:24:56 MSK 2019
>> On 20 Feb 2019, at 18:47, i.koptelov <ivan.koptelov at tarantool.org> wrote:
>>
>> Thanks to Alexander, I fixed my patch to use a function
>> from icu to count the length of the string.
>>
>> Changes:
>>
>
> Look, each next implementation again and again changes
> results of certain tests. Lets firstly define exact behaviour of
> length() function and then write function which will satisfy these
> requirements, not vice versa. Is this the final version?
I thought that these changes in ‘badutf’ tests are OK because we came
to an agreement that we don’t care for results of LENGTH() on
invalid strings.
> Moreover, since Konstantin suggest as fast implementation
> as we can, I propose to consider sort of asm written variant:
>
> .global ap_strlen_utf8_s
> ap_strlen_utf8_s:
> push %esi
> cld
> mov 8(%esp), %esi
> xor %ecx, %ecx
> loopa: dec %ecx
> loopb: lodsb
> shl $1, %al
> js loopa
> jc loopb
> jnz loopa
> mov %ecx, %eax
> not %eax
> pop %esi
> ret
>
>
> It is taken from http://canonical.org/~kragen/strlen-utf8
> and author claims that quite fast (seems like it doesn’t
> handle \0, but we can patch it). I didn’t bench it, so I am
> not absolutely sure that it ‘way faster’ than other implementations.
I’ve also came across this solution, but I considered it to be kind of overkill.
>
>> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
>> index 233ea2901..8ddb9780f 100644
>> --- a/src/box/sql/func.c
>> +++ b/src/box/sql/func.c
>> @@ -149,16 +149,7 @@ utf8_char_count(const unsigned char *str, int byte_len)
>> {
>> int symbol_count = 0;
>> for (int i = 0; i < byte_len;) {
>> - if ((str[i] & 0x80) == 0)
>> - i += 1;
>> - else if ((str[i] & 0xe0) == 0xc0)
>> - i += 2;
>> - else if ((str[i] & 0xf0) == 0xe0)
>> - i += 3;
>> - else if ((str[i] & 0xf8) == 0xf0)
>> - i += 4;
>> - else
>> - i += 1;
>> + U8_FWD_1_UNSAFE(str, i);
>
> This function handles string not in the way we’ve discussed.
Because it always does three comparisons?
#define U8_COUNT_TRAIL_BYTES_UNSAFE(leadByte) \
(((uint8_t)(leadByte)>=0xc2)+((uint8_t)(leadByte)>=0xe0)+((uint8_t)(leadByte)>=0xf0))
#define U8_FWD_1_UNSAFE(s, i) { \
(i)+=1+U8_COUNT_TRAIL_BYTES_UNSAFE((s)[i]); \
}
> Furthermore, description says that it “assumes well-formed UTF-8”,
> which in our case is not true. So who knows what may happen if we pass
> malformed byte sequence. I am not even saying that behaviour of
> this function on invalid inputs may change later.
In it's current implementation U8_FWD_1_UNSAFE satisfy our needs safely. Returned
symbol length would never exceed byte_len.
static int
utf8_char_count(const unsigned char *str, int byte_len)
{
int symbol_count = 0;
for (int i = 0; i < byte_len;) {
U8_FWD_1_UNSAFE(str, i);
symbol_count++;
}
return symbol_count;
}
I agree that it is a bad idea to relay on lib behaviour which may
change lately. So maybe I would just inline these one line macros?
Or use my own implementation, since it’s more efficient (but less beautiful)
More information about the Tarantool-patches
mailing list