[tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'

n.pettik korablev at tarantool.org
Wed Feb 20 19:04:46 MSK 2019



> On 20 Feb 2019, at 18:47, i.koptelov <ivan.koptelov at tarantool.org> wrote:
> 
> Thanks to Alexander, I fixed my patch to use a function
> from icu to count the length of the string.
> 
> Changes:
> 

Look, each next implementation again and again changes
results of certain tests. Lets firstly define exact behaviour of
length() function and then write function which will satisfy these
requirements, not vice versa. Is this the final version?
Moreover, since Konstantin suggest as fast implementation
as we can, I propose to consider sort of asm written variant:

        .global ap_strlen_utf8_s
ap_strlen_utf8_s:
        push %esi
        cld
        mov 8(%esp), %esi
        xor %ecx, %ecx
loopa:  dec %ecx
loopb:  lodsb
        shl $1, %al
        js loopa
        jc loopb
        jnz loopa
        mov %ecx, %eax
        not %eax
        pop %esi
        ret


It is taken from http://canonical.org/~kragen/strlen-utf8
and author claims that quite fast (seems like it doesn’t
handle \0, but we can patch it). I didn’t bench it, so I am
not absolutely sure that it ‘way faster’ than other implementations.

> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
> index 233ea2901..8ddb9780f 100644
> --- a/src/box/sql/func.c
> +++ b/src/box/sql/func.c
> @@ -149,16 +149,7 @@ utf8_char_count(const unsigned char *str, int byte_len)
> {
> 	int symbol_count = 0;
> 	for (int i = 0; i < byte_len;) {
> -		if ((str[i] & 0x80) == 0)
> -			i += 1;
> -		else if ((str[i] & 0xe0) == 0xc0)
> -			i += 2;
> -		else if ((str[i] & 0xf0) == 0xe0)
> -			i += 3;
> -		else if ((str[i] & 0xf8) == 0xf0)
> -			i += 4;
> -		else
> -			i += 1;
> +		U8_FWD_1_UNSAFE(str, i);

This function handles string not in the way we’ve discussed.
Furthermore, description says that it “assumes well-formed UTF-8”,
which in our case is not true. So who knows what may happen if we pass
malformed byte sequence. I am not even saying that behaviour of
this function on invalid inputs may change later.





More information about the Tarantool-patches mailing list