[tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'

Tarantool development patches archive
 help / color / mirror / Atom feed

From: "n.pettik" <korablev@tarantool.org>
To: tarantool-patches@freelists.org
Cc: Ivan Koptelov <ivan.koptelov@tarantool.org>
Subject: [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
Date: Wed, 20 Feb 2019 19:04:46 +0300	[thread overview]
Message-ID: <DD522CAF-BD70-4E66-B8A4-C1837370B81D@tarantool.org> (raw)
In-Reply-To: <427EE913-3E58-413F-A645-DBF83C809334@tarantool.org>

> On 20 Feb 2019, at 18:47, i.koptelov <ivan.koptelov@tarantool.org> wrote:
> 
> Thanks to Alexander, I fixed my patch to use a function
> from icu to count the length of the string.
> 
> Changes:
> 

Look, each next implementation again and again changes
results of certain tests. Lets firstly define exact behaviour of
length() function and then write function which will satisfy these
requirements, not vice versa. Is this the final version?
Moreover, since Konstantin suggest as fast implementation
as we can, I propose to consider sort of asm written variant:

        .global ap_strlen_utf8_s
ap_strlen_utf8_s:
        push %esi
        cld
        mov 8(%esp), %esi
        xor %ecx, %ecx
loopa:  dec %ecx
loopb:  lodsb
        shl $1, %al
        js loopa
        jc loopb
        jnz loopa
        mov %ecx, %eax
        not %eax
        pop %esi
        ret

It is taken from http://canonical.org/~kragen/strlen-utf8
and author claims that quite fast (seems like it doesn’t
handle \0, but we can patch it). I didn’t bench it, so I am
not absolutely sure that it ‘way faster’ than other implementations.

> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
> index 233ea2901..8ddb9780f 100644
> --- a/src/box/sql/func.c
> +++ b/src/box/sql/func.c
> @@ -149,16 +149,7 @@ utf8_char_count(const unsigned char *str, int byte_len)
> {
> 	int symbol_count = 0;
> 	for (int i = 0; i < byte_len;) {
> -		if ((str[i] & 0x80) == 0)
> -			i += 1;
> -		else if ((str[i] & 0xe0) == 0xc0)
> -			i += 2;
> -		else if ((str[i] & 0xf0) == 0xe0)
> -			i += 3;
> -		else if ((str[i] & 0xf8) == 0xf0)
> -			i += 4;
> -		else
> -			i += 1;
> +		U8_FWD_1_UNSAFE(str, i);

This function handles string not in the way we’ve discussed.
Furthermore, description says that it “assumes well-formed UTF-8”,
which in our case is not true. So who knows what may happen if we pass
malformed byte sequence. I am not even saying that behaviour of
this function on invalid inputs may change later.

next prev parent reply	other threads:[~2019-02-20 16:04 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-29  9:56 [tarantool-patches] " Ivan Koptelov
2019-01-29 16:35 ` [tarantool-patches] " n.pettik
2019-02-04 12:34   ` Ivan Koptelov
2019-02-05 13:50     ` n.pettik
2019-02-07 15:14       ` i.koptelov
2019-02-11 13:15         ` n.pettik
2019-02-13 15:46           ` i.koptelov
2019-02-14 12:57             ` n.pettik
2019-02-20 13:54               ` i.koptelov
2019-02-20 15:47                 ` i.koptelov
2019-02-20 16:04                   ` n.pettik [this message]
2019-02-20 18:08                     ` Vladislav Shpilevoy
2019-02-20 19:24                     ` i.koptelov
2019-02-22 12:59                       ` n.pettik
2019-02-25 11:09                         ` i.koptelov
2019-02-25 15:10                           ` n.pettik
2019-02-26 13:33                             ` i.koptelov
2019-02-26 17:50                               ` n.pettik
2019-02-26 18:44                                 ` i.koptelov
2019-02-26 20:16                                   ` Vladislav Shpilevoy
2019-03-04 11:59                                     ` i.koptelov
2019-03-04 15:30 ` Kirill Yukhin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DD522CAF-BD70-4E66-B8A4-C1837370B81D@tarantool.org \
    --to=korablev@tarantool.org \
    --cc=ivan.koptelov@tarantool.org \
    --cc=tarantool-patches@freelists.org \
    --subject='[tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\''\0'\''' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox