From: "n.pettik" <korablev@tarantool.org> To: tarantool-patches@freelists.org Cc: Ivan Koptelov <ivan.koptelov@tarantool.org> Subject: [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0' Date: Wed, 20 Feb 2019 19:04:46 +0300 [thread overview] Message-ID: <DD522CAF-BD70-4E66-B8A4-C1837370B81D@tarantool.org> (raw) In-Reply-To: <427EE913-3E58-413F-A645-DBF83C809334@tarantool.org> > On 20 Feb 2019, at 18:47, i.koptelov <ivan.koptelov@tarantool.org> wrote: > > Thanks to Alexander, I fixed my patch to use a function > from icu to count the length of the string. > > Changes: > Look, each next implementation again and again changes results of certain tests. Lets firstly define exact behaviour of length() function and then write function which will satisfy these requirements, not vice versa. Is this the final version? Moreover, since Konstantin suggest as fast implementation as we can, I propose to consider sort of asm written variant: .global ap_strlen_utf8_s ap_strlen_utf8_s: push %esi cld mov 8(%esp), %esi xor %ecx, %ecx loopa: dec %ecx loopb: lodsb shl $1, %al js loopa jc loopb jnz loopa mov %ecx, %eax not %eax pop %esi ret It is taken from http://canonical.org/~kragen/strlen-utf8 and author claims that quite fast (seems like it doesn’t handle \0, but we can patch it). I didn’t bench it, so I am not absolutely sure that it ‘way faster’ than other implementations. > diff --git a/src/box/sql/func.c b/src/box/sql/func.c > index 233ea2901..8ddb9780f 100644 > --- a/src/box/sql/func.c > +++ b/src/box/sql/func.c > @@ -149,16 +149,7 @@ utf8_char_count(const unsigned char *str, int byte_len) > { > int symbol_count = 0; > for (int i = 0; i < byte_len;) { > - if ((str[i] & 0x80) == 0) > - i += 1; > - else if ((str[i] & 0xe0) == 0xc0) > - i += 2; > - else if ((str[i] & 0xf0) == 0xe0) > - i += 3; > - else if ((str[i] & 0xf8) == 0xf0) > - i += 4; > - else > - i += 1; > + U8_FWD_1_UNSAFE(str, i); This function handles string not in the way we’ve discussed. Furthermore, description says that it “assumes well-formed UTF-8”, which in our case is not true. So who knows what may happen if we pass malformed byte sequence. I am not even saying that behaviour of this function on invalid inputs may change later.
next prev parent reply other threads:[~2019-02-20 16:04 UTC|newest] Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-01-29 9:56 [tarantool-patches] " Ivan Koptelov 2019-01-29 16:35 ` [tarantool-patches] " n.pettik 2019-02-04 12:34 ` Ivan Koptelov 2019-02-05 13:50 ` n.pettik 2019-02-07 15:14 ` i.koptelov 2019-02-11 13:15 ` n.pettik 2019-02-13 15:46 ` i.koptelov 2019-02-14 12:57 ` n.pettik 2019-02-20 13:54 ` i.koptelov 2019-02-20 15:47 ` i.koptelov 2019-02-20 16:04 ` n.pettik [this message] 2019-02-20 18:08 ` Vladislav Shpilevoy 2019-02-20 19:24 ` i.koptelov 2019-02-22 12:59 ` n.pettik 2019-02-25 11:09 ` i.koptelov 2019-02-25 15:10 ` n.pettik 2019-02-26 13:33 ` i.koptelov 2019-02-26 17:50 ` n.pettik 2019-02-26 18:44 ` i.koptelov 2019-02-26 20:16 ` Vladislav Shpilevoy 2019-03-04 11:59 ` i.koptelov 2019-03-04 15:30 ` Kirill Yukhin
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=DD522CAF-BD70-4E66-B8A4-C1837370B81D@tarantool.org \ --to=korablev@tarantool.org \ --cc=ivan.koptelov@tarantool.org \ --cc=tarantool-patches@freelists.org \ --subject='[tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\''\0'\''' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox