From: "i.koptelov" <ivan.koptelov@tarantool.org> To: tarantool-patches@freelists.org Cc: "n.pettik" <korablev@tarantool.org> Subject: [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0' Date: Wed, 20 Feb 2019 22:24:56 +0300 [thread overview] Message-ID: <583EC402-D1FF-45C4-B18B-8A06D4362200@tarantool.org> (raw) In-Reply-To: <DD522CAF-BD70-4E66-B8A4-C1837370B81D@tarantool.org> >> On 20 Feb 2019, at 18:47, i.koptelov <ivan.koptelov@tarantool.org> wrote: >> >> Thanks to Alexander, I fixed my patch to use a function >> from icu to count the length of the string. >> >> Changes: >> > > Look, each next implementation again and again changes > results of certain tests. Lets firstly define exact behaviour of > length() function and then write function which will satisfy these > requirements, not vice versa. Is this the final version? I thought that these changes in ‘badutf’ tests are OK because we came to an agreement that we don’t care for results of LENGTH() on invalid strings. > Moreover, since Konstantin suggest as fast implementation > as we can, I propose to consider sort of asm written variant: > > .global ap_strlen_utf8_s > ap_strlen_utf8_s: > push %esi > cld > mov 8(%esp), %esi > xor %ecx, %ecx > loopa: dec %ecx > loopb: lodsb > shl $1, %al > js loopa > jc loopb > jnz loopa > mov %ecx, %eax > not %eax > pop %esi > ret > > > It is taken from http://canonical.org/~kragen/strlen-utf8 > and author claims that quite fast (seems like it doesn’t > handle \0, but we can patch it). I didn’t bench it, so I am > not absolutely sure that it ‘way faster’ than other implementations. I’ve also came across this solution, but I considered it to be kind of overkill. > >> diff --git a/src/box/sql/func.c b/src/box/sql/func.c >> index 233ea2901..8ddb9780f 100644 >> --- a/src/box/sql/func.c >> +++ b/src/box/sql/func.c >> @@ -149,16 +149,7 @@ utf8_char_count(const unsigned char *str, int byte_len) >> { >> int symbol_count = 0; >> for (int i = 0; i < byte_len;) { >> - if ((str[i] & 0x80) == 0) >> - i += 1; >> - else if ((str[i] & 0xe0) == 0xc0) >> - i += 2; >> - else if ((str[i] & 0xf0) == 0xe0) >> - i += 3; >> - else if ((str[i] & 0xf8) == 0xf0) >> - i += 4; >> - else >> - i += 1; >> + U8_FWD_1_UNSAFE(str, i); > > This function handles string not in the way we’ve discussed. Because it always does three comparisons? #define U8_COUNT_TRAIL_BYTES_UNSAFE(leadByte) \ (((uint8_t)(leadByte)>=0xc2)+((uint8_t)(leadByte)>=0xe0)+((uint8_t)(leadByte)>=0xf0)) #define U8_FWD_1_UNSAFE(s, i) { \ (i)+=1+U8_COUNT_TRAIL_BYTES_UNSAFE((s)[i]); \ } > Furthermore, description says that it “assumes well-formed UTF-8”, > which in our case is not true. So who knows what may happen if we pass > malformed byte sequence. I am not even saying that behaviour of > this function on invalid inputs may change later. In it's current implementation U8_FWD_1_UNSAFE satisfy our needs safely. Returned symbol length would never exceed byte_len. static int utf8_char_count(const unsigned char *str, int byte_len) { int symbol_count = 0; for (int i = 0; i < byte_len;) { U8_FWD_1_UNSAFE(str, i); symbol_count++; } return symbol_count; } I agree that it is a bad idea to relay on lib behaviour which may change lately. So maybe I would just inline these one line macros? Or use my own implementation, since it’s more efficient (but less beautiful)
next prev parent reply other threads:[~2019-02-20 19:24 UTC|newest] Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-01-29 9:56 [tarantool-patches] " Ivan Koptelov 2019-01-29 16:35 ` [tarantool-patches] " n.pettik 2019-02-04 12:34 ` Ivan Koptelov 2019-02-05 13:50 ` n.pettik 2019-02-07 15:14 ` i.koptelov 2019-02-11 13:15 ` n.pettik 2019-02-13 15:46 ` i.koptelov 2019-02-14 12:57 ` n.pettik 2019-02-20 13:54 ` i.koptelov 2019-02-20 15:47 ` i.koptelov 2019-02-20 16:04 ` n.pettik 2019-02-20 18:08 ` Vladislav Shpilevoy 2019-02-20 19:24 ` i.koptelov [this message] 2019-02-22 12:59 ` n.pettik 2019-02-25 11:09 ` i.koptelov 2019-02-25 15:10 ` n.pettik 2019-02-26 13:33 ` i.koptelov 2019-02-26 17:50 ` n.pettik 2019-02-26 18:44 ` i.koptelov 2019-02-26 20:16 ` Vladislav Shpilevoy 2019-03-04 11:59 ` i.koptelov 2019-03-04 15:30 ` Kirill Yukhin
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=583EC402-D1FF-45C4-B18B-8A06D4362200@tarantool.org \ --to=ivan.koptelov@tarantool.org \ --cc=korablev@tarantool.org \ --cc=tarantool-patches@freelists.org \ --subject='[tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\''\0'\''' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox