From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id A19CE2794A for ; Wed, 20 Feb 2019 14:24:59 -0500 (EST) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ZIRrsf8cMRtJ for ; Wed, 20 Feb 2019 14:24:59 -0500 (EST) Received: from smtp5.mail.ru (smtp5.mail.ru [94.100.179.24]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id DABDC27913 for ; Wed, 20 Feb 2019 14:24:58 -0500 (EST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 12.0 \(3445.100.39\)) Subject: [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0' From: "i.koptelov" In-Reply-To: Date: Wed, 20 Feb 2019 22:24:56 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: <583EC402-D1FF-45C4-B18B-8A06D4362200@tarantool.org> References: <15e143f4-3ea7-c7d6-d8ac-8a0e20b76449@tarantool.org> <1560FF96-FECD-4368-8AF8-F8F2AE7696E3@tarantool.org> <07DBA796-6DD4-41DD-8438-104FE3AE05BB@tarantool.org> <4F4E0A7E-199C-4647-A49C-DD0E8A216527@tarantool.org> <8EF5CE57-C6B5-493C-94CC-AA3C88639485@tarantool.org> <7E6CE8AA-512D-4472-9DBD-8159073386C5@tarantool.org> <25649276-74CD-46E7-A1EB-F4CE299E637C@tarantool.org> <427EE913-3E58-413F-A645-DBF83C809334@tarantool.org> Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-subscribe: List-owner: List-post: List-archive: To: tarantool-patches@freelists.org Cc: "n.pettik" >> On 20 Feb 2019, at 18:47, i.koptelov = wrote: >>=20 >> Thanks to Alexander, I fixed my patch to use a function >> from icu to count the length of the string. >>=20 >> Changes: >>=20 >=20 > Look, each next implementation again and again changes > results of certain tests. Lets firstly define exact behaviour of > length() function and then write function which will satisfy these > requirements, not vice versa. Is this the final version? I thought that these changes in =E2=80=98badutf=E2=80=99 tests are OK = because we came to an agreement that we don=E2=80=99t care for results of LENGTH() on invalid strings. > Moreover, since Konstantin suggest as fast implementation > as we can, I propose to consider sort of asm written variant: >=20 > .global ap_strlen_utf8_s > ap_strlen_utf8_s: > push %esi > cld > mov 8(%esp), %esi > xor %ecx, %ecx > loopa: dec %ecx > loopb: lodsb > shl $1, %al > js loopa > jc loopb > jnz loopa > mov %ecx, %eax > not %eax > pop %esi > ret >=20 >=20 > It is taken from http://canonical.org/~kragen/strlen-utf8 > and author claims that quite fast (seems like it doesn=E2=80=99t > handle \0, but we can patch it). I didn=E2=80=99t bench it, so I am > not absolutely sure that it =E2=80=98way faster=E2=80=99 than other = implementations. I=E2=80=99ve also came across this solution, but I considered it to be = kind of overkill. >=20 >> diff --git a/src/box/sql/func.c b/src/box/sql/func.c >> index 233ea2901..8ddb9780f 100644 >> --- a/src/box/sql/func.c >> +++ b/src/box/sql/func.c >> @@ -149,16 +149,7 @@ utf8_char_count(const unsigned char *str, int = byte_len) >> { >> int symbol_count =3D 0; >> for (int i =3D 0; i < byte_len;) { >> - if ((str[i] & 0x80) =3D=3D 0) >> - i +=3D 1; >> - else if ((str[i] & 0xe0) =3D=3D 0xc0) >> - i +=3D 2; >> - else if ((str[i] & 0xf0) =3D=3D 0xe0) >> - i +=3D 3; >> - else if ((str[i] & 0xf8) =3D=3D 0xf0) >> - i +=3D 4; >> - else >> - i +=3D 1; >> + U8_FWD_1_UNSAFE(str, i); >=20 > This function handles string not in the way we=E2=80=99ve discussed. Because it always does three comparisons? #define U8_COUNT_TRAIL_BYTES_UNSAFE(leadByte) \ = (((uint8_t)(leadByte)>=3D0xc2)+((uint8_t)(leadByte)>=3D0xe0)+((uint8_t)(le= adByte)>=3D0xf0)) #define U8_FWD_1_UNSAFE(s, i) { \ (i)+=3D1+U8_COUNT_TRAIL_BYTES_UNSAFE((s)[i]); \ } > Furthermore, description says that it =E2=80=9Cassumes well-formed = UTF-8=E2=80=9D, > which in our case is not true. So who knows what may happen if we pass > malformed byte sequence. I am not even saying that behaviour of > this function on invalid inputs may change later. In it's current implementation U8_FWD_1_UNSAFE satisfy our needs safely. = Returned symbol length would never exceed byte_len. static int utf8_char_count(const unsigned char *str, int byte_len) { int symbol_count =3D 0; for (int i =3D 0; i < byte_len;) { U8_FWD_1_UNSAFE(str, i); symbol_count++; } return symbol_count; } I agree that it is a bad idea to relay on lib behaviour which may change lately. So maybe I would just inline these one line macros? Or use my own implementation, since it=E2=80=99s more efficient (but = less beautiful)=