From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id F1A4020449 for ; Wed, 20 Feb 2019 11:04:49 -0500 (EST) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id HCTS9W0YUVCE for ; Wed, 20 Feb 2019 11:04:49 -0500 (EST) Received: from smtpng3.m.smailru.net (smtpng3.m.smailru.net [94.100.177.149]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 61520203EB for ; Wed, 20 Feb 2019 11:04:49 -0500 (EST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 12.2 \(3445.102.3\)) Subject: [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0' From: "n.pettik" In-Reply-To: <427EE913-3E58-413F-A645-DBF83C809334@tarantool.org> Date: Wed, 20 Feb 2019 19:04:46 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: References: <15e143f4-3ea7-c7d6-d8ac-8a0e20b76449@tarantool.org> <1560FF96-FECD-4368-8AF8-F8F2AE7696E3@tarantool.org> <07DBA796-6DD4-41DD-8438-104FE3AE05BB@tarantool.org> <4F4E0A7E-199C-4647-A49C-DD0E8A216527@tarantool.org> <8EF5CE57-C6B5-493C-94CC-AA3C88639485@tarantool.org> <7E6CE8AA-512D-4472-9DBD-8159073386C5@tarantool.org> <25649276-74CD-46E7-A1EB-F4CE299E637C@tarantool.org> <427EE913-3E58-413F-A645-DBF83C809334@tarantool.org> Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-subscribe: List-owner: List-post: List-archive: To: tarantool-patches@freelists.org Cc: Ivan Koptelov > On 20 Feb 2019, at 18:47, i.koptelov = wrote: >=20 > Thanks to Alexander, I fixed my patch to use a function > from icu to count the length of the string. >=20 > Changes: >=20 Look, each next implementation again and again changes results of certain tests. Lets firstly define exact behaviour of length() function and then write function which will satisfy these requirements, not vice versa. Is this the final version? Moreover, since Konstantin suggest as fast implementation as we can, I propose to consider sort of asm written variant: .global ap_strlen_utf8_s ap_strlen_utf8_s: push %esi cld mov 8(%esp), %esi xor %ecx, %ecx loopa: dec %ecx loopb: lodsb shl $1, %al js loopa jc loopb jnz loopa mov %ecx, %eax not %eax pop %esi ret It is taken from http://canonical.org/~kragen/strlen-utf8 and author claims that quite fast (seems like it doesn=E2=80=99t handle \0, but we can patch it). I didn=E2=80=99t bench it, so I am not absolutely sure that it =E2=80=98way faster=E2=80=99 than other = implementations. > diff --git a/src/box/sql/func.c b/src/box/sql/func.c > index 233ea2901..8ddb9780f 100644 > --- a/src/box/sql/func.c > +++ b/src/box/sql/func.c > @@ -149,16 +149,7 @@ utf8_char_count(const unsigned char *str, int = byte_len) > { > int symbol_count =3D 0; > for (int i =3D 0; i < byte_len;) { > - if ((str[i] & 0x80) =3D=3D 0) > - i +=3D 1; > - else if ((str[i] & 0xe0) =3D=3D 0xc0) > - i +=3D 2; > - else if ((str[i] & 0xf0) =3D=3D 0xe0) > - i +=3D 3; > - else if ((str[i] & 0xf8) =3D=3D 0xf0) > - i +=3D 4; > - else > - i +=3D 1; > + U8_FWD_1_UNSAFE(str, i); This function handles string not in the way we=E2=80=99ve discussed. Furthermore, description says that it =E2=80=9Cassumes well-formed = UTF-8=E2=80=9D, which in our case is not true. So who knows what may happen if we pass malformed byte sequence. I am not even saying that behaviour of this function on invalid inputs may change later.