From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <tarantool-patches-bounce@freelists.org>
Received: from localhost (localhost [127.0.0.1])
	by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id F1A4020449
	for <tarantool-patches@freelists.org>; Wed, 20 Feb 2019 11:04:49 -0500 (EST)
Received: from turing.freelists.org ([127.0.0.1])
	by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id HCTS9W0YUVCE for <tarantool-patches@freelists.org>;
	Wed, 20 Feb 2019 11:04:49 -0500 (EST)
Received: from smtpng3.m.smailru.net (smtpng3.m.smailru.net [94.100.177.149])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 61520203EB
	for <tarantool-patches@freelists.org>; Wed, 20 Feb 2019 11:04:49 -0500 (EST)
Content-Type: text/plain;
	charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 12.2 \(3445.102.3\))
Subject: [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
From: "n.pettik" <korablev@tarantool.org>
In-Reply-To: <427EE913-3E58-413F-A645-DBF83C809334@tarantool.org>
Date: Wed, 20 Feb 2019 19:04:46 +0300
Content-Transfer-Encoding: quoted-printable
Message-Id: <DD522CAF-BD70-4E66-B8A4-C1837370B81D@tarantool.org>
References: <15e143f4-3ea7-c7d6-d8ac-8a0e20b76449@tarantool.org>
 <1560FF96-FECD-4368-8AF8-F8F2AE7696E3@tarantool.org>
 <fd70561d-6a0e-70dc-4c20-bdcac764040a@tarantool.org>
 <07DBA796-6DD4-41DD-8438-104FE3AE05BB@tarantool.org>
 <4F4E0A7E-199C-4647-A49C-DD0E8A216527@tarantool.org>
 <8EF5CE57-C6B5-493C-94CC-AA3C88639485@tarantool.org>
 <7E6CE8AA-512D-4472-9DBD-8159073386C5@tarantool.org>
 <FC2CA838-42C3-4494-BC6C-37C864D4BF79@tarantool.org>
 <25649276-74CD-46E7-A1EB-F4CE299E637C@tarantool.org>
 <427EE913-3E58-413F-A645-DBF83C809334@tarantool.org>
Sender: tarantool-patches-bounce@freelists.org
Errors-to: tarantool-patches-bounce@freelists.org
Reply-To: tarantool-patches@freelists.org
List-help: <mailto:ecartis@freelists.org?Subject=help>
List-unsubscribe: <tarantool-patches-request@freelists.org?Subject=unsubscribe>
List-software: Ecartis version 1.0.0
List-Id: tarantool-patches <tarantool-patches.freelists.org>
List-subscribe: <tarantool-patches-request@freelists.org?Subject=subscribe>
List-owner: <mailto:>
List-post: <mailto:tarantool-patches@freelists.org>
List-archive: <http://www.freelists.org/archives/tarantool-patches>
To: tarantool-patches@freelists.org
Cc: Ivan Koptelov <ivan.koptelov@tarantool.org>


> On 20 Feb 2019, at 18:47, i.koptelov <ivan.koptelov@tarantool.org> =
wrote:
>=20
> Thanks to Alexander, I fixed my patch to use a function
> from icu to count the length of the string.
>=20
> Changes:
>=20

Look, each next implementation again and again changes
results of certain tests. Lets firstly define exact behaviour of
length() function and then write function which will satisfy these
requirements, not vice versa. Is this the final version?
Moreover, since Konstantin suggest as fast implementation
as we can, I propose to consider sort of asm written variant:

        .global ap_strlen_utf8_s
ap_strlen_utf8_s:
        push %esi
        cld
        mov 8(%esp), %esi
        xor %ecx, %ecx
loopa:  dec %ecx
loopb:  lodsb
        shl $1, %al
        js loopa
        jc loopb
        jnz loopa
        mov %ecx, %eax
        not %eax
        pop %esi
        ret


It is taken from http://canonical.org/~kragen/strlen-utf8
and author claims that quite fast (seems like it doesn=E2=80=99t
handle \0, but we can patch it). I didn=E2=80=99t bench it, so I am
not absolutely sure that it =E2=80=98way faster=E2=80=99 than other =
implementations.

> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
> index 233ea2901..8ddb9780f 100644
> --- a/src/box/sql/func.c
> +++ b/src/box/sql/func.c
> @@ -149,16 +149,7 @@ utf8_char_count(const unsigned char *str, int =
byte_len)
> {
> 	int symbol_count =3D 0;
> 	for (int i =3D 0; i < byte_len;) {
> -		if ((str[i] & 0x80) =3D=3D 0)
> -			i +=3D 1;
> -		else if ((str[i] & 0xe0) =3D=3D 0xc0)
> -			i +=3D 2;
> -		else if ((str[i] & 0xf0) =3D=3D 0xe0)
> -			i +=3D 3;
> -		else if ((str[i] & 0xf8) =3D=3D 0xf0)
> -			i +=3D 4;
> -		else
> -			i +=3D 1;
> +		U8_FWD_1_UNSAFE(str, i);

This function handles string not in the way we=E2=80=99ve discussed.
Furthermore, description says that it =E2=80=9Cassumes well-formed =
UTF-8=E2=80=9D,
which in our case is not true. So who knows what may happen if we pass
malformed byte sequence. I am not even saying that behaviour of
this function on invalid inputs may change later.