From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id 266792623D for ; Fri, 22 Feb 2019 07:59:38 -0500 (EST) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id M282QZJx7KT8 for ; Fri, 22 Feb 2019 07:59:38 -0500 (EST) Received: from smtpng1.m.smailru.net (smtpng1.m.smailru.net [94.100.181.251]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 4879C26210 for ; Fri, 22 Feb 2019 07:59:37 -0500 (EST) From: "n.pettik" Message-Id: <3377BC01-D943-4D24-A9A2-BA9B9C67EA92@tarantool.org> Content-Type: multipart/alternative; boundary="Apple-Mail=_FA0A9AFA-4649-4B63-9F15-DEEC87BA3284" Mime-Version: 1.0 (Mac OS X Mail 12.2 \(3445.102.3\)) Subject: [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0' Date: Fri, 22 Feb 2019 15:59:35 +0300 In-Reply-To: <583EC402-D1FF-45C4-B18B-8A06D4362200@tarantool.org> References: <15e143f4-3ea7-c7d6-d8ac-8a0e20b76449@tarantool.org> <1560FF96-FECD-4368-8AF8-F8F2AE7696E3@tarantool.org> <07DBA796-6DD4-41DD-8438-104FE3AE05BB@tarantool.org> <4F4E0A7E-199C-4647-A49C-DD0E8A216527@tarantool.org> <8EF5CE57-C6B5-493C-94CC-AA3C88639485@tarantool.org> <7E6CE8AA-512D-4472-9DBD-8159073386C5@tarantool.org> <25649276-74CD-46E7-A1EB-F4CE299E637C@tarantool.org> <427EE913-3E58-413F-A645-DBF83C809334@tarantool.org> <583EC402-D1FF-45C4-B18B-8A06D4362200@tarantool.org> Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-subscribe: List-owner: List-post: List-archive: To: tarantool-patches@freelists.org Cc: Ivan Koptelov --Apple-Mail=_FA0A9AFA-4649-4B63-9F15-DEEC87BA3284 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 > On 20 Feb 2019, at 22:24, i.koptelov = wrote: >=20 >>>=20 >>> On 20 Feb 2019, at 18:47, i.koptelov > wrote: >>>=20 >>> Thanks to Alexander, I fixed my patch to use a function >>> from icu to count the length of the string. >>>=20 >>> Changes: Travis has failed. Please, make sure it is OK before sending the patch. It doesn=E2=80=99t fail on my local (Mac) machine, so I guess this fail = appears only on Linux system. >> Furthermore, description says that it =E2=80=9Cassumes well-formed = UTF-8=E2=80=9D, >> which in our case is not true. So who knows what may happen if we = pass >> malformed byte sequence. I am not even saying that behaviour of >> this function on invalid inputs may change later. >=20 > In it's current implementation U8_FWD_1_UNSAFE satisfy our needs = safely. Returned > symbol length would never exceed byte_len. >=20 > static int > utf8_char_count(const unsigned char *str, int byte_len) > { > int symbol_count =3D 0; > for (int i =3D 0; i < byte_len;) { > U8_FWD_1_UNSAFE(str, i); > symbol_count++; > } > return symbol_count; > } >=20 > I agree that it is a bad idea to relay on lib behaviour which may > change lately. So maybe I would just inline these one line macros? > Or use my own implementation, since it=E2=80=99s more efficient (but = less beautiful) Nevermind, let's keep it as is. I really worry only about the fact that in other places SQL_SKIP_UTF8 is used instead. It handles only two-bytes utf8 symbols, meanwhile U8_FWD_1_UNSAFE() accounts three and four bytes length symbols. Can we use everywhere the same pattern?= --Apple-Mail=_FA0A9AFA-4649-4B63-9F15-DEEC87BA3284 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8

On 20 Feb 2019, at 22:24, i.koptelov <ivan.koptelov@tarantool.org> wrote:


On 20 Feb 2019, at 18:47, i.koptelov = <ivan.koptelov@tarantool.org> wrote:

Thanks to Alexander, I fixed my patch to use a function
from icu to count the length of the string.

Changes:

Travis has failed. Please, make sure it is OK = before sending the patch.
It doesn=E2=80=99t fail on my local = (Mac) machine, so I guess this fail appears
only on Linux = system.

Furthermore, description says that it =E2=80=9Cassumes = well-formed UTF-8=E2=80=9D,
which in our case is not true. = So who knows what may happen if we pass
malformed byte = sequence. I am not even saying that behaviour of
this = function on invalid inputs may change later.

In it's current implementation U8_FWD_1_UNSAFE satisfy our = needs safely. Returned
symbol length would never exceed byte_len.

static int
utf8_char_count(const unsigned = char *str, int byte_len)
{
int symbol_count =3D 0;
= for (int i =3D = 0; i < byte_len;) {
= U8_FWD_1_UNSAFE(str, i);
= symbol_count++;
= }
return = symbol_count;
}

I agree that it is a bad idea to = relay on lib behaviour which may
change lately. So maybe I would just inline these one line = macros?
Or use my own = implementation, since it=E2=80=99s more efficient (but less = beautiful)

Nevermind, = let's keep it as is.
I really worry only about the fact that = in other places SQL_SKIP_UTF8
is used instead. It handles only = two-bytes utf8 symbols, meanwhile
U8_FWD_1_UNSAFE() accounts = three and four bytes length symbols.
Can we use = everywhere the same pattern?
= --Apple-Mail=_FA0A9AFA-4649-4B63-9F15-DEEC87BA3284--