From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id 22D6724D8C for ; Mon, 25 Feb 2019 10:10:30 -0500 (EST) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id yJGXEaanoGSH for ; Mon, 25 Feb 2019 10:10:30 -0500 (EST) Received: from smtpng3.m.smailru.net (smtpng3.m.smailru.net [94.100.177.149]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id A752824543 for ; Mon, 25 Feb 2019 10:10:29 -0500 (EST) From: "n.pettik" Message-Id: Content-Type: multipart/alternative; boundary="Apple-Mail=_C65209DF-9E8F-43AA-BE0A-2A041EEC4284" Mime-Version: 1.0 (Mac OS X Mail 12.2 \(3445.102.3\)) Subject: [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0' Date: Mon, 25 Feb 2019 18:10:27 +0300 In-Reply-To: References: <15e143f4-3ea7-c7d6-d8ac-8a0e20b76449@tarantool.org> <1560FF96-FECD-4368-8AF8-F8F2AE7696E3@tarantool.org> <07DBA796-6DD4-41DD-8438-104FE3AE05BB@tarantool.org> <4F4E0A7E-199C-4647-A49C-DD0E8A216527@tarantool.org> <8EF5CE57-C6B5-493C-94CC-AA3C88639485@tarantool.org> <7E6CE8AA-512D-4472-9DBD-8159073386C5@tarantool.org> <25649276-74CD-46E7-A1EB-F4CE299E637C@tarantool.org> <427EE913-3E58-413F-A645-DBF83C809334@tarantool.org> <583EC402-D1FF-45C4-B18B-8A06D4362200@tarantool.org> <3377BC01-D943-4D24-A9A2-BA9B9C67EA92@tarantool.org> Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-subscribe: List-owner: List-post: List-archive: To: tarantool-patches@freelists.org Cc: Ivan Koptelov --Apple-Mail=_C65209DF-9E8F-43AA-BE0A-2A041EEC4284 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 > On 25 Feb 2019, at 14:09, i.koptelov = wrote: >> On 22 Feb 2019, at 15:59, n.pettik wrote: >>> On 20 Feb 2019, at 22:24, i.koptelov = wrote: >>>>>=20 >>>>> On 20 Feb 2019, at 18:47, i.koptelov = wrote: >>>>>=20 >>>>> Thanks to Alexander, I fixed my patch to use a function >>>>> from icu to count the length of the string. >>>>>=20 >>>>> Changes: >>=20 >> Travis has failed. Please, make sure it is OK before sending the = patch. >> It doesn=E2=80=99t fail on my local (Mac) machine, so I guess this = fail appears >> only on Linux system. > The problem is with badutf test (LENGTH tests). > I=E2=80=99ve tried to reproduce the problem on my machine (using = Docker with Ubuntu), > but with no success. It seems like that different versions of icu4c = lib > provide different behavior of U8_FWD_1_UNSAFE. > I propose to just inline these two lines (which we need) into > some util function. Logic of these lines seems to be quite simple > and obvious (after you read about utf8 on wikipedia), so I see no > problem. >=20 > #define U8_COUNT_TRAIL_BYTES_UNSAFE(leadByte) \ > = (((uint8_t)(leadByte)>=3D0xc2)+((uint8_t)(leadByte)>=3D0xe0)+((uint8_t)(le= adByte)>=3D0xf0)) >=20 > #define U8_FWD_1_UNSAFE(s, i) { \ > (i)+=3D1+U8_COUNT_TRAIL_BYTES_UNSAFE((s)[i]); \ > } That=E2=80=99s I was talking about. But using the macros with the same name as in utf library doesn=E2=80=99t look like a good pattern. Yep, = you can use define guards like: #ifdef U8_COUNT_TRAIL_BYTES_UNSAFE #undef U8_COUNT_TRAIL_BYTES_UNSAFE #endif #define U8_COUNT_TRAIL_BYTES_UNSAFE But I=E2=80=99d rather just give it another name. Hence, taking into account comment below, we are going to substitute SQL_SKIP_UTF8() with implementation borrowed from icu library. >>>> Furthermore, description says that it =E2=80=9Cassumes well-formed = UTF-8=E2=80=9D, >>>> which in our case is not true. So who knows what may happen if we = pass >>>> malformed byte sequence. I am not even saying that behaviour of >>>> this function on invalid inputs may change later. >>>=20 >>> In it's current implementation U8_FWD_1_UNSAFE satisfy our needs = safely. Returned >>> symbol length would never exceed byte_len. >>>=20 >>> static int >>> utf8_char_count(const unsigned char *str, int byte_len) >>> { >>> int symbol_count =3D 0; >>> for (int i =3D 0; i < byte_len;) { >>> U8_FWD_1_UNSAFE(str, i); >>> symbol_count++; >>> } >>> return symbol_count; >>> } >>>=20 >>> I agree that it is a bad idea to relay on lib behaviour which may >>> change lately. So maybe I would just inline these one line macros? >>> Or use my own implementation, since it=E2=80=99s more efficient (but = less beautiful) >>=20 >> Nevermind, let's keep it as is. >> I really worry only about the fact that in other places SQL_SKIP_UTF8 >> is used instead. It handles only two-bytes utf8 symbols, meanwhile >> U8_FWD_1_UNSAFE() accounts three and four bytes length symbols. >> Can we use everywhere the same pattern? > Yes, I think, we can. Ok, then will be waiting for updates. --Apple-Mail=_C65209DF-9E8F-43AA-BE0A-2A041EEC4284 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8

On 25 Feb 2019, at 14:09, i.koptelov <ivan.koptelov@tarantool.org> wrote:
On 22 = Feb 2019, at 15:59, n.pettik <korablev@tarantool.org> wrote:
On 20 Feb 2019, at = 22:24, i.koptelov <ivan.koptelov@tarantool.org> wrote:

On 20 Feb 2019, at 18:47, i.koptelov <ivan.koptelov@tarantool.org> wrote:

Thanks to Alexander, I fixed my patch to use a function
from icu to count the length of the string.

Changes:

Travis = has failed. Please, make sure it is OK before sending the patch.
It doesn=E2=80=99t fail on my local (Mac) machine, so I guess = this fail appears
only on Linux system.
The problem is with badutf test (LENGTH tests).
I=E2=80=99ve tried to reproduce = the problem on my machine (using Docker with Ubuntu),
but with no success. It seems = like that different versions of icu4c lib
provide different behavior of U8_FWD_1_UNSAFE.
I propose to just inline these = two lines (which we need) into
some util function. Logic of these lines seems to be quite = simple
and obvious = (after you read about utf8 on wikipedia), so I see no
problem.

#define = U8_COUNT_TRAIL_BYTES_UNSAFE(leadByte) \
  (((uint8_t)(leadByte)>=3D0xc2)+((uint8_t)(leadBy= te)>=3D0xe0)+((uint8_t)(leadByte)>=3D0xf0))

#define U8_FWD_1_UNSAFE(s, i) { = \
  (i)+=3D1+U8_COUNT_TRAIL_BYTES_UNSAFE((s)[i]); = \
}

That=E2=80=99= s I was talking about. But using the macros with the same
name = as in utf library doesn=E2=80=99t look like a good pattern. Yep, = you
can use define guards like:

#ifdef =  U8_COUNT_TRAIL_BYTES_UNSAFE
#undef = U8_COUNT_TRAIL_BYTES_UNSAFE
#endif
#define = U8_COUNT_TRAIL_BYTES_UNSAFE

But = I=E2=80=99d rather just give it another name.
Hence, taking = into account comment below,
we are going to substitute = SQL_SKIP_UTF8() with
implementation borrowed from icu = library.

Furthermore, description says that it =E2=80=9Cassumes = well-formed UTF-8=E2=80=9D,
which in our case is not true. = So who knows what may happen if we pass
malformed byte = sequence. I am not even saying that behaviour of
this = function on invalid inputs may change later.

In it's current implementation = U8_FWD_1_UNSAFE satisfy our needs safely. Returned
symbol = length would never exceed byte_len.

static = int
utf8_char_count(const unsigned char *str, int = byte_len)
{
int symbol_count =3D 0;
= for (int i =3D 0; i < byte_len;) {
= U8_FWD_1_UNSAFE(str, i);
= symbol_count++;
}
return = symbol_count;
}

I agree that = it is a bad idea to relay on lib behaviour which may
change = lately. So maybe I would just inline these one line macros?
Or use my own implementation, since it=E2=80=99s more = efficient (but less beautiful)

Nevermind, let's keep it as is.
I really worry = only about the fact that in other places SQL_SKIP_UTF8
is = used instead. It handles only two-bytes utf8 symbols, meanwhile
U8_FWD_1_UNSAFE() accounts three and four bytes length = symbols.
Can we use everywhere the same pattern?
Yes, I think, we can.

Ok, then will be waiting for updates.

= --Apple-Mail=_C65209DF-9E8F-43AA-BE0A-2A041EEC4284--