From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id F009C26AC3 for ; Mon, 25 Feb 2019 06:09:08 -0500 (EST) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id EHtmDKFSxoCK for ; Mon, 25 Feb 2019 06:09:08 -0500 (EST) Received: from smtp48.i.mail.ru (smtp48.i.mail.ru [94.100.177.108]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 4990723E26 for ; Mon, 25 Feb 2019 06:09:08 -0500 (EST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 12.0 \(3445.100.39\)) Subject: [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0' From: "i.koptelov" In-Reply-To: <3377BC01-D943-4D24-A9A2-BA9B9C67EA92@tarantool.org> Date: Mon, 25 Feb 2019 14:09:05 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: References: <15e143f4-3ea7-c7d6-d8ac-8a0e20b76449@tarantool.org> <1560FF96-FECD-4368-8AF8-F8F2AE7696E3@tarantool.org> <07DBA796-6DD4-41DD-8438-104FE3AE05BB@tarantool.org> <4F4E0A7E-199C-4647-A49C-DD0E8A216527@tarantool.org> <8EF5CE57-C6B5-493C-94CC-AA3C88639485@tarantool.org> <7E6CE8AA-512D-4472-9DBD-8159073386C5@tarantool.org> <25649276-74CD-46E7-A1EB-F4CE299E637C@tarantool.org> <427EE913-3E58-413F-A645-DBF83C809334@tarantool.org> <583EC402-D1FF-45C4-B18B-8A06D4362200@tarantool.org> <3377BC01-D943-4D24-A9A2-BA9B9C67EA92@tarantool.org> Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-subscribe: List-owner: List-post: List-archive: To: tarantool-patches@freelists.org Cc: "n.pettik" > On 22 Feb 2019, at 15:59, n.pettik wrote: >=20 >=20 >=20 >> On 20 Feb 2019, at 22:24, i.koptelov = wrote: >>=20 >>>>=20 >>>> On 20 Feb 2019, at 18:47, i.koptelov = wrote: >>>>=20 >>>> Thanks to Alexander, I fixed my patch to use a function >>>> from icu to count the length of the string. >>>>=20 >>>> Changes: >=20 > Travis has failed. Please, make sure it is OK before sending the = patch. > It doesn=E2=80=99t fail on my local (Mac) machine, so I guess this = fail appears > only on Linux system. The problem is with badutf test (LENGTH tests). I=E2=80=99ve tried to reproduce the problem on my machine (using Docker = with Ubuntu), but with no success. It seems like that different versions of icu4c lib provide different behavior of U8_FWD_1_UNSAFE. I propose to just inline these two lines (which we need) into some util function. Logic of these lines seems to be quite simple and obvious (after you read about utf8 on wikipedia), so I see no problem. #define U8_COUNT_TRAIL_BYTES_UNSAFE(leadByte) \ = (((uint8_t)(leadByte)>=3D0xc2)+((uint8_t)(leadByte)>=3D0xe0)+((uint8_t)(le= adByte)>=3D0xf0)) #define U8_FWD_1_UNSAFE(s, i) { \ (i)+=3D1+U8_COUNT_TRAIL_BYTES_UNSAFE((s)[i]); \ } >=20 >>> Furthermore, description says that it =E2=80=9Cassumes well-formed = UTF-8=E2=80=9D, >>> which in our case is not true. So who knows what may happen if we = pass >>> malformed byte sequence. I am not even saying that behaviour of >>> this function on invalid inputs may change later. >>=20 >> In it's current implementation U8_FWD_1_UNSAFE satisfy our needs = safely. Returned >> symbol length would never exceed byte_len. >>=20 >> static int >> utf8_char_count(const unsigned char *str, int byte_len) >> { >> int symbol_count =3D 0; >> for (int i =3D 0; i < byte_len;) { >> U8_FWD_1_UNSAFE(str, i); >> symbol_count++; >> } >> return symbol_count; >> } >>=20 >> I agree that it is a bad idea to relay on lib behaviour which may >> change lately. So maybe I would just inline these one line macros? >> Or use my own implementation, since it=E2=80=99s more efficient (but = less beautiful) >=20 > Nevermind, let's keep it as is. > I really worry only about the fact that in other places SQL_SKIP_UTF8 > is used instead. It handles only two-bytes utf8 symbols, meanwhile > U8_FWD_1_UNSAFE() accounts three and four bytes length symbols. > Can we use everywhere the same pattern? Yes, I think, we can.