From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id 186AA26892 for ; Mon, 11 Feb 2019 08:15:41 -0500 (EST) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id pUUF7W5EIY-h for ; Mon, 11 Feb 2019 08:15:41 -0500 (EST) Received: from smtpng2.m.smailru.net (smtpng2.m.smailru.net [94.100.179.3]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id A764126873 for ; Mon, 11 Feb 2019 08:15:40 -0500 (EST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 12.2 \(3445.102.3\)) Subject: [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0' From: "n.pettik" In-Reply-To: <4F4E0A7E-199C-4647-A49C-DD0E8A216527@tarantool.org> Date: Mon, 11 Feb 2019 16:15:37 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: <8EF5CE57-C6B5-493C-94CC-AA3C88639485@tarantool.org> References: <15e143f4-3ea7-c7d6-d8ac-8a0e20b76449@tarantool.org> <1560FF96-FECD-4368-8AF8-F8F2AE7696E3@tarantool.org> <07DBA796-6DD4-41DD-8438-104FE3AE05BB@tarantool.org> <4F4E0A7E-199C-4647-A49C-DD0E8A216527@tarantool.org> Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-subscribe: List-owner: List-post: List-archive: To: tarantool-patches@freelists.org Cc: Ivan Koptelov > On 7 Feb 2019, at 18:14, i.koptelov = wrote: >> On 5 Feb 2019, at 16:50, n.pettik wrote: >>> On 29/01/2019 19:35, n.pettik wrote: >>>>> Fixes LIKE and LENGTH functions. '\0' now treated as >>>>>=20 >>>> Nit: is treated. >>> Fixed. >>>>> a usual symbol. Strings with '\0' are now processed >>>>> entirely. Consider examples: >>>>>=20 >>>>> LENGTH(CHAR(65,00,65)) =3D=3D 3 >>>>> LIKE(CHAR(65,00,65), CHAR(65,00,66)) =3D=3D False >>>>>=20 >>>> Also, I see that smth wrong with text in this mail again >>> I hope now the mail text is ok. >>=20 >> Not quite. It is still highlighted in some way. Have no idea. > Oh, I am really sorry. Does it look ok now? (I changed the mailing = client)=20 Yes, it does. >>> src/box/sql/func.c | 88 +++++++++++++----- >>> src/box/sql/vdbeInt.h | 2 +- >>> test/sql-tap/func.test.lua | 220 = ++++++++++++++++++++++++++++++++++++++++++++- >>> 3 files changed, 284 insertions(+), 26 deletions(-) >>>=20 >>> diff --git a/src/box/sql/func.c b/src/box/sql/func.c >>> index e46b162d9..2978af983 100644 >>> --- a/src/box/sql/func.c >>> +++ b/src/box/sql/func.c >>> @@ -128,6 +128,30 @@ typeofFunc(sqlite3_context * context, int = NotUsed, sqlite3_value ** argv) >>> sqlite3_result_text(context, z, -1, SQLITE_STATIC); >>> } >>>=20 >>> +/** >>> + * Return number of chars in the given string. >>> + * >>> + * Number of chars !=3D byte size of string because some characters >>> + * are encoded with more than one byte. Also note that all >>> + * characters from 'str' to 'str + byte_len' would be counted, >>> + * even if there is a '\0' somewhere between them. >>> + * @param str String to be counted. >>> + * @param byte_len Byte length of given string. >>> + * @return >>>=20 >> Return what? > Fixed. >>> + */ >>> +static int >>> +count_chars(const unsigned char *str, size_t byte_len) >>>=20 >> Quite poor naming. I would call it utf8_str_len or >> smth with utf8 prefix. Mb it is worth to put it some utils source = file. >> Also, consider using native U8_NEXT function from utf8.c, >> instead of custom SQLITE_SKIP_UTF8. It may be not so fast >> but safer I suppose. I don't insist though. >>> +{ >> What if str is NULL? Add at least an assertion. >>> + int n_chars =3D 0; >>> + const unsigned char *prev_z; >>> + for (size_t cnt =3D 0; cnt < byte_len; cnt +=3D (str - prev_z)) = { >>> + n_chars++; >>> + prev_z =3D str; >>> + SQLITE_SKIP_UTF8(str); >>> + } >>> + return n_chars; >>> +} >>>=20 >> You can rewrite this function in a simpler way without using SQLITE = macroses. >> Read this topic: = https://stackoverflow.com/questions/3911536/utf-8-unicode-whats-with-0xc0-= and-0x80/3911566#3911566 >> It is quite useful. You may borrow implementation from there. > Both usage of U8_NEXT and the solution from stack overflow causes = sql-tap/badutf1 test to fail, > while func.test (and other tests) are OK. In other words, = SQLITE_SKIP_UTF8 and U8_NEXT > proceeds Invalid byte sequences differently. > As far as I understand, the purpose of the test is to check that = malformed UTF-8 would not cause a crash. > So, I just used U8_NEXT and fixed the test. >=20 > (Since the function is used only in func.c, I prefer to keep it there = and don=E2=80=99t move to separate utils file) >=20 > diff --git a/src/box/sql/func.c b/src/box/sql/func.c > index 2978af983..8ad75adb8 100644 > --- a/src/box/sql/func.c > +++ b/src/box/sql/func.c > @@ -129,27 +129,29 @@ typeofFunc(sqlite3_context * context, int = NotUsed, sqlite3_value ** argv) > } >=20 > /** > - * Return number of chars in the given string. > + * Return number of symbols in the given string. > * > - * Number of chars !=3D byte size of string because some characters > + * Number of symbols !=3D byte size of string because some symbols > * are encoded with more than one byte. Also note that all > - * characters from 'str' to 'str + byte_len' would be counted, > + * symbols from 'str' to 'str + byte_len' would be counted, > * even if there is a '\0' somewhere between them. > * @param str String to be counted. > * @param byte_len Byte length of given string. > - * @return > + * @return number of symbols in the given string. > */ > static int > -count_chars(const unsigned char *str, size_t byte_len) > +char_count(const unsigned char *str, size_t byte_len) As I pointed in previous review, use utf8_ prefix: utf8_strlen, utf8_str_char_count or sort of. Why unsigned char? > { > - int n_chars =3D 0; > - const unsigned char *prev_z; > - for (size_t cnt =3D 0; cnt < byte_len; cnt +=3D (str - prev_z)) = { > - n_chars++; > - prev_z =3D str; > - SQLITE_SKIP_UTF8(str); > + int symbol_len =3D 0; Name is confusing: it sounds like a length of a single symbol. symbol_count would be better. > + int offset =3D 0; > + UChar32 res; > + while (offset < (int) byte_len) { > + U8_NEXT(str, offset, (int) byte_len, res) Instead of two explicit casts to int, lets make byte_len be of type int. > + if (res < 0) > + break; Hm, so if a sequence contains non-utf8 symbols, you simply cut the string. Not sure it is solid approach. Could you please check how other DBs behave (do they throw an error? Or ignore invalid symbols?) in similar situation and what standard says. > + symbol_len++; > } > - return n_chars; > + return symbol_len; > } >=20 > diff --git a/test/sql-tap/badutf1.test.lua = b/test/sql-tap/badutf1.test.lua > index 534c762ba..0a90d1b17 100755 > --- a/test/sql-tap/badutf1.test.lua > +++ b/test/sql-tap/badutf1.test.lua > @@ -215,7 +215,7 @@ test:do_test( > return test:execsql2("SELECT length('\x80') AS x") > end, { > -- > - "X", 1 > + "X", 0 Why this and the rest of tests below has been changed? I guess because now they contain invalid utf8 symbols. > -- > }) >=20 > @@ -235,7 +235,7 @@ test:do_test( > return test:execsql2("SELECT length('\x7f\x80\x81') AS x") > end, { > -- > - "X", 3 > + "X", 1 But wait, I execute tarantool> SELECT length('\x7f\x80\x81') AS x --- - - [12] ... Looks extremely strangle. What do these tests check at all? Why we can=E2=80=99t use simple sqlexecute or smth? This test suite is so broken... >>> @@ -388,12 +405,21 @@ substrFunc(sqlite3_context * context, int = argc, sqlite3_value ** argv) >>> } >>> assert(p1 >=3D 0 && p2 >=3D 0); >>> if (p0type !=3D SQLITE_BLOB) { >>> - while (*z && p1) { >>> + /* >>> + * In the code below 'cnt' and 'n_chars' is >>> + * used because '\0' is not supposed to be >>> + * end-of-string symbol. >>> + */ >>> + int n_chars =3D count_chars(z, = sqlite3_value_bytes(argv[0])); >>>=20 >> I=E2=80=99d better call it char_count or symbol_count or char_count. >=20 >>> diff --git a/test/sql-tap/func.test.lua b/test/sql-tap/func.test.lua >>> index b7de1d955..8c712bd5e 100755 >>> --- a/test/sql-tap/func.test.lua >>> +++ b/test/sql-tap/func.test.lua >>> +-- REPLACE >>> +test:do_execsql_test( >>> + "func-62", >>> + "SELECT REPLACE(CHAR(00,65,00,65), CHAR(00), CHAR(65)) LIKE = 'AAAA';", >>> + {1}) >>> + >>> +test:do_execsql_test( >>> + "func-63", >>> + "SELECT REPLACE(CHAR(00,65,00,65), CHAR(65), CHAR(00)) \ >>> + LIKE CHAR(00,00,00,00);", >>> + {1}) >>> + >>> +-- SUBSTR >>> +test:do_execsql_test( >>> + "func-64", >>> + "SELECT SUBSTR(CHAR(65,00,66,67), 3, 2) LIKE CHAR(66, 67);", >>> + {1}) >>> + >>> +test:do_execsql_test( >>> + "func-65", >>> + "SELECT SUBSTR(CHAR(00,00,00,65), 1, 4) LIKE = CHAR(00,00,00,65);", >>> + {1}) >>> + >>>=20 >> Just wondering: why do you use LIKE function almost in all tests? >>=20 > To compare actual result with expected one. But now I think It would = be more readable like this: Moreover, in case LIKE fails for some reason, these tests will fail as well. Meanwhile, their purpose to test not LIKE function. Glad you fixed that.