From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id 94EA22437D for ; Wed, 20 Feb 2019 08:54:32 -0500 (EST) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id qYLZcd82wFC0 for ; Wed, 20 Feb 2019 08:54:32 -0500 (EST) Received: from smtp3.mail.ru (smtp3.mail.ru [94.100.179.58]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 1C19024379 for ; Wed, 20 Feb 2019 08:54:31 -0500 (EST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 12.0 \(3445.100.39\)) Subject: [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0' From: i.koptelov In-Reply-To: Date: Wed, 20 Feb 2019 16:54:29 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: <25649276-74CD-46E7-A1EB-F4CE299E637C@tarantool.org> References: <15e143f4-3ea7-c7d6-d8ac-8a0e20b76449@tarantool.org> <1560FF96-FECD-4368-8AF8-F8F2AE7696E3@tarantool.org> <07DBA796-6DD4-41DD-8438-104FE3AE05BB@tarantool.org> <4F4E0A7E-199C-4647-A49C-DD0E8A216527@tarantool.org> <8EF5CE57-C6B5-493C-94CC-AA3C88639485@tarantool.org> <7E6CE8AA-512D-4472-9DBD-8159073386C5@tarantool.org> Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-subscribe: List-owner: List-post: List-archive: To: tarantool-patches@freelists.org Cc: "n.pettik" > On 14 Feb 2019, at 15:57, n.pettik wrote: >=20 > I see some strange file on your branch: >=20 > = https://github.com/tarantool/tarantool/commit/0a34e978c8023dcb4f99965d9320= 3a99adf21520 >=20 > I guess it=E2=80=99s an artefact from your investigation. > If you want to share some results/proposals, just > attach them to the this letter or write to dev list. Sorry, I committed it by mistake. Removed now. >=20 >> diff --git a/src/box/sql/func.c b/src/box/sql/func.c >> index 8ad75adb8..bab102988 100644 >> --- a/src/box/sql/func.c >> +++ b/src/box/sql/func.c >> @@ -140,18 +140,18 @@ typeofFunc(sqlite3_context * context, int = NotUsed, sqlite3_value ** argv) >> * @return number of symbols in the given string. >> */ >> static int >> -char_count(const unsigned char *str, size_t byte_len) >> +utf8_char_count(const unsigned char *str, int byte_len) >> { >>>=20 >>> Why unsigned char? >> I used unsigned char because sqlite3_value_text() returns result >> of this type. >=20 > Ok, let it be. >=20 >> diff --git a/src/box/sql/func.c b/src/box/sql/func.c >> index 8ad75adb8..bab102988 100644 >> --- a/src/box/sql/func.c >> +++ b/src/box/sql/func.c >> @@ -140,18 +140,18 @@ typeofFunc(sqlite3_context * context, int = NotUsed, sqlite3_value ** argv) >> * @return number of symbols in the given string. >> */ >> static int >> -char_count(const unsigned char *str, size_t byte_len) >> +utf8_char_count(const unsigned char *str, int byte_len) >> { >> - int symbol_len =3D 0; >> + int symbol_count =3D 0; >> int offset =3D 0; >> UChar32 res; >> - while (offset < (int) byte_len) { >> - U8_NEXT(str, offset, (int) byte_len, res) >> + while (offset < byte_len) { >> + U8_NEXT(str, offset, byte_len, res) >> if (res < 0) >> break; >> - symbol_len++; >> + symbol_count++; >> } >> - return symbol_len; >> + return symbol_count; >> } >>=20 >>>=20 >>>> + if (res < 0) >>>> + break; >>>=20 >>> Hm, so if a sequence contains non-utf8 symbols, >>> you simply cut the string. Not sure it is solid approach. >>> Could you please check how other DBs behave (do they throw >>> an error? Or ignore invalid symbols?) in similar situation and >>> what standard says. >> I=E2=80=99ve tested statements with LENGTH from the test badutf1 in = different >> DBs. PostgreSQL raised an error "invalid byte sequence for encoding = =E2=80=9CUTF8=E2=80=9D. >> MySQL, DB2 and MSSQL behaved all the same way: count each invalid = byte as >> a symbol. For example: >>=20 >> 0x80, 0x7f, 0x81 are all invalid first bytes from >> UTF8 point of view, 0xC03F is bad two byte seq. where >> first byte is ok and second is broken, 0xE0800F is >> bad three byte seq. where first two bytes are ok, but the >> third is broken. >>=20 >> (this is MySQL SQL) >>=20 >> select length(X'80'); 1 >> select length(concat(X'7f', X'80', X'81')); 3 >> select length(concat(X'61', X'c0')); 2 >> select length(concat(X'61', X'c0', X'80', X'80', X'80', >> X'80', X'80', X'80', X'80', X'80', X'80', X'80')); 12 >>=20 >> select length(X'C03F'); 2 >> select length(concat(X'C03F', 'a')); 3 >> select length(X'E0800F'); 3 >> select length(concat(X'E0800F', 'a')); 4 >>=20 >> I have not found anything in standard about dealing with >> invalid UTF8 sequences. >>=20 >> Even before the patch test gave results different from what >> all others DBs. >>=20 >> I propose to behave all other DBs do, as I describe above. >> Is it ok? >=20 > PostgreSQL makes me feel uncertain concerning this question. > Anyway, I still believe that it=E2=80=99s better to raise an error. > I propose you ask other members of core team for their opinion. > Additionally, you could ask author of origin issue and our expert P. = Gulutzan. After discussions, we decided to use the following algorithm: Starting from the first byte in string, we try to determine what kind of byte is it: first byte of 1,2,3 or 4 byte sequence. Then we skip corresponding amount of bytes and increment symbol length. (e.g 2 bytes for 2 byte sequence). If current byte is not a valid first byte of any sequence, when we skip it and increment symbol length.=20= diff --git a/src/box/sql/func.c b/src/box/sql/func.c index bab102988..8406eb42f 100644 --- a/src/box/sql/func.c +++ b/src/box/sql/func.c @@ -135,6 +135,11 @@ typeofFunc(sqlite3_context * context, int NotUsed, = sqlite3_value ** argv) * are encoded with more than one byte. Also note that all * symbols from 'str' to 'str + byte_len' would be counted, * even if there is a '\0' somewhere between them. + * + * This function is implemented to be fast and indifferent to + * correctness of string being processed. If input string has + * even one invalid utf-8 sequence, then the resulting length + * could be arbitary in these boundaries (0 < len < byte_len). * @param str String to be counted. * @param byte_len Byte length of given string. * @return number of symbols in the given string. @@ -143,12 +148,17 @@ static int utf8_char_count(const unsigned char *str, int byte_len) { int symbol_count =3D 0; - int offset =3D 0; - UChar32 res; - while (offset < byte_len) { - U8_NEXT(str, offset, byte_len, res) - if (res < 0) - break; + for (int i =3D 0; i < byte_len;) { + if ((str[i] & 0x80) =3D=3D 0) + i +=3D 1; + else if ((str[i] & 0xe0) =3D=3D 0xc0) + i +=3D 2; + else if ((str[i] & 0xf0) =3D=3D 0xe0) + i +=3D 3; + else if ((str[i] & 0xf8) =3D=3D 0xf0) + i +=3D 4; + else + i +=3D 1; symbol_count++; } return symbol_count; >=20 >>>> + symbol_len++; >>>> } >>>> - return n_chars; >>>> + return symbol_len; >>>> } >>>>=20 >>>> diff --git a/test/sql-tap/badutf1.test.lua = b/test/sql-tap/badutf1.test.lua >>>> index 534c762ba..0a90d1b17 100755 >>>> --- a/test/sql-tap/badutf1.test.lua >>>> +++ b/test/sql-tap/badutf1.test.lua >>>> @@ -215,7 +215,7 @@ test:do_test( >>>> return test:execsql2("SELECT length('\x80') AS x") >>>> end, { >>>> -- >>>> - "X", 1 >>>> + "X", 0 >>>=20 >>> Why this and the rest of tests below has been changed? >>> I guess because now they contain invalid utf8 symbols. >> They contained invalid utf8 symbols before. The tests are changed = because >> I changed the way we handle malformed utf8 strings. >=20 > Ok, but at least underline this fact in commit message. Ok.=20 >=20 >>>> -- >>>> }) >>>>=20 >>>> @@ -235,7 +235,7 @@ test:do_test( >>>> return test:execsql2("SELECT length('\x7f\x80\x81') AS x") >>>> end, { >>>> -- >>>> - "X", 3 >>>> + "X", 1 >>>=20 >>> But wait, I execute >>> tarantool> SELECT length('\x7f\x80\x81') AS x >>> --- >>> - - [12] >>> ... >>> x=C2=A7 >> Maybe you got this result because =E2=80=98\x7f=E2=80=99 is treated = by tarantool sql >> as a simple string with 4 characters? >=20 > And why in tests it is treated in other way? When you use SQL interface in console and type SELECT LENGTH = (=E2=80=98\x7f\x80\x81=E2=80=99) AS x; things like =E2=80=98\x7f=E2=80=99 are not treated as hexadecimal = literals, they are just treated as symbols =E2=80=98\=E2=80=99, =E2=80=98x=E2=80=99, =E2=80=997=E2= =80=99, =E2=80=98f=E2=80=99 (And if you want to use hexadecimal literals, you should use: > \set language sql=09 > select length(X'7f' || X'80' || X'81' ); --- - - [3] ) >=20 >>> Looks extremely strangle. What do these tests check at all? >>> Why we can=E2=80=99t use simple sqlexecute or smth? This test suite >>> is so broken... >> I think these tests check that DB does not crash if malformed utf8 is >> encountered. >=20 > I mean why results from console run are different from tests > results in our suite? Still can=E2=80=99t understand that. Now I = execute this: > tarantool> box.sql.execute('select length("\x7f\x80\x81")=E2=80=99) >=20 > And got strange error: > --- > - error: !!binary bm8gc3VjaCBjb2x1bW46IH+AgQ=3D=3D > ... I would like to investigate it in scope of separate task, because it seems to be quite time consuming problem. At the end, we got these changes in badutf1.test.lua diff --git a/test/sql-tap/badutf1.test.lua = b/test/sql-tap/badutf1.test.lua index 534c762ba..2b9f9b209 100755 --- a/test/sql-tap/badutf1.test.lua +++ b/test/sql-tap/badutf1.test.lua @@ -255,7 +255,7 @@ test:do_test( return test:execsql2("SELECT = length('\x61\xc0\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80') AS x") end, { -- - "X", 2 + "X", 11 -- }) =20 @@ -265,7 +265,7 @@ test:do_test( return test:execsql2("SELECT = length('\xc0\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80') AS x") end, { -- - "X", 1 + "X", 10 -- }) =20 @@ -285,7 +285,7 @@ test:do_test( return test:execsql2("SELECT = length('\x80\x80\x80\x80\x80\xf0\x80\x80\x80\x80') AS x") end, { -- - "X", 6 + "X", 7 -- })