From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id D100820CBD for ; Thu, 14 Feb 2019 07:57:51 -0500 (EST) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id K1_ommcoAuja for ; Thu, 14 Feb 2019 07:57:51 -0500 (EST) Received: from smtpng3.m.smailru.net (smtpng3.m.smailru.net [94.100.177.149]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 194FA1FB33 for ; Thu, 14 Feb 2019 07:57:49 -0500 (EST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 12.2 \(3445.102.3\)) Subject: [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0' From: "n.pettik" In-Reply-To: <7E6CE8AA-512D-4472-9DBD-8159073386C5@tarantool.org> Date: Thu, 14 Feb 2019 15:57:45 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: References: <15e143f4-3ea7-c7d6-d8ac-8a0e20b76449@tarantool.org> <1560FF96-FECD-4368-8AF8-F8F2AE7696E3@tarantool.org> <07DBA796-6DD4-41DD-8438-104FE3AE05BB@tarantool.org> <4F4E0A7E-199C-4647-A49C-DD0E8A216527@tarantool.org> <8EF5CE57-C6B5-493C-94CC-AA3C88639485@tarantool.org> <7E6CE8AA-512D-4472-9DBD-8159073386C5@tarantool.org> Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-subscribe: List-owner: List-post: List-archive: To: tarantool-patches@freelists.org Cc: Ivan Koptelov I see some strange file on your branch: = https://github.com/tarantool/tarantool/commit/0a34e978c8023dcb4f99965d9320= 3a99adf21520 I guess it=E2=80=99s an artefact from your investigation. If you want to share some results/proposals, just attach them to the this letter or write to dev list. > diff --git a/src/box/sql/func.c b/src/box/sql/func.c > index 8ad75adb8..bab102988 100644 > --- a/src/box/sql/func.c > +++ b/src/box/sql/func.c > @@ -140,18 +140,18 @@ typeofFunc(sqlite3_context * context, int = NotUsed, sqlite3_value ** argv) > * @return number of symbols in the given string. > */ > static int > -char_count(const unsigned char *str, size_t byte_len) > +utf8_char_count(const unsigned char *str, int byte_len) > { >>=20 >> Why unsigned char? > I used unsigned char because sqlite3_value_text() returns result > of this type. Ok, let it be. > diff --git a/src/box/sql/func.c b/src/box/sql/func.c > index 8ad75adb8..bab102988 100644 > --- a/src/box/sql/func.c > +++ b/src/box/sql/func.c > @@ -140,18 +140,18 @@ typeofFunc(sqlite3_context * context, int = NotUsed, sqlite3_value ** argv) > * @return number of symbols in the given string. > */ > static int > -char_count(const unsigned char *str, size_t byte_len) > +utf8_char_count(const unsigned char *str, int byte_len) > { > - int symbol_len =3D 0; > + int symbol_count =3D 0; > int offset =3D 0; > UChar32 res; > - while (offset < (int) byte_len) { > - U8_NEXT(str, offset, (int) byte_len, res) > + while (offset < byte_len) { > + U8_NEXT(str, offset, byte_len, res) > if (res < 0) > break; > - symbol_len++; > + symbol_count++; > } > - return symbol_len; > + return symbol_count; > } >=20 >>=20 >>> + if (res < 0) >>> + break; >>=20 >> Hm, so if a sequence contains non-utf8 symbols, >> you simply cut the string. Not sure it is solid approach. >> Could you please check how other DBs behave (do they throw >> an error? Or ignore invalid symbols?) in similar situation and >> what standard says. > I=E2=80=99ve tested statements with LENGTH from the test badutf1 in = different > DBs. PostgreSQL raised an error "invalid byte sequence for encoding = =E2=80=9CUTF8=E2=80=9D. > MySQL, DB2 and MSSQL behaved all the same way: count each invalid byte = as > a symbol. For example: >=20 > 0x80, 0x7f, 0x81 are all invalid first bytes from > UTF8 point of view, 0xC03F is bad two byte seq. where > first byte is ok and second is broken, 0xE0800F is > bad three byte seq. where first two bytes are ok, but the > third is broken. >=20 > (this is MySQL SQL) >=20 > select length(X'80'); 1 > select length(concat(X'7f', X'80', X'81')); 3 > select length(concat(X'61', X'c0')); 2 > select length(concat(X'61', X'c0', X'80', X'80', X'80', > X'80', X'80', X'80', X'80', X'80', X'80', X'80')); 12 >=20 > select length(X'C03F'); 2 > select length(concat(X'C03F', 'a')); 3 > select length(X'E0800F'); 3 > select length(concat(X'E0800F', 'a')); 4 >=20 > I have not found anything in standard about dealing with > invalid UTF8 sequences. >=20 > Even before the patch test gave results different from what > all others DBs. >=20 > I propose to behave all other DBs do, as I describe above. > Is it ok? PostgreSQL makes me feel uncertain concerning this question. Anyway, I still believe that it=E2=80=99s better to raise an error. I propose you ask other members of core team for their opinion. Additionally, you could ask author of origin issue and our expert P. = Gulutzan. >>> + symbol_len++; >>> } >>> - return n_chars; >>> + return symbol_len; >>> } >>>=20 >>> diff --git a/test/sql-tap/badutf1.test.lua = b/test/sql-tap/badutf1.test.lua >>> index 534c762ba..0a90d1b17 100755 >>> --- a/test/sql-tap/badutf1.test.lua >>> +++ b/test/sql-tap/badutf1.test.lua >>> @@ -215,7 +215,7 @@ test:do_test( >>> return test:execsql2("SELECT length('\x80') AS x") >>> end, { >>> -- >>> - "X", 1 >>> + "X", 0 >>=20 >> Why this and the rest of tests below has been changed? >> I guess because now they contain invalid utf8 symbols. > They contained invalid utf8 symbols before. The tests are changed = because > I changed the way we handle malformed utf8 strings. Ok, but at least underline this fact in commit message. >>> -- >>> }) >>>=20 >>> @@ -235,7 +235,7 @@ test:do_test( >>> return test:execsql2("SELECT length('\x7f\x80\x81') AS x") >>> end, { >>> -- >>> - "X", 3 >>> + "X", 1 >>=20 >> But wait, I execute >> tarantool> SELECT length('\x7f\x80\x81') AS x >> --- >> - - [12] >> ... >>=20 > Maybe you got this result because =E2=80=98\x7f=E2=80=99 is treated by = tarantool sql > as a simple string with 4 characters? And why in tests it is treated in other way? >> Looks extremely strangle. What do these tests check at all? >> Why we can=E2=80=99t use simple sqlexecute or smth? This test suite >> is so broken... > I think these tests check that DB does not crash if malformed utf8 is > encountered. I mean why results from console run are different from tests results in our suite? Still can=E2=80=99t understand that. Now I execute = this: tarantool> box.sql.execute('select length("\x7f\x80\x81")=E2=80=99) And got strange error: --- - error: !!binary bm8gc3VjaCBjb2x1bW46IH+AgQ=3D=3D ...