From: i.koptelov <ivan.koptelov@tarantool.org> To: tarantool-patches@freelists.org Cc: "n.pettik" <korablev@tarantool.org> Subject: [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0' Date: Wed, 20 Feb 2019 16:54:29 +0300 [thread overview] Message-ID: <25649276-74CD-46E7-A1EB-F4CE299E637C@tarantool.org> (raw) In-Reply-To: <FC2CA838-42C3-4494-BC6C-37C864D4BF79@tarantool.org> > On 14 Feb 2019, at 15:57, n.pettik <korablev@tarantool.org> wrote: > > I see some strange file on your branch: > > https://github.com/tarantool/tarantool/commit/0a34e978c8023dcb4f99965d93203a99adf21520 > > I guess it’s an artefact from your investigation. > If you want to share some results/proposals, just > attach them to the this letter or write to dev list. Sorry, I committed it by mistake. Removed now. > >> diff --git a/src/box/sql/func.c b/src/box/sql/func.c >> index 8ad75adb8..bab102988 100644 >> --- a/src/box/sql/func.c >> +++ b/src/box/sql/func.c >> @@ -140,18 +140,18 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv) >> * @return number of symbols in the given string. >> */ >> static int >> -char_count(const unsigned char *str, size_t byte_len) >> +utf8_char_count(const unsigned char *str, int byte_len) >> { >>> >>> Why unsigned char? >> I used unsigned char because sqlite3_value_text() returns result >> of this type. > > Ok, let it be. > >> diff --git a/src/box/sql/func.c b/src/box/sql/func.c >> index 8ad75adb8..bab102988 100644 >> --- a/src/box/sql/func.c >> +++ b/src/box/sql/func.c >> @@ -140,18 +140,18 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv) >> * @return number of symbols in the given string. >> */ >> static int >> -char_count(const unsigned char *str, size_t byte_len) >> +utf8_char_count(const unsigned char *str, int byte_len) >> { >> - int symbol_len = 0; >> + int symbol_count = 0; >> int offset = 0; >> UChar32 res; >> - while (offset < (int) byte_len) { >> - U8_NEXT(str, offset, (int) byte_len, res) >> + while (offset < byte_len) { >> + U8_NEXT(str, offset, byte_len, res) >> if (res < 0) >> break; >> - symbol_len++; >> + symbol_count++; >> } >> - return symbol_len; >> + return symbol_count; >> } >> >>> >>>> + if (res < 0) >>>> + break; >>> >>> Hm, so if a sequence contains non-utf8 symbols, >>> you simply cut the string. Not sure it is solid approach. >>> Could you please check how other DBs behave (do they throw >>> an error? Or ignore invalid symbols?) in similar situation and >>> what standard says. >> I’ve tested statements with LENGTH from the test badutf1 in different >> DBs. PostgreSQL raised an error "invalid byte sequence for encoding “UTF8”. >> MySQL, DB2 and MSSQL behaved all the same way: count each invalid byte as >> a symbol. For example: >> >> 0x80, 0x7f, 0x81 are all invalid first bytes from >> UTF8 point of view, 0xC03F is bad two byte seq. where >> first byte is ok and second is broken, 0xE0800F is >> bad three byte seq. where first two bytes are ok, but the >> third is broken. >> >> (this is MySQL SQL) >> >> select length(X'80'); 1 >> select length(concat(X'7f', X'80', X'81')); 3 >> select length(concat(X'61', X'c0')); 2 >> select length(concat(X'61', X'c0', X'80', X'80', X'80', >> X'80', X'80', X'80', X'80', X'80', X'80', X'80')); 12 >> >> select length(X'C03F'); 2 >> select length(concat(X'C03F', 'a')); 3 >> select length(X'E0800F'); 3 >> select length(concat(X'E0800F', 'a')); 4 >> >> I have not found anything in standard about dealing with >> invalid UTF8 sequences. >> >> Even before the patch test gave results different from what >> all others DBs. >> >> I propose to behave all other DBs do, as I describe above. >> Is it ok? > > PostgreSQL makes me feel uncertain concerning this question. > Anyway, I still believe that it’s better to raise an error. > I propose you ask other members of core team for their opinion. > Additionally, you could ask author of origin issue and our expert P. Gulutzan. After discussions, we decided to use the following algorithm: Starting from the first byte in string, we try to determine what kind of byte is it: first byte of 1,2,3 or 4 byte sequence. Then we skip corresponding amount of bytes and increment symbol length. (e.g 2 bytes for 2 byte sequence). If current byte is not a valid first byte of any sequence, when we skip it and increment symbol length. diff --git a/src/box/sql/func.c b/src/box/sql/func.c index bab102988..8406eb42f 100644 --- a/src/box/sql/func.c +++ b/src/box/sql/func.c @@ -135,6 +135,11 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv) * are encoded with more than one byte. Also note that all * symbols from 'str' to 'str + byte_len' would be counted, * even if there is a '\0' somewhere between them. + * + * This function is implemented to be fast and indifferent to + * correctness of string being processed. If input string has + * even one invalid utf-8 sequence, then the resulting length + * could be arbitary in these boundaries (0 < len < byte_len). * @param str String to be counted. * @param byte_len Byte length of given string. * @return number of symbols in the given string. @@ -143,12 +148,17 @@ static int utf8_char_count(const unsigned char *str, int byte_len) { int symbol_count = 0; - int offset = 0; - UChar32 res; - while (offset < byte_len) { - U8_NEXT(str, offset, byte_len, res) - if (res < 0) - break; + for (int i = 0; i < byte_len;) { + if ((str[i] & 0x80) == 0) + i += 1; + else if ((str[i] & 0xe0) == 0xc0) + i += 2; + else if ((str[i] & 0xf0) == 0xe0) + i += 3; + else if ((str[i] & 0xf8) == 0xf0) + i += 4; + else + i += 1; symbol_count++; } return symbol_count; > >>>> + symbol_len++; >>>> } >>>> - return n_chars; >>>> + return symbol_len; >>>> } >>>> >>>> diff --git a/test/sql-tap/badutf1.test.lua b/test/sql-tap/badutf1.test.lua >>>> index 534c762ba..0a90d1b17 100755 >>>> --- a/test/sql-tap/badutf1.test.lua >>>> +++ b/test/sql-tap/badutf1.test.lua >>>> @@ -215,7 +215,7 @@ test:do_test( >>>> return test:execsql2("SELECT length('\x80') AS x") >>>> end, { >>>> -- <badutf-3.1> >>>> - "X", 1 >>>> + "X", 0 >>> >>> Why this and the rest of tests below has been changed? >>> I guess because now they contain invalid utf8 symbols. >> They contained invalid utf8 symbols before. The tests are changed because >> I changed the way we handle malformed utf8 strings. > > Ok, but at least underline this fact in commit message. Ok. > >>>> -- </badutf-3.1> >>>> }) >>>> >>>> @@ -235,7 +235,7 @@ test:do_test( >>>> return test:execsql2("SELECT length('\x7f\x80\x81') AS x") >>>> end, { >>>> -- <badutf-3.3> >>>> - "X", 3 >>>> + "X", 1 >>> >>> But wait, I execute >>> tarantool> SELECT length('\x7f\x80\x81') AS x >>> --- >>> - - [12] >>> ... >>> x§ >> Maybe you got this result because ‘\x7f’ is treated by tarantool sql >> as a simple string with 4 characters? > > And why in tests it is treated in other way? When you use SQL interface in console and type SELECT LENGTH (‘\x7f\x80\x81’) AS x; things like ‘\x7f’ are not treated as hexadecimal literals, they are just treated as symbols ‘\’, ‘x’, ’7’, ‘f’ (And if you want to use hexadecimal literals, you should use: > \set language sql > select length(X'7f' || X'80' || X'81' ); --- - - [3] ) > >>> Looks extremely strangle. What do these tests check at all? >>> Why we can’t use simple sqlexecute or smth? This test suite >>> is so broken... >> I think these tests check that DB does not crash if malformed utf8 is >> encountered. > > I mean why results from console run are different from tests > results in our suite? Still can’t understand that. Now I execute this: > tarantool> box.sql.execute('select length("\x7f\x80\x81")’) > > And got strange error: > --- > - error: !!binary bm8gc3VjaCBjb2x1bW46IH+AgQ== > ... I would like to investigate it in scope of separate task, because it seems to be quite time consuming problem. At the end, we got these changes in badutf1.test.lua diff --git a/test/sql-tap/badutf1.test.lua b/test/sql-tap/badutf1.test.lua index 534c762ba..2b9f9b209 100755 --- a/test/sql-tap/badutf1.test.lua +++ b/test/sql-tap/badutf1.test.lua @@ -255,7 +255,7 @@ test:do_test( return test:execsql2("SELECT length('\x61\xc0\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80') AS x") end, { -- <badutf-3.5> - "X", 2 + "X", 11 -- </badutf-3.5> }) @@ -265,7 +265,7 @@ test:do_test( return test:execsql2("SELECT length('\xc0\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80') AS x") end, { -- <badutf-3.6> - "X", 1 + "X", 10 -- </badutf-3.6> }) @@ -285,7 +285,7 @@ test:do_test( return test:execsql2("SELECT length('\x80\x80\x80\x80\x80\xf0\x80\x80\x80\x80') AS x") end, { -- <badutf-3.8> - "X", 6 + "X", 7 -- </badutf-3.8> })
next prev parent reply other threads:[~2019-02-20 13:54 UTC|newest] Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-01-29 9:56 [tarantool-patches] " Ivan Koptelov 2019-01-29 16:35 ` [tarantool-patches] " n.pettik 2019-02-04 12:34 ` Ivan Koptelov 2019-02-05 13:50 ` n.pettik 2019-02-07 15:14 ` i.koptelov 2019-02-11 13:15 ` n.pettik 2019-02-13 15:46 ` i.koptelov 2019-02-14 12:57 ` n.pettik 2019-02-20 13:54 ` i.koptelov [this message] 2019-02-20 15:47 ` i.koptelov 2019-02-20 16:04 ` n.pettik 2019-02-20 18:08 ` Vladislav Shpilevoy 2019-02-20 19:24 ` i.koptelov 2019-02-22 12:59 ` n.pettik 2019-02-25 11:09 ` i.koptelov 2019-02-25 15:10 ` n.pettik 2019-02-26 13:33 ` i.koptelov 2019-02-26 17:50 ` n.pettik 2019-02-26 18:44 ` i.koptelov 2019-02-26 20:16 ` Vladislav Shpilevoy 2019-03-04 11:59 ` i.koptelov 2019-03-04 15:30 ` Kirill Yukhin
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=25649276-74CD-46E7-A1EB-F4CE299E637C@tarantool.org \ --to=ivan.koptelov@tarantool.org \ --cc=korablev@tarantool.org \ --cc=tarantool-patches@freelists.org \ --subject='[tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\''\0'\''' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox