From: i.koptelov <ivan.koptelov@tarantool.org> To: tarantool-patches@freelists.org Cc: "n.pettik" <korablev@tarantool.org> Subject: [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0' Date: Wed, 13 Feb 2019 18:46:24 +0300 [thread overview] Message-ID: <7E6CE8AA-512D-4472-9DBD-8159073386C5@tarantool.org> (raw) In-Reply-To: <8EF5CE57-C6B5-493C-94CC-AA3C88639485@tarantool.org> > On 11 Feb 2019, at 16:15, n.pettik <korablev@tarantool.org> wrote: > >> >> On 7 Feb 2019, at 18:14, i.koptelov <ivan.koptelov@tarantool.org> wrote: >>> On 5 Feb 2019, at 16:50, n.pettik <korablev@tarantool.org> wrote: >>>> On 29/01/2019 19:35, n.pettik wrote: >>>>>> Fixes LIKE and LENGTH functions. '\0' now treated as >>>>>> >>>>> Nit: is treated. >>>> Fixed. >>>>>> a usual symbol. Strings with '\0' are now processed >>>>>> entirely. Consider examples: >>>>>> >>>>>> LENGTH(CHAR(65,00,65)) == 3 >>>>>> LIKE(CHAR(65,00,65), CHAR(65,00,66)) == False >>>>>> >>>>> Also, I see that smth wrong with text in this mail again >>>> I hope now the mail text is ok. >>> >>> Not quite. It is still highlighted in some way. Have no idea. >> Oh, I am really sorry. Does it look ok now? (I changed the mailing client) > > Yes, it does. > >>>> src/box/sql/func.c | 88 +++++++++++++----- >>>> src/box/sql/vdbeInt.h | 2 +- >>>> test/sql-tap/func.test.lua | 220 ++++++++++++++++++++++++++++++++++++++++++++- >>>> 3 files changed, 284 insertions(+), 26 deletions(-) >>>> >>>> diff --git a/src/box/sql/func.c b/src/box/sql/func.c >>>> index e46b162d9..2978af983 100644 >>>> --- a/src/box/sql/func.c >>>> +++ b/src/box/sql/func.c >>>> @@ -128,6 +128,30 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv) >>>> sqlite3_result_text(context, z, -1, SQLITE_STATIC); >>>> } >>>> >>>> +/** >>>> + * Return number of chars in the given string. >>>> + * >>>> + * Number of chars != byte size of string because some characters >>>> + * are encoded with more than one byte. Also note that all >>>> + * characters from 'str' to 'str + byte_len' would be counted, >>>> + * even if there is a '\0' somewhere between them. >>>> + * @param str String to be counted. >>>> + * @param byte_len Byte length of given string. >>>> + * @return >>>> >>> Return what? >> Fixed. >>>> + */ >>>> +static int >>>> +count_chars(const unsigned char *str, size_t byte_len) >>>> >>> Quite poor naming. I would call it utf8_str_len or >>> smth with utf8 prefix. Mb it is worth to put it some utils source file. >>> Also, consider using native U8_NEXT function from utf8.c, >>> instead of custom SQLITE_SKIP_UTF8. It may be not so fast >>> but safer I suppose. I don't insist though. >>>> +{ >>> What if str is NULL? Add at least an assertion. >>>> + int n_chars = 0; >>>> + const unsigned char *prev_z; >>>> + for (size_t cnt = 0; cnt < byte_len; cnt += (str - prev_z)) { >>>> + n_chars++; >>>> + prev_z = str; >>>> + SQLITE_SKIP_UTF8(str); >>>> + } >>>> + return n_chars; >>>> +} >>>> >>> You can rewrite this function in a simpler way without using SQLITE macroses. >>> Read this topic: https://stackoverflow.com/questions/3911536/utf-8-unicode-whats-with-0xc0-and-0x80/3911566#3911566 >>> It is quite useful. You may borrow implementation from there. >> Both usage of U8_NEXT and the solution from stack overflow causes sql-tap/badutf1 test to fail, >> while func.test (and other tests) are OK. In other words, SQLITE_SKIP_UTF8 and U8_NEXT >> proceeds Invalid byte sequences differently. >> As far as I understand, the purpose of the test is to check that malformed UTF-8 would not cause a crash. >> So, I just used U8_NEXT and fixed the test. >> >> (Since the function is used only in func.c, I prefer to keep it there and don’t move to separate utils file) >> >> diff --git a/src/box/sql/func.c b/src/box/sql/func.c >> index 2978af983..8ad75adb8 100644 >> --- a/src/box/sql/func.c >> +++ b/src/box/sql/func.c >> @@ -129,27 +129,29 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv) >> } >> >> /** >> - * Return number of chars in the given string. >> + * Return number of symbols in the given string. >> * >> - * Number of chars != byte size of string because some characters >> + * Number of symbols != byte size of string because some symbols >> * are encoded with more than one byte. Also note that all >> - * characters from 'str' to 'str + byte_len' would be counted, >> + * symbols from 'str' to 'str + byte_len' would be counted, >> * even if there is a '\0' somewhere between them. >> * @param str String to be counted. >> * @param byte_len Byte length of given string. >> - * @return >> + * @return number of symbols in the given string. >> */ >> static int >> -count_chars(const unsigned char *str, size_t byte_len) >> +char_count(const unsigned char *str, size_t byte_len) > > As I pointed in previous review, use utf8_ prefix: > utf8_strlen, utf8_str_char_count or sort of. Sorry, overlooked this. Fixed now. diff --git a/src/box/sql/func.c b/src/box/sql/func.c index 8ad75adb8..bab102988 100644 --- a/src/box/sql/func.c +++ b/src/box/sql/func.c @@ -140,18 +140,18 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv) * @return number of symbols in the given string. */ static int -char_count(const unsigned char *str, size_t byte_len) +utf8_char_count(const unsigned char *str, int byte_len) { > > Why unsigned char? I used unsigned char because sqlite3_value_text() returns result of this type. > >> { >> - int n_chars = 0; >> - const unsigned char *prev_z; >> - for (size_t cnt = 0; cnt < byte_len; cnt += (str - prev_z)) { >> - n_chars++; >> - prev_z = str; >> - SQLITE_SKIP_UTF8(str); >> + int symbol_len = 0; > > Name is confusing: it sounds like a length of a single symbol. > symbol_count would be better. Ok, thank you, fixed. > >> + int offset = 0; >> + UChar32 res; >> + while (offset < (int) byte_len) { >> + U8_NEXT(str, offset, (int) byte_len, res) > > Instead of two explicit casts to int, lets make byte_len > be of type int. Ok. diff --git a/src/box/sql/func.c b/src/box/sql/func.c index 8ad75adb8..bab102988 100644 --- a/src/box/sql/func.c +++ b/src/box/sql/func.c @@ -140,18 +140,18 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv) * @return number of symbols in the given string. */ static int -char_count(const unsigned char *str, size_t byte_len) +utf8_char_count(const unsigned char *str, int byte_len) { - int symbol_len = 0; + int symbol_count = 0; int offset = 0; UChar32 res; - while (offset < (int) byte_len) { - U8_NEXT(str, offset, (int) byte_len, res) + while (offset < byte_len) { + U8_NEXT(str, offset, byte_len, res) if (res < 0) break; - symbol_len++; + symbol_count++; } - return symbol_len; + return symbol_count; } > >> + if (res < 0) >> + break; > > Hm, so if a sequence contains non-utf8 symbols, > you simply cut the string. Not sure it is solid approach. > Could you please check how other DBs behave (do they throw > an error? Or ignore invalid symbols?) in similar situation and > what standard says. I’ve tested statements with LENGTH from the test badutf1 in different DBs. PostgreSQL raised an error "invalid byte sequence for encoding “UTF8”. MySQL, DB2 and MSSQL behaved all the same way: count each invalid byte as a symbol. For example: 0x80, 0x7f, 0x81 are all invalid first bytes from UTF8 point of view, 0xC03F is bad two byte seq. where first byte is ok and second is broken, 0xE0800F is bad three byte seq. where first two bytes are ok, but the third is broken. (this is MySQL SQL) select length(X'80'); 1 select length(concat(X'7f', X'80', X'81')); 3 select length(concat(X'61', X'c0')); 2 select length(concat(X'61', X'c0', X'80', X'80', X'80', X'80', X'80', X'80', X'80', X'80', X'80', X'80')); 12 select length(X'C03F'); 2 select length(concat(X'C03F', 'a')); 3 select length(X'E0800F'); 3 select length(concat(X'E0800F', 'a')); 4 I have not found anything in standard about dealing with invalid UTF8 sequences. Even before the patch test gave results different from what all others DBs. I propose to behave all other DBs do, as I describe above. Is it ok? > >> + symbol_len++; >> } >> - return n_chars; >> + return symbol_len; >> } >> >> diff --git a/test/sql-tap/badutf1.test.lua b/test/sql-tap/badutf1.test.lua >> index 534c762ba..0a90d1b17 100755 >> --- a/test/sql-tap/badutf1.test.lua >> +++ b/test/sql-tap/badutf1.test.lua >> @@ -215,7 +215,7 @@ test:do_test( >> return test:execsql2("SELECT length('\x80') AS x") >> end, { >> -- <badutf-3.1> >> - "X", 1 >> + "X", 0 > > Why this and the rest of tests below has been changed? > I guess because now they contain invalid utf8 symbols. They contained invalid utf8 symbols before. The tests are changed because I changed the way we handle malformed utf8 strings. > >> -- </badutf-3.1> >> }) >> >> @@ -235,7 +235,7 @@ test:do_test( >> return test:execsql2("SELECT length('\x7f\x80\x81') AS x") >> end, { >> -- <badutf-3.3> >> - "X", 3 >> + "X", 1 > > But wait, I execute > tarantool> SELECT length('\x7f\x80\x81') AS x > --- > - - [12] > ... > Maybe you got this result because ‘\x7f’ is treated by tarantool sql as a simple string with 4 characters? > Looks extremely strangle. What do these tests check at all? > Why we can’t use simple sqlexecute or smth? This test suite > is so broken... I think these tests check that DB does not crash if malformed utf8 is encountered. > >>>> @@ -388,12 +405,21 @@ substrFunc(sqlite3_context * context, int argc, sqlite3_value ** argv) >>>> } >>>> assert(p1 >= 0 && p2 >= 0); >>>> if (p0type != SQLITE_BLOB) { >>>> - while (*z && p1) { >>>> + /* >>>> + * In the code below 'cnt' and 'n_chars' is >>>> + * used because '\0' is not supposed to be >>>> + * end-of-string symbol. >>>> + */ >>>> + int n_chars = count_chars(z, sqlite3_value_bytes(argv[0])); >>>> >>> I’d better call it char_count or symbol_count or char_count. >> >>>> diff --git a/test/sql-tap/func.test.lua b/test/sql-tap/func.test.lua >>>> index b7de1d955..8c712bd5e 100755 >>>> --- a/test/sql-tap/func.test.lua >>>> +++ b/test/sql-tap/func.test.lua >>>> +-- REPLACE >>>> +test:do_execsql_test( >>>> + "func-62", >>>> + "SELECT REPLACE(CHAR(00,65,00,65), CHAR(00), CHAR(65)) LIKE 'AAAA';", >>>> + {1}) >>>> + >>>> +test:do_execsql_test( >>>> + "func-63", >>>> + "SELECT REPLACE(CHAR(00,65,00,65), CHAR(65), CHAR(00)) \ >>>> + LIKE CHAR(00,00,00,00);", >>>> + {1}) >>>> + >>>> +-- SUBSTR >>>> +test:do_execsql_test( >>>> + "func-64", >>>> + "SELECT SUBSTR(CHAR(65,00,66,67), 3, 2) LIKE CHAR(66, 67);", >>>> + {1}) >>>> + >>>> +test:do_execsql_test( >>>> + "func-65", >>>> + "SELECT SUBSTR(CHAR(00,00,00,65), 1, 4) LIKE CHAR(00,00,00,65);", >>>> + {1}) >>>> + >>>> >>> Just wondering: why do you use LIKE function almost in all tests? >>> >> To compare actual result with expected one. But now I think It would be more readable like this: > > Moreover, in case LIKE fails for some reason, these tests will > fail as well. Meanwhile, their purpose to test not LIKE function. > Glad you fixed that. Thanks.
next prev parent reply other threads:[~2019-02-13 15:46 UTC|newest] Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-01-29 9:56 [tarantool-patches] " Ivan Koptelov 2019-01-29 16:35 ` [tarantool-patches] " n.pettik 2019-02-04 12:34 ` Ivan Koptelov 2019-02-05 13:50 ` n.pettik 2019-02-07 15:14 ` i.koptelov 2019-02-11 13:15 ` n.pettik 2019-02-13 15:46 ` i.koptelov [this message] 2019-02-14 12:57 ` n.pettik 2019-02-20 13:54 ` i.koptelov 2019-02-20 15:47 ` i.koptelov 2019-02-20 16:04 ` n.pettik 2019-02-20 18:08 ` Vladislav Shpilevoy 2019-02-20 19:24 ` i.koptelov 2019-02-22 12:59 ` n.pettik 2019-02-25 11:09 ` i.koptelov 2019-02-25 15:10 ` n.pettik 2019-02-26 13:33 ` i.koptelov 2019-02-26 17:50 ` n.pettik 2019-02-26 18:44 ` i.koptelov 2019-02-26 20:16 ` Vladislav Shpilevoy 2019-03-04 11:59 ` i.koptelov 2019-03-04 15:30 ` Kirill Yukhin
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=7E6CE8AA-512D-4472-9DBD-8159073386C5@tarantool.org \ --to=ivan.koptelov@tarantool.org \ --cc=korablev@tarantool.org \ --cc=tarantool-patches@freelists.org \ --subject='[tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\''\0'\''' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox