From: "n.pettik" <korablev@tarantool.org>
To: tarantool-patches@freelists.org
Cc: Ivan Koptelov <ivan.koptelov@tarantool.org>
Subject: [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
Date: Thu, 14 Feb 2019 15:57:45 +0300 [thread overview]
Message-ID: <FC2CA838-42C3-4494-BC6C-37C864D4BF79@tarantool.org> (raw)
In-Reply-To: <7E6CE8AA-512D-4472-9DBD-8159073386C5@tarantool.org>
I see some strange file on your branch:
https://github.com/tarantool/tarantool/commit/0a34e978c8023dcb4f99965d93203a99adf21520
I guess it’s an artefact from your investigation.
If you want to share some results/proposals, just
attach them to the this letter or write to dev list.
> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
> index 8ad75adb8..bab102988 100644
> --- a/src/box/sql/func.c
> +++ b/src/box/sql/func.c
> @@ -140,18 +140,18 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv)
> * @return number of symbols in the given string.
> */
> static int
> -char_count(const unsigned char *str, size_t byte_len)
> +utf8_char_count(const unsigned char *str, int byte_len)
> {
>>
>> Why unsigned char?
> I used unsigned char because sqlite3_value_text() returns result
> of this type.
Ok, let it be.
> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
> index 8ad75adb8..bab102988 100644
> --- a/src/box/sql/func.c
> +++ b/src/box/sql/func.c
> @@ -140,18 +140,18 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv)
> * @return number of symbols in the given string.
> */
> static int
> -char_count(const unsigned char *str, size_t byte_len)
> +utf8_char_count(const unsigned char *str, int byte_len)
> {
> - int symbol_len = 0;
> + int symbol_count = 0;
> int offset = 0;
> UChar32 res;
> - while (offset < (int) byte_len) {
> - U8_NEXT(str, offset, (int) byte_len, res)
> + while (offset < byte_len) {
> + U8_NEXT(str, offset, byte_len, res)
> if (res < 0)
> break;
> - symbol_len++;
> + symbol_count++;
> }
> - return symbol_len;
> + return symbol_count;
> }
>
>>
>>> + if (res < 0)
>>> + break;
>>
>> Hm, so if a sequence contains non-utf8 symbols,
>> you simply cut the string. Not sure it is solid approach.
>> Could you please check how other DBs behave (do they throw
>> an error? Or ignore invalid symbols?) in similar situation and
>> what standard says.
> I’ve tested statements with LENGTH from the test badutf1 in different
> DBs. PostgreSQL raised an error "invalid byte sequence for encoding “UTF8”.
> MySQL, DB2 and MSSQL behaved all the same way: count each invalid byte as
> a symbol. For example:
>
> 0x80, 0x7f, 0x81 are all invalid first bytes from
> UTF8 point of view, 0xC03F is bad two byte seq. where
> first byte is ok and second is broken, 0xE0800F is
> bad three byte seq. where first two bytes are ok, but the
> third is broken.
>
> (this is MySQL SQL)
>
> select length(X'80'); 1
> select length(concat(X'7f', X'80', X'81')); 3
> select length(concat(X'61', X'c0')); 2
> select length(concat(X'61', X'c0', X'80', X'80', X'80',
> X'80', X'80', X'80', X'80', X'80', X'80', X'80')); 12
>
> select length(X'C03F'); 2
> select length(concat(X'C03F', 'a')); 3
> select length(X'E0800F'); 3
> select length(concat(X'E0800F', 'a')); 4
>
> I have not found anything in standard about dealing with
> invalid UTF8 sequences.
>
> Even before the patch test gave results different from what
> all others DBs.
>
> I propose to behave all other DBs do, as I describe above.
> Is it ok?
PostgreSQL makes me feel uncertain concerning this question.
Anyway, I still believe that it’s better to raise an error.
I propose you ask other members of core team for their opinion.
Additionally, you could ask author of origin issue and our expert P. Gulutzan.
>>> + symbol_len++;
>>> }
>>> - return n_chars;
>>> + return symbol_len;
>>> }
>>>
>>> diff --git a/test/sql-tap/badutf1.test.lua b/test/sql-tap/badutf1.test.lua
>>> index 534c762ba..0a90d1b17 100755
>>> --- a/test/sql-tap/badutf1.test.lua
>>> +++ b/test/sql-tap/badutf1.test.lua
>>> @@ -215,7 +215,7 @@ test:do_test(
>>> return test:execsql2("SELECT length('\x80') AS x")
>>> end, {
>>> -- <badutf-3.1>
>>> - "X", 1
>>> + "X", 0
>>
>> Why this and the rest of tests below has been changed?
>> I guess because now they contain invalid utf8 symbols.
> They contained invalid utf8 symbols before. The tests are changed because
> I changed the way we handle malformed utf8 strings.
Ok, but at least underline this fact in commit message.
>>> -- </badutf-3.1>
>>> })
>>>
>>> @@ -235,7 +235,7 @@ test:do_test(
>>> return test:execsql2("SELECT length('\x7f\x80\x81') AS x")
>>> end, {
>>> -- <badutf-3.3>
>>> - "X", 3
>>> + "X", 1
>>
>> But wait, I execute
>> tarantool> SELECT length('\x7f\x80\x81') AS x
>> ---
>> - - [12]
>> ...
>>
> Maybe you got this result because ‘\x7f’ is treated by tarantool sql
> as a simple string with 4 characters?
And why in tests it is treated in other way?
>> Looks extremely strangle. What do these tests check at all?
>> Why we can’t use simple sqlexecute or smth? This test suite
>> is so broken...
> I think these tests check that DB does not crash if malformed utf8 is
> encountered.
I mean why results from console run are different from tests
results in our suite? Still can’t understand that. Now I execute this:
tarantool> box.sql.execute('select length("\x7f\x80\x81")’)
And got strange error:
---
- error: !!binary bm8gc3VjaCBjb2x1bW46IH+AgQ==
...
next prev parent reply other threads:[~2019-02-14 12:57 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-01-29 9:56 [tarantool-patches] " Ivan Koptelov
2019-01-29 16:35 ` [tarantool-patches] " n.pettik
2019-02-04 12:34 ` Ivan Koptelov
2019-02-05 13:50 ` n.pettik
2019-02-07 15:14 ` i.koptelov
2019-02-11 13:15 ` n.pettik
2019-02-13 15:46 ` i.koptelov
2019-02-14 12:57 ` n.pettik [this message]
2019-02-20 13:54 ` i.koptelov
2019-02-20 15:47 ` i.koptelov
2019-02-20 16:04 ` n.pettik
2019-02-20 18:08 ` Vladislav Shpilevoy
2019-02-20 19:24 ` i.koptelov
2019-02-22 12:59 ` n.pettik
2019-02-25 11:09 ` i.koptelov
2019-02-25 15:10 ` n.pettik
2019-02-26 13:33 ` i.koptelov
2019-02-26 17:50 ` n.pettik
2019-02-26 18:44 ` i.koptelov
2019-02-26 20:16 ` Vladislav Shpilevoy
2019-03-04 11:59 ` i.koptelov
2019-03-04 15:30 ` Kirill Yukhin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=FC2CA838-42C3-4494-BC6C-37C864D4BF79@tarantool.org \
--to=korablev@tarantool.org \
--cc=ivan.koptelov@tarantool.org \
--cc=tarantool-patches@freelists.org \
--subject='[tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\''\0'\''' \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox