[tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'

Tarantool development patches archive
 help / color / mirror / Atom feed

From: i.koptelov <ivan.koptelov@tarantool.org>
To: tarantool-patches@freelists.org
Cc: "n.pettik" <korablev@tarantool.org>
Subject: [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
Date: Wed, 13 Feb 2019 18:46:24 +0300	[thread overview]
Message-ID: <7E6CE8AA-512D-4472-9DBD-8159073386C5@tarantool.org> (raw)
In-Reply-To: <8EF5CE57-C6B5-493C-94CC-AA3C88639485@tarantool.org>



> On 11 Feb 2019, at 16:15, n.pettik <korablev@tarantool.org> wrote:
> 
>> 
>> On 7 Feb 2019, at 18:14, i.koptelov <ivan.koptelov@tarantool.org> wrote:
>>> On 5 Feb 2019, at 16:50, n.pettik <korablev@tarantool.org> wrote:
>>>> On 29/01/2019 19:35, n.pettik wrote:
>>>>>> Fixes LIKE and LENGTH functions. '\0' now treated as
>>>>>> 
>>>>> Nit: is treated.
>>>> Fixed.
>>>>>> a usual symbol. Strings with '\0' are now processed
>>>>>> entirely. Consider examples:
>>>>>> 
>>>>>> LENGTH(CHAR(65,00,65)) == 3
>>>>>> LIKE(CHAR(65,00,65), CHAR(65,00,66)) == False
>>>>>> 
>>>>> Also, I see that smth wrong with text in this mail again
>>>> I hope now the mail text is ok.
>>> 
>>> Not quite. It is still highlighted in some way. Have no idea.
>> Oh, I am really sorry. Does it look ok now? (I changed the mailing client) 
> 
> Yes, it does.
> 
>>>> src/box/sql/func.c         |  88 +++++++++++++-----
>>>> src/box/sql/vdbeInt.h      |   2 +-
>>>> test/sql-tap/func.test.lua | 220 ++++++++++++++++++++++++++++++++++++++++++++-
>>>> 3 files changed, 284 insertions(+), 26 deletions(-)
>>>> 
>>>> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
>>>> index e46b162d9..2978af983 100644
>>>> --- a/src/box/sql/func.c
>>>> +++ b/src/box/sql/func.c
>>>> @@ -128,6 +128,30 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv)
>>>> 	sqlite3_result_text(context, z, -1, SQLITE_STATIC);
>>>> }
>>>> 
>>>> +/**
>>>> + * Return number of chars in the given string.
>>>> + *
>>>> + * Number of chars != byte size of string because some characters
>>>> + * are encoded with more than one byte. Also note that all
>>>> + * characters from 'str' to 'str + byte_len' would be counted,
>>>> + * even if there is a '\0' somewhere between them.
>>>> + * @param str String to be counted.
>>>> + * @param byte_len Byte length of given string.
>>>> + * @return
>>>> 
>>> Return what?
>> Fixed.
>>>> + */
>>>> +static int
>>>> +count_chars(const unsigned char *str, size_t byte_len)
>>>> 
>>> Quite poor naming. I would call it utf8_str_len or
>>> smth with utf8 prefix. Mb it is worth to put it some utils source file.
>>> Also, consider using native U8_NEXT function from utf8.c,
>>> instead of custom SQLITE_SKIP_UTF8. It may be not so fast
>>> but safer I suppose. I don't insist though.
>>>> +{
>>> What if str is NULL? Add at least an assertion.
>>>> +	int n_chars = 0;
>>>> +	const unsigned char *prev_z;
>>>> +	for (size_t cnt = 0; cnt < byte_len; cnt += (str - prev_z)) {
>>>> +		n_chars++;
>>>> +		prev_z = str;
>>>> +		SQLITE_SKIP_UTF8(str);
>>>> +	}
>>>> +	return n_chars;
>>>> +}
>>>> 
>>> You can rewrite this function in a simpler way without using SQLITE macroses.
>>> Read this topic: https://stackoverflow.com/questions/3911536/utf-8-unicode-whats-with-0xc0-and-0x80/3911566#3911566
>>> It is quite useful. You may borrow implementation from there.
>> Both usage of U8_NEXT and the solution from stack overflow causes sql-tap/badutf1 test to fail,
>> while func.test (and other tests) are OK. In other words, SQLITE_SKIP_UTF8 and U8_NEXT
>> proceeds Invalid byte sequences differently.
>> As far as I understand, the purpose of the test is to check that malformed UTF-8 would not cause a crash.
>> So, I just used U8_NEXT and fixed the test.
>> 
>> (Since the function is used only in func.c, I prefer to keep it there and don’t move to separate utils file)
>> 
>> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
>> index 2978af983..8ad75adb8 100644
>> --- a/src/box/sql/func.c
>> +++ b/src/box/sql/func.c
>> @@ -129,27 +129,29 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv)
>> }
>> 
>> /**
>> - * Return number of chars in the given string.
>> + * Return number of symbols in the given string.
>> *
>> - * Number of chars != byte size of string because some characters
>> + * Number of symbols != byte size of string because some symbols
>> * are encoded with more than one byte. Also note that all
>> - * characters from 'str' to 'str + byte_len' would be counted,
>> + * symbols from 'str' to 'str + byte_len' would be counted,
>> * even if there is a '\0' somewhere between them.
>> * @param str String to be counted.
>> * @param byte_len Byte length of given string.
>> - * @return
>> + * @return number of symbols in the given string.
>> */
>> static int
>> -count_chars(const unsigned char *str, size_t byte_len)
>> +char_count(const unsigned char *str, size_t byte_len)
> 
> As I pointed in previous review, use utf8_ prefix:
> utf8_strlen, utf8_str_char_count or sort of.

Sorry, overlooked this. Fixed now.


diff --git a/src/box/sql/func.c b/src/box/sql/func.c
index 8ad75adb8..bab102988 100644
--- a/src/box/sql/func.c
+++ b/src/box/sql/func.c
@@ -140,18 +140,18 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv)
  * @return number of symbols in the given string.
  */
 static int
-char_count(const unsigned char *str, size_t byte_len)
+utf8_char_count(const unsigned char *str, int byte_len)
 {
> 
> Why unsigned char?
I used unsigned char because sqlite3_value_text() returns result
of this type.
> 
>> {
>> -	int n_chars = 0;
>> -	const unsigned char *prev_z;
>> -	for (size_t cnt = 0; cnt < byte_len; cnt += (str - prev_z)) {
>> -		n_chars++;
>> -		prev_z = str;
>> -		SQLITE_SKIP_UTF8(str);
>> +	int symbol_len = 0;
> 
> Name is confusing: it sounds like a length of a single symbol.
> symbol_count would be better.
Ok, thank you, fixed.

> 
>> +	int offset = 0;
>> +	UChar32 res;
>> +	while (offset < (int) byte_len) {
>> +		U8_NEXT(str, offset, (int) byte_len, res)
> 
> Instead of two explicit casts to int, lets make byte_len
> be of type int.
Ok.


diff --git a/src/box/sql/func.c b/src/box/sql/func.c
index 8ad75adb8..bab102988 100644
--- a/src/box/sql/func.c
+++ b/src/box/sql/func.c
@@ -140,18 +140,18 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv)
  * @return number of symbols in the given string.
  */
 static int
-char_count(const unsigned char *str, size_t byte_len)
+utf8_char_count(const unsigned char *str, int byte_len)
 {
-	int symbol_len = 0;
+	int symbol_count = 0;
 	int offset = 0;
 	UChar32 res;
-	while (offset < (int) byte_len) {
-		U8_NEXT(str, offset, (int) byte_len, res)
+	while (offset < byte_len) {
+		U8_NEXT(str, offset, byte_len, res)
 		if (res < 0)
 			break;
-		symbol_len++;
+		symbol_count++;
 	}
-	return symbol_len;
+	return symbol_count;
 }

> 
>> +		if (res < 0)
>> +			break;
> 
> Hm, so if a sequence contains non-utf8 symbols,
> you simply cut the string. Not sure it is solid approach.
> Could you please check how other DBs behave (do they throw
> an error? Or ignore invalid symbols?) in similar situation and
> what standard says.
I’ve tested statements with LENGTH from the test badutf1 in different
DBs. PostgreSQL raised an error "invalid byte sequence for encoding “UTF8”.
MySQL, DB2 and MSSQL behaved all the same way: count each invalid byte as
a symbol. For example:

0x80, 0x7f, 0x81 are all invalid first bytes from
UTF8 point of view, 0xC03F is bad two byte seq. where
first byte is ok and second is broken, 0xE0800F is
bad three byte seq. where first two bytes are ok, but the
third is broken.

(this is MySQL SQL)

select length(X'80'); 1
select length(concat(X'7f', X'80', X'81')); 3
select length(concat(X'61', X'c0')); 2
select length(concat(X'61', X'c0', X'80', X'80', X'80',
X'80', X'80', X'80', X'80', X'80', X'80', X'80')); 12

select length(X'C03F'); 2
select length(concat(X'C03F', 'a')); 3
select length(X'E0800F'); 3
select length(concat(X'E0800F', 'a')); 4

I have not found anything in standard about dealing with
invalid UTF8 sequences.

Even before the patch test gave results different from what
all others DBs.

I propose to behave all other DBs do, as I describe above.
Is it ok?

> 
>> +		symbol_len++;
>> 	}
>> -	return n_chars;
>> +	return symbol_len;
>> }
>> 
>> diff --git a/test/sql-tap/badutf1.test.lua b/test/sql-tap/badutf1.test.lua
>> index 534c762ba..0a90d1b17 100755
>> --- a/test/sql-tap/badutf1.test.lua
>> +++ b/test/sql-tap/badutf1.test.lua
>> @@ -215,7 +215,7 @@ test:do_test(
>>        return test:execsql2("SELECT length('\x80') AS x")
>>    end, {
>>        -- <badutf-3.1>
>> -        "X", 1
>> +        "X", 0
> 
> Why this and the rest of tests below has been changed?
> I guess because now they contain invalid utf8 symbols.
They contained invalid utf8 symbols before. The tests are changed because
I changed the way we handle malformed utf8 strings.
> 
>>        -- </badutf-3.1>
>>    })
>> 
>> @@ -235,7 +235,7 @@ test:do_test(
>>        return test:execsql2("SELECT length('\x7f\x80\x81') AS x")
>>    end, {
>>        -- <badutf-3.3>
>> -        "X", 3
>> +        "X", 1
> 
> But wait, I execute
> tarantool> SELECT length('\x7f\x80\x81') AS x
> ---
> - - [12]
> ...
> 
Maybe you got this result because ‘\x7f’ is treated by tarantool sql
as a simple string with 4 characters?
> Looks extremely strangle. What do these tests check at all?
> Why we can’t use simple sqlexecute or smth? This test suite
> is so broken...
I think these tests check that DB does not crash if malformed utf8 is
encountered.
> 
>>>> @@ -388,12 +405,21 @@ substrFunc(sqlite3_context * context, int argc, sqlite3_value ** argv)
>>>> 	}
>>>> 	assert(p1 >= 0 && p2 >= 0);
>>>> 	if (p0type != SQLITE_BLOB) {
>>>> -		while (*z && p1) {
>>>> +		/*
>>>> +		 * In the code below 'cnt' and 'n_chars' is
>>>> +		 * used because '\0' is not supposed to be
>>>> +		 * end-of-string symbol.
>>>> +		 */
>>>> +		int n_chars = count_chars(z, sqlite3_value_bytes(argv[0]));
>>>> 
>>> I’d better call it char_count or symbol_count or char_count.
>> 
>>>> diff --git a/test/sql-tap/func.test.lua b/test/sql-tap/func.test.lua
>>>> index b7de1d955..8c712bd5e 100755
>>>> --- a/test/sql-tap/func.test.lua
>>>> +++ b/test/sql-tap/func.test.lua
>>>> +-- REPLACE
>>>> +test:do_execsql_test(
>>>> +    "func-62",
>>>> +    "SELECT REPLACE(CHAR(00,65,00,65), CHAR(00), CHAR(65)) LIKE 'AAAA';",
>>>> +    {1})
>>>> +
>>>> +test:do_execsql_test(
>>>> +    "func-63",
>>>> +    "SELECT REPLACE(CHAR(00,65,00,65), CHAR(65), CHAR(00)) \
>>>> +    LIKE CHAR(00,00,00,00);",
>>>> +    {1})
>>>> +
>>>> +-- SUBSTR
>>>> +test:do_execsql_test(
>>>> +    "func-64",
>>>> +    "SELECT SUBSTR(CHAR(65,00,66,67), 3, 2) LIKE CHAR(66, 67);",
>>>> +    {1})
>>>> +
>>>> +test:do_execsql_test(
>>>> +    "func-65",
>>>> +    "SELECT SUBSTR(CHAR(00,00,00,65), 1, 4) LIKE CHAR(00,00,00,65);",
>>>> +    {1})
>>>> +
>>>> 
>>> Just wondering: why do you use LIKE function almost in all tests?
>>> 
>> To compare actual result with expected one. But now I think It would be more readable like this:
> 
> Moreover, in case LIKE fails for some reason, these tests will
> fail as well. Meanwhile, their purpose to test not LIKE function.
> Glad you fixed that.
Thanks.

next prev parent reply	other threads:[~2019-02-13 15:46 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-29  9:56 [tarantool-patches] " Ivan Koptelov
2019-01-29 16:35 ` [tarantool-patches] " n.pettik
2019-02-04 12:34   ` Ivan Koptelov
2019-02-05 13:50     ` n.pettik
2019-02-07 15:14       ` i.koptelov
2019-02-11 13:15         ` n.pettik
2019-02-13 15:46           ` i.koptelov [this message]
2019-02-14 12:57             ` n.pettik
2019-02-20 13:54               ` i.koptelov
2019-02-20 15:47                 ` i.koptelov
2019-02-20 16:04                   ` n.pettik
2019-02-20 18:08                     ` Vladislav Shpilevoy
2019-02-20 19:24                     ` i.koptelov
2019-02-22 12:59                       ` n.pettik
2019-02-25 11:09                         ` i.koptelov
2019-02-25 15:10                           ` n.pettik
2019-02-26 13:33                             ` i.koptelov
2019-02-26 17:50                               ` n.pettik
2019-02-26 18:44                                 ` i.koptelov
2019-02-26 20:16                                   ` Vladislav Shpilevoy
2019-03-04 11:59                                     ` i.koptelov
2019-03-04 15:30 ` Kirill Yukhin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7E6CE8AA-512D-4472-9DBD-8159073386C5@tarantool.org \
    --to=ivan.koptelov@tarantool.org \
    --cc=korablev@tarantool.org \
    --cc=tarantool-patches@freelists.org \
    --subject='[tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\''\0'\''' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox