[tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'

Mon Feb 11 16:15:37 MSK 2019

> On 7 Feb 2019, at 18:14, i.koptelov <ivan.koptelov at tarantool.org> wrote:
>> On 5 Feb 2019, at 16:50, n.pettik <korablev at tarantool.org> wrote:
>>> On 29/01/2019 19:35, n.pettik wrote:
>>>>> Fixes LIKE and LENGTH functions. '\0' now treated as
>>>>> 
>>>> Nit: is treated.
>>> Fixed.
>>>>> a usual symbol. Strings with '\0' are now processed
>>>>> entirely. Consider examples:
>>>>> 
>>>>> LENGTH(CHAR(65,00,65)) == 3
>>>>> LIKE(CHAR(65,00,65), CHAR(65,00,66)) == False
>>>>> 
>>>> Also, I see that smth wrong with text in this mail again
>>> I hope now the mail text is ok.
>> 
>> Not quite. It is still highlighted in some way. Have no idea.
> Oh, I am really sorry. Does it look ok now? (I changed the mailing client) 

Yes, it does.

>>> src/box/sql/func.c         |  88 +++++++++++++-----
>>> src/box/sql/vdbeInt.h      |   2 +-
>>> test/sql-tap/func.test.lua | 220 ++++++++++++++++++++++++++++++++++++++++++++-
>>> 3 files changed, 284 insertions(+), 26 deletions(-)
>>> 
>>> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
>>> index e46b162d9..2978af983 100644
>>> --- a/src/box/sql/func.c
>>> +++ b/src/box/sql/func.c
>>> @@ -128,6 +128,30 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv)
>>> 	sqlite3_result_text(context, z, -1, SQLITE_STATIC);
>>> }
>>> 
>>> +/**
>>> + * Return number of chars in the given string.
>>> + *
>>> + * Number of chars != byte size of string because some characters
>>> + * are encoded with more than one byte. Also note that all
>>> + * characters from 'str' to 'str + byte_len' would be counted,
>>> + * even if there is a '\0' somewhere between them.
>>> + * @param str String to be counted.
>>> + * @param byte_len Byte length of given string.
>>> + * @return
>>> 
>> Return what?
> Fixed.
>>> + */
>>> +static int
>>> +count_chars(const unsigned char *str, size_t byte_len)
>>> 
>> Quite poor naming. I would call it utf8_str_len or
>> smth with utf8 prefix. Mb it is worth to put it some utils source file.
>> Also, consider using native U8_NEXT function from utf8.c,
>> instead of custom SQLITE_SKIP_UTF8. It may be not so fast
>> but safer I suppose. I don't insist though.
>>> +{
>> What if str is NULL? Add at least an assertion.
>>> +	int n_chars = 0;
>>> +	const unsigned char *prev_z;
>>> +	for (size_t cnt = 0; cnt < byte_len; cnt += (str - prev_z)) {
>>> +		n_chars++;
>>> +		prev_z = str;
>>> +		SQLITE_SKIP_UTF8(str);
>>> +	}
>>> +	return n_chars;
>>> +}
>>> 
>> You can rewrite this function in a simpler way without using SQLITE macroses.
>> Read this topic: https://stackoverflow.com/questions/3911536/utf-8-unicode-whats-with-0xc0-and-0x80/3911566#3911566
>> It is quite useful. You may borrow implementation from there.
> Both usage of U8_NEXT and the solution from stack overflow causes sql-tap/badutf1 test to fail,
> while func.test (and other tests) are OK. In other words, SQLITE_SKIP_UTF8 and U8_NEXT
> proceeds Invalid byte sequences differently.
> As far as I understand, the purpose of the test is to check that malformed UTF-8 would not cause a crash.
> So, I just used U8_NEXT and fixed the test.
> 
> (Since the function is used only in func.c, I prefer to keep it there and don’t move to separate utils file)
> 
> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
> index 2978af983..8ad75adb8 100644
> --- a/src/box/sql/func.c
> +++ b/src/box/sql/func.c
> @@ -129,27 +129,29 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv)
> }
> 
> /**
> - * Return number of chars in the given string.
> + * Return number of symbols in the given string.
>  *
> - * Number of chars != byte size of string because some characters
> + * Number of symbols != byte size of string because some symbols
>  * are encoded with more than one byte. Also note that all
> - * characters from 'str' to 'str + byte_len' would be counted,
> + * symbols from 'str' to 'str + byte_len' would be counted,
>  * even if there is a '\0' somewhere between them.
>  * @param str String to be counted.
>  * @param byte_len Byte length of given string.
> - * @return
> + * @return number of symbols in the given string.
>  */
> static int
> -count_chars(const unsigned char *str, size_t byte_len)
> +char_count(const unsigned char *str, size_t byte_len)

As I pointed in previous review, use utf8_ prefix:
utf8_strlen, utf8_str_char_count or sort of.

Why unsigned char?

> {
> -	int n_chars = 0;
> -	const unsigned char *prev_z;
> -	for (size_t cnt = 0; cnt < byte_len; cnt += (str - prev_z)) {
> -		n_chars++;
> -		prev_z = str;
> -		SQLITE_SKIP_UTF8(str);
> +	int symbol_len = 0;

Name is confusing: it sounds like a length of a single symbol.
symbol_count would be better.

> +	int offset = 0;
> +	UChar32 res;
> +	while (offset < (int) byte_len) {
> +		U8_NEXT(str, offset, (int) byte_len, res)

Instead of two explicit casts to int, lets make byte_len
be of type int.

> +		if (res < 0)
> +			break;

Hm, so if a sequence contains non-utf8 symbols,
you simply cut the string. Not sure it is solid approach.
Could you please check how other DBs behave (do they throw
an error? Or ignore invalid symbols?) in similar situation and
what standard says.

> +		symbol_len++;
> 	}
> -	return n_chars;
> +	return symbol_len;
> }
> 
> diff --git a/test/sql-tap/badutf1.test.lua b/test/sql-tap/badutf1.test.lua
> index 534c762ba..0a90d1b17 100755
> --- a/test/sql-tap/badutf1.test.lua
> +++ b/test/sql-tap/badutf1.test.lua
> @@ -215,7 +215,7 @@ test:do_test(
>         return test:execsql2("SELECT length('\x80') AS x")
>     end, {
>         -- <badutf-3.1>
> -        "X", 1
> +        "X", 0

Why this and the rest of tests below has been changed?
I guess because now they contain invalid utf8 symbols.

>         -- </badutf-3.1>
>     })
> 
> @@ -235,7 +235,7 @@ test:do_test(
>         return test:execsql2("SELECT length('\x7f\x80\x81') AS x")
>     end, {
>         -- <badutf-3.3>
> -        "X", 3
> +        "X", 1

But wait, I execute
tarantool> SELECT length('\x7f\x80\x81') AS x
---
- - [12]
...

Looks extremely strangle. What do these tests check at all?
Why we can’t use simple sqlexecute or smth? This test suite
is so broken...

>>> @@ -388,12 +405,21 @@ substrFunc(sqlite3_context * context, int argc, sqlite3_value ** argv)
>>> 	}
>>> 	assert(p1 >= 0 && p2 >= 0);
>>> 	if (p0type != SQLITE_BLOB) {
>>> -		while (*z && p1) {
>>> +		/*
>>> +		 * In the code below 'cnt' and 'n_chars' is
>>> +		 * used because '\0' is not supposed to be
>>> +		 * end-of-string symbol.
>>> +		 */
>>> +		int n_chars = count_chars(z, sqlite3_value_bytes(argv[0]));
>>> 
>> I’d better call it char_count or symbol_count or char_count.
> 
>>> diff --git a/test/sql-tap/func.test.lua b/test/sql-tap/func.test.lua
>>> index b7de1d955..8c712bd5e 100755
>>> --- a/test/sql-tap/func.test.lua
>>> +++ b/test/sql-tap/func.test.lua
>>> +-- REPLACE
>>> +test:do_execsql_test(
>>> +    "func-62",
>>> +    "SELECT REPLACE(CHAR(00,65,00,65), CHAR(00), CHAR(65)) LIKE 'AAAA';",
>>> +    {1})
>>> +
>>> +test:do_execsql_test(
>>> +    "func-63",
>>> +    "SELECT REPLACE(CHAR(00,65,00,65), CHAR(65), CHAR(00)) \
>>> +    LIKE CHAR(00,00,00,00);",
>>> +    {1})
>>> +
>>> +-- SUBSTR
>>> +test:do_execsql_test(
>>> +    "func-64",
>>> +    "SELECT SUBSTR(CHAR(65,00,66,67), 3, 2) LIKE CHAR(66, 67);",
>>> +    {1})
>>> +
>>> +test:do_execsql_test(
>>> +    "func-65",
>>> +    "SELECT SUBSTR(CHAR(00,00,00,65), 1, 4) LIKE CHAR(00,00,00,65);",
>>> +    {1})
>>> +
>>> 
>> Just wondering: why do you use LIKE function almost in all tests?
>> 
> To compare actual result with expected one. But now I think It would be more readable like this:

Moreover, in case LIKE fails for some reason, these tests will
fail as well. Meanwhile, their purpose to test not LIKE function.
Glad you fixed that.