From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <tarantool-patches-bounce@freelists.org>
Received: from localhost (localhost [127.0.0.1])
	by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id 793F1266E9
	for <tarantool-patches@freelists.org>; Wed, 13 Feb 2019 10:46:28 -0500 (EST)
Received: from turing.freelists.org ([127.0.0.1])
	by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id XAIk_tCUbh3k for <tarantool-patches@freelists.org>;
	Wed, 13 Feb 2019 10:46:28 -0500 (EST)
Received: from smtp57.i.mail.ru (smtp57.i.mail.ru [217.69.128.37])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 8BE9224617
	for <tarantool-patches@freelists.org>; Wed, 13 Feb 2019 10:46:27 -0500 (EST)
Content-Type: text/plain;
	charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 12.0 \(3445.100.39\))
Subject: [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
From: i.koptelov <ivan.koptelov@tarantool.org>
In-Reply-To: <8EF5CE57-C6B5-493C-94CC-AA3C88639485@tarantool.org>
Date: Wed, 13 Feb 2019 18:46:24 +0300
Content-Transfer-Encoding: quoted-printable
Message-Id: <7E6CE8AA-512D-4472-9DBD-8159073386C5@tarantool.org>
References: <15e143f4-3ea7-c7d6-d8ac-8a0e20b76449@tarantool.org>
 <1560FF96-FECD-4368-8AF8-F8F2AE7696E3@tarantool.org>
 <fd70561d-6a0e-70dc-4c20-bdcac764040a@tarantool.org>
 <07DBA796-6DD4-41DD-8438-104FE3AE05BB@tarantool.org>
 <4F4E0A7E-199C-4647-A49C-DD0E8A216527@tarantool.org>
 <8EF5CE57-C6B5-493C-94CC-AA3C88639485@tarantool.org>
Sender: tarantool-patches-bounce@freelists.org
Errors-to: tarantool-patches-bounce@freelists.org
Reply-To: tarantool-patches@freelists.org
List-help: <mailto:ecartis@freelists.org?Subject=help>
List-unsubscribe: <tarantool-patches-request@freelists.org?Subject=unsubscribe>
List-software: Ecartis version 1.0.0
List-Id: tarantool-patches <tarantool-patches.freelists.org>
List-subscribe: <tarantool-patches-request@freelists.org?Subject=subscribe>
List-owner: <mailto:>
List-post: <mailto:tarantool-patches@freelists.org>
List-archive: <http://www.freelists.org/archives/tarantool-patches>
To: tarantool-patches@freelists.org
Cc: "n.pettik" <korablev@tarantool.org>


> On 11 Feb 2019, at 16:15, n.pettik <korablev@tarantool.org> wrote:
>=20
>>=20
>> On 7 Feb 2019, at 18:14, i.koptelov <ivan.koptelov@tarantool.org> =
wrote:
>>> On 5 Feb 2019, at 16:50, n.pettik <korablev@tarantool.org> wrote:
>>>> On 29/01/2019 19:35, n.pettik wrote:
>>>>>> Fixes LIKE and LENGTH functions. '\0' now treated as
>>>>>>=20
>>>>> Nit: is treated.
>>>> Fixed.
>>>>>> a usual symbol. Strings with '\0' are now processed
>>>>>> entirely. Consider examples:
>>>>>>=20
>>>>>> LENGTH(CHAR(65,00,65)) =3D=3D 3
>>>>>> LIKE(CHAR(65,00,65), CHAR(65,00,66)) =3D=3D False
>>>>>>=20
>>>>> Also, I see that smth wrong with text in this mail again
>>>> I hope now the mail text is ok.
>>>=20
>>> Not quite. It is still highlighted in some way. Have no idea.
>> Oh, I am really sorry. Does it look ok now? (I changed the mailing =
client)=20
>=20
> Yes, it does.
>=20
>>>> src/box/sql/func.c         |  88 +++++++++++++-----
>>>> src/box/sql/vdbeInt.h      |   2 +-
>>>> test/sql-tap/func.test.lua | 220 =
++++++++++++++++++++++++++++++++++++++++++++-
>>>> 3 files changed, 284 insertions(+), 26 deletions(-)
>>>>=20
>>>> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
>>>> index e46b162d9..2978af983 100644
>>>> --- a/src/box/sql/func.c
>>>> +++ b/src/box/sql/func.c
>>>> @@ -128,6 +128,30 @@ typeofFunc(sqlite3_context * context, int =
NotUsed, sqlite3_value ** argv)
>>>> 	sqlite3_result_text(context, z, -1, SQLITE_STATIC);
>>>> }
>>>>=20
>>>> +/**
>>>> + * Return number of chars in the given string.
>>>> + *
>>>> + * Number of chars !=3D byte size of string because some =
characters
>>>> + * are encoded with more than one byte. Also note that all
>>>> + * characters from 'str' to 'str + byte_len' would be counted,
>>>> + * even if there is a '\0' somewhere between them.
>>>> + * @param str String to be counted.
>>>> + * @param byte_len Byte length of given string.
>>>> + * @return
>>>>=20
>>> Return what?
>> Fixed.
>>>> + */
>>>> +static int
>>>> +count_chars(const unsigned char *str, size_t byte_len)
>>>>=20
>>> Quite poor naming. I would call it utf8_str_len or
>>> smth with utf8 prefix. Mb it is worth to put it some utils source =
file.
>>> Also, consider using native U8_NEXT function from utf8.c,
>>> instead of custom SQLITE_SKIP_UTF8. It may be not so fast
>>> but safer I suppose. I don't insist though.
>>>> +{
>>> What if str is NULL? Add at least an assertion.
>>>> +	int n_chars =3D 0;
>>>> +	const unsigned char *prev_z;
>>>> +	for (size_t cnt =3D 0; cnt < byte_len; cnt +=3D (str - prev_z)) =
{
>>>> +		n_chars++;
>>>> +		prev_z =3D str;
>>>> +		SQLITE_SKIP_UTF8(str);
>>>> +	}
>>>> +	return n_chars;
>>>> +}
>>>>=20
>>> You can rewrite this function in a simpler way without using SQLITE =
macroses.
>>> Read this topic: =
https://stackoverflow.com/questions/3911536/utf-8-unicode-whats-with-0xc0-=
and-0x80/3911566#3911566
>>> It is quite useful. You may borrow implementation from there.
>> Both usage of U8_NEXT and the solution from stack overflow causes =
sql-tap/badutf1 test to fail,
>> while func.test (and other tests) are OK. In other words, =
SQLITE_SKIP_UTF8 and U8_NEXT
>> proceeds Invalid byte sequences differently.
>> As far as I understand, the purpose of the test is to check that =
malformed UTF-8 would not cause a crash.
>> So, I just used U8_NEXT and fixed the test.
>>=20
>> (Since the function is used only in func.c, I prefer to keep it there =
and don=E2=80=99t move to separate utils file)
>>=20
>> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
>> index 2978af983..8ad75adb8 100644
>> --- a/src/box/sql/func.c
>> +++ b/src/box/sql/func.c
>> @@ -129,27 +129,29 @@ typeofFunc(sqlite3_context * context, int =
NotUsed, sqlite3_value ** argv)
>> }
>>=20
>> /**
>> - * Return number of chars in the given string.
>> + * Return number of symbols in the given string.
>> *
>> - * Number of chars !=3D byte size of string because some characters
>> + * Number of symbols !=3D byte size of string because some symbols
>> * are encoded with more than one byte. Also note that all
>> - * characters from 'str' to 'str + byte_len' would be counted,
>> + * symbols from 'str' to 'str + byte_len' would be counted,
>> * even if there is a '\0' somewhere between them.
>> * @param str String to be counted.
>> * @param byte_len Byte length of given string.
>> - * @return
>> + * @return number of symbols in the given string.
>> */
>> static int
>> -count_chars(const unsigned char *str, size_t byte_len)
>> +char_count(const unsigned char *str, size_t byte_len)
>=20
> As I pointed in previous review, use utf8_ prefix:
> utf8_strlen, utf8_str_char_count or sort of.

Sorry, overlooked this. Fixed now.


diff --git a/src/box/sql/func.c b/src/box/sql/func.c
index 8ad75adb8..bab102988 100644
--- a/src/box/sql/func.c
+++ b/src/box/sql/func.c
@@ -140,18 +140,18 @@ typeofFunc(sqlite3_context * context, int NotUsed, =
sqlite3_value ** argv)
  * @return number of symbols in the given string.
  */
 static int
-char_count(const unsigned char *str, size_t byte_len)
+utf8_char_count(const unsigned char *str, int byte_len)
 {
>=20
> Why unsigned char?
I used unsigned char because sqlite3_value_text() returns result
of this type.
>=20
>> {
>> -	int n_chars =3D 0;
>> -	const unsigned char *prev_z;
>> -	for (size_t cnt =3D 0; cnt < byte_len; cnt +=3D (str - prev_z)) =
{
>> -		n_chars++;
>> -		prev_z =3D str;
>> -		SQLITE_SKIP_UTF8(str);
>> +	int symbol_len =3D 0;
>=20
> Name is confusing: it sounds like a length of a single symbol.
> symbol_count would be better.
Ok, thank you, fixed.

>=20
>> +	int offset =3D 0;
>> +	UChar32 res;
>> +	while (offset < (int) byte_len) {
>> +		U8_NEXT(str, offset, (int) byte_len, res)
>=20
> Instead of two explicit casts to int, lets make byte_len
> be of type int.
Ok.


diff --git a/src/box/sql/func.c b/src/box/sql/func.c
index 8ad75adb8..bab102988 100644
--- a/src/box/sql/func.c
+++ b/src/box/sql/func.c
@@ -140,18 +140,18 @@ typeofFunc(sqlite3_context * context, int NotUsed, =
sqlite3_value ** argv)
  * @return number of symbols in the given string.
  */
 static int
-char_count(const unsigned char *str, size_t byte_len)
+utf8_char_count(const unsigned char *str, int byte_len)
 {
-	int symbol_len =3D 0;
+	int symbol_count =3D 0;
 	int offset =3D 0;
 	UChar32 res;
-	while (offset < (int) byte_len) {
-		U8_NEXT(str, offset, (int) byte_len, res)
+	while (offset < byte_len) {
+		U8_NEXT(str, offset, byte_len, res)
 		if (res < 0)
 			break;
-		symbol_len++;
+		symbol_count++;
 	}
-	return symbol_len;
+	return symbol_count;
 }

>=20
>> +		if (res < 0)
>> +			break;
>=20
> Hm, so if a sequence contains non-utf8 symbols,
> you simply cut the string. Not sure it is solid approach.
> Could you please check how other DBs behave (do they throw
> an error? Or ignore invalid symbols?) in similar situation and
> what standard says.
I=E2=80=99ve tested statements with LENGTH from the test badutf1 in =
different
DBs. PostgreSQL raised an error "invalid byte sequence for encoding =
=E2=80=9CUTF8=E2=80=9D.
MySQL, DB2 and MSSQL behaved all the same way: count each invalid byte =
as
a symbol. For example:

0x80, 0x7f, 0x81 are all invalid first bytes from
UTF8 point of view, 0xC03F is bad two byte seq. where
first byte is ok and second is broken, 0xE0800F is
bad three byte seq. where first two bytes are ok, but the
third is broken.

(this is MySQL SQL)

select length(X'80'); 1
select length(concat(X'7f', X'80', X'81')); 3
select length(concat(X'61', X'c0')); 2
select length(concat(X'61', X'c0', X'80', X'80', X'80',
X'80', X'80', X'80', X'80', X'80', X'80', X'80')); 12

select length(X'C03F'); 2
select length(concat(X'C03F', 'a')); 3
select length(X'E0800F'); 3
select length(concat(X'E0800F', 'a')); 4

I have not found anything in standard about dealing with
invalid UTF8 sequences.

Even before the patch test gave results different from what
all others DBs.

I propose to behave all other DBs do, as I describe above.
Is it ok?

>=20
>> +		symbol_len++;
>> 	}
>> -	return n_chars;
>> +	return symbol_len;
>> }
>>=20
>> diff --git a/test/sql-tap/badutf1.test.lua =
b/test/sql-tap/badutf1.test.lua
>> index 534c762ba..0a90d1b17 100755
>> --- a/test/sql-tap/badutf1.test.lua
>> +++ b/test/sql-tap/badutf1.test.lua
>> @@ -215,7 +215,7 @@ test:do_test(
>>        return test:execsql2("SELECT length('\x80') AS x")
>>    end, {
>>        -- <badutf-3.1>
>> -        "X", 1
>> +        "X", 0
>=20
> Why this and the rest of tests below has been changed?
> I guess because now they contain invalid utf8 symbols.
They contained invalid utf8 symbols before. The tests are changed =
because
I changed the way we handle malformed utf8 strings.
>=20
>>        -- </badutf-3.1>
>>    })
>>=20
>> @@ -235,7 +235,7 @@ test:do_test(
>>        return test:execsql2("SELECT length('\x7f\x80\x81') AS x")
>>    end, {
>>        -- <badutf-3.3>
>> -        "X", 3
>> +        "X", 1
>=20
> But wait, I execute
> tarantool> SELECT length('\x7f\x80\x81') AS x
> ---
> - - [12]
> ...
>=20
Maybe you got this result because =E2=80=98\x7f=E2=80=99 is treated by =
tarantool sql
as a simple string with 4 characters?
> Looks extremely strangle. What do these tests check at all?
> Why we can=E2=80=99t use simple sqlexecute or smth? This test suite
> is so broken...
I think these tests check that DB does not crash if malformed utf8 is
encountered.
>=20
>>>> @@ -388,12 +405,21 @@ substrFunc(sqlite3_context * context, int =
argc, sqlite3_value ** argv)
>>>> 	}
>>>> 	assert(p1 >=3D 0 && p2 >=3D 0);
>>>> 	if (p0type !=3D SQLITE_BLOB) {
>>>> -		while (*z && p1) {
>>>> +		/*
>>>> +		 * In the code below 'cnt' and 'n_chars' is
>>>> +		 * used because '\0' is not supposed to be
>>>> +		 * end-of-string symbol.
>>>> +		 */
>>>> +		int n_chars =3D count_chars(z, =
sqlite3_value_bytes(argv[0]));
>>>>=20
>>> I=E2=80=99d better call it char_count or symbol_count or char_count.
>>=20
>>>> diff --git a/test/sql-tap/func.test.lua =
b/test/sql-tap/func.test.lua
>>>> index b7de1d955..8c712bd5e 100755
>>>> --- a/test/sql-tap/func.test.lua
>>>> +++ b/test/sql-tap/func.test.lua
>>>> +-- REPLACE
>>>> +test:do_execsql_test(
>>>> +    "func-62",
>>>> +    "SELECT REPLACE(CHAR(00,65,00,65), CHAR(00), CHAR(65)) LIKE =
'AAAA';",
>>>> +    {1})
>>>> +
>>>> +test:do_execsql_test(
>>>> +    "func-63",
>>>> +    "SELECT REPLACE(CHAR(00,65,00,65), CHAR(65), CHAR(00)) \
>>>> +    LIKE CHAR(00,00,00,00);",
>>>> +    {1})
>>>> +
>>>> +-- SUBSTR
>>>> +test:do_execsql_test(
>>>> +    "func-64",
>>>> +    "SELECT SUBSTR(CHAR(65,00,66,67), 3, 2) LIKE CHAR(66, 67);",
>>>> +    {1})
>>>> +
>>>> +test:do_execsql_test(
>>>> +    "func-65",
>>>> +    "SELECT SUBSTR(CHAR(00,00,00,65), 1, 4) LIKE =
CHAR(00,00,00,65);",
>>>> +    {1})
>>>> +
>>>>=20
>>> Just wondering: why do you use LIKE function almost in all tests?
>>>=20
>> To compare actual result with expected one. But now I think It would =
be more readable like this:
>=20
> Moreover, in case LIKE fails for some reason, these tests will
> fail as well. Meanwhile, their purpose to test not LIKE function.
> Glad you fixed that.
Thanks.