From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <tarantool-patches-bounce@freelists.org>
Received: from localhost (localhost [127.0.0.1])
	by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id D100820CBD
	for <tarantool-patches@freelists.org>; Thu, 14 Feb 2019 07:57:51 -0500 (EST)
Received: from turing.freelists.org ([127.0.0.1])
	by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id K1_ommcoAuja for <tarantool-patches@freelists.org>;
	Thu, 14 Feb 2019 07:57:51 -0500 (EST)
Received: from smtpng3.m.smailru.net (smtpng3.m.smailru.net [94.100.177.149])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 194FA1FB33
	for <tarantool-patches@freelists.org>; Thu, 14 Feb 2019 07:57:49 -0500 (EST)
Content-Type: text/plain;
	charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 12.2 \(3445.102.3\))
Subject: [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
From: "n.pettik" <korablev@tarantool.org>
In-Reply-To: <7E6CE8AA-512D-4472-9DBD-8159073386C5@tarantool.org>
Date: Thu, 14 Feb 2019 15:57:45 +0300
Content-Transfer-Encoding: quoted-printable
Message-Id: <FC2CA838-42C3-4494-BC6C-37C864D4BF79@tarantool.org>
References: <15e143f4-3ea7-c7d6-d8ac-8a0e20b76449@tarantool.org>
 <1560FF96-FECD-4368-8AF8-F8F2AE7696E3@tarantool.org>
 <fd70561d-6a0e-70dc-4c20-bdcac764040a@tarantool.org>
 <07DBA796-6DD4-41DD-8438-104FE3AE05BB@tarantool.org>
 <4F4E0A7E-199C-4647-A49C-DD0E8A216527@tarantool.org>
 <8EF5CE57-C6B5-493C-94CC-AA3C88639485@tarantool.org>
 <7E6CE8AA-512D-4472-9DBD-8159073386C5@tarantool.org>
Sender: tarantool-patches-bounce@freelists.org
Errors-to: tarantool-patches-bounce@freelists.org
Reply-To: tarantool-patches@freelists.org
List-help: <mailto:ecartis@freelists.org?Subject=help>
List-unsubscribe: <tarantool-patches-request@freelists.org?Subject=unsubscribe>
List-software: Ecartis version 1.0.0
List-Id: tarantool-patches <tarantool-patches.freelists.org>
List-subscribe: <tarantool-patches-request@freelists.org?Subject=subscribe>
List-owner: <mailto:>
List-post: <mailto:tarantool-patches@freelists.org>
List-archive: <http://www.freelists.org/archives/tarantool-patches>
To: tarantool-patches@freelists.org
Cc: Ivan Koptelov <ivan.koptelov@tarantool.org>

I see some strange file on your branch:

=
https://github.com/tarantool/tarantool/commit/0a34e978c8023dcb4f99965d9320=
3a99adf21520

I guess it=E2=80=99s an artefact from your investigation.
If you want to share some results/proposals, just
attach them to the this letter or write to dev list.

> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
> index 8ad75adb8..bab102988 100644
> --- a/src/box/sql/func.c
> +++ b/src/box/sql/func.c
> @@ -140,18 +140,18 @@ typeofFunc(sqlite3_context * context, int =
NotUsed, sqlite3_value ** argv)
>  * @return number of symbols in the given string.
>  */
> static int
> -char_count(const unsigned char *str, size_t byte_len)
> +utf8_char_count(const unsigned char *str, int byte_len)
> {
>>=20
>> Why unsigned char?
> I used unsigned char because sqlite3_value_text() returns result
> of this type.

Ok, let it be.

> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
> index 8ad75adb8..bab102988 100644
> --- a/src/box/sql/func.c
> +++ b/src/box/sql/func.c
> @@ -140,18 +140,18 @@ typeofFunc(sqlite3_context * context, int =
NotUsed, sqlite3_value ** argv)
>  * @return number of symbols in the given string.
>  */
> static int
> -char_count(const unsigned char *str, size_t byte_len)
> +utf8_char_count(const unsigned char *str, int byte_len)
> {
> -	int symbol_len =3D 0;
> +	int symbol_count =3D 0;
> 	int offset =3D 0;
> 	UChar32 res;
> -	while (offset < (int) byte_len) {
> -		U8_NEXT(str, offset, (int) byte_len, res)
> +	while (offset < byte_len) {
> +		U8_NEXT(str, offset, byte_len, res)
> 		if (res < 0)
> 			break;
> -		symbol_len++;
> +		symbol_count++;
> 	}
> -	return symbol_len;
> +	return symbol_count;
> }
>=20
>>=20
>>> +		if (res < 0)
>>> +			break;
>>=20
>> Hm, so if a sequence contains non-utf8 symbols,
>> you simply cut the string. Not sure it is solid approach.
>> Could you please check how other DBs behave (do they throw
>> an error? Or ignore invalid symbols?) in similar situation and
>> what standard says.
> I=E2=80=99ve tested statements with LENGTH from the test badutf1 in =
different
> DBs. PostgreSQL raised an error "invalid byte sequence for encoding =
=E2=80=9CUTF8=E2=80=9D.
> MySQL, DB2 and MSSQL behaved all the same way: count each invalid byte =
as
> a symbol. For example:
>=20
> 0x80, 0x7f, 0x81 are all invalid first bytes from
> UTF8 point of view, 0xC03F is bad two byte seq. where
> first byte is ok and second is broken, 0xE0800F is
> bad three byte seq. where first two bytes are ok, but the
> third is broken.
>=20
> (this is MySQL SQL)
>=20
> select length(X'80'); 1
> select length(concat(X'7f', X'80', X'81')); 3
> select length(concat(X'61', X'c0')); 2
> select length(concat(X'61', X'c0', X'80', X'80', X'80',
> X'80', X'80', X'80', X'80', X'80', X'80', X'80')); 12
>=20
> select length(X'C03F'); 2
> select length(concat(X'C03F', 'a')); 3
> select length(X'E0800F'); 3
> select length(concat(X'E0800F', 'a')); 4
>=20
> I have not found anything in standard about dealing with
> invalid UTF8 sequences.
>=20
> Even before the patch test gave results different from what
> all others DBs.
>=20
> I propose to behave all other DBs do, as I describe above.
> Is it ok?

PostgreSQL makes me feel uncertain concerning this question.
Anyway, I still believe that it=E2=80=99s better to raise an error.
I propose you ask other members of core team for their opinion.
Additionally, you could ask author of origin issue and our expert P. =
Gulutzan.

>>> +		symbol_len++;
>>> 	}
>>> -	return n_chars;
>>> +	return symbol_len;
>>> }
>>>=20
>>> diff --git a/test/sql-tap/badutf1.test.lua =
b/test/sql-tap/badutf1.test.lua
>>> index 534c762ba..0a90d1b17 100755
>>> --- a/test/sql-tap/badutf1.test.lua
>>> +++ b/test/sql-tap/badutf1.test.lua
>>> @@ -215,7 +215,7 @@ test:do_test(
>>>       return test:execsql2("SELECT length('\x80') AS x")
>>>   end, {
>>>       -- <badutf-3.1>
>>> -        "X", 1
>>> +        "X", 0
>>=20
>> Why this and the rest of tests below has been changed?
>> I guess because now they contain invalid utf8 symbols.
> They contained invalid utf8 symbols before. The tests are changed =
because
> I changed the way we handle malformed utf8 strings.

Ok, but at least underline this fact in commit message.

>>>       -- </badutf-3.1>
>>>   })
>>>=20
>>> @@ -235,7 +235,7 @@ test:do_test(
>>>       return test:execsql2("SELECT length('\x7f\x80\x81') AS x")
>>>   end, {
>>>       -- <badutf-3.3>
>>> -        "X", 3
>>> +        "X", 1
>>=20
>> But wait, I execute
>> tarantool> SELECT length('\x7f\x80\x81') AS x
>> ---
>> - - [12]
>> ...
>>=20
> Maybe you got this result because =E2=80=98\x7f=E2=80=99 is treated by =
tarantool sql
> as a simple string with 4 characters?

And why in tests it is treated in other way?

>> Looks extremely strangle. What do these tests check at all?
>> Why we can=E2=80=99t use simple sqlexecute or smth? This test suite
>> is so broken...
> I think these tests check that DB does not crash if malformed utf8 is
> encountered.

I mean why results from console run are different from tests
results in our suite? Still can=E2=80=99t understand that. Now I execute =
this:
tarantool> box.sql.execute('select length("\x7f\x80\x81")=E2=80=99)

And got strange error:
---
- error: !!binary bm8gc3VjaCBjb2x1bW46IH+AgQ=3D=3D
...