[tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'

Wed Feb 20 16:54:29 MSK 2019

> On 14 Feb 2019, at 15:57, n.pettik <korablev at tarantool.org> wrote:
> 
> I see some strange file on your branch:
> 
> https://github.com/tarantool/tarantool/commit/0a34e978c8023dcb4f99965d93203a99adf21520
> 
> I guess it’s an artefact from your investigation.
> If you want to share some results/proposals, just
> attach them to the this letter or write to dev list.
Sorry, I committed it by mistake. Removed now.
> 
>> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
>> index 8ad75adb8..bab102988 100644
>> --- a/src/box/sql/func.c
>> +++ b/src/box/sql/func.c
>> @@ -140,18 +140,18 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv)
>> * @return number of symbols in the given string.
>> */
>> static int
>> -char_count(const unsigned char *str, size_t byte_len)
>> +utf8_char_count(const unsigned char *str, int byte_len)
>> {
>>> 
>>> Why unsigned char?
>> I used unsigned char because sqlite3_value_text() returns result
>> of this type.
> 
> Ok, let it be.
> 
>> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
>> index 8ad75adb8..bab102988 100644
>> --- a/src/box/sql/func.c
>> +++ b/src/box/sql/func.c
>> @@ -140,18 +140,18 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv)
>> * @return number of symbols in the given string.
>> */
>> static int
>> -char_count(const unsigned char *str, size_t byte_len)
>> +utf8_char_count(const unsigned char *str, int byte_len)
>> {
>> -	int symbol_len = 0;
>> +	int symbol_count = 0;
>> 	int offset = 0;
>> 	UChar32 res;
>> -	while (offset < (int) byte_len) {
>> -		U8_NEXT(str, offset, (int) byte_len, res)
>> +	while (offset < byte_len) {
>> +		U8_NEXT(str, offset, byte_len, res)
>> 		if (res < 0)
>> 			break;
>> -		symbol_len++;
>> +		symbol_count++;
>> 	}
>> -	return symbol_len;
>> +	return symbol_count;
>> }
>> 
>>> 
>>>> +		if (res < 0)
>>>> +			break;
>>> 
>>> Hm, so if a sequence contains non-utf8 symbols,
>>> you simply cut the string. Not sure it is solid approach.
>>> Could you please check how other DBs behave (do they throw
>>> an error? Or ignore invalid symbols?) in similar situation and
>>> what standard says.
>> I’ve tested statements with LENGTH from the test badutf1 in different
>> DBs. PostgreSQL raised an error "invalid byte sequence for encoding “UTF8”.
>> MySQL, DB2 and MSSQL behaved all the same way: count each invalid byte as
>> a symbol. For example:
>> 
>> 0x80, 0x7f, 0x81 are all invalid first bytes from
>> UTF8 point of view, 0xC03F is bad two byte seq. where
>> first byte is ok and second is broken, 0xE0800F is
>> bad three byte seq. where first two bytes are ok, but the
>> third is broken.
>> 
>> (this is MySQL SQL)
>> 
>> select length(X'80'); 1
>> select length(concat(X'7f', X'80', X'81')); 3
>> select length(concat(X'61', X'c0')); 2
>> select length(concat(X'61', X'c0', X'80', X'80', X'80',
>> X'80', X'80', X'80', X'80', X'80', X'80', X'80')); 12
>> 
>> select length(X'C03F'); 2
>> select length(concat(X'C03F', 'a')); 3
>> select length(X'E0800F'); 3
>> select length(concat(X'E0800F', 'a')); 4
>> 
>> I have not found anything in standard about dealing with
>> invalid UTF8 sequences.
>> 
>> Even before the patch test gave results different from what
>> all others DBs.
>> 
>> I propose to behave all other DBs do, as I describe above.
>> Is it ok?
> 
> PostgreSQL makes me feel uncertain concerning this question.
> Anyway, I still believe that it’s better to raise an error.
> I propose you ask other members of core team for their opinion.
> Additionally, you could ask author of origin issue and our expert P. Gulutzan.

After discussions, we decided to use the following algorithm:
Starting from the first byte in string, we try to determine
what kind of byte is it: first byte of 1,2,3 or 4 byte sequence.
Then we skip corresponding amount of bytes and increment symbol length.
(e.g 2 bytes for 2 byte sequence). If current byte is not a valid
first byte of any sequence, when we skip it and increment symbol length.

diff --git a/src/box/sql/func.c b/src/box/sql/func.c
index bab102988..8406eb42f 100644
--- a/src/box/sql/func.c
+++ b/src/box/sql/func.c
@@ -135,6 +135,11 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv)
  * are encoded with more than one byte. Also note that all
  * symbols from 'str' to 'str + byte_len' would be counted,
  * even if there is a '\0' somewhere between them.
+ *
+ * This function is implemented to be fast and indifferent to
+ * correctness of string being processed. If input string has
+ * even one invalid utf-8 sequence, then the resulting length
+ * could be arbitary in these boundaries (0 < len < byte_len).
  * @param str String to be counted.
  * @param byte_len Byte length of given string.
  * @return number of symbols in the given string.
@@ -143,12 +148,17 @@ static int
 utf8_char_count(const unsigned char *str, int byte_len)
 {
 	int symbol_count = 0;
-	int offset = 0;
-	UChar32 res;
-	while (offset < byte_len) {
-		U8_NEXT(str, offset, byte_len, res)
-		if (res < 0)
-			break;
+	for (int i = 0; i < byte_len;) {
+		if ((str[i] & 0x80) == 0)
+			i += 1;
+		else if ((str[i] & 0xe0) == 0xc0)
+			i += 2;
+		else if ((str[i] & 0xf0) == 0xe0)
+			i += 3;
+		else if ((str[i] & 0xf8) == 0xf0)
+			i += 4;
+		else
+			i += 1;
 		symbol_count++;
 	}
 	return symbol_count;

> 
>>>> +		symbol_len++;
>>>> 	}
>>>> -	return n_chars;
>>>> +	return symbol_len;
>>>> }
>>>> 
>>>> diff --git a/test/sql-tap/badutf1.test.lua b/test/sql-tap/badutf1.test.lua
>>>> index 534c762ba..0a90d1b17 100755
>>>> --- a/test/sql-tap/badutf1.test.lua
>>>> +++ b/test/sql-tap/badutf1.test.lua
>>>> @@ -215,7 +215,7 @@ test:do_test(
>>>>      return test:execsql2("SELECT length('\x80') AS x")
>>>>  end, {
>>>>      -- <badutf-3.1>
>>>> -        "X", 1
>>>> +        "X", 0
>>> 
>>> Why this and the rest of tests below has been changed?
>>> I guess because now they contain invalid utf8 symbols.
>> They contained invalid utf8 symbols before. The tests are changed because
>> I changed the way we handle malformed utf8 strings.
> 
> Ok, but at least underline this fact in commit message.
Ok. 
> 
>>>>      -- </badutf-3.1>
>>>>  })
>>>> 
>>>> @@ -235,7 +235,7 @@ test:do_test(
>>>>      return test:execsql2("SELECT length('\x7f\x80\x81') AS x")
>>>>  end, {
>>>>      -- <badutf-3.3>
>>>> -        "X", 3
>>>> +        "X", 1
>>> 
>>> But wait, I execute
>>> tarantool> SELECT length('\x7f\x80\x81') AS x
>>> ---
>>> - - [12]
>>> ...
>>> x§
>> Maybe you got this result because ‘\x7f’ is treated by tarantool sql
>> as a simple string with 4 characters?
> 
> And why in tests it is treated in other way?
When you use SQL interface in console and type SELECT LENGTH (‘\x7f\x80\x81’) AS x;
things like ‘\x7f’ are not treated as hexadecimal literals, they are just
treated as symbols ‘\’, ‘x’, ’7’, ‘f’

(And if you want to use hexadecimal literals, you should use:

> \set language sql	
> select length(X'7f' || X'80' || X'81' );
---
- - [3]
)
> 
>>> Looks extremely strangle. What do these tests check at all?
>>> Why we can’t use simple sqlexecute or smth? This test suite
>>> is so broken...
>> I think these tests check that DB does not crash if malformed utf8 is
>> encountered.
> 
> I mean why results from console run are different from tests
> results in our suite? Still can’t understand that. Now I execute this:
> tarantool> box.sql.execute('select length("\x7f\x80\x81")’)
> 
> And got strange error:
> ---
> - error: !!binary bm8gc3VjaCBjb2x1bW46IH+AgQ==
> ...
I would like to investigate it in scope of separate task, because
it seems to be quite time consuming problem.

At the end, we got these changes in badutf1.test.lua

diff --git a/test/sql-tap/badutf1.test.lua b/test/sql-tap/badutf1.test.lua
index 534c762ba..2b9f9b209 100755
--- a/test/sql-tap/badutf1.test.lua
+++ b/test/sql-tap/badutf1.test.lua
@@ -255,7 +255,7 @@ test:do_test(
         return test:execsql2("SELECT length('\x61\xc0\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80') AS x")
     end, {
         -- <badutf-3.5>
-        "X", 2
+        "X", 11
         -- </badutf-3.5>
     })
 
@@ -265,7 +265,7 @@ test:do_test(
         return test:execsql2("SELECT length('\xc0\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80') AS x")
     end, {
         -- <badutf-3.6>
-        "X", 1
+        "X", 10
         -- </badutf-3.6>
     })
 
@@ -285,7 +285,7 @@ test:do_test(
         return test:execsql2("SELECT length('\x80\x80\x80\x80\x80\xf0\x80\x80\x80\x80') AS x")
     end, {
         -- <badutf-3.8>
-        "X", 6
+        "X", 7
         -- </badutf-3.8>
     })