Tarantool development patches archive
 help / color / mirror / Atom feed
* [tarantool-patches] [PATCH] sql: LIKE/LENGTH process '\0'
@ 2019-01-29  9:56 Ivan Koptelov
  2019-01-29 16:35 ` [tarantool-patches] " n.pettik
  2019-03-04 15:30 ` Kirill Yukhin
  0 siblings, 2 replies; 22+ messages in thread
From: Ivan Koptelov @ 2019-01-29  9:56 UTC (permalink / raw)
  To: tarantool-patches

[-- Attachment #1: Type: text/html, Size: 8208 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
  2019-01-29  9:56 [tarantool-patches] [PATCH] sql: LIKE/LENGTH process '\0' Ivan Koptelov
@ 2019-01-29 16:35 ` n.pettik
  2019-02-04 12:34   ` Ivan Koptelov
  2019-03-04 15:30 ` Kirill Yukhin
  1 sibling, 1 reply; 22+ messages in thread
From: n.pettik @ 2019-01-29 16:35 UTC (permalink / raw)
  To: tarantool-patches; +Cc: Ivan Koptelov

> Fixes LIKE and LENGTH functions. '\0' now treated as
Nit: is treated.
> a usual symbol. Strings with '\0' are now processed
> entirely. Consider examples:
> 
> LENGTH(CHAR(65,00,65)) == 3
> LIKE(CHAR(65,00,65), CHAR(65,00,66)) == False
Also, I see that smth wrong with text in this mail again.
> 
> Closes #3542
Please, check other functions working with strings.
For instance, now I see that replace ignores \0 as well:

tarantool> box.sql.execute("select replace(CHAR(65,00,66), char(0), char(1))")
---
- - ["A\0B"]
...
> @@ -150,9 +150,13 @@ lengthFunc(sqlite3_context * context, int argc, sqlite3_value ** argv)
>  			const unsigned char *z = sqlite3_value_text(argv[0]);
>  			if (z == 0)
>  				return;
> +
Nit: don’t abuse empty lines.
>  			len = 0;
> -			while (*z) {
> +			size_t byte_len = sqlite3_value_bytes(argv[0]);
> +			const unsigned char *prev_z;
> +			for (size_t cnt = 0; cnt < byte_len; cnt += (z - prev_z)) {
Out of 80.
> diff --git a/test/sql-tap/gh-3542-like-len-null-term.test.lua b/test/sql-tap/gh-3542-like-len-null-term.test.lua
> new file mode 100755
> index 000000000..e9ea9ea30
> --- /dev/null
> +++ b/test/sql-tap/gh-3542-like-len-null-term.test.lua
> @@ -0,0 +1,97 @@
> +#!/usr/bin/env tarantool
> +test = require("sqltester")
> +test:plan(14)
> +
> +-- gh-3542 - LIKE/LENGTH do not scan if '\0' is encountered.
> +-- This test ensures that LIKE and LENGTH functions does NOT stop
> +-- string processing if '\0' is encountered.
I’d rather put these tests to sql-tap/func.test.lua

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
  2019-01-29 16:35 ` [tarantool-patches] " n.pettik
@ 2019-02-04 12:34   ` Ivan Koptelov
  2019-02-05 13:50     ` n.pettik
  0 siblings, 1 reply; 22+ messages in thread
From: Ivan Koptelov @ 2019-02-04 12:34 UTC (permalink / raw)
  To: n.pettik, tarantool-patches

[-- Attachment #1: Type: text/html, Size: 16665 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
  2019-02-04 12:34   ` Ivan Koptelov
@ 2019-02-05 13:50     ` n.pettik
  2019-02-07 15:14       ` i.koptelov
  0 siblings, 1 reply; 22+ messages in thread
From: n.pettik @ 2019-02-05 13:50 UTC (permalink / raw)
  To: tarantool-patches; +Cc: Ivan Koptelov

[-- Attachment #1: Type: text/plain, Size: 4616 bytes --]


> On 29/01/2019 19:35, n.pettik wrote:
>>> Fixes LIKE and LENGTH functions. '\0' now treated as
>> Nit: is treated.
> Fixed.
>>> a usual symbol. Strings with '\0' are now processed
>>> entirely. Consider examples:
>>> 
>>> LENGTH(CHAR(65,00,65)) == 3
>>> LIKE(CHAR(65,00,65), CHAR(65,00,66)) == False
>> Also, I see that smth wrong with text in this mail again
> I hope now the mail text is ok.

Not quite. It is still highlighted in some way. Have no idea.
>  src/box/sql/func.c         |  88 +++++++++++++-----
>  src/box/sql/vdbeInt.h      |   2 +-
>  test/sql-tap/func.test.lua | 220 ++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 284 insertions(+), 26 deletions(-)
> 
> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
> index e46b162d9..2978af983 100644
> --- a/src/box/sql/func.c
> +++ b/src/box/sql/func.c
> @@ -128,6 +128,30 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv)
>  	sqlite3_result_text(context, z, -1, SQLITE_STATIC);
>  }
>  
> +/**
> + * Return number of chars in the given string.
> + *
> + * Number of chars != byte size of string because some characters
> + * are encoded with more than one byte. Also note that all
> + * characters from 'str' to 'str + byte_len' would be counted,
> + * even if there is a '\0' somewhere between them.
> + * @param str String to be counted.
> + * @param byte_len Byte length of given string.
> + * @return
Return what?
> + */
> +static int
> +count_chars(const unsigned char *str, size_t byte_len)
Quite poor naming. I would call it utf8_str_len or
smth with utf8 prefix. Mb it is worth to put it some utils source file.
Also, consider using native U8_NEXT function from utf8.c,
instead of custom SQLITE_SKIP_UTF8. It may be not so fast
but safer I suppose. I don't insist though.
> +{
What if str is NULL? Add at least an assertion.
> +	int n_chars = 0;
> +	const unsigned char *prev_z;
> +	for (size_t cnt = 0; cnt < byte_len; cnt += (str - prev_z)) {
> +		n_chars++;
> +		prev_z = str;
> +		SQLITE_SKIP_UTF8(str);
> +	}
> +	return n_chars;
> +}
You can rewrite this function in a simpler way without using SQLITE macroses.
Read this topic: https://stackoverflow.com/questions/3911536/utf-8-unicode-whats-with-0xc0-and-0x80/3911566#3911566
It is quite useful. You may borrow implementation from there.
> +
>  /*
>   * Implementation of the length() function
>   */
> @@ -150,11 +174,7 @@ lengthFunc(sqlite3_context * context, int argc, sqlite3_value ** argv)
>  			const unsigned char *z = sqlite3_value_text(argv[0]);
>  			if (z == 0)
>  				return;
> -			len = 0;
> -			while (*z) {
> -				len++;
> -				SQLITE_SKIP_UTF8(z);
> -			}
> +			len = count_chars(z, sqlite3_value_bytes(argv[0]));
>  			sqlite3_result_int(context, len);
>  			break;
>  		}
> @@ -340,11 +360,8 @@ substrFunc(sqlite3_context * context, int argc, sqlite3_value ** argv)
>  		if (z == 0)
>  			return;
>  		len = 0;
> -		if (p1 < 0) {
> -			for (z2 = z; *z2; len++) {
> -				SQLITE_SKIP_UTF8(z2);
> -			}
> -		}
> +		if (p1 < 0)
> +			len = count_chars(z, sqlite3_value_bytes(argv[0]));
>  	}
>  #ifdef SQLITE_SUBSTR_COMPATIBILITY
>  	/* If SUBSTR_COMPATIBILITY is defined then substr(X,0,N) work the same as
> @@ -388,12 +405,21 @@ substrFunc(sqlite3_context * context, int argc, sqlite3_value ** argv)
>  	}
>  	assert(p1 >= 0 && p2 >= 0);
>  	if (p0type != SQLITE_BLOB) {
> -		while (*z && p1) {
> +		/*
> +		 * In the code below 'cnt' and 'n_chars' is
> +		 * used because '\0' is not supposed to be
> +		 * end-of-string symbol.
> +		 */
> +		int n_chars = count_chars(z, sqlite3_value_bytes(argv[0]));
I’d better call it char_count or symbol_count or char_count.
> diff --git a/test/sql-tap/func.test.lua b/test/sql-tap/func.test.lua
> index b7de1d955..8c712bd5e 100755
> --- a/test/sql-tap/func.test.lua
> +++ b/test/sql-tap/func.test.lua
> +-- REPLACE
> +test:do_execsql_test(
> +    "func-62",
> +    "SELECT REPLACE(CHAR(00,65,00,65), CHAR(00), CHAR(65)) LIKE 'AAAA';",
> +    {1})
> +
> +test:do_execsql_test(
> +    "func-63",
> +    "SELECT REPLACE(CHAR(00,65,00,65), CHAR(65), CHAR(00)) \
> +    LIKE CHAR(00,00,00,00);",
> +    {1})
> +
> +-- SUBSTR
> +test:do_execsql_test(
> +    "func-64",
> +    "SELECT SUBSTR(CHAR(65,00,66,67), 3, 2) LIKE CHAR(66, 67);",
> +    {1})
> +
> +test:do_execsql_test(
> +    "func-65",
> +    "SELECT SUBSTR(CHAR(00,00,00,65), 1, 4) LIKE CHAR(00,00,00,65);",
> +    {1})
> +
Just wondering: why do you use LIKE function almost in all tests?


[-- Attachment #2: Type: text/html, Size: 11216 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
  2019-02-05 13:50     ` n.pettik
@ 2019-02-07 15:14       ` i.koptelov
  2019-02-11 13:15         ` n.pettik
  0 siblings, 1 reply; 22+ messages in thread
From: i.koptelov @ 2019-02-07 15:14 UTC (permalink / raw)
  To: tarantool-patches; +Cc: n.pettik



> On 5 Feb 2019, at 16:50, n.pettik <korablev@tarantool.org> wrote:
> 
> 
>> On 29/01/2019 19:35, n.pettik wrote:
>>>> Fixes LIKE and LENGTH functions. '\0' now treated as
>>>> 
>>> Nit: is treated.
>> Fixed.
>>>> a usual symbol. Strings with '\0' are now processed
>>>> entirely. Consider examples:
>>>> 
>>>> LENGTH(CHAR(65,00,65)) == 3
>>>> LIKE(CHAR(65,00,65), CHAR(65,00,66)) == False
>>>> 
>>> Also, I see that smth wrong with text in this mail again
>> I hope now the mail text is ok.
> 
> Not quite. It is still highlighted in some way. Have no idea.
Oh, I am really sorry. Does it look ok now? (I changed the mailing client) 
>>  src/box/sql/func.c         |  88 +++++++++++++-----
>>  src/box/sql/vdbeInt.h      |   2 +-
>>  test/sql-tap/func.test.lua | 220 ++++++++++++++++++++++++++++++++++++++++++++-
>>  3 files changed, 284 insertions(+), 26 deletions(-)
>> 
>> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
>> index e46b162d9..2978af983 100644
>> --- a/src/box/sql/func.c
>> +++ b/src/box/sql/func.c
>> @@ -128,6 +128,30 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv)
>>  	sqlite3_result_text(context, z, -1, SQLITE_STATIC);
>>  }
>>  
>> +/**
>> + * Return number of chars in the given string.
>> + *
>> + * Number of chars != byte size of string because some characters
>> + * are encoded with more than one byte. Also note that all
>> + * characters from 'str' to 'str + byte_len' would be counted,
>> + * even if there is a '\0' somewhere between them.
>> + * @param str String to be counted.
>> + * @param byte_len Byte length of given string.
>> + * @return
>> 
> Return what?
Fixed.
>> + */
>> +static int
>> +count_chars(const unsigned char *str, size_t byte_len)
>> 
> Quite poor naming. I would call it utf8_str_len or
> smth with utf8 prefix. Mb it is worth to put it some utils source file.
> Also, consider using native U8_NEXT function from utf8.c,
> instead of custom SQLITE_SKIP_UTF8. It may be not so fast
> but safer I suppose. I don't insist though.
>> +{
> What if str is NULL? Add at least an assertion.
>> +	int n_chars = 0;
>> +	const unsigned char *prev_z;
>> +	for (size_t cnt = 0; cnt < byte_len; cnt += (str - prev_z)) {
>> +		n_chars++;
>> +		prev_z = str;
>> +		SQLITE_SKIP_UTF8(str);
>> +	}
>> +	return n_chars;
>> +}
>> 
> You can rewrite this function in a simpler way without using SQLITE macroses.
> Read this topic: https://stackoverflow.com/questions/3911536/utf-8-unicode-whats-with-0xc0-and-0x80/3911566#3911566
> It is quite useful. You may borrow implementation from there.
Both usage of U8_NEXT and the solution from stack overflow causes sql-tap/badutf1 test to fail,
while func.test (and other tests) are OK. In other words, SQLITE_SKIP_UTF8 and U8_NEXT
proceeds Invalid byte sequences differently.
As far as I understand, the purpose of the test is to check that malformed UTF-8 would not cause a crash.
So, I just used U8_NEXT and fixed the test.

(Since the function is used only in func.c, I prefer to keep it there and don’t move to separate utils file)

diff --git a/src/box/sql/func.c b/src/box/sql/func.c
index 2978af983..8ad75adb8 100644
--- a/src/box/sql/func.c
+++ b/src/box/sql/func.c
@@ -129,27 +129,29 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv)
 }
 
 /**
- * Return number of chars in the given string.
+ * Return number of symbols in the given string.
  *
- * Number of chars != byte size of string because some characters
+ * Number of symbols != byte size of string because some symbols
  * are encoded with more than one byte. Also note that all
- * characters from 'str' to 'str + byte_len' would be counted,
+ * symbols from 'str' to 'str + byte_len' would be counted,
  * even if there is a '\0' somewhere between them.
  * @param str String to be counted.
  * @param byte_len Byte length of given string.
- * @return
+ * @return number of symbols in the given string.
  */
 static int
-count_chars(const unsigned char *str, size_t byte_len)
+char_count(const unsigned char *str, size_t byte_len)
 {
-	int n_chars = 0;
-	const unsigned char *prev_z;
-	for (size_t cnt = 0; cnt < byte_len; cnt += (str - prev_z)) {
-		n_chars++;
-		prev_z = str;
-		SQLITE_SKIP_UTF8(str);
+	int symbol_len = 0;
+	int offset = 0;
+	UChar32 res;
+	while (offset < (int) byte_len) {
+		U8_NEXT(str, offset, (int) byte_len, res)
+		if (res < 0)
+			break;
+		symbol_len++;
 	}
-	return n_chars;
+	return symbol_len;
 }
 
 /*
@@ -174,7 +176,7 @@ lengthFunc(sqlite3_context * context, int argc, sqlite3_value ** argv)
 			const unsigned char *z = sqlite3_value_text(argv[0]);
 			if (z == 0)
 				return;
-			len = count_chars(z, sqlite3_value_bytes(argv[0]));
+			len = char_count(z, sqlite3_value_bytes(argv[0]));
 			sqlite3_result_int(context, len);
 			break;
 		}
@@ -361,7 +363,7 @@ substrFunc(sqlite3_context * context, int argc, sqlite3_value ** argv)
 			return;
 		len = 0;
 		if (p1 < 0)
-			len = count_chars(z, sqlite3_value_bytes(argv[0]));
+			len = char_count(z, sqlite3_value_bytes(argv[0]));
 	}
 #ifdef SQLITE_SUBSTR_COMPATIBILITY
 	/* If SUBSTR_COMPATIBILITY is defined then substr(X,0,N) work the same as
@@ -410,7 +412,7 @@ substrFunc(sqlite3_context * context, int argc, sqlite3_value ** argv)
 		 * used because '\0' is not supposed to be
 		 * end-of-string symbol.
 		 */
-		int n_chars = count_chars(z, sqlite3_value_bytes(argv[0]));
+		int n_chars = char_count(z, sqlite3_value_bytes(argv[0]));
 		int cnt = 0;
 		while (cnt < n_chars && p1) {
 			SQLITE_SKIP_UTF8(z);
diff --git a/test/sql-tap/badutf1.test.lua b/test/sql-tap/badutf1.test.lua
index 534c762ba..0a90d1b17 100755
--- a/test/sql-tap/badutf1.test.lua
+++ b/test/sql-tap/badutf1.test.lua
@@ -215,7 +215,7 @@ test:do_test(
         return test:execsql2("SELECT length('\x80') AS x")
     end, {
         -- <badutf-3.1>
-        "X", 1
+        "X", 0
         -- </badutf-3.1>
     })
 
@@ -235,7 +235,7 @@ test:do_test(
         return test:execsql2("SELECT length('\x7f\x80\x81') AS x")
     end, {
         -- <badutf-3.3>
-        "X", 3
+        "X", 1
         -- </badutf-3.3>
     })
 
@@ -245,7 +245,7 @@ test:do_test(
         return test:execsql2("SELECT length('\x61\xc0') AS x")
     end, {
         -- <badutf-3.4>
-        "X", 2
+        "X", 1
         -- </badutf-3.4>
     })
 
@@ -255,7 +255,7 @@ test:do_test(
         return test:execsql2("SELECT length('\x61\xc0\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80') AS x")
     end, {
         -- <badutf-3.5>
-        "X", 2
+        "X", 1
         -- </badutf-3.5>
     })
 
@@ -265,7 +265,7 @@ test:do_test(
         return test:execsql2("SELECT length('\xc0\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80') AS x")
     end, {
         -- <badutf-3.6>
-        "X", 1
+        "X", 0
         -- </badutf-3.6>
     })
 
@@ -275,7 +275,7 @@ test:do_test(
         return test:execsql2("SELECT length('\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80') AS x")
     end, {
         -- <badutf-3.7>
-        "X", 10
+        "X", 0
         -- </badutf-3.7>
     })
 
@@ -285,7 +285,7 @@ test:do_test(
         return test:execsql2("SELECT length('\x80\x80\x80\x80\x80\xf0\x80\x80\x80\x80') AS x")
     end, {
         -- <badutf-3.8>
-        "X", 6
+        "X", 0
         -- </badutf-3.8>
     })
 
@@ -295,7 +295,7 @@ test:do_test(
         return test:execsql2("SELECT length('\x80\x80\x80\x80\x80\xf0\x80\x80\x80\xff') AS x")
     end, {
         -- <badutf-3.9>
-        "X", 7
+        "X", 0
         -- </badutf-3.9>
     })

>> +
>>  /*
>>   * Implementation of the length() function
>>   */
>> @@ -150,11 +174,7 @@ lengthFunc(sqlite3_context * context, int argc, sqlite3_value ** argv)
>>  			const unsigned char *z = sqlite3_value_text(argv[0]);
>>  			if (z == 0)
>>  				return;
>> -			len = 0;
>> -			while (*z) {
>> -				len++;
>> -				SQLITE_SKIP_UTF8(z);
>> -			}
>> +			len = count_chars(z, sqlite3_value_bytes(argv[0]));
>>  			sqlite3_result_int(context, len);
>>  			break;
>>  		}
>> @@ -340,11 +360,8 @@ substrFunc(sqlite3_context * context, int argc, sqlite3_value ** argv)
>>  		if (z == 0)
>>  			return;
>>  		len = 0;
>> -		if (p1 < 0) {
>> -			for (z2 = z; *z2; len++) {
>> -				SQLITE_SKIP_UTF8(z2);
>> -			}
>> -		}
>> +		if (p1 < 0)
>> +			len = count_chars(z, sqlite3_value_bytes(argv[0]));
>>  	}
>>  #ifdef SQLITE_SUBSTR_COMPATIBILITY
>>  	/* If SUBSTR_COMPATIBILITY is defined then substr(X,0,N) work the same as
>> @@ -388,12 +405,21 @@ substrFunc(sqlite3_context * context, int argc, sqlite3_value ** argv)
>>  	}
>>  	assert(p1 >= 0 && p2 >= 0);
>>  	if (p0type != SQLITE_BLOB) {
>> -		while (*z && p1) {
>> +		/*
>> +		 * In the code below 'cnt' and 'n_chars' is
>> +		 * used because '\0' is not supposed to be
>> +		 * end-of-string symbol.
>> +		 */
>> +		int n_chars = count_chars(z, sqlite3_value_bytes(argv[0]));
>> 
> I’d better call it char_count or symbol_count or char_count.

>> diff --git a/test/sql-tap/func.test.lua b/test/sql-tap/func.test.lua
>> index b7de1d955..8c712bd5e 100755
>> --- a/test/sql-tap/func.test.lua
>> +++ b/test/sql-tap/func.test.lua
>> +-- REPLACE
>> +test:do_execsql_test(
>> +    "func-62",
>> +    "SELECT REPLACE(CHAR(00,65,00,65), CHAR(00), CHAR(65)) LIKE 'AAAA';",
>> +    {1})
>> +
>> +test:do_execsql_test(
>> +    "func-63",
>> +    "SELECT REPLACE(CHAR(00,65,00,65), CHAR(65), CHAR(00)) \
>> +    LIKE CHAR(00,00,00,00);",
>> +    {1})
>> +
>> +-- SUBSTR
>> +test:do_execsql_test(
>> +    "func-64",
>> +    "SELECT SUBSTR(CHAR(65,00,66,67), 3, 2) LIKE CHAR(66, 67);",
>> +    {1})
>> +
>> +test:do_execsql_test(
>> +    "func-65",
>> +    "SELECT SUBSTR(CHAR(00,00,00,65), 1, 4) LIKE CHAR(00,00,00,65);",
>> +    {1})
>> +
>> 
> Just wondering: why do you use LIKE function almost in all tests?
> 
To compare actual result with expected one. But now I think It would be more readable like this:

diff --git a/test/sql-tap/func.test.lua b/test/sql-tap/func.test.lua
index 8c712bd5e..7d1611fc2 100755
--- a/test/sql-tap/func.test.lua
+++ b/test/sql-tap/func.test.lua
@@ -2795,61 +2795,60 @@ test:do_execsql_test(
 -- REPLACE
 test:do_execsql_test(
     "func-62",
-    "SELECT REPLACE(CHAR(00,65,00,65), CHAR(00), CHAR(65)) LIKE 'AAAA';",
-    {1})
+    "SELECT REPLACE(CHAR(00,65,00,65), CHAR(00), CHAR(65));",
+    {'AAAA'})
 
 test:do_execsql_test(
     "func-63",
-    "SELECT REPLACE(CHAR(00,65,00,65), CHAR(65), CHAR(00)) \
-    LIKE CHAR(00,00,00,00);",
-    {1})
+    "SELECT REPLACE(CHAR(00,65,00,65), CHAR(65), CHAR(00));",
+    {string.char(00,00,00,00)})
 
 -- SUBSTR
 test:do_execsql_test(
     "func-64",
-    "SELECT SUBSTR(CHAR(65,00,66,67), 3, 2) LIKE CHAR(66, 67);",
-    {1})
+    "SELECT SUBSTR(CHAR(65,00,66,67), 3, 2);",
+    {string.char(66,67)})
 
 test:do_execsql_test(
     "func-65",
-    "SELECT SUBSTR(CHAR(00,00,00,65), 1, 4) LIKE CHAR(00,00,00,65);",
-    {1})
+    "SELECT SUBSTR(CHAR(00,00,00,65), 1, 4);",
+    {string.char(00,00,00,65)})
 
 -- UPPER
 test:do_execsql_test(
     "func-66",
-    "SELECT UPPER(CHAR(00,97,00,98,00)) LIKE CHAR(00,65,00,66,00);",
-    {1})
+    "SELECT UPPER(CHAR(00,97,00,98,00));",
+    {string.char(00,65,00,66,00)})
 
 -- LOWER
 test:do_execsql_test(
     "func-67",
-    "SELECT LOWER(CHAR(00,65,00,66,00)) LIKE CHAR(00,97,00,98,00);",
-    {1})
+    "SELECT LOWER(CHAR(00,65,00,66,00));",
+    {string.char(00,97,00,98,00)})
 
 -- HEX
 test:do_execsql_test(
     "func-68",
-    "SELECT HEX(CHAR(00,65,00,65,00)) LIKE ('0041004100');",
-    {1})
+    "SELECT HEX(CHAR(00,65,00,65,00));",
+    {'0041004100'})
 
 -- TRIM
 test:do_execsql_test(
     "func-69",
-    "SELECT TRIM(CHAR(32,00,32,00,32)) LIKE (CHAR(00,32,00));",
-    {1})
+    "SELECT TRIM(CHAR(32,00,32,00,32));",
+    {string.char(00,32,00)})
 
 -- LTRIM
 test:do_execsql_test(
     "func-70",
-    "SELECT LTRIM(CHAR(32,00,32,00,32)) LIKE (CHAR(00,32,00,32));",
-    {1})
+    "SELECT LTRIM(CHAR(32,00,32,00,32));",
+    {string.char(00,32,00,32)})
 
 -- RTRIM
 test:do_execsql_test(
     "func-71",
-    "SELECT RTRIM(CHAR(32,00,32,00,32)) LIKE (CHAR(32,00,32,00));",
-    {1})
+    "SELECT RTRIM(CHAR(32,00,32,00,32));",
+    {string.char(32,00,32,00)})
 
 -- GROUP_CONCAT
 test:do_execsql_test(

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
  2019-02-07 15:14       ` i.koptelov
@ 2019-02-11 13:15         ` n.pettik
  2019-02-13 15:46           ` i.koptelov
  0 siblings, 1 reply; 22+ messages in thread
From: n.pettik @ 2019-02-11 13:15 UTC (permalink / raw)
  To: tarantool-patches; +Cc: Ivan Koptelov


> On 7 Feb 2019, at 18:14, i.koptelov <ivan.koptelov@tarantool.org> wrote:
>> On 5 Feb 2019, at 16:50, n.pettik <korablev@tarantool.org> wrote:
>>> On 29/01/2019 19:35, n.pettik wrote:
>>>>> Fixes LIKE and LENGTH functions. '\0' now treated as
>>>>> 
>>>> Nit: is treated.
>>> Fixed.
>>>>> a usual symbol. Strings with '\0' are now processed
>>>>> entirely. Consider examples:
>>>>> 
>>>>> LENGTH(CHAR(65,00,65)) == 3
>>>>> LIKE(CHAR(65,00,65), CHAR(65,00,66)) == False
>>>>> 
>>>> Also, I see that smth wrong with text in this mail again
>>> I hope now the mail text is ok.
>> 
>> Not quite. It is still highlighted in some way. Have no idea.
> Oh, I am really sorry. Does it look ok now? (I changed the mailing client) 

Yes, it does.

>>> src/box/sql/func.c         |  88 +++++++++++++-----
>>> src/box/sql/vdbeInt.h      |   2 +-
>>> test/sql-tap/func.test.lua | 220 ++++++++++++++++++++++++++++++++++++++++++++-
>>> 3 files changed, 284 insertions(+), 26 deletions(-)
>>> 
>>> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
>>> index e46b162d9..2978af983 100644
>>> --- a/src/box/sql/func.c
>>> +++ b/src/box/sql/func.c
>>> @@ -128,6 +128,30 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv)
>>> 	sqlite3_result_text(context, z, -1, SQLITE_STATIC);
>>> }
>>> 
>>> +/**
>>> + * Return number of chars in the given string.
>>> + *
>>> + * Number of chars != byte size of string because some characters
>>> + * are encoded with more than one byte. Also note that all
>>> + * characters from 'str' to 'str + byte_len' would be counted,
>>> + * even if there is a '\0' somewhere between them.
>>> + * @param str String to be counted.
>>> + * @param byte_len Byte length of given string.
>>> + * @return
>>> 
>> Return what?
> Fixed.
>>> + */
>>> +static int
>>> +count_chars(const unsigned char *str, size_t byte_len)
>>> 
>> Quite poor naming. I would call it utf8_str_len or
>> smth with utf8 prefix. Mb it is worth to put it some utils source file.
>> Also, consider using native U8_NEXT function from utf8.c,
>> instead of custom SQLITE_SKIP_UTF8. It may be not so fast
>> but safer I suppose. I don't insist though.
>>> +{
>> What if str is NULL? Add at least an assertion.
>>> +	int n_chars = 0;
>>> +	const unsigned char *prev_z;
>>> +	for (size_t cnt = 0; cnt < byte_len; cnt += (str - prev_z)) {
>>> +		n_chars++;
>>> +		prev_z = str;
>>> +		SQLITE_SKIP_UTF8(str);
>>> +	}
>>> +	return n_chars;
>>> +}
>>> 
>> You can rewrite this function in a simpler way without using SQLITE macroses.
>> Read this topic: https://stackoverflow.com/questions/3911536/utf-8-unicode-whats-with-0xc0-and-0x80/3911566#3911566
>> It is quite useful. You may borrow implementation from there.
> Both usage of U8_NEXT and the solution from stack overflow causes sql-tap/badutf1 test to fail,
> while func.test (and other tests) are OK. In other words, SQLITE_SKIP_UTF8 and U8_NEXT
> proceeds Invalid byte sequences differently.
> As far as I understand, the purpose of the test is to check that malformed UTF-8 would not cause a crash.
> So, I just used U8_NEXT and fixed the test.
> 
> (Since the function is used only in func.c, I prefer to keep it there and don’t move to separate utils file)
> 
> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
> index 2978af983..8ad75adb8 100644
> --- a/src/box/sql/func.c
> +++ b/src/box/sql/func.c
> @@ -129,27 +129,29 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv)
> }
> 
> /**
> - * Return number of chars in the given string.
> + * Return number of symbols in the given string.
>  *
> - * Number of chars != byte size of string because some characters
> + * Number of symbols != byte size of string because some symbols
>  * are encoded with more than one byte. Also note that all
> - * characters from 'str' to 'str + byte_len' would be counted,
> + * symbols from 'str' to 'str + byte_len' would be counted,
>  * even if there is a '\0' somewhere between them.
>  * @param str String to be counted.
>  * @param byte_len Byte length of given string.
> - * @return
> + * @return number of symbols in the given string.
>  */
> static int
> -count_chars(const unsigned char *str, size_t byte_len)
> +char_count(const unsigned char *str, size_t byte_len)

As I pointed in previous review, use utf8_ prefix:
utf8_strlen, utf8_str_char_count or sort of.

Why unsigned char?

> {
> -	int n_chars = 0;
> -	const unsigned char *prev_z;
> -	for (size_t cnt = 0; cnt < byte_len; cnt += (str - prev_z)) {
> -		n_chars++;
> -		prev_z = str;
> -		SQLITE_SKIP_UTF8(str);
> +	int symbol_len = 0;

Name is confusing: it sounds like a length of a single symbol.
symbol_count would be better.

> +	int offset = 0;
> +	UChar32 res;
> +	while (offset < (int) byte_len) {
> +		U8_NEXT(str, offset, (int) byte_len, res)

Instead of two explicit casts to int, lets make byte_len
be of type int.

> +		if (res < 0)
> +			break;

Hm, so if a sequence contains non-utf8 symbols,
you simply cut the string. Not sure it is solid approach.
Could you please check how other DBs behave (do they throw
an error? Or ignore invalid symbols?) in similar situation and
what standard says.

> +		symbol_len++;
> 	}
> -	return n_chars;
> +	return symbol_len;
> }
> 
> diff --git a/test/sql-tap/badutf1.test.lua b/test/sql-tap/badutf1.test.lua
> index 534c762ba..0a90d1b17 100755
> --- a/test/sql-tap/badutf1.test.lua
> +++ b/test/sql-tap/badutf1.test.lua
> @@ -215,7 +215,7 @@ test:do_test(
>         return test:execsql2("SELECT length('\x80') AS x")
>     end, {
>         -- <badutf-3.1>
> -        "X", 1
> +        "X", 0

Why this and the rest of tests below has been changed?
I guess because now they contain invalid utf8 symbols.

>         -- </badutf-3.1>
>     })
> 
> @@ -235,7 +235,7 @@ test:do_test(
>         return test:execsql2("SELECT length('\x7f\x80\x81') AS x")
>     end, {
>         -- <badutf-3.3>
> -        "X", 3
> +        "X", 1

But wait, I execute
tarantool> SELECT length('\x7f\x80\x81') AS x
---
- - [12]
...

Looks extremely strangle. What do these tests check at all?
Why we can’t use simple sqlexecute or smth? This test suite
is so broken...

>>> @@ -388,12 +405,21 @@ substrFunc(sqlite3_context * context, int argc, sqlite3_value ** argv)
>>> 	}
>>> 	assert(p1 >= 0 && p2 >= 0);
>>> 	if (p0type != SQLITE_BLOB) {
>>> -		while (*z && p1) {
>>> +		/*
>>> +		 * In the code below 'cnt' and 'n_chars' is
>>> +		 * used because '\0' is not supposed to be
>>> +		 * end-of-string symbol.
>>> +		 */
>>> +		int n_chars = count_chars(z, sqlite3_value_bytes(argv[0]));
>>> 
>> I’d better call it char_count or symbol_count or char_count.
> 
>>> diff --git a/test/sql-tap/func.test.lua b/test/sql-tap/func.test.lua
>>> index b7de1d955..8c712bd5e 100755
>>> --- a/test/sql-tap/func.test.lua
>>> +++ b/test/sql-tap/func.test.lua
>>> +-- REPLACE
>>> +test:do_execsql_test(
>>> +    "func-62",
>>> +    "SELECT REPLACE(CHAR(00,65,00,65), CHAR(00), CHAR(65)) LIKE 'AAAA';",
>>> +    {1})
>>> +
>>> +test:do_execsql_test(
>>> +    "func-63",
>>> +    "SELECT REPLACE(CHAR(00,65,00,65), CHAR(65), CHAR(00)) \
>>> +    LIKE CHAR(00,00,00,00);",
>>> +    {1})
>>> +
>>> +-- SUBSTR
>>> +test:do_execsql_test(
>>> +    "func-64",
>>> +    "SELECT SUBSTR(CHAR(65,00,66,67), 3, 2) LIKE CHAR(66, 67);",
>>> +    {1})
>>> +
>>> +test:do_execsql_test(
>>> +    "func-65",
>>> +    "SELECT SUBSTR(CHAR(00,00,00,65), 1, 4) LIKE CHAR(00,00,00,65);",
>>> +    {1})
>>> +
>>> 
>> Just wondering: why do you use LIKE function almost in all tests?
>> 
> To compare actual result with expected one. But now I think It would be more readable like this:

Moreover, in case LIKE fails for some reason, these tests will
fail as well. Meanwhile, their purpose to test not LIKE function.
Glad you fixed that.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
  2019-02-11 13:15         ` n.pettik
@ 2019-02-13 15:46           ` i.koptelov
  2019-02-14 12:57             ` n.pettik
  0 siblings, 1 reply; 22+ messages in thread
From: i.koptelov @ 2019-02-13 15:46 UTC (permalink / raw)
  To: tarantool-patches; +Cc: n.pettik



> On 11 Feb 2019, at 16:15, n.pettik <korablev@tarantool.org> wrote:
> 
>> 
>> On 7 Feb 2019, at 18:14, i.koptelov <ivan.koptelov@tarantool.org> wrote:
>>> On 5 Feb 2019, at 16:50, n.pettik <korablev@tarantool.org> wrote:
>>>> On 29/01/2019 19:35, n.pettik wrote:
>>>>>> Fixes LIKE and LENGTH functions. '\0' now treated as
>>>>>> 
>>>>> Nit: is treated.
>>>> Fixed.
>>>>>> a usual symbol. Strings with '\0' are now processed
>>>>>> entirely. Consider examples:
>>>>>> 
>>>>>> LENGTH(CHAR(65,00,65)) == 3
>>>>>> LIKE(CHAR(65,00,65), CHAR(65,00,66)) == False
>>>>>> 
>>>>> Also, I see that smth wrong with text in this mail again
>>>> I hope now the mail text is ok.
>>> 
>>> Not quite. It is still highlighted in some way. Have no idea.
>> Oh, I am really sorry. Does it look ok now? (I changed the mailing client) 
> 
> Yes, it does.
> 
>>>> src/box/sql/func.c         |  88 +++++++++++++-----
>>>> src/box/sql/vdbeInt.h      |   2 +-
>>>> test/sql-tap/func.test.lua | 220 ++++++++++++++++++++++++++++++++++++++++++++-
>>>> 3 files changed, 284 insertions(+), 26 deletions(-)
>>>> 
>>>> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
>>>> index e46b162d9..2978af983 100644
>>>> --- a/src/box/sql/func.c
>>>> +++ b/src/box/sql/func.c
>>>> @@ -128,6 +128,30 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv)
>>>> 	sqlite3_result_text(context, z, -1, SQLITE_STATIC);
>>>> }
>>>> 
>>>> +/**
>>>> + * Return number of chars in the given string.
>>>> + *
>>>> + * Number of chars != byte size of string because some characters
>>>> + * are encoded with more than one byte. Also note that all
>>>> + * characters from 'str' to 'str + byte_len' would be counted,
>>>> + * even if there is a '\0' somewhere between them.
>>>> + * @param str String to be counted.
>>>> + * @param byte_len Byte length of given string.
>>>> + * @return
>>>> 
>>> Return what?
>> Fixed.
>>>> + */
>>>> +static int
>>>> +count_chars(const unsigned char *str, size_t byte_len)
>>>> 
>>> Quite poor naming. I would call it utf8_str_len or
>>> smth with utf8 prefix. Mb it is worth to put it some utils source file.
>>> Also, consider using native U8_NEXT function from utf8.c,
>>> instead of custom SQLITE_SKIP_UTF8. It may be not so fast
>>> but safer I suppose. I don't insist though.
>>>> +{
>>> What if str is NULL? Add at least an assertion.
>>>> +	int n_chars = 0;
>>>> +	const unsigned char *prev_z;
>>>> +	for (size_t cnt = 0; cnt < byte_len; cnt += (str - prev_z)) {
>>>> +		n_chars++;
>>>> +		prev_z = str;
>>>> +		SQLITE_SKIP_UTF8(str);
>>>> +	}
>>>> +	return n_chars;
>>>> +}
>>>> 
>>> You can rewrite this function in a simpler way without using SQLITE macroses.
>>> Read this topic: https://stackoverflow.com/questions/3911536/utf-8-unicode-whats-with-0xc0-and-0x80/3911566#3911566
>>> It is quite useful. You may borrow implementation from there.
>> Both usage of U8_NEXT and the solution from stack overflow causes sql-tap/badutf1 test to fail,
>> while func.test (and other tests) are OK. In other words, SQLITE_SKIP_UTF8 and U8_NEXT
>> proceeds Invalid byte sequences differently.
>> As far as I understand, the purpose of the test is to check that malformed UTF-8 would not cause a crash.
>> So, I just used U8_NEXT and fixed the test.
>> 
>> (Since the function is used only in func.c, I prefer to keep it there and don’t move to separate utils file)
>> 
>> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
>> index 2978af983..8ad75adb8 100644
>> --- a/src/box/sql/func.c
>> +++ b/src/box/sql/func.c
>> @@ -129,27 +129,29 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv)
>> }
>> 
>> /**
>> - * Return number of chars in the given string.
>> + * Return number of symbols in the given string.
>> *
>> - * Number of chars != byte size of string because some characters
>> + * Number of symbols != byte size of string because some symbols
>> * are encoded with more than one byte. Also note that all
>> - * characters from 'str' to 'str + byte_len' would be counted,
>> + * symbols from 'str' to 'str + byte_len' would be counted,
>> * even if there is a '\0' somewhere between them.
>> * @param str String to be counted.
>> * @param byte_len Byte length of given string.
>> - * @return
>> + * @return number of symbols in the given string.
>> */
>> static int
>> -count_chars(const unsigned char *str, size_t byte_len)
>> +char_count(const unsigned char *str, size_t byte_len)
> 
> As I pointed in previous review, use utf8_ prefix:
> utf8_strlen, utf8_str_char_count or sort of.

Sorry, overlooked this. Fixed now.


diff --git a/src/box/sql/func.c b/src/box/sql/func.c
index 8ad75adb8..bab102988 100644
--- a/src/box/sql/func.c
+++ b/src/box/sql/func.c
@@ -140,18 +140,18 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv)
  * @return number of symbols in the given string.
  */
 static int
-char_count(const unsigned char *str, size_t byte_len)
+utf8_char_count(const unsigned char *str, int byte_len)
 {
> 
> Why unsigned char?
I used unsigned char because sqlite3_value_text() returns result
of this type.
> 
>> {
>> -	int n_chars = 0;
>> -	const unsigned char *prev_z;
>> -	for (size_t cnt = 0; cnt < byte_len; cnt += (str - prev_z)) {
>> -		n_chars++;
>> -		prev_z = str;
>> -		SQLITE_SKIP_UTF8(str);
>> +	int symbol_len = 0;
> 
> Name is confusing: it sounds like a length of a single symbol.
> symbol_count would be better.
Ok, thank you, fixed.

> 
>> +	int offset = 0;
>> +	UChar32 res;
>> +	while (offset < (int) byte_len) {
>> +		U8_NEXT(str, offset, (int) byte_len, res)
> 
> Instead of two explicit casts to int, lets make byte_len
> be of type int.
Ok.


diff --git a/src/box/sql/func.c b/src/box/sql/func.c
index 8ad75adb8..bab102988 100644
--- a/src/box/sql/func.c
+++ b/src/box/sql/func.c
@@ -140,18 +140,18 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv)
  * @return number of symbols in the given string.
  */
 static int
-char_count(const unsigned char *str, size_t byte_len)
+utf8_char_count(const unsigned char *str, int byte_len)
 {
-	int symbol_len = 0;
+	int symbol_count = 0;
 	int offset = 0;
 	UChar32 res;
-	while (offset < (int) byte_len) {
-		U8_NEXT(str, offset, (int) byte_len, res)
+	while (offset < byte_len) {
+		U8_NEXT(str, offset, byte_len, res)
 		if (res < 0)
 			break;
-		symbol_len++;
+		symbol_count++;
 	}
-	return symbol_len;
+	return symbol_count;
 }

> 
>> +		if (res < 0)
>> +			break;
> 
> Hm, so if a sequence contains non-utf8 symbols,
> you simply cut the string. Not sure it is solid approach.
> Could you please check how other DBs behave (do they throw
> an error? Or ignore invalid symbols?) in similar situation and
> what standard says.
I’ve tested statements with LENGTH from the test badutf1 in different
DBs. PostgreSQL raised an error "invalid byte sequence for encoding “UTF8”.
MySQL, DB2 and MSSQL behaved all the same way: count each invalid byte as
a symbol. For example:

0x80, 0x7f, 0x81 are all invalid first bytes from
UTF8 point of view, 0xC03F is bad two byte seq. where
first byte is ok and second is broken, 0xE0800F is
bad three byte seq. where first two bytes are ok, but the
third is broken.

(this is MySQL SQL)

select length(X'80'); 1
select length(concat(X'7f', X'80', X'81')); 3
select length(concat(X'61', X'c0')); 2
select length(concat(X'61', X'c0', X'80', X'80', X'80',
X'80', X'80', X'80', X'80', X'80', X'80', X'80')); 12

select length(X'C03F'); 2
select length(concat(X'C03F', 'a')); 3
select length(X'E0800F'); 3
select length(concat(X'E0800F', 'a')); 4

I have not found anything in standard about dealing with
invalid UTF8 sequences.

Even before the patch test gave results different from what
all others DBs.

I propose to behave all other DBs do, as I describe above.
Is it ok?

> 
>> +		symbol_len++;
>> 	}
>> -	return n_chars;
>> +	return symbol_len;
>> }
>> 
>> diff --git a/test/sql-tap/badutf1.test.lua b/test/sql-tap/badutf1.test.lua
>> index 534c762ba..0a90d1b17 100755
>> --- a/test/sql-tap/badutf1.test.lua
>> +++ b/test/sql-tap/badutf1.test.lua
>> @@ -215,7 +215,7 @@ test:do_test(
>>        return test:execsql2("SELECT length('\x80') AS x")
>>    end, {
>>        -- <badutf-3.1>
>> -        "X", 1
>> +        "X", 0
> 
> Why this and the rest of tests below has been changed?
> I guess because now they contain invalid utf8 symbols.
They contained invalid utf8 symbols before. The tests are changed because
I changed the way we handle malformed utf8 strings.
> 
>>        -- </badutf-3.1>
>>    })
>> 
>> @@ -235,7 +235,7 @@ test:do_test(
>>        return test:execsql2("SELECT length('\x7f\x80\x81') AS x")
>>    end, {
>>        -- <badutf-3.3>
>> -        "X", 3
>> +        "X", 1
> 
> But wait, I execute
> tarantool> SELECT length('\x7f\x80\x81') AS x
> ---
> - - [12]
> ...
> 
Maybe you got this result because ‘\x7f’ is treated by tarantool sql
as a simple string with 4 characters?
> Looks extremely strangle. What do these tests check at all?
> Why we can’t use simple sqlexecute or smth? This test suite
> is so broken...
I think these tests check that DB does not crash if malformed utf8 is
encountered.
> 
>>>> @@ -388,12 +405,21 @@ substrFunc(sqlite3_context * context, int argc, sqlite3_value ** argv)
>>>> 	}
>>>> 	assert(p1 >= 0 && p2 >= 0);
>>>> 	if (p0type != SQLITE_BLOB) {
>>>> -		while (*z && p1) {
>>>> +		/*
>>>> +		 * In the code below 'cnt' and 'n_chars' is
>>>> +		 * used because '\0' is not supposed to be
>>>> +		 * end-of-string symbol.
>>>> +		 */
>>>> +		int n_chars = count_chars(z, sqlite3_value_bytes(argv[0]));
>>>> 
>>> I’d better call it char_count or symbol_count or char_count.
>> 
>>>> diff --git a/test/sql-tap/func.test.lua b/test/sql-tap/func.test.lua
>>>> index b7de1d955..8c712bd5e 100755
>>>> --- a/test/sql-tap/func.test.lua
>>>> +++ b/test/sql-tap/func.test.lua
>>>> +-- REPLACE
>>>> +test:do_execsql_test(
>>>> +    "func-62",
>>>> +    "SELECT REPLACE(CHAR(00,65,00,65), CHAR(00), CHAR(65)) LIKE 'AAAA';",
>>>> +    {1})
>>>> +
>>>> +test:do_execsql_test(
>>>> +    "func-63",
>>>> +    "SELECT REPLACE(CHAR(00,65,00,65), CHAR(65), CHAR(00)) \
>>>> +    LIKE CHAR(00,00,00,00);",
>>>> +    {1})
>>>> +
>>>> +-- SUBSTR
>>>> +test:do_execsql_test(
>>>> +    "func-64",
>>>> +    "SELECT SUBSTR(CHAR(65,00,66,67), 3, 2) LIKE CHAR(66, 67);",
>>>> +    {1})
>>>> +
>>>> +test:do_execsql_test(
>>>> +    "func-65",
>>>> +    "SELECT SUBSTR(CHAR(00,00,00,65), 1, 4) LIKE CHAR(00,00,00,65);",
>>>> +    {1})
>>>> +
>>>> 
>>> Just wondering: why do you use LIKE function almost in all tests?
>>> 
>> To compare actual result with expected one. But now I think It would be more readable like this:
> 
> Moreover, in case LIKE fails for some reason, these tests will
> fail as well. Meanwhile, their purpose to test not LIKE function.
> Glad you fixed that.
Thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
  2019-02-13 15:46           ` i.koptelov
@ 2019-02-14 12:57             ` n.pettik
  2019-02-20 13:54               ` i.koptelov
  0 siblings, 1 reply; 22+ messages in thread
From: n.pettik @ 2019-02-14 12:57 UTC (permalink / raw)
  To: tarantool-patches; +Cc: Ivan Koptelov

I see some strange file on your branch:

https://github.com/tarantool/tarantool/commit/0a34e978c8023dcb4f99965d93203a99adf21520

I guess it’s an artefact from your investigation.
If you want to share some results/proposals, just
attach them to the this letter or write to dev list.

> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
> index 8ad75adb8..bab102988 100644
> --- a/src/box/sql/func.c
> +++ b/src/box/sql/func.c
> @@ -140,18 +140,18 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv)
>  * @return number of symbols in the given string.
>  */
> static int
> -char_count(const unsigned char *str, size_t byte_len)
> +utf8_char_count(const unsigned char *str, int byte_len)
> {
>> 
>> Why unsigned char?
> I used unsigned char because sqlite3_value_text() returns result
> of this type.

Ok, let it be.

> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
> index 8ad75adb8..bab102988 100644
> --- a/src/box/sql/func.c
> +++ b/src/box/sql/func.c
> @@ -140,18 +140,18 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv)
>  * @return number of symbols in the given string.
>  */
> static int
> -char_count(const unsigned char *str, size_t byte_len)
> +utf8_char_count(const unsigned char *str, int byte_len)
> {
> -	int symbol_len = 0;
> +	int symbol_count = 0;
> 	int offset = 0;
> 	UChar32 res;
> -	while (offset < (int) byte_len) {
> -		U8_NEXT(str, offset, (int) byte_len, res)
> +	while (offset < byte_len) {
> +		U8_NEXT(str, offset, byte_len, res)
> 		if (res < 0)
> 			break;
> -		symbol_len++;
> +		symbol_count++;
> 	}
> -	return symbol_len;
> +	return symbol_count;
> }
> 
>> 
>>> +		if (res < 0)
>>> +			break;
>> 
>> Hm, so if a sequence contains non-utf8 symbols,
>> you simply cut the string. Not sure it is solid approach.
>> Could you please check how other DBs behave (do they throw
>> an error? Or ignore invalid symbols?) in similar situation and
>> what standard says.
> I’ve tested statements with LENGTH from the test badutf1 in different
> DBs. PostgreSQL raised an error "invalid byte sequence for encoding “UTF8”.
> MySQL, DB2 and MSSQL behaved all the same way: count each invalid byte as
> a symbol. For example:
> 
> 0x80, 0x7f, 0x81 are all invalid first bytes from
> UTF8 point of view, 0xC03F is bad two byte seq. where
> first byte is ok and second is broken, 0xE0800F is
> bad three byte seq. where first two bytes are ok, but the
> third is broken.
> 
> (this is MySQL SQL)
> 
> select length(X'80'); 1
> select length(concat(X'7f', X'80', X'81')); 3
> select length(concat(X'61', X'c0')); 2
> select length(concat(X'61', X'c0', X'80', X'80', X'80',
> X'80', X'80', X'80', X'80', X'80', X'80', X'80')); 12
> 
> select length(X'C03F'); 2
> select length(concat(X'C03F', 'a')); 3
> select length(X'E0800F'); 3
> select length(concat(X'E0800F', 'a')); 4
> 
> I have not found anything in standard about dealing with
> invalid UTF8 sequences.
> 
> Even before the patch test gave results different from what
> all others DBs.
> 
> I propose to behave all other DBs do, as I describe above.
> Is it ok?

PostgreSQL makes me feel uncertain concerning this question.
Anyway, I still believe that it’s better to raise an error.
I propose you ask other members of core team for their opinion.
Additionally, you could ask author of origin issue and our expert P. Gulutzan.

>>> +		symbol_len++;
>>> 	}
>>> -	return n_chars;
>>> +	return symbol_len;
>>> }
>>> 
>>> diff --git a/test/sql-tap/badutf1.test.lua b/test/sql-tap/badutf1.test.lua
>>> index 534c762ba..0a90d1b17 100755
>>> --- a/test/sql-tap/badutf1.test.lua
>>> +++ b/test/sql-tap/badutf1.test.lua
>>> @@ -215,7 +215,7 @@ test:do_test(
>>>       return test:execsql2("SELECT length('\x80') AS x")
>>>   end, {
>>>       -- <badutf-3.1>
>>> -        "X", 1
>>> +        "X", 0
>> 
>> Why this and the rest of tests below has been changed?
>> I guess because now they contain invalid utf8 symbols.
> They contained invalid utf8 symbols before. The tests are changed because
> I changed the way we handle malformed utf8 strings.

Ok, but at least underline this fact in commit message.

>>>       -- </badutf-3.1>
>>>   })
>>> 
>>> @@ -235,7 +235,7 @@ test:do_test(
>>>       return test:execsql2("SELECT length('\x7f\x80\x81') AS x")
>>>   end, {
>>>       -- <badutf-3.3>
>>> -        "X", 3
>>> +        "X", 1
>> 
>> But wait, I execute
>> tarantool> SELECT length('\x7f\x80\x81') AS x
>> ---
>> - - [12]
>> ...
>> 
> Maybe you got this result because ‘\x7f’ is treated by tarantool sql
> as a simple string with 4 characters?

And why in tests it is treated in other way?

>> Looks extremely strangle. What do these tests check at all?
>> Why we can’t use simple sqlexecute or smth? This test suite
>> is so broken...
> I think these tests check that DB does not crash if malformed utf8 is
> encountered.

I mean why results from console run are different from tests
results in our suite? Still can’t understand that. Now I execute this:
tarantool> box.sql.execute('select length("\x7f\x80\x81")’)

And got strange error:
---
- error: !!binary bm8gc3VjaCBjb2x1bW46IH+AgQ==
...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
  2019-02-14 12:57             ` n.pettik
@ 2019-02-20 13:54               ` i.koptelov
  2019-02-20 15:47                 ` i.koptelov
  0 siblings, 1 reply; 22+ messages in thread
From: i.koptelov @ 2019-02-20 13:54 UTC (permalink / raw)
  To: tarantool-patches; +Cc: n.pettik



> On 14 Feb 2019, at 15:57, n.pettik <korablev@tarantool.org> wrote:
> 
> I see some strange file on your branch:
> 
> https://github.com/tarantool/tarantool/commit/0a34e978c8023dcb4f99965d93203a99adf21520
> 
> I guess it’s an artefact from your investigation.
> If you want to share some results/proposals, just
> attach them to the this letter or write to dev list.
Sorry, I committed it by mistake. Removed now.
> 
>> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
>> index 8ad75adb8..bab102988 100644
>> --- a/src/box/sql/func.c
>> +++ b/src/box/sql/func.c
>> @@ -140,18 +140,18 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv)
>> * @return number of symbols in the given string.
>> */
>> static int
>> -char_count(const unsigned char *str, size_t byte_len)
>> +utf8_char_count(const unsigned char *str, int byte_len)
>> {
>>> 
>>> Why unsigned char?
>> I used unsigned char because sqlite3_value_text() returns result
>> of this type.
> 
> Ok, let it be.
> 
>> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
>> index 8ad75adb8..bab102988 100644
>> --- a/src/box/sql/func.c
>> +++ b/src/box/sql/func.c
>> @@ -140,18 +140,18 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv)
>> * @return number of symbols in the given string.
>> */
>> static int
>> -char_count(const unsigned char *str, size_t byte_len)
>> +utf8_char_count(const unsigned char *str, int byte_len)
>> {
>> -	int symbol_len = 0;
>> +	int symbol_count = 0;
>> 	int offset = 0;
>> 	UChar32 res;
>> -	while (offset < (int) byte_len) {
>> -		U8_NEXT(str, offset, (int) byte_len, res)
>> +	while (offset < byte_len) {
>> +		U8_NEXT(str, offset, byte_len, res)
>> 		if (res < 0)
>> 			break;
>> -		symbol_len++;
>> +		symbol_count++;
>> 	}
>> -	return symbol_len;
>> +	return symbol_count;
>> }
>> 
>>> 
>>>> +		if (res < 0)
>>>> +			break;
>>> 
>>> Hm, so if a sequence contains non-utf8 symbols,
>>> you simply cut the string. Not sure it is solid approach.
>>> Could you please check how other DBs behave (do they throw
>>> an error? Or ignore invalid symbols?) in similar situation and
>>> what standard says.
>> I’ve tested statements with LENGTH from the test badutf1 in different
>> DBs. PostgreSQL raised an error "invalid byte sequence for encoding “UTF8”.
>> MySQL, DB2 and MSSQL behaved all the same way: count each invalid byte as
>> a symbol. For example:
>> 
>> 0x80, 0x7f, 0x81 are all invalid first bytes from
>> UTF8 point of view, 0xC03F is bad two byte seq. where
>> first byte is ok and second is broken, 0xE0800F is
>> bad three byte seq. where first two bytes are ok, but the
>> third is broken.
>> 
>> (this is MySQL SQL)
>> 
>> select length(X'80'); 1
>> select length(concat(X'7f', X'80', X'81')); 3
>> select length(concat(X'61', X'c0')); 2
>> select length(concat(X'61', X'c0', X'80', X'80', X'80',
>> X'80', X'80', X'80', X'80', X'80', X'80', X'80')); 12
>> 
>> select length(X'C03F'); 2
>> select length(concat(X'C03F', 'a')); 3
>> select length(X'E0800F'); 3
>> select length(concat(X'E0800F', 'a')); 4
>> 
>> I have not found anything in standard about dealing with
>> invalid UTF8 sequences.
>> 
>> Even before the patch test gave results different from what
>> all others DBs.
>> 
>> I propose to behave all other DBs do, as I describe above.
>> Is it ok?
> 
> PostgreSQL makes me feel uncertain concerning this question.
> Anyway, I still believe that it’s better to raise an error.
> I propose you ask other members of core team for their opinion.
> Additionally, you could ask author of origin issue and our expert P. Gulutzan.

After discussions, we decided to use the following algorithm:
Starting from the first byte in string, we try to determine
what kind of byte is it: first byte of 1,2,3 or 4 byte sequence.
Then we skip corresponding amount of bytes and increment symbol length.
(e.g 2 bytes for 2 byte sequence). If current byte is not a valid
first byte of any sequence, when we skip it and increment symbol length. 

diff --git a/src/box/sql/func.c b/src/box/sql/func.c
index bab102988..8406eb42f 100644
--- a/src/box/sql/func.c
+++ b/src/box/sql/func.c
@@ -135,6 +135,11 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv)
  * are encoded with more than one byte. Also note that all
  * symbols from 'str' to 'str + byte_len' would be counted,
  * even if there is a '\0' somewhere between them.
+ *
+ * This function is implemented to be fast and indifferent to
+ * correctness of string being processed. If input string has
+ * even one invalid utf-8 sequence, then the resulting length
+ * could be arbitary in these boundaries (0 < len < byte_len).
  * @param str String to be counted.
  * @param byte_len Byte length of given string.
  * @return number of symbols in the given string.
@@ -143,12 +148,17 @@ static int
 utf8_char_count(const unsigned char *str, int byte_len)
 {
 	int symbol_count = 0;
-	int offset = 0;
-	UChar32 res;
-	while (offset < byte_len) {
-		U8_NEXT(str, offset, byte_len, res)
-		if (res < 0)
-			break;
+	for (int i = 0; i < byte_len;) {
+		if ((str[i] & 0x80) == 0)
+			i += 1;
+		else if ((str[i] & 0xe0) == 0xc0)
+			i += 2;
+		else if ((str[i] & 0xf0) == 0xe0)
+			i += 3;
+		else if ((str[i] & 0xf8) == 0xf0)
+			i += 4;
+		else
+			i += 1;
 		symbol_count++;
 	}
 	return symbol_count;

> 
>>>> +		symbol_len++;
>>>> 	}
>>>> -	return n_chars;
>>>> +	return symbol_len;
>>>> }
>>>> 
>>>> diff --git a/test/sql-tap/badutf1.test.lua b/test/sql-tap/badutf1.test.lua
>>>> index 534c762ba..0a90d1b17 100755
>>>> --- a/test/sql-tap/badutf1.test.lua
>>>> +++ b/test/sql-tap/badutf1.test.lua
>>>> @@ -215,7 +215,7 @@ test:do_test(
>>>>      return test:execsql2("SELECT length('\x80') AS x")
>>>>  end, {
>>>>      -- <badutf-3.1>
>>>> -        "X", 1
>>>> +        "X", 0
>>> 
>>> Why this and the rest of tests below has been changed?
>>> I guess because now they contain invalid utf8 symbols.
>> They contained invalid utf8 symbols before. The tests are changed because
>> I changed the way we handle malformed utf8 strings.
> 
> Ok, but at least underline this fact in commit message.
Ok. 
> 
>>>>      -- </badutf-3.1>
>>>>  })
>>>> 
>>>> @@ -235,7 +235,7 @@ test:do_test(
>>>>      return test:execsql2("SELECT length('\x7f\x80\x81') AS x")
>>>>  end, {
>>>>      -- <badutf-3.3>
>>>> -        "X", 3
>>>> +        "X", 1
>>> 
>>> But wait, I execute
>>> tarantool> SELECT length('\x7f\x80\x81') AS x
>>> ---
>>> - - [12]
>>> ...
>>> x§
>> Maybe you got this result because ‘\x7f’ is treated by tarantool sql
>> as a simple string with 4 characters?
> 
> And why in tests it is treated in other way?
When you use SQL interface in console and type SELECT LENGTH (‘\x7f\x80\x81’) AS x;
things like ‘\x7f’ are not treated as hexadecimal literals, they are just
treated as symbols ‘\’, ‘x’, ’7’, ‘f’

(And if you want to use hexadecimal literals, you should use:

> \set language sql	
> select length(X'7f' || X'80' || X'81' );
---
- - [3]
)
> 
>>> Looks extremely strangle. What do these tests check at all?
>>> Why we can’t use simple sqlexecute or smth? This test suite
>>> is so broken...
>> I think these tests check that DB does not crash if malformed utf8 is
>> encountered.
> 
> I mean why results from console run are different from tests
> results in our suite? Still can’t understand that. Now I execute this:
> tarantool> box.sql.execute('select length("\x7f\x80\x81")’)
> 
> And got strange error:
> ---
> - error: !!binary bm8gc3VjaCBjb2x1bW46IH+AgQ==
> ...
I would like to investigate it in scope of separate task, because
it seems to be quite time consuming problem.

At the end, we got these changes in badutf1.test.lua

diff --git a/test/sql-tap/badutf1.test.lua b/test/sql-tap/badutf1.test.lua
index 534c762ba..2b9f9b209 100755
--- a/test/sql-tap/badutf1.test.lua
+++ b/test/sql-tap/badutf1.test.lua
@@ -255,7 +255,7 @@ test:do_test(
         return test:execsql2("SELECT length('\x61\xc0\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80') AS x")
     end, {
         -- <badutf-3.5>
-        "X", 2
+        "X", 11
         -- </badutf-3.5>
     })
 
@@ -265,7 +265,7 @@ test:do_test(
         return test:execsql2("SELECT length('\xc0\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80') AS x")
     end, {
         -- <badutf-3.6>
-        "X", 1
+        "X", 10
         -- </badutf-3.6>
     })
 
@@ -285,7 +285,7 @@ test:do_test(
         return test:execsql2("SELECT length('\x80\x80\x80\x80\x80\xf0\x80\x80\x80\x80') AS x")
     end, {
         -- <badutf-3.8>
-        "X", 6
+        "X", 7
         -- </badutf-3.8>
     })

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
  2019-02-20 13:54               ` i.koptelov
@ 2019-02-20 15:47                 ` i.koptelov
  2019-02-20 16:04                   ` n.pettik
  0 siblings, 1 reply; 22+ messages in thread
From: i.koptelov @ 2019-02-20 15:47 UTC (permalink / raw)
  To: tarantool-patches; +Cc: n.pettik

Thanks to Alexander, I fixed my patch to use a function
from icu to count the length of the string.

Changes:


diff --git a/src/box/sql/func.c b/src/box/sql/func.c
index 233ea2901..8ddb9780f 100644
--- a/src/box/sql/func.c
+++ b/src/box/sql/func.c
@@ -149,16 +149,7 @@ utf8_char_count(const unsigned char *str, int byte_len)
 {
 	int symbol_count = 0;
 	for (int i = 0; i < byte_len;) {
-		if ((str[i] & 0x80) == 0)
-			i += 1;
-		else if ((str[i] & 0xe0) == 0xc0)
-			i += 2;
-		else if ((str[i] & 0xf0) == 0xe0)
-			i += 3;
-		else if ((str[i] & 0xf8) == 0xf0)
-			i += 4;
-		else
-			i += 1;
+		U8_FWD_1_UNSAFE(str, i);
 		symbol_count++;
 	}
 	return symbol_count;
diff --git a/test/sql-tap/badutf1.test.lua b/test/sql-tap/badutf1.test.lua
index 67c1071b3..d104efaa9 100755
--- a/test/sql-tap/badutf1.test.lua
+++ b/test/sql-tap/badutf1.test.lua
@@ -255,7 +255,7 @@ test:do_test(
         return test:execsql2("SELECT length('\x61\xc0\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80') AS x")
     end, {
         -- <badutf-3.5>
-        "X", 11
+        "X", 12
         -- </badutf-3.5>
     })
 
@@ -265,7 +265,7 @@ test:do_test(
         return test:execsql2("SELECT length('\xc0\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80') AS x")
     end, {
         -- <badutf-3.6>
-        "X", 10
+        "X", 11
         -- </badutf-3.6>
     })
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
  2019-02-20 15:47                 ` i.koptelov
@ 2019-02-20 16:04                   ` n.pettik
  2019-02-20 18:08                     ` Vladislav Shpilevoy
  2019-02-20 19:24                     ` i.koptelov
  0 siblings, 2 replies; 22+ messages in thread
From: n.pettik @ 2019-02-20 16:04 UTC (permalink / raw)
  To: tarantool-patches; +Cc: Ivan Koptelov



> On 20 Feb 2019, at 18:47, i.koptelov <ivan.koptelov@tarantool.org> wrote:
> 
> Thanks to Alexander, I fixed my patch to use a function
> from icu to count the length of the string.
> 
> Changes:
> 

Look, each next implementation again and again changes
results of certain tests. Lets firstly define exact behaviour of
length() function and then write function which will satisfy these
requirements, not vice versa. Is this the final version?
Moreover, since Konstantin suggest as fast implementation
as we can, I propose to consider sort of asm written variant:

        .global ap_strlen_utf8_s
ap_strlen_utf8_s:
        push %esi
        cld
        mov 8(%esp), %esi
        xor %ecx, %ecx
loopa:  dec %ecx
loopb:  lodsb
        shl $1, %al
        js loopa
        jc loopb
        jnz loopa
        mov %ecx, %eax
        not %eax
        pop %esi
        ret


It is taken from http://canonical.org/~kragen/strlen-utf8
and author claims that quite fast (seems like it doesn’t
handle \0, but we can patch it). I didn’t bench it, so I am
not absolutely sure that it ‘way faster’ than other implementations.

> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
> index 233ea2901..8ddb9780f 100644
> --- a/src/box/sql/func.c
> +++ b/src/box/sql/func.c
> @@ -149,16 +149,7 @@ utf8_char_count(const unsigned char *str, int byte_len)
> {
> 	int symbol_count = 0;
> 	for (int i = 0; i < byte_len;) {
> -		if ((str[i] & 0x80) == 0)
> -			i += 1;
> -		else if ((str[i] & 0xe0) == 0xc0)
> -			i += 2;
> -		else if ((str[i] & 0xf0) == 0xe0)
> -			i += 3;
> -		else if ((str[i] & 0xf8) == 0xf0)
> -			i += 4;
> -		else
> -			i += 1;
> +		U8_FWD_1_UNSAFE(str, i);

This function handles string not in the way we’ve discussed.
Furthermore, description says that it “assumes well-formed UTF-8”,
which in our case is not true. So who knows what may happen if we pass
malformed byte sequence. I am not even saying that behaviour of
this function on invalid inputs may change later.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
  2019-02-20 16:04                   ` n.pettik
@ 2019-02-20 18:08                     ` Vladislav Shpilevoy
  2019-02-20 19:24                     ` i.koptelov
  1 sibling, 0 replies; 22+ messages in thread
From: Vladislav Shpilevoy @ 2019-02-20 18:08 UTC (permalink / raw)
  To: tarantool-patches, n.pettik; +Cc: Ivan Koptelov



On 20/02/2019 19:04, n.pettik wrote:
> 
> 
>> On 20 Feb 2019, at 18:47, i.koptelov <ivan.koptelov@tarantool.org> wrote:
>>
>> Thanks to Alexander, I fixed my patch to use a function
>> from icu to count the length of the string.
>>
>> Changes:
>>
> 
> Look, each next implementation again and again changes
> results of certain tests. Lets firstly define exact behaviour of
> length() function and then write function which will satisfy these
> requirements, not vice versa. Is this the final version?
> Moreover, since Konstantin suggest as fast implementation
> as we can, I propose to consider sort of asm written variant:
> 
>          .global ap_strlen_utf8_s
> ap_strlen_utf8_s:
>          push %esi
>          cld
>          mov 8(%esp), %esi
>          xor %ecx, %ecx
> loopa:  dec %ecx
> loopb:  lodsb
>          shl $1, %al
>          js loopa
>          jc loopb
>          jnz loopa
>          mov %ecx, %eax
>          not %eax
>          pop %esi
>          ret
> 
> 
> It is taken from http://canonical.org/~kragen/strlen-utf8
> and author claims that quite fast (seems like it doesn’t
> handle \0, but we can patch it). I didn’t bench it, so I am
> not absolutely sure that it ‘way faster’ than other implementations.

https://github.com/Gerold103/tarantool-memes/blob/master/further%20from%20god.jpg

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
  2019-02-20 16:04                   ` n.pettik
  2019-02-20 18:08                     ` Vladislav Shpilevoy
@ 2019-02-20 19:24                     ` i.koptelov
  2019-02-22 12:59                       ` n.pettik
  1 sibling, 1 reply; 22+ messages in thread
From: i.koptelov @ 2019-02-20 19:24 UTC (permalink / raw)
  To: tarantool-patches; +Cc: n.pettik


>> On 20 Feb 2019, at 18:47, i.koptelov <ivan.koptelov@tarantool.org> wrote:
>> 
>> Thanks to Alexander, I fixed my patch to use a function
>> from icu to count the length of the string.
>> 
>> Changes:
>> 
> 
> Look, each next implementation again and again changes
> results of certain tests. Lets firstly define exact behaviour of
> length() function and then write function which will satisfy these
> requirements, not vice versa. Is this the final version?
I thought that these changes in ‘badutf’ tests are OK because we came
to an agreement that we don’t care for results of LENGTH() on
invalid strings.
> Moreover, since Konstantin suggest as fast implementation
> as we can, I propose to consider sort of asm written variant:
> 
>        .global ap_strlen_utf8_s
> ap_strlen_utf8_s:
>        push %esi
>        cld
>        mov 8(%esp), %esi
>        xor %ecx, %ecx
> loopa:  dec %ecx
> loopb:  lodsb
>        shl $1, %al
>        js loopa
>        jc loopb
>        jnz loopa
>        mov %ecx, %eax
>        not %eax
>        pop %esi
>        ret
> 
> 
> It is taken from http://canonical.org/~kragen/strlen-utf8
> and author claims that quite fast (seems like it doesn’t
> handle \0, but we can patch it). I didn’t bench it, so I am
> not absolutely sure that it ‘way faster’ than other implementations.
I’ve also came across this solution, but I considered it to be kind of overkill.
> 
>> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
>> index 233ea2901..8ddb9780f 100644
>> --- a/src/box/sql/func.c
>> +++ b/src/box/sql/func.c
>> @@ -149,16 +149,7 @@ utf8_char_count(const unsigned char *str, int byte_len)
>> {
>> 	int symbol_count = 0;
>> 	for (int i = 0; i < byte_len;) {
>> -		if ((str[i] & 0x80) == 0)
>> -			i += 1;
>> -		else if ((str[i] & 0xe0) == 0xc0)
>> -			i += 2;
>> -		else if ((str[i] & 0xf0) == 0xe0)
>> -			i += 3;
>> -		else if ((str[i] & 0xf8) == 0xf0)
>> -			i += 4;
>> -		else
>> -			i += 1;
>> +		U8_FWD_1_UNSAFE(str, i);
> 
> This function handles string not in the way we’ve discussed.
Because it always does three comparisons?

#define U8_COUNT_TRAIL_BYTES_UNSAFE(leadByte) \
    (((uint8_t)(leadByte)>=0xc2)+((uint8_t)(leadByte)>=0xe0)+((uint8_t)(leadByte)>=0xf0))

#define U8_FWD_1_UNSAFE(s, i) { \
    (i)+=1+U8_COUNT_TRAIL_BYTES_UNSAFE((s)[i]); \
}


> Furthermore, description says that it “assumes well-formed UTF-8”,
> which in our case is not true. So who knows what may happen if we pass
> malformed byte sequence. I am not even saying that behaviour of
> this function on invalid inputs may change later.

In it's current implementation U8_FWD_1_UNSAFE satisfy our needs safely. Returned
symbol length would never exceed byte_len.

static int
utf8_char_count(const unsigned char *str, int byte_len)
{
	int symbol_count = 0;
	for (int i = 0; i < byte_len;) {
		U8_FWD_1_UNSAFE(str, i);
		symbol_count++;
	}
	return symbol_count;
}

I agree that it is a bad idea to relay on lib behaviour which may
change lately. So maybe I would just inline these one line macros?
Or use my own implementation, since it’s more efficient (but less beautiful)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
  2019-02-20 19:24                     ` i.koptelov
@ 2019-02-22 12:59                       ` n.pettik
  2019-02-25 11:09                         ` i.koptelov
  0 siblings, 1 reply; 22+ messages in thread
From: n.pettik @ 2019-02-22 12:59 UTC (permalink / raw)
  To: tarantool-patches; +Cc: Ivan Koptelov

[-- Attachment #1: Type: text/plain, Size: 1660 bytes --]



> On 20 Feb 2019, at 22:24, i.koptelov <ivan.koptelov@tarantool.org> wrote:
> 
>>> 
>>> On 20 Feb 2019, at 18:47, i.koptelov <ivan.koptelov@tarantool.org <mailto:ivan.koptelov@tarantool.org>> wrote:
>>> 
>>> Thanks to Alexander, I fixed my patch to use a function
>>> from icu to count the length of the string.
>>> 
>>> Changes:

Travis has failed. Please, make sure it is OK before sending the patch.
It doesn’t fail on my local (Mac) machine, so I guess this fail appears
only on Linux system.

>> Furthermore, description says that it “assumes well-formed UTF-8”,
>> which in our case is not true. So who knows what may happen if we pass
>> malformed byte sequence. I am not even saying that behaviour of
>> this function on invalid inputs may change later.
> 
> In it's current implementation U8_FWD_1_UNSAFE satisfy our needs safely. Returned
> symbol length would never exceed byte_len.
> 
> static int
> utf8_char_count(const unsigned char *str, int byte_len)
> {
> 	int symbol_count = 0;
> 	for (int i = 0; i < byte_len;) {
> 		U8_FWD_1_UNSAFE(str, i);
> 		symbol_count++;
> 	}
> 	return symbol_count;
> }
> 
> I agree that it is a bad idea to relay on lib behaviour which may
> change lately. So maybe I would just inline these one line macros?
> Or use my own implementation, since it’s more efficient (but less beautiful)

Nevermind, let's keep it as is.
I really worry only about the fact that in other places SQL_SKIP_UTF8
is used instead. It handles only two-bytes utf8 symbols, meanwhile
U8_FWD_1_UNSAFE() accounts three and four bytes length symbols.
Can we use everywhere the same pattern?

[-- Attachment #2: Type: text/html, Size: 17493 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
  2019-02-22 12:59                       ` n.pettik
@ 2019-02-25 11:09                         ` i.koptelov
  2019-02-25 15:10                           ` n.pettik
  0 siblings, 1 reply; 22+ messages in thread
From: i.koptelov @ 2019-02-25 11:09 UTC (permalink / raw)
  To: tarantool-patches; +Cc: n.pettik



> On 22 Feb 2019, at 15:59, n.pettik <korablev@tarantool.org> wrote:
> 
> 
> 
>> On 20 Feb 2019, at 22:24, i.koptelov <ivan.koptelov@tarantool.org> wrote:
>> 
>>>> 
>>>> On 20 Feb 2019, at 18:47, i.koptelov <ivan.koptelov@tarantool.org> wrote:
>>>> 
>>>> Thanks to Alexander, I fixed my patch to use a function
>>>> from icu to count the length of the string.
>>>> 
>>>> Changes:
> 
> Travis has failed. Please, make sure it is OK before sending the patch.
> It doesn’t fail on my local (Mac) machine, so I guess this fail appears
> only on Linux system.
The problem is with badutf test (LENGTH tests).
I’ve tried to reproduce the problem on my machine (using Docker with Ubuntu),
but with no success. It seems like that different versions of icu4c lib
provide different behavior of U8_FWD_1_UNSAFE.
I propose to just inline these two lines (which we need) into
some util function. Logic of these lines seems to be quite simple
and obvious (after you read about utf8 on wikipedia), so I see no
problem.

#define U8_COUNT_TRAIL_BYTES_UNSAFE(leadByte) \
   (((uint8_t)(leadByte)>=0xc2)+((uint8_t)(leadByte)>=0xe0)+((uint8_t)(leadByte)>=0xf0))

#define U8_FWD_1_UNSAFE(s, i) { \
   (i)+=1+U8_COUNT_TRAIL_BYTES_UNSAFE((s)[i]); \
}


> 
>>> Furthermore, description says that it “assumes well-formed UTF-8”,
>>> which in our case is not true. So who knows what may happen if we pass
>>> malformed byte sequence. I am not even saying that behaviour of
>>> this function on invalid inputs may change later.
>> 
>> In it's current implementation U8_FWD_1_UNSAFE satisfy our needs safely. Returned
>> symbol length would never exceed byte_len.
>> 
>> static int
>> utf8_char_count(const unsigned char *str, int byte_len)
>> {
>> 	int symbol_count = 0;
>> 	for (int i = 0; i < byte_len;) {
>> 		U8_FWD_1_UNSAFE(str, i);
>> 		symbol_count++;
>> 	}
>> 	return symbol_count;
>> }
>> 
>> I agree that it is a bad idea to relay on lib behaviour which may
>> change lately. So maybe I would just inline these one line macros?
>> Or use my own implementation, since it’s more efficient (but less beautiful)
> 
> Nevermind, let's keep it as is.
> I really worry only about the fact that in other places SQL_SKIP_UTF8
> is used instead. It handles only two-bytes utf8 symbols, meanwhile
> U8_FWD_1_UNSAFE() accounts three and four bytes length symbols.
> Can we use everywhere the same pattern?
Yes, I think, we can.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
  2019-02-25 11:09                         ` i.koptelov
@ 2019-02-25 15:10                           ` n.pettik
  2019-02-26 13:33                             ` i.koptelov
  0 siblings, 1 reply; 22+ messages in thread
From: n.pettik @ 2019-02-25 15:10 UTC (permalink / raw)
  To: tarantool-patches; +Cc: Ivan Koptelov

[-- Attachment #1: Type: text/plain, Size: 3112 bytes --]



> On 25 Feb 2019, at 14:09, i.koptelov <ivan.koptelov@tarantool.org> wrote:
>> On 22 Feb 2019, at 15:59, n.pettik <korablev@tarantool.org> wrote:
>>> On 20 Feb 2019, at 22:24, i.koptelov <ivan.koptelov@tarantool.org> wrote:
>>>>> 
>>>>> On 20 Feb 2019, at 18:47, i.koptelov <ivan.koptelov@tarantool.org> wrote:
>>>>> 
>>>>> Thanks to Alexander, I fixed my patch to use a function
>>>>> from icu to count the length of the string.
>>>>> 
>>>>> Changes:
>> 
>> Travis has failed. Please, make sure it is OK before sending the patch.
>> It doesn’t fail on my local (Mac) machine, so I guess this fail appears
>> only on Linux system.
> The problem is with badutf test (LENGTH tests).
> I’ve tried to reproduce the problem on my machine (using Docker with Ubuntu),
> but with no success. It seems like that different versions of icu4c lib
> provide different behavior of U8_FWD_1_UNSAFE.
> I propose to just inline these two lines (which we need) into
> some util function. Logic of these lines seems to be quite simple
> and obvious (after you read about utf8 on wikipedia), so I see no
> problem.
> 
> #define U8_COUNT_TRAIL_BYTES_UNSAFE(leadByte) \
>   (((uint8_t)(leadByte)>=0xc2)+((uint8_t)(leadByte)>=0xe0)+((uint8_t)(leadByte)>=0xf0))
> 
> #define U8_FWD_1_UNSAFE(s, i) { \
>   (i)+=1+U8_COUNT_TRAIL_BYTES_UNSAFE((s)[i]); \
> }

That’s I was talking about. But using the macros with the same
name as in utf library doesn’t look like a good pattern. Yep, you
can use define guards like:

#ifdef  U8_COUNT_TRAIL_BYTES_UNSAFE
#undef U8_COUNT_TRAIL_BYTES_UNSAFE
#endif
#define U8_COUNT_TRAIL_BYTES_UNSAFE

But I’d rather just give it another name.
Hence, taking into account comment below,
we are going to substitute SQL_SKIP_UTF8() with
implementation borrowed from icu library.

>>>> Furthermore, description says that it “assumes well-formed UTF-8”,
>>>> which in our case is not true. So who knows what may happen if we pass
>>>> malformed byte sequence. I am not even saying that behaviour of
>>>> this function on invalid inputs may change later.
>>> 
>>> In it's current implementation U8_FWD_1_UNSAFE satisfy our needs safely. Returned
>>> symbol length would never exceed byte_len.
>>> 
>>> static int
>>> utf8_char_count(const unsigned char *str, int byte_len)
>>> {
>>> 	int symbol_count = 0;
>>> 	for (int i = 0; i < byte_len;) {
>>> 		U8_FWD_1_UNSAFE(str, i);
>>> 		symbol_count++;
>>> 	}
>>> 	return symbol_count;
>>> }
>>> 
>>> I agree that it is a bad idea to relay on lib behaviour which may
>>> change lately. So maybe I would just inline these one line macros?
>>> Or use my own implementation, since it’s more efficient (but less beautiful)
>> 
>> Nevermind, let's keep it as is.
>> I really worry only about the fact that in other places SQL_SKIP_UTF8
>> is used instead. It handles only two-bytes utf8 symbols, meanwhile
>> U8_FWD_1_UNSAFE() accounts three and four bytes length symbols.
>> Can we use everywhere the same pattern?
> Yes, I think, we can.

Ok, then will be waiting for updates.


[-- Attachment #2: Type: text/html, Size: 16152 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
  2019-02-25 15:10                           ` n.pettik
@ 2019-02-26 13:33                             ` i.koptelov
  2019-02-26 17:50                               ` n.pettik
  0 siblings, 1 reply; 22+ messages in thread
From: i.koptelov @ 2019-02-26 13:33 UTC (permalink / raw)
  To: tarantool-patches; +Cc: n.pettik



> On 25 Feb 2019, at 18:10, n.pettik <korablev@tarantool.org> wrote:
> 
> That’s I was talking about. But using the macros with the same
> name as in utf library doesn’t look like a good pattern. Yep, you
> can use define guards like:
> 
> #ifdef  U8_COUNT_TRAIL_BYTES_UNSAFE
> #undef U8_COUNT_TRAIL_BYTES_UNSAFE
> #endif
> #define U8_COUNT_TRAIL_BYTES_UNSAFE
> 
> But I’d rather just give it another name.
> Hence, taking into account comment below,
> we are going to substitute SQL_SKIP_UTF8() with
> implementation borrowed from icu library.
I changed the names to SQL_UTF8_FWD_1_UNSAFE and
SQL_UTF8_COUNT_TRAIL_BYTES_UNSAFE.
> 
>>>>> Furthermore, description says that it “assumes well-formed UTF-8”,
>>>>> which in our case is not true. So who knows what may happen if we pass
>>>>> malformed byte sequence. I am not even saying that behaviour of
>>>>> this function on invalid inputs may change later.
>>>> 
>>>> In it's current implementation U8_FWD_1_UNSAFE satisfy our needs safely. Returned
>>>> symbol length would never exceed byte_len.
>>>> 
>>>> static int
>>>> utf8_char_count(const unsigned char *str, int byte_len)
>>>> {
>>>> 	int symbol_count = 0;
>>>> 	for (int i = 0; i < byte_len;) {
>>>> 		U8_FWD_1_UNSAFE(str, i);
>>>> 		symbol_count++;
>>>> 	}
>>>> 	return symbol_count;
>>>> }
>>>> 
>>>> I agree that it is a bad idea to relay on lib behaviour which may
>>>> change lately. So maybe I would just inline these one line macros?
>>>> Or use my own implementation, since it’s more efficient (but less beautiful)
>>> 
>>> Nevermind, let's keep it as is.
>>> I really worry only about the fact that in other places SQL_SKIP_UTF8
>>> is used instead. It handles only two-bytes utf8 symbols, meanwhile
>>> U8_FWD_1_UNSAFE() accounts three and four bytes length symbols.
>>> Can we use everywhere the same pattern?
>> Yes, I think, we can.
> 
> Ok, then will be waiting for updates.
So here are the updates:

sqlUtf8CharLen is removed with utf8_char_count (which is
just moved to utf.c). sqlUtf8CharLen used SQL_SKIP_UTF8
and itself was used only in one place.

diff --git a/src/box/sql/utf.c b/src/box/sql/utf.c
index 22e0de809..5a1862a52 100644
--- a/src/box/sql/utf.c
+++ b/src/box/sql/utf.c
@@ -157,30 +157,31 @@ sqlUtf8Read(const unsigned char **pz	/* Pointer to string from which to read cha
  */
 /* #define TRANSLATE_TRACE 1 */
 
-/*
- * pZ is a UTF-8 encoded unicode string. If nByte is less than zero,
- * return the number of unicode characters in pZ up to (but not including)
- * the first 0x00 byte. If nByte is not less than zero, return the
- * number of unicode characters in the first nByte of pZ (or up to
- * the first 0x00, whichever comes first).
+/**
+ * Return number of symbols in the given string.
+ *
+ * Number of symbols != byte size of string because some symbols
+ * are encoded with more than one byte. Also note that all
+ * symbols from 'str' to 'str + byte_len' would be counted,
+ * even if there is a '\0' somewhere between them.
+ *
+ * This function is implemented to be fast and indifferent to
+ * correctness of string being processed. If input string has
+ * even one invalid utf-8 sequence, then the resulting length
+ * could be arbitary in these boundaries (0 < len < byte_len).
+ * @param str String to be counted.
+ * @param byte_len Byte length of given string.
+ * @return number of symbols in the given string.
  */
 int
-sqlUtf8CharLen(const char *zIn, int nByte)
+utf8_char_count(const unsigned char *str, int byte_len)
 {
-	int r = 0;
-	const u8 *z = (const u8 *)zIn;
-	const u8 *zTerm;
-	if (nByte >= 0) {
-		zTerm = &z[nByte];
-	} else {
-		zTerm = (const u8 *)(-1);
-	}
-	assert(z <= zTerm);
-	while (*z != 0 && z < zTerm) {
-		SQL_SKIP_UTF8(z);
-		r++;
+	int symbol_count = 0;
+	for (int i = 0; i < byte_len;) {
+		SQL_UTF8_FWD_1_UNSAFE(str, i, byte_len);
+		symbol_count++;
 	}
-	return r;
+	return symbol_count;
 }


I inlined SQL_UTF8_FWD_1_UNSAFE and
SQL_UTF8_COUNT_TRAIL_BYTES_UNSAFE from icu
library, because different versions of
the library provide slightly different
behavior of UTF8_FWD_1_UNSAFE. It caused
sql-tap/badutf test to fail on some platforms,
while on the others it was working.

diff --git a/src/box/sql/sqlInt.h b/src/box/sql/sqlInt.h
index 2830ab639..859450e88 100644
--- a/src/box/sql/sqlInt.h
+++ b/src/box/sql/sqlInt.h
@@ -3071,16 +3071,6 @@ struct TreeView {
 };
 #endif				/* SQL_DEBUG */
 
-/*
- * Assuming zIn points to the first byte of a UTF-8 character,
- * advance zIn to point to the first byte of the next UTF-8 character.
- */
-#define SQL_SKIP_UTF8(zIn) {                        \
-  if( (*(zIn++))>=0xc0 ){                              \
-    while( (*zIn & 0xc0)==0x80 ){ zIn++; }             \
-  }                                                    \
-}
-
 /*
  * The sql_*_BKPT macros are substitutes for the error codes with
  * the same name but without the _BKPT suffix.  These macros invoke
@@ -4142,11 +4132,51 @@ void
 sql_drop_foreign_key(struct Parse *parse_context, struct SrcList *table,
 		     struct Token *constraint);
 
+/**
+ * Counts the trail bytes for a UTF-8 lead byte of a valid UTF-8
+ * sequence. Returns 0 for 0..0xc1. Undefined for 0xf5..0xff.
+ * leadByte might be evaluated multiple times.
+ *
+ * @param leadByte The first byte of a UTF-8 sequence.
+ *                 Must be 0..0xff.
+ */
+#define SQL_UTF8_COUNT_TRAIL_BYTES_UNSAFE(leadByte) \
+    (((uint8_t)(leadByte) >= 0xc2) + ((uint8_t)(leadByte) >= 0xe0) + \
+    ((uint8_t)(leadByte) >= 0xf0))
+
+/**
+ * Advance the string offset from one code point boundary to the
+ * next. (Post-incrementing iteration.)
+ * "Unsafe" macro, assumes well-formed UTF-8.
+ *
+ * After the whole string is traversed, (str + i) points to the
+ * position right after the last element of the string (*).
+ *
+ * If resulting offset > byte_size then resulting offset is set
+ * to byte_size. This is to provide (*) in cases where it might
+ * be violated.
+ *
+ * SQL_UTF8_FWD_1_UNSAFE sometimes is used get the size of utf-8
+ * character subsequence and we don't want to get summary size
+ * which exceeds total string size (in bytes). Consider example:
+ *
+ * '0xE0' - this is invalid utf-8 string because it consists only
+ * of first byte of 3 byte sequence. After traverse, the
+ * offset == 2 and we set it to 1, to keep (*).
+ *
+ * @param s const uint8_t * string.
+ * @param i string offset.
+ * @param byte_size byte size of the string.
+ */
+#define SQL_UTF8_FWD_1_UNSAFE(str, i, byte_size) \
+    (i) += 1 + SQL_UTF8_COUNT_TRAIL_BYTES_UNSAFE((str)[i]); \
+    (i) = (i) <= (byte_size) ? (i) : (byte_size);
+
 void sqlDetach(Parse *, Expr *);
 int sqlAtoF(const char *z, double *, int);
 int sqlGetInt32(const char *, int *);
 int sqlAtoi(const char *);
-int sqlUtf8CharLen(const char *pData, int nByte);
+int utf8_char_count(const unsigned char *str, int byte_len);
 u32 sqlUtf8Read(const u8 **);
 LogEst sqlLogEst(u64);
 LogEst sqlLogEstAdd(LogEst, LogEst);



Just removed SQL_SKIP_UTF8 with SQL_UTF8_FWD_1_UNSAFE.

diff --git a/src/box/sql/func.c b/src/box/sql/func.c
index 8ddb9780f..5352ca6b8 100644
--- a/src/box/sql/func.c
+++ b/src/box/sql/func.c
@@ -128,33 +128,6 @@ typeofFunc(sql_context * context, int NotUsed, sql_value ** argv)
 	sql_result_text(context, z, -1, SQL_STATIC);
 }
 
-/**
- * Return number of symbols in the given string.
- *
- * Number of symbols != byte size of string because some symbols
- * are encoded with more than one byte. Also note that all
- * symbols from 'str' to 'str + byte_len' would be counted,
- * even if there is a '\0' somewhere between them.
- *
- * This function is implemented to be fast and indifferent to
- * correctness of string being processed. If input string has
- * even one invalid utf-8 sequence, then the resulting length
- * could be arbitary in these boundaries (0 < len < byte_len).
- * @param str String to be counted.
- * @param byte_len Byte length of given string.
- * @return number of symbols in the given string.
- */
-static int
-utf8_char_count(const unsigned char *str, int byte_len)
-{
-	int symbol_count = 0;
-	for (int i = 0; i < byte_len;) {
-		U8_FWD_1_UNSAFE(str, i);
-		symbol_count++;
-	}
-	return symbol_count;
-}
-
 /*
  * Implementation of the length() function
  */
@@ -413,17 +386,22 @@ substrFunc(sql_context * context, int argc, sql_value ** argv)
 		 * used because '\0' is not supposed to be
 		 * end-of-string symbol.
 		 */
-		int n_chars = utf8_char_count(z, sql_value_bytes(argv[0]));
+		int byte_size = sql_value_bytes(argv[0]);
+		int n_chars = utf8_char_count(z, byte_size);
 		int cnt = 0;
+		int i = 0;
 		while (cnt < n_chars && p1) {
-			SQL_SKIP_UTF8(z);
+			SQL_UTF8_FWD_1_UNSAFE(z, i, byte_size);
 			cnt++;
 			p1--;
 		}
+		z += i;
+		i = 0;
 		for (z2 = z; cnt < n_chars && p2; p2--) {
-			SQL_SKIP_UTF8(z2);
+			SQL_UTF8_FWD_1_UNSAFE(z2, i, byte_size);
 			cnt++;
 		}
+		z2 += i;
 		sql_result_text64(context, (char *)z, z2 - z,
 				      SQL_TRANSIENT);
 	} else {
@@ -890,7 +868,7 @@ likeFunc(sql_context *context, int argc, sql_value **argv)
 			return;
 		const char *const err_msg =
 			"ESCAPE expression must be a single character";
-		if (sqlUtf8CharLen((char *)zEsc, -1) != 1) {
+		if (utf8_char_count(zEsc, sql_value_bytes(argv[2])) != 1) {
 			sql_result_error(context, err_msg, -1);
 			return;
 		}
@@ -1268,8 +1246,6 @@ trimFunc(sql_context * context, int argc, sql_value ** argv)
 	} else {
 		const unsigned char *z = zCharSet;
 		int trim_set_sz = sql_value_bytes(argv[1]);
-		int handled_bytes_cnt = trim_set_sz;
-		nChar = 0;
 		/*
 		* Count the number of UTF-8 characters passing
 		* through the entire char set, but not up
@@ -1277,12 +1253,7 @@ trimFunc(sql_context * context, int argc, sql_value ** argv)
 		* to handle trimming set containing such
 		* characters.
 		*/
-		while(handled_bytes_cnt > 0) {
-		const unsigned char *prev_byte = z;
-			SQL_SKIP_UTF8(z);
-			handled_bytes_cnt -= (z - prev_byte);
-			nChar++;
-		}
+		nChar = utf8_char_count(z, trim_set_sz);
 		if (nChar > 0) {
 			azChar =
 			    contextMalloc(context,
@@ -1292,12 +1263,13 @@ trimFunc(sql_context * context, int argc, sql_value ** argv)
 			}
 			aLen = (unsigned char *)&azChar[nChar];
 			z = zCharSet;
+			i = 0;
 			nChar = 0;
-			handled_bytes_cnt = trim_set_sz;
+			int handled_bytes_cnt = trim_set_sz;
 			while(handled_bytes_cnt > 0) {
-				azChar[nChar] = (unsigned char *)z;
-				SQL_SKIP_UTF8(z);
-				aLen[nChar] = (u8) (z - azChar[nChar]);
+				azChar[nChar] = (unsigned char *)(z + i);
+				SQL_UTF8_FWD_1_UNSAFE(z, i, trim_set_sz);
+				aLen[nChar] = (u8) (z + i - azChar[nChar]);
 				handled_bytes_cnt -= aLen[nChar];
 				nChar++;
 			}

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
  2019-02-26 13:33                             ` i.koptelov
@ 2019-02-26 17:50                               ` n.pettik
  2019-02-26 18:44                                 ` i.koptelov
  0 siblings, 1 reply; 22+ messages in thread
From: n.pettik @ 2019-02-26 17:50 UTC (permalink / raw)
  To: tarantool-patches; +Cc: Ivan Koptelov

[-- Attachment #1: Type: text/plain, Size: 1041 bytes --]



> On 26 Feb 2019, at 16:33, i.koptelov <ivan.koptelov@tarantool.org> wrote:
>> On 25 Feb 2019, at 18:10, n.pettik <korablev@tarantool.org> wrote:
>> 
>> That’s I was talking about. But using the macros with the same
>> name as in utf library doesn’t look like a good pattern. Yep, you
>> can use define guards like:
>> 
>> #ifdef  U8_COUNT_TRAIL_BYTES_UNSAFE
>> #undef U8_COUNT_TRAIL_BYTES_UNSAFE
>> #endif
>> #define U8_COUNT_TRAIL_BYTES_UNSAFE
>> 
>> But I’d rather just give it another name.
>> Hence, taking into account comment below,
>> we are going to substitute SQL_SKIP_UTF8() with
>> implementation borrowed from icu library.
> I changed the names to SQL_UTF8_FWD_1_UNSAFE and
> SQL_UTF8_COUNT_TRAIL_BYTES_UNSAFE.

Won’t bother you with portion of minor comments.
I’ve pushed them, take a look. If they are OK, just
squash (but don’t squash last commit) them and
patch will be OK as well.

Also, I’ve fixed a bit commit message. Add please
doc request which includes user-visible changes.

[-- Attachment #2: Type: text/html, Size: 3195 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
  2019-02-26 17:50                               ` n.pettik
@ 2019-02-26 18:44                                 ` i.koptelov
  2019-02-26 20:16                                   ` Vladislav Shpilevoy
  0 siblings, 1 reply; 22+ messages in thread
From: i.koptelov @ 2019-02-26 18:44 UTC (permalink / raw)
  To: tarantool-patches; +Cc: n.pettik



> On 26 Feb 2019, at 20:50, n.pettik <korablev@tarantool.org> wrote:
> 
> Won’t bother you with portion of minor comments.
> I’ve pushed them, take a look. If they are OK, just
> squash (but don’t squash last commit) them and
> patch will be OK as well.
> 
> Also, I’ve fixed a bit commit message. Add please
> doc request which includes user-visible changes.
Thank you for the fixes! They are OK for me.
Squashed them. Added doc request into the
commit message.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
  2019-02-26 18:44                                 ` i.koptelov
@ 2019-02-26 20:16                                   ` Vladislav Shpilevoy
  2019-03-04 11:59                                     ` i.koptelov
  0 siblings, 1 reply; 22+ messages in thread
From: Vladislav Shpilevoy @ 2019-02-26 20:16 UTC (permalink / raw)
  To: tarantool-patches, i.koptelov; +Cc: n.pettik



On 26/02/2019 21:44, i.koptelov wrote:
> 
> 
>> On 26 Feb 2019, at 20:50, n.pettik <korablev@tarantool.org> wrote:
>>
>> Won’t bother you with portion of minor comments.
>> I’ve pushed them, take a look. If they are OK, just
>> squash (but don’t squash last commit) them and
>> patch will be OK as well.
>>
>> Also, I’ve fixed a bit commit message. Add please
>> doc request which includes user-visible changes.
> Thank you for the fixes! They are OK for me.
> Squashed them. Added doc request into the
> commit message.
> 

If under a doc request you mean that: https://github.com/tarantool/tarantool/commit/4704f531fc4af0e074410c552a4f3a9d9848c949 then it is incorrect. Please, look at
tarantool/docbot for the proper syntax, and check your requests in the online
table: https://tarantool-docbot.herokuapp.com

If you do not see your request here, then it is incorrect.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
  2019-02-26 20:16                                   ` Vladislav Shpilevoy
@ 2019-03-04 11:59                                     ` i.koptelov
  0 siblings, 0 replies; 22+ messages in thread
From: i.koptelov @ 2019-03-04 11:59 UTC (permalink / raw)
  To: tarantool-patches



> On 26 Feb 2019, at 23:16, Vladislav Shpilevoy <v.shpilevoy@tarantool.org> wrote:
> 
> 
> 
> On 26/02/2019 21:44, i.koptelov wrote:
>>> On 26 Feb 2019, at 20:50, n.pettik <korablev@tarantool.org> wrote:
>>> 
>>> Won’t bother you with portion of minor comments.
>>> I’ve pushed them, take a look. If they are OK, just
>>> squash (but don’t squash last commit) them and
>>> patch will be OK as well.
>>> 
>>> Also, I’ve fixed a bit commit message. Add please
>>> doc request which includes user-visible changes.
>> Thank you for the fixes! They are OK for me.
>> Squashed them. Added doc request into the
>> commit message.
> 
> If under a doc request you mean that: https://github.com/tarantool/tarantool/commit/4704f531fc4af0e074410c552a4f3a9d9848c949 then it is incorrect. Please, look at
> tarantool/docbot for the proper syntax, and check your requests in the online
> table: https://tarantool-docbot.herokuapp.com
> 
> If you do not see your request here, then it is incorrect.
> 
Thank you, I’ve changed the commit message. Now I can see my
doc request in the online table.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
  2019-01-29  9:56 [tarantool-patches] [PATCH] sql: LIKE/LENGTH process '\0' Ivan Koptelov
  2019-01-29 16:35 ` [tarantool-patches] " n.pettik
@ 2019-03-04 15:30 ` Kirill Yukhin
  1 sibling, 0 replies; 22+ messages in thread
From: Kirill Yukhin @ 2019-03-04 15:30 UTC (permalink / raw)
  To: tarantool-patches

Hello,

On 29 Jan 12:56, Ivan Koptelov wrote:
> Fixes LIKE and LENGTH functions. '\0' now treated as
> a usual symbol. Strings with '\0' are now processed
> entirely. Consider examples:
> 
> LENGTH(CHAR(65,00,65)) == 3
> LIKE(CHAR(65,00,65), CHAR(65,00,66)) == False
> 
> Closes #3542
> ---
> Branch: [1]https://github.com/tarantool/tarantool/tree/sudobobo/gh-3542-LIKE/LEN-null-term
> Issue: [2]https://github.com/tarantool/tarantool/issues/3542

I've checked in final patch and clean-up by Nikita into 2.1 branch.

--
Regards, Kirill Yukhin

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2019-03-04 15:30 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-29  9:56 [tarantool-patches] [PATCH] sql: LIKE/LENGTH process '\0' Ivan Koptelov
2019-01-29 16:35 ` [tarantool-patches] " n.pettik
2019-02-04 12:34   ` Ivan Koptelov
2019-02-05 13:50     ` n.pettik
2019-02-07 15:14       ` i.koptelov
2019-02-11 13:15         ` n.pettik
2019-02-13 15:46           ` i.koptelov
2019-02-14 12:57             ` n.pettik
2019-02-20 13:54               ` i.koptelov
2019-02-20 15:47                 ` i.koptelov
2019-02-20 16:04                   ` n.pettik
2019-02-20 18:08                     ` Vladislav Shpilevoy
2019-02-20 19:24                     ` i.koptelov
2019-02-22 12:59                       ` n.pettik
2019-02-25 11:09                         ` i.koptelov
2019-02-25 15:10                           ` n.pettik
2019-02-26 13:33                             ` i.koptelov
2019-02-26 17:50                               ` n.pettik
2019-02-26 18:44                                 ` i.koptelov
2019-02-26 20:16                                   ` Vladislav Shpilevoy
2019-03-04 11:59                                     ` i.koptelov
2019-03-04 15:30 ` Kirill Yukhin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox