[tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
i.koptelov
ivan.koptelov at tarantool.org
Wed Feb 20 16:54:29 MSK 2019
> On 14 Feb 2019, at 15:57, n.pettik <korablev at tarantool.org> wrote:
>
> I see some strange file on your branch:
>
> https://github.com/tarantool/tarantool/commit/0a34e978c8023dcb4f99965d93203a99adf21520
>
> I guess it’s an artefact from your investigation.
> If you want to share some results/proposals, just
> attach them to the this letter or write to dev list.
Sorry, I committed it by mistake. Removed now.
>
>> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
>> index 8ad75adb8..bab102988 100644
>> --- a/src/box/sql/func.c
>> +++ b/src/box/sql/func.c
>> @@ -140,18 +140,18 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv)
>> * @return number of symbols in the given string.
>> */
>> static int
>> -char_count(const unsigned char *str, size_t byte_len)
>> +utf8_char_count(const unsigned char *str, int byte_len)
>> {
>>>
>>> Why unsigned char?
>> I used unsigned char because sqlite3_value_text() returns result
>> of this type.
>
> Ok, let it be.
>
>> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
>> index 8ad75adb8..bab102988 100644
>> --- a/src/box/sql/func.c
>> +++ b/src/box/sql/func.c
>> @@ -140,18 +140,18 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv)
>> * @return number of symbols in the given string.
>> */
>> static int
>> -char_count(const unsigned char *str, size_t byte_len)
>> +utf8_char_count(const unsigned char *str, int byte_len)
>> {
>> - int symbol_len = 0;
>> + int symbol_count = 0;
>> int offset = 0;
>> UChar32 res;
>> - while (offset < (int) byte_len) {
>> - U8_NEXT(str, offset, (int) byte_len, res)
>> + while (offset < byte_len) {
>> + U8_NEXT(str, offset, byte_len, res)
>> if (res < 0)
>> break;
>> - symbol_len++;
>> + symbol_count++;
>> }
>> - return symbol_len;
>> + return symbol_count;
>> }
>>
>>>
>>>> + if (res < 0)
>>>> + break;
>>>
>>> Hm, so if a sequence contains non-utf8 symbols,
>>> you simply cut the string. Not sure it is solid approach.
>>> Could you please check how other DBs behave (do they throw
>>> an error? Or ignore invalid symbols?) in similar situation and
>>> what standard says.
>> I’ve tested statements with LENGTH from the test badutf1 in different
>> DBs. PostgreSQL raised an error "invalid byte sequence for encoding “UTF8”.
>> MySQL, DB2 and MSSQL behaved all the same way: count each invalid byte as
>> a symbol. For example:
>>
>> 0x80, 0x7f, 0x81 are all invalid first bytes from
>> UTF8 point of view, 0xC03F is bad two byte seq. where
>> first byte is ok and second is broken, 0xE0800F is
>> bad three byte seq. where first two bytes are ok, but the
>> third is broken.
>>
>> (this is MySQL SQL)
>>
>> select length(X'80'); 1
>> select length(concat(X'7f', X'80', X'81')); 3
>> select length(concat(X'61', X'c0')); 2
>> select length(concat(X'61', X'c0', X'80', X'80', X'80',
>> X'80', X'80', X'80', X'80', X'80', X'80', X'80')); 12
>>
>> select length(X'C03F'); 2
>> select length(concat(X'C03F', 'a')); 3
>> select length(X'E0800F'); 3
>> select length(concat(X'E0800F', 'a')); 4
>>
>> I have not found anything in standard about dealing with
>> invalid UTF8 sequences.
>>
>> Even before the patch test gave results different from what
>> all others DBs.
>>
>> I propose to behave all other DBs do, as I describe above.
>> Is it ok?
>
> PostgreSQL makes me feel uncertain concerning this question.
> Anyway, I still believe that it’s better to raise an error.
> I propose you ask other members of core team for their opinion.
> Additionally, you could ask author of origin issue and our expert P. Gulutzan.
After discussions, we decided to use the following algorithm:
Starting from the first byte in string, we try to determine
what kind of byte is it: first byte of 1,2,3 or 4 byte sequence.
Then we skip corresponding amount of bytes and increment symbol length.
(e.g 2 bytes for 2 byte sequence). If current byte is not a valid
first byte of any sequence, when we skip it and increment symbol length.
diff --git a/src/box/sql/func.c b/src/box/sql/func.c
index bab102988..8406eb42f 100644
--- a/src/box/sql/func.c
+++ b/src/box/sql/func.c
@@ -135,6 +135,11 @@ typeofFunc(sqlite3_context * context, int NotUsed, sqlite3_value ** argv)
* are encoded with more than one byte. Also note that all
* symbols from 'str' to 'str + byte_len' would be counted,
* even if there is a '\0' somewhere between them.
+ *
+ * This function is implemented to be fast and indifferent to
+ * correctness of string being processed. If input string has
+ * even one invalid utf-8 sequence, then the resulting length
+ * could be arbitary in these boundaries (0 < len < byte_len).
* @param str String to be counted.
* @param byte_len Byte length of given string.
* @return number of symbols in the given string.
@@ -143,12 +148,17 @@ static int
utf8_char_count(const unsigned char *str, int byte_len)
{
int symbol_count = 0;
- int offset = 0;
- UChar32 res;
- while (offset < byte_len) {
- U8_NEXT(str, offset, byte_len, res)
- if (res < 0)
- break;
+ for (int i = 0; i < byte_len;) {
+ if ((str[i] & 0x80) == 0)
+ i += 1;
+ else if ((str[i] & 0xe0) == 0xc0)
+ i += 2;
+ else if ((str[i] & 0xf0) == 0xe0)
+ i += 3;
+ else if ((str[i] & 0xf8) == 0xf0)
+ i += 4;
+ else
+ i += 1;
symbol_count++;
}
return symbol_count;
>
>>>> + symbol_len++;
>>>> }
>>>> - return n_chars;
>>>> + return symbol_len;
>>>> }
>>>>
>>>> diff --git a/test/sql-tap/badutf1.test.lua b/test/sql-tap/badutf1.test.lua
>>>> index 534c762ba..0a90d1b17 100755
>>>> --- a/test/sql-tap/badutf1.test.lua
>>>> +++ b/test/sql-tap/badutf1.test.lua
>>>> @@ -215,7 +215,7 @@ test:do_test(
>>>> return test:execsql2("SELECT length('\x80') AS x")
>>>> end, {
>>>> -- <badutf-3.1>
>>>> - "X", 1
>>>> + "X", 0
>>>
>>> Why this and the rest of tests below has been changed?
>>> I guess because now they contain invalid utf8 symbols.
>> They contained invalid utf8 symbols before. The tests are changed because
>> I changed the way we handle malformed utf8 strings.
>
> Ok, but at least underline this fact in commit message.
Ok.
>
>>>> -- </badutf-3.1>
>>>> })
>>>>
>>>> @@ -235,7 +235,7 @@ test:do_test(
>>>> return test:execsql2("SELECT length('\x7f\x80\x81') AS x")
>>>> end, {
>>>> -- <badutf-3.3>
>>>> - "X", 3
>>>> + "X", 1
>>>
>>> But wait, I execute
>>> tarantool> SELECT length('\x7f\x80\x81') AS x
>>> ---
>>> - - [12]
>>> ...
>>> x§
>> Maybe you got this result because ‘\x7f’ is treated by tarantool sql
>> as a simple string with 4 characters?
>
> And why in tests it is treated in other way?
When you use SQL interface in console and type SELECT LENGTH (‘\x7f\x80\x81’) AS x;
things like ‘\x7f’ are not treated as hexadecimal literals, they are just
treated as symbols ‘\’, ‘x’, ’7’, ‘f’
(And if you want to use hexadecimal literals, you should use:
> \set language sql
> select length(X'7f' || X'80' || X'81' );
---
- - [3]
)
>
>>> Looks extremely strangle. What do these tests check at all?
>>> Why we can’t use simple sqlexecute or smth? This test suite
>>> is so broken...
>> I think these tests check that DB does not crash if malformed utf8 is
>> encountered.
>
> I mean why results from console run are different from tests
> results in our suite? Still can’t understand that. Now I execute this:
> tarantool> box.sql.execute('select length("\x7f\x80\x81")’)
>
> And got strange error:
> ---
> - error: !!binary bm8gc3VjaCBjb2x1bW46IH+AgQ==
> ...
I would like to investigate it in scope of separate task, because
it seems to be quite time consuming problem.
At the end, we got these changes in badutf1.test.lua
diff --git a/test/sql-tap/badutf1.test.lua b/test/sql-tap/badutf1.test.lua
index 534c762ba..2b9f9b209 100755
--- a/test/sql-tap/badutf1.test.lua
+++ b/test/sql-tap/badutf1.test.lua
@@ -255,7 +255,7 @@ test:do_test(
return test:execsql2("SELECT length('\x61\xc0\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80') AS x")
end, {
-- <badutf-3.5>
- "X", 2
+ "X", 11
-- </badutf-3.5>
})
@@ -265,7 +265,7 @@ test:do_test(
return test:execsql2("SELECT length('\xc0\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80') AS x")
end, {
-- <badutf-3.6>
- "X", 1
+ "X", 10
-- </badutf-3.6>
})
@@ -285,7 +285,7 @@ test:do_test(
return test:execsql2("SELECT length('\x80\x80\x80\x80\x80\xf0\x80\x80\x80\x80') AS x")
end, {
-- <badutf-3.8>
- "X", 6
+ "X", 7
-- </badutf-3.8>
})
More information about the Tarantool-patches
mailing list