[tarantool-patches] Re: [PATCH] sql: LIKE & GLOB pattern comparison issue
Nikita Tatunov
hollow653 at gmail.com
Fri Jul 27 22:11:35 MSK 2018
Hello!
пт, 27 июл. 2018 г. в 16:06, Alexander Turenko <
alexander.turenko at tarantool.org>:
> Hi,
>
> See comments below.
>
> Don't sure we should return 'no match' situation when a left hand
> expression in not correct unicode string. How other DBs handle that?
>
>
The main point is that since pattern is always valid (otherwise we return
an error)
it cannot be matched with a string that contains an invalid utf-8 character.
- PostgreSQL doesn't allow storing invalid characters in text types and
even explicit E'\xD1' anywhere results an error.
- MySQL Doesn't allow using invalid characters in text types.
May be it's reasons for thinking about prohibiting invalid characters in
text?
WBR, Alexander Turenko.
>
> > sql: LIKE & GLOB pattern comparison issue
> >
> > Currently function that compares pattern and string for GLOB & LIKE
> > operators doesn't work properly. It uses ICU reading function which
> > perhaps was working differently before and the implementation for the
> > comparison ending isn't paying attention to some special cases, hence
> > in those cases it works improperly.
> >
>
> It used non-ICU function before, consider commit 198d59ce.
>
>
Thank you, I've got it now.
> > With the patch applied an error will be returned in case there's
> > invalid UTF-8 symbol in pattern & pattern will not be matched
> > with the string that contains it. Some minor corrections to function
> > were made as well.
> >
>
> Nitpicking: 'error will be returned' is third situatuon, other then
> 'matched' and 'not matched'.
>
>
I guess commit message is about changes. Do I really need to
mention valid cases that didn't change?
> > Сloses #3251
> > Сloses #3334
> > Part of #3572
>
> > > 2. The thing you test is not related to a table and other columns.
> > > Please, convert the tests to the next format: {[[select '' like
> > > '_B';]], {1}]]}.
> > > To make it more readable, you can do it like `like_testcases` in
> > > `sql-tap/collation.test.lua`.
> > >
> >
> > Done.
>
> You are still perform comparisons with table columns.
>
Thank you, fixed it. More tests now tho.
>
> > -/*
> > - * For LIKE and GLOB matching on EBCDIC machines, assume that every
> > - * character is exactly one byte in size. Also, provde the Utf8Read()
> > - * macro for fast reading of the next character in the common case where
> > - * the next character is ASCII.
> > +/**
> > + * Providing there are symbols in string s this
> > + * macro returns UTF-8 code of character and
> > + * promotes pointer to the next symbol in the string.
> > + * Otherwise return code is SQL_END_OF_STRING.
> > */
> > -#define Utf8Read(s, e) ucnv_getNextUChar(pUtf8conv, &s, e, &status)
> > +#define Utf8Read(s, e) (((s) < (e)) ?\
> > + ucnv_getNextUChar(pUtf8conv, &(s), (e), &(status)) : 0)
>
> I suggest to add a whitespace before backslash to make it looks better.
>
>
Guess so.
> > @@ -643,51 +647,61 @@ static const struct compareInfo likeInfoAlt = {
> '%',
> > '_', 0, 0 };
> > #define SQLITE_MATCH 0
> > #define SQLITE_NOMATCH 1
> > #define SQLITE_NOWILDCARDMATCH 2
> > +#define SQL_PROHIBITED_PATTERN 3
>
> Update comment before that has a reference to 'patternMatch' function.
>
>
Didn't notice it, thnx.
> > *
> > * Globbing rules:
> > *
> > - * '*' Matches any sequence of zero or more characters.
> > + * '*' Matches any sequence of zero or more characters.
>
> Why?
>
> > *
> > - * '?' Matches exactly one character.
> > + * '?' Matches exactly one character.
> > *
> > - * [...] Matches one character from the enclosed list of
> > - * characters.
> > + * [...] Matches one character from the enclosed list of
> > + * characters.
> > *
> > - * [^...] Matches one character not in the enclosed list.
> > + * [^...] Matches one character not in the enclosed list.
> > *
>
> The same question.
>
>
It was looking like the indentation for similar table below was different.
I was wrong.
> > + * @retval SQLITE_MATCH: Match.
> > + * SQLITE_NOMATCH: No match.
> > + * SQLITE_NOWILDCARDMATCH: No match in spite of having *
> > + * or % wildcards.
> > + * SQL_PROHIBITED_PATTERN Pattern contains invalid
> > + * symbol.
>
> Nitpicking: no colon after SQL_PROHIBITED_PATTERN.
>
Fixed.
>
> > */
> > static int
> > -patternCompare(const char * pattern, /* The glob pattern */
> > - const char * string, /* The string to compare against the glob
> */
> > - const struct compareInfo *pInfo, /* Information about how to do
> > the compare */
> > - UChar32 matchOther /* The escape char (LIKE) or '[' (GLOB) */
> > - )
> > +sql_utf8_pattern_compare(const char * pattern,
> > + const char * string,
> > + const struct compareInfo *pInfo,
> > + UChar32 matchOther)
>
> Don't get why indent is tabulation + 7 spaces.
>
>
Me neither :thinking:. Thank you!
> The function itself looks good except lines longer then 80 columns.
>
As i said before I don't think breaking these lines will increase
readability.
> > @@ -853,8 +903,7 @@ patternCompare(const char * pattern, /* The glob
> > pattern */
> > int
> > sqlite3_strglob(const char *zGlobPattern, const char *zString)
> > {
> > - return patternCompare(zGlobPattern, zString, &globInfo,
> > - '[');
> > + return sql_utf8_pattern_compare(zGlobPattern, zString, &globInfo, '[');
> > }
> >
> > /*
> > @@ -864,7 +913,7 @@ sqlite3_strglob(const char *zGlobPattern, const char
> > *zString)
> > int
> > sqlite3_strlike(const char *zPattern, const char *zStr, unsigned int
> esc)
> > {
> > - return patternCompare(zPattern, zStr, &likeInfoNorm, esc);
> > + return sql_utf8_pattern_compare(zPattern, zStr, &likeInfoNorm, esc);
> > }
> >
>
> Should we add sqlite3_result_error is case of SQL_PROHIBITED_PATTERN for
> these two functions?
>
>
It doesn't matter for "sqlite3_strglob()", since it's only used internally
with valid explicitly given pattern in "analysis_loader()".
Same for "sqlite3_strlike()".
> > - {"LIKE", "like"},
> > - {"GLOB", "glob"},
> > +-- NOTE: This test needs refactoring after deletion of GLOB &
> > +-- type restrictions for LIKE.
> > +-- {"LIKE", "like"},
> > +-- {"GLOB", "glob"},
>
> I would add related issue number.
>
>
True. Done.
> > {"AND", "and"},
> > {"OR", "or"},
> > {"MATCH", "match"},
> > @@ -96,7 +98,7 @@ operations = {
> > {"+", "-"},
> > {"<<", ">>", "&", "|"},
> > {"<", "<=", ">", ">="},
> > - {"=", "==", "!=", "<>", "LIKE", "GLOB"}, --"MATCH", "REGEXP"},
> > + {"=", "==", "!=", "<>"}, --"LIKE", "GLOB"}, "MATCH", "REGEXP"},
>
> Leave comment before MATCH and REGEXP to avoid surprises after removing
> your comment.
>
>
Done.
> > --- /dev/null
> > +++ b/test/sql-tap/gh-3251-string-pattern-comparison.lua
> > @@ -0,0 +1,238 @@
> > +#!/usr/bin/env tarantool
> > +test = require("sqltester")
> > +test:plan(106)
> > +
> > +local prefix = "like-test-"
> > +
> > +-- Unciode byte sequences.
> > +
>
> 1. Typo: unicode.
> 2. Extra empty line.
>
>
Fixed.
> > +-- Invalid testcases.
> > +
> > +for i, tested_string in ipairs(invalid_testcases) do
> > + local test_name = prefix .. "2." .. tostring(i)
> > + local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';"
> > +
> > +-- We should raise an error if pattern contains invalid characters.
> > +
>
> 1. Trailing whitespaces.
> 2. Why comments are not indented as the other code?
> 3. test_name and test_itself before the section comment looks strange
> for me here.
>
diff --git a/src/box/sql/func.c b/src/box/sql/func.c
index 5b53076..2f989d4 100644
--- a/src/box/sql/func.c
+++ b/src/box/sql/func.c
@@ -623,7 +623,7 @@ struct compareInfo {
* promotes pointer to the next symbol in the string.
* Otherwise return code is SQL_END_OF_STRING.
*/
-#define Utf8Read(s, e) (((s) < (e)) ?\
+#define Utf8Read(s, e) (((s) < (e)) ? \
ucnv_getNextUChar(pUtf8conv, &(s), (e), &(status)) : 0)
#define SQL_END_OF_STRING 0
@@ -642,7 +642,7 @@ static const struct compareInfo likeInfoNorm = { '%',
'_', 0, 1 };
static const struct compareInfo likeInfoAlt = { '%', '_', 0, 0 };
/*
- * Possible error returns from patternMatch()
+ * Possible error returns from sql_utf8_pattern_compare()
*/
#define SQLITE_MATCH 0
#define SQLITE_NOMATCH 1
@@ -655,14 +655,14 @@ static const struct compareInfo likeInfoAlt = { '%',
'_', 0, 0 };
*
* Globbing rules:
*
- * '*' Matches any sequence of zero or more characters.
+ * '*' Matches any sequence of zero or more characters.
*
- * '?' Matches exactly one character.
+ * '?' Matches exactly one character.
*
- * [...] Matches one character from the enclosed list of
- * characters.
+ * [...] Matches one character from the enclosed list of
+ * characters.
*
- * [^...] Matches one character not in the enclosed list.
+ * [^...] Matches one character not in the enclosed list.
*
* With the [...] and [^...] matching, a ']' character can be
* included in the list by making it the first character after
@@ -694,14 +694,14 @@ static const struct compareInfo likeInfoAlt = { '%',
'_', 0, 0 };
* SQLITE_NOMATCH: No match.
* SQLITE_NOWILDCARDMATCH: No match in spite of having *
* or % wildcards.
- * SQL_PROHIBITED_PATTERN Pattern contains invalid
+ * SQL_PROHIBITED_PATTERN: Pattern contains invalid
* symbol.
*/
static int
sql_utf8_pattern_compare(const char * pattern,
- const char * string,
- const struct compareInfo *pInfo,
- UChar32 matchOther)
+ const char * string,
+ const struct compareInfo *pInfo,
+ UChar32 matchOther)
{
UChar32 c, c2; /* Next pattern and input string chars */
UChar32 matchOne = pInfo->matchOne; /* "?" or "_" */
diff --git a/test/sql-tap/e_expr.test.lua b/test/sql-tap/e_expr.test.lua
index 051210a..9780d2c 100755
--- a/test/sql-tap/e_expr.test.lua
+++ b/test/sql-tap/e_expr.test.lua
@@ -78,7 +78,7 @@ local operations = {
{"!=", "ne2"},
{"IS", "is"},
-- NOTE: This test needs refactoring after deletion of GLOB &
--- type restrictions for LIKE.
+-- type restrictions for LIKE. (See #3572)
-- {"LIKE", "like"},
-- {"GLOB", "glob"},
{"AND", "and"},
@@ -98,7 +98,12 @@ operations = {
{"+", "-"},
{"<<", ">>", "&", "|"},
{"<", "<=", ">", ">="},
- {"=", "==", "!=", "<>"}, --"LIKE", "GLOB"}, "MATCH", "REGEXP"},
+-- NOTE: This test needs refactoring after deletion of GLOB &
+-- type restrictions for LIKE. (See #3572)
+-- Another NOTE: MATCH & REGEXP aren't supported in Tarantool &
+-- are waiting for their hour, don't confuse them
+-- being commented with ticket above.
+ {"=", "==", "!=", "<>"}, --"LIKE", "GLOB"}, --"MATCH", "REGEXP"},
{"AND"},
{"OR"},
}
diff --git a/test/sql-tap/gh-3251-string-pattern-comparison.test.lua
b/test/sql-tap/gh-3251-string-pattern-comparison.test.lua
new file mode 100755
index 0000000..a823bc6
--- /dev/null
+++ b/test/sql-tap/gh-3251-string-pattern-comparison.test.lua
@@ -0,0 +1,281 @@
+#!/usr/bin/env tarantool
+test = require("sqltester")
+test:plan(128)
+
+local prefix = "like-test-"
+
+-- Unicode byte sequences.
+local valid_testcases = {
+ '\x01',
+ '\x09',
+ '\x1F',
+ '\x7F',
+ '\xC2\x80',
+ '\xC2\x90',
+ '\xC2\x9F',
+ '\xE2\x80\xA8',
+ '\x20\x0B',
+ '\xE2\x80\xA9',
+}
+
+-- Non-Unicode byte sequences.
+local invalid_testcases = {
+ '\xE2\x80',
+ '\xFE\xFF',
+ '\xC2',
+ '\xED\xB0\x80',
+ '\xD0',
+}
+
+local like_test_cases =
+{
+ {"1.1",
+ "SELECT 'AB' LIKE '_B';",
+ {0, {1}} },
+ {"1.2",
+ "SELECT 'CD' LIKE '_B';",
+ {0, {0}} },
+ {"1.3",
+ "SELECT '' LIKE '_B';",
+ {0, {0}} },
+ {"1.4",
+ "SELECT 'AB' LIKE '%B';",
+ {0, {1}} },
+ {"1.5",
+ "SELECT 'CD' LIKE '%B';",
+ {0, {0}} },
+ {"1.6",
+ "SELECT '' LIKE '%B';",
+ {0, {0}} },
+ {"1.7",
+ "SELECT 'AB' LIKE 'A__';",
+ {0, {0}} },
+ {"1.8",
+ "SELECT 'CD' LIKE 'A__';",
+ {0, {0}} },
+ {"1.9",
+ "SELECT '' LIKE 'A__';",
+ {0, {0}} },
+ {"1.10",
+ "SELECT 'AB' LIKE 'A_';",
+ {0, {1}} },
+ {"1.11",
+ "SELECT 'CD' LIKE 'A_';",
+ {0, {0}} },
+ {"1.12",
+ "SELECT '' LIKE 'A_';",
+ {0, {0}} },
+ {"1.13",
+ "SELECT 'AB' LIKE 'A';",
+ {0, {0}} },
+ {"1.14",
+ "SELECT 'CD' LIKE 'A';",
+ {0, {0}} },
+ {"1.15",
+ "SELECT '' LIKE 'A';",
+ {0, {0}} },
+ {"1.16",
+ "SELECT 'AB' LIKE '_';",
+ {0, {0}} },
+ {"1.17",
+ "SELECT 'CD' LIKE '_';",
+ {0, {0}} },
+ {"1.18",
+ "SELECT '' LIKE '_';",
+ {0, {0}} },
+ {"1.19",
+ "SELECT 'AB' LIKE '__';",
+ {0, {1}} },
+ {"1.20",
+ "SELECT 'CD' LIKE '__';",
+ {0, {1}} },
+ {"1.21",
+ "SELECT '' LIKE '__';",
+ {0, {0}} },
+ {"1.22",
+ "SELECT 'AB' LIKE '%A';",
+ {0, {0}} },
+ {"1.23",
+ "SELECT 'AB' LIKE '%C';",
+ {0, {0}} },
+ {"1.24",
+ "SELECT 'ab' LIKE '%df';",
+ {0, {0}} },
+ {"1.25",
+ "SELECT 'abCDF' LIKE '%df';",
+ {0, {1}} },
+ {"1.26",
+ "SELECT 'CDF' LIKE '%df';",
+ {0, {1}} },
+ {"1.27",
+ "SELECT 'ab' LIKE 'a_';",
+ {0, {1}} },
+ {"1.28",
+ "SELECT 'abCDF' LIKE 'a_';",
+ {0, {0}} },
+ {"1.29",
+ "SELECT 'CDF' LIKE 'a_';",
+ {0, {0}} },
+ {"1.30",
+ "SELECT 'ab' LIKE 'ab%';",
+ {0, {1}} },
+ {"1.31",
+ "SELECT 'abCDF' LIKE 'ab%';",
+ {0, {1}} },
+ {"1.32",
+ "SELECT 'CDF' LIKE 'ab%';",
+ {0, {0}} },
+ {"1.33",
+ "SELECT 'ab' LIKE 'abC%';",
+ {0, {0}} },
+ {"1.34",
+ "SELECT 'abCDF' LIKE 'abC%';",
+ {0, {1}} },
+ {"1.35",
+ "SELECT 'CDF' LIKE 'abC%';",
+ {0, {0}} },
+ {"1.36",
+ "SELECT 'ab' LIKE 'a_%';",
+ {0, {1}} },
+ {"1.37",
+ "SELECT 'abCDF' LIKE 'a_%';",
+ {0, {1}} },
+ {"1.38",
+ "SELECT 'CDF' LIKE 'a_%';",
+ {0, {0}} },
+}
+
+test:do_catchsql_set_test(like_test_cases, prefix)
+
+-- Invalid testcases.
+
+for i, tested_string in ipairs(invalid_testcases) do
+
+-- We should raise an error if pattern contains invalid characters.
+ local test_name = prefix .. "2." .. tostring(i)
+ local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';"
+ test:do_catchsql_test(
+ test_name,
+ test_itself, {
+ -- <test_name>
+ 1, "LIKE or GLOB pattern can only contain UTF-8 characters"
+ -- <test_name>
+ })
+
+ test_name = prefix .. "3." .. tostring(i)
+ test_itself = "SELECT 'abc' LIKE 'abc" .. tested_string .. "';"
+ test:do_catchsql_test(
+ test_name,
+ test_itself, {
+ -- <test_name>
+ 1, "LIKE or GLOB pattern can only contain UTF-8 characters"
+ -- <test_name>
+ })
+
+ test_name = prefix .. "4." .. tostring(i)
+ test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "c';"
+ test:do_catchsql_test(
+ test_name,
+ test_itself, {
+ -- <test_name>
+ 1, "LIKE or GLOB pattern can only contain UTF-8 characters"
+ -- <test_name>
+ })
+
+-- Just skipping if row value predicand contains invalid character.
+
+ test_name = prefix .. "5." .. tostring(i)
+ test_itself = "SELECT 'ab" .. tested_string .. "' LIKE 'abc';"
+ test:do_execsql_test(
+ test_name,
+ test_itself, {
+ -- <test_name>
+ 0
+ -- <test_name>
+ })
+
+ test_name = prefix .. "6." .. tostring(i)
+ test_itself = "SELECT 'abc" .. tested_string .. "' LIKE 'abc';"
+ test:do_execsql_test(
+ test_name,
+ test_itself, {
+ -- <test_name>
+ 0
+ -- <test_name>
+ })
+
+ test_name = prefix .. "7." .. tostring(i)
+ test_itself = "SELECT 'ab" .. tested_string .. "c' LIKE 'abc';"
+ test:do_execsql_test(
+ test_name,
+ test_itself, {
+ -- <test_name>
+ 0
+ -- <test_name>
+ })
+end
+
+-- Valid testcases.
+
+for i, tested_string in ipairs(valid_testcases) do
+ test_name = prefix .. "8." .. tostring(i)
+ local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';"
+ test:do_execsql_test(
+ test_name,
+ test_itself, {
+ -- <test_name>
+ 0
+ -- <test_name>
+ })
+
+ test_name = prefix .. "9." .. tostring(i)
+ test_itself = "SELECT 'abc' LIKE 'abc" .. tested_string .. "';"
+ test:do_execsql_test(
+ test_name,
+ test_itself, {
+ -- <test_name>
+ 0
+ -- <test_name>
+ })
+
+ test_name = prefix .. "10." .. tostring(i)
+ test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "c';"
+ test:do_execsql_test(
+ test_name,
+ test_itself, {
+ -- <test_name>
+ 0
+ -- <test_name>
+ })
+ test_name = prefix .. "11." .. tostring(i)
+ test_itself = "SELECT 'ab" .. tested_string .. "' LIKE 'abc';"
+ test:do_execsql_test(
+ test_name,
+ test_itself, {
+ -- <test_name>
+ 0
+ -- <test_name>
+ })
+
+ test_name = prefix .. "12." .. tostring(i)
+ test_itself = "SELECT 'abc" .. tested_string .. "' LIKE 'abc';"
+ test:do_execsql_test(
+ test_name,
+ test_itself, {
+ -- <test_name>
+ 0
+ -- <test_name>
+ })
+
+ test_name = prefix .. "13." .. tostring(i)
+ test_itself = "SELECT 'ab" .. tested_string .. "c' LIKE 'abc';"
+ test:do_execsql_test(
+ test_name,
+ test_itself, {
+ -- <test_name>
+ 0
+ -- <test_name>
+ })
+end
+
+test:finish_test()
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.tarantool.org/pipermail/tarantool-patches/attachments/20180727/7f66b67e/attachment.html>
More information about the Tarantool-patches
mailing list