[tarantool-patches] Re: [PATCH] sql: LIKE & GLOB pattern comparison issue

Nikita Tatunov hollow653 at gmail.com
Fri Jul 27 22:11:35 MSK 2018


Hello!

пт, 27 июл. 2018 г. в 16:06, Alexander Turenko <
alexander.turenko at tarantool.org>:

> Hi,
>
> See comments below.
>
> Don't sure we should return 'no match' situation when a left hand
> expression in not correct unicode string. How other DBs handle that?
>
>
The main point is that since pattern is always valid (otherwise we return
an error)
it cannot be matched with a string that contains an invalid utf-8 character.

- PostgreSQL doesn't allow storing invalid characters in text types and
  even explicit E'\xD1' anywhere results an error.
- MySQL Doesn't allow using invalid characters in text types.

May be it's reasons for thinking about prohibiting invalid characters in
text?

WBR, Alexander Turenko.
>
> > sql: LIKE & GLOB pattern comparison issue
> >
> > Currently function that compares pattern and string for GLOB & LIKE
> > operators doesn't work properly. It uses ICU reading function which
> > perhaps was working differently before and the implementation for the
> > comparison ending isn't paying attention to some special cases, hence
> > in those cases it works improperly.
> >
>
> It used non-ICU function before, consider commit 198d59ce.
>
>
Thank you, I've got it now.


> > With the patch applied an error will be returned in case there's
> > invalid UTF-8 symbol in pattern & pattern will not be matched
> > with the string that contains it. Some minor corrections to function
> > were made as well.
> >
>
> Nitpicking: 'error will be returned' is third situatuon, other then
> 'matched' and 'not matched'.
>
>
I guess commit message is about changes. Do I really need to
mention valid cases that didn't change?


> > Сloses #3251
> > Сloses #3334
> > Part of #3572
>
> > > 2. The thing you test is not related to a table and other columns.
> > >     Please, convert the tests to the next format: {[[select '' like
> > > '_B';]], {1}]]}.
> > >     To make it more readable, you can do it like `like_testcases` in
> > > `sql-tap/collation.test.lua`.
> > >
> >
> > Done.
>
> You are still perform comparisons with table columns.
>

Thank you, fixed it.  More tests now tho.


>
> > -/*
> > - * For LIKE and GLOB matching on EBCDIC machines, assume that every
> > - * character is exactly one byte in size.  Also, provde the Utf8Read()
> > - * macro for fast reading of the next character in the common case where
> > - * the next character is ASCII.
> > +/**
> > + * Providing there are symbols in string s this
> > + * macro returns UTF-8 code of character and
> > + * promotes pointer to the next symbol in the string.
> > + * Otherwise return code is SQL_END_OF_STRING.
> >   */
> > -#define Utf8Read(s, e)    ucnv_getNextUChar(pUtf8conv, &s, e, &status)
> > +#define Utf8Read(s, e) (((s) < (e)) ?\
> > + ucnv_getNextUChar(pUtf8conv, &(s), (e), &(status)) : 0)
>
> I suggest to add a whitespace before backslash to make it looks better.
>
>
Guess so.


> > @@ -643,51 +647,61 @@ static const struct compareInfo likeInfoAlt = {
> '%',
> > '_', 0, 0 };
> >  #define SQLITE_MATCH             0
> >  #define SQLITE_NOMATCH           1
> >  #define SQLITE_NOWILDCARDMATCH   2
> > +#define SQL_PROHIBITED_PATTERN   3
>
> Update comment before that has a reference to 'patternMatch' function.
>
>
Didn't notice it, thnx.


> >   *
> >   * Globbing rules:
> >   *
> > - *      '*'       Matches any sequence of zero or more characters.
> > + *      '*'      Matches any sequence of zero or more characters.
>
> Why?
>
> >   *
> > - *      '?'       Matches exactly one character.
> > + *      '?'      Matches exactly one character.
> >   *
> > - *     [...]      Matches one character from the enclosed list of
> > - *                characters.
> > + *     [...]     Matches one character from the enclosed list of
> > + *               characters.
> >   *
> > - *     [^...]     Matches one character not in the enclosed list.
> > + *     [^...]    Matches one character not in the enclosed list.
> >   *
>
> The same question.
>
>
It was looking like the indentation for similar table below was different.
I was wrong.


> > + * @retval SQLITE_MATCH:            Match.
> > + *    SQLITE_NOMATCH:          No match.
> > + *    SQLITE_NOWILDCARDMATCH:  No match in spite of having *
> > + *     or % wildcards.
> > + *    SQL_PROHIBITED_PATTERN   Pattern contains invalid
> > + *     symbol.
>
> Nitpicking: no colon after SQL_PROHIBITED_PATTERN.
>

Fixed.


>
> >   */
> >  static int
> > -patternCompare(const char * pattern, /* The glob pattern */
> > -        const char * string, /* The string to compare against the glob
> */
> > -        const struct compareInfo *pInfo, /* Information about how to do
> > the compare */
> > -        UChar32 matchOther /* The escape char (LIKE) or '[' (GLOB) */
> > -    )
> > +sql_utf8_pattern_compare(const char * pattern,
> > +        const char * string,
> > +        const struct compareInfo *pInfo,
> > +        UChar32 matchOther)
>
> Don't get why indent is tabulation + 7 spaces.
>
>
Me neither :thinking:. Thank you!


> The function itself looks good except lines longer then 80 columns.
>

As i said before I don't think breaking these lines will increase
readability.


> > @@ -853,8 +903,7 @@ patternCompare(const char * pattern, /* The glob
> > pattern */
> >  int
> >  sqlite3_strglob(const char *zGlobPattern, const char *zString)
> >  {
> > - return patternCompare(zGlobPattern, zString, &globInfo,
> > -       '[');
> > + return sql_utf8_pattern_compare(zGlobPattern, zString, &globInfo, '[');
> >  }
> >
> >  /*
> > @@ -864,7 +913,7 @@ sqlite3_strglob(const char *zGlobPattern, const char
> > *zString)
> >  int
> >  sqlite3_strlike(const char *zPattern, const char *zStr, unsigned int
> esc)
> >  {
> > - return patternCompare(zPattern, zStr, &likeInfoNorm, esc);
> > + return sql_utf8_pattern_compare(zPattern, zStr, &likeInfoNorm, esc);
> >  }
> >
>
> Should we add sqlite3_result_error is case of SQL_PROHIBITED_PATTERN for
> these two functions?
>
>
It doesn't matter for "sqlite3_strglob()", since it's only used internally
with valid explicitly given pattern in "analysis_loader()".
Same for "sqlite3_strlike()".



> > -    {"LIKE", "like"},
> > -    {"GLOB", "glob"},
> > +-- NOTE: This test needs refactoring after deletion of GLOB &
> > +-- type restrictions for LIKE.
> > +--    {"LIKE", "like"},
> > +--    {"GLOB", "glob"},
>
> I would add related issue number.
>
>
True. Done.


> >      {"AND", "and"},
> >      {"OR", "or"},
> >      {"MATCH", "match"},
> > @@ -96,7 +98,7 @@ operations = {
> >      {"+", "-"},
> >      {"<<", ">>", "&", "|"},
> >      {"<", "<=", ">", ">="},
> > -    {"=", "==", "!=", "<>", "LIKE", "GLOB"}, --"MATCH", "REGEXP"},
> > +    {"=", "==", "!=", "<>"}, --"LIKE", "GLOB"}, "MATCH", "REGEXP"},
>
> Leave comment before MATCH and REGEXP to avoid surprises after removing
> your comment.
>
>
Done.


> > --- /dev/null
> > +++ b/test/sql-tap/gh-3251-string-pattern-comparison.lua
> > @@ -0,0 +1,238 @@
> > +#!/usr/bin/env tarantool
> > +test = require("sqltester")
> > +test:plan(106)
> > +
> > +local prefix = "like-test-"
> > +
> > +-- Unciode byte sequences.
> > +
>
> 1. Typo: unicode.
> 2. Extra empty line.
>
>
Fixed.


> > +-- Invalid testcases.
> > +
> > +for i, tested_string in ipairs(invalid_testcases) do
> > + local test_name = prefix .. "2." .. tostring(i)
> > + local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';"
> > +
> > +-- We should raise an error if pattern contains invalid characters.
> > +
>
> 1. Trailing whitespaces.
> 2. Why comments are not indented as the other code?
> 3. test_name and test_itself before the section comment looks strange
>    for me here.
>

diff --git a/src/box/sql/func.c b/src/box/sql/func.c
index 5b53076..2f989d4 100644
--- a/src/box/sql/func.c
+++ b/src/box/sql/func.c
@@ -623,7 +623,7 @@ struct compareInfo {
  * promotes pointer to the next symbol in the string.
  * Otherwise return code is SQL_END_OF_STRING.
  */
-#define Utf8Read(s, e) (((s) < (e)) ?\
+#define Utf8Read(s, e) (((s) < (e)) ? \
  ucnv_getNextUChar(pUtf8conv, &(s), (e), &(status)) : 0)

 #define SQL_END_OF_STRING        0
@@ -642,7 +642,7 @@ static const struct compareInfo likeInfoNorm = { '%',
'_', 0, 1 };
 static const struct compareInfo likeInfoAlt = { '%', '_', 0, 0 };

 /*
- * Possible error returns from patternMatch()
+ * Possible error returns from sql_utf8_pattern_compare()
  */
 #define SQLITE_MATCH             0
 #define SQLITE_NOMATCH           1
@@ -655,14 +655,14 @@ static const struct compareInfo likeInfoAlt = { '%',
'_', 0, 0 };
  *
  * Globbing rules:
  *
- *      '*'      Matches any sequence of zero or more characters.
+ *      '*'       Matches any sequence of zero or more characters.
  *
- *      '?'      Matches exactly one character.
+ *      '?'       Matches exactly one character.
  *
- *     [...]     Matches one character from the enclosed list of
- *               characters.
+ *     [...]      Matches one character from the enclosed list of
+ *                characters.
  *
- *     [^...]    Matches one character not in the enclosed list.
+ *     [^...]     Matches one character not in the enclosed list.
  *
  * With the [...] and [^...] matching, a ']' character can be
  * included in the list by making it the first character after
@@ -694,14 +694,14 @@ static const struct compareInfo likeInfoAlt = { '%',
'_', 0, 0 };
  *    SQLITE_NOMATCH:          No match.
  *    SQLITE_NOWILDCARDMATCH:  No match in spite of having *
  *     or % wildcards.
- *    SQL_PROHIBITED_PATTERN   Pattern contains invalid
+ *    SQL_PROHIBITED_PATTERN:  Pattern contains invalid
  *     symbol.
  */
 static int
 sql_utf8_pattern_compare(const char * pattern,
-        const char * string,
-        const struct compareInfo *pInfo,
-        UChar32 matchOther)
+ const char * string,
+ const struct compareInfo *pInfo,
+ UChar32 matchOther)
 {
  UChar32 c, c2; /* Next pattern and input string chars */
  UChar32 matchOne = pInfo->matchOne; /* "?" or "_" */
diff --git a/test/sql-tap/e_expr.test.lua b/test/sql-tap/e_expr.test.lua
index 051210a..9780d2c 100755
--- a/test/sql-tap/e_expr.test.lua
+++ b/test/sql-tap/e_expr.test.lua
@@ -78,7 +78,7 @@ local operations = {
     {"!=", "ne2"},
     {"IS", "is"},
 -- NOTE: This test needs refactoring after deletion of GLOB &
--- type restrictions for LIKE.
+-- type restrictions for LIKE. (See #3572)
 --    {"LIKE", "like"},
 --    {"GLOB", "glob"},
     {"AND", "and"},
@@ -98,7 +98,12 @@ operations = {
     {"+", "-"},
     {"<<", ">>", "&", "|"},
     {"<", "<=", ">", ">="},
-    {"=", "==", "!=", "<>"}, --"LIKE", "GLOB"}, "MATCH", "REGEXP"},
+-- NOTE: This test needs refactoring after deletion of GLOB &
+-- type restrictions for LIKE. (See #3572)
+-- Another NOTE: MATCH & REGEXP aren't supported in Tarantool &
+-- are waiting for their hour, don't confuse them
+-- being commented with ticket above.
+    {"=", "==", "!=", "<>"}, --"LIKE", "GLOB"}, --"MATCH", "REGEXP"},
     {"AND"},
     {"OR"},
 }
diff --git a/test/sql-tap/gh-3251-string-pattern-comparison.test.lua
b/test/sql-tap/gh-3251-string-pattern-comparison.test.lua
new file mode 100755
index 0000000..a823bc6
--- /dev/null
+++ b/test/sql-tap/gh-3251-string-pattern-comparison.test.lua
@@ -0,0 +1,281 @@
+#!/usr/bin/env tarantool
+test = require("sqltester")
+test:plan(128)
+
+local prefix = "like-test-"
+
+-- Unicode byte sequences.
+local valid_testcases = {
+ '\x01',
+ '\x09',
+ '\x1F',
+ '\x7F',
+ '\xC2\x80',
+ '\xC2\x90',
+ '\xC2\x9F',
+ '\xE2\x80\xA8',
+ '\x20\x0B',
+ '\xE2\x80\xA9',
+}
+
+-- Non-Unicode byte sequences.
+local invalid_testcases = {
+ '\xE2\x80',
+ '\xFE\xFF',
+ '\xC2',
+ '\xED\xB0\x80',
+ '\xD0',
+}
+
+local like_test_cases =
+{
+ {"1.1",
+ "SELECT 'AB' LIKE '_B';",
+ {0, {1}} },
+ {"1.2",
+ "SELECT 'CD' LIKE '_B';",
+ {0, {0}} },
+ {"1.3",
+ "SELECT '' LIKE '_B';",
+ {0, {0}} },
+ {"1.4",
+ "SELECT 'AB' LIKE '%B';",
+ {0, {1}} },
+ {"1.5",
+ "SELECT 'CD' LIKE '%B';",
+ {0, {0}} },
+ {"1.6",
+ "SELECT '' LIKE '%B';",
+ {0, {0}} },
+ {"1.7",
+ "SELECT 'AB' LIKE 'A__';",
+ {0, {0}} },
+ {"1.8",
+ "SELECT 'CD' LIKE 'A__';",
+ {0, {0}} },
+ {"1.9",
+ "SELECT '' LIKE 'A__';",
+ {0, {0}} },
+ {"1.10",
+ "SELECT 'AB' LIKE 'A_';",
+ {0, {1}} },
+ {"1.11",
+ "SELECT 'CD' LIKE 'A_';",
+ {0, {0}} },
+ {"1.12",
+ "SELECT '' LIKE 'A_';",
+ {0, {0}} },
+ {"1.13",
+ "SELECT 'AB' LIKE 'A';",
+ {0, {0}} },
+ {"1.14",
+ "SELECT 'CD' LIKE 'A';",
+ {0, {0}} },
+ {"1.15",
+ "SELECT '' LIKE 'A';",
+ {0, {0}} },
+ {"1.16",
+ "SELECT 'AB' LIKE '_';",
+ {0, {0}} },
+ {"1.17",
+ "SELECT 'CD' LIKE '_';",
+ {0, {0}} },
+ {"1.18",
+ "SELECT '' LIKE '_';",
+ {0, {0}} },
+ {"1.19",
+ "SELECT 'AB' LIKE '__';",
+ {0, {1}} },
+ {"1.20",
+ "SELECT 'CD' LIKE '__';",
+ {0, {1}} },
+ {"1.21",
+ "SELECT '' LIKE '__';",
+ {0, {0}} },
+ {"1.22",
+ "SELECT 'AB' LIKE '%A';",
+ {0, {0}} },
+ {"1.23",
+ "SELECT 'AB' LIKE '%C';",
+ {0, {0}} },
+ {"1.24",
+ "SELECT 'ab' LIKE '%df';",
+ {0, {0}} },
+ {"1.25",
+ "SELECT 'abCDF' LIKE '%df';",
+ {0, {1}} },
+ {"1.26",
+ "SELECT 'CDF' LIKE '%df';",
+ {0, {1}} },
+ {"1.27",
+ "SELECT 'ab' LIKE 'a_';",
+ {0, {1}} },
+ {"1.28",
+ "SELECT 'abCDF' LIKE 'a_';",
+ {0, {0}} },
+ {"1.29",
+ "SELECT 'CDF' LIKE 'a_';",
+ {0, {0}} },
+ {"1.30",
+ "SELECT 'ab' LIKE 'ab%';",
+ {0, {1}} },
+ {"1.31",
+ "SELECT 'abCDF' LIKE 'ab%';",
+ {0, {1}} },
+ {"1.32",
+ "SELECT 'CDF' LIKE 'ab%';",
+ {0, {0}} },
+ {"1.33",
+ "SELECT 'ab' LIKE 'abC%';",
+ {0, {0}} },
+ {"1.34",
+ "SELECT 'abCDF' LIKE 'abC%';",
+ {0, {1}} },
+ {"1.35",
+ "SELECT 'CDF' LIKE 'abC%';",
+ {0, {0}} },
+ {"1.36",
+ "SELECT 'ab' LIKE 'a_%';",
+ {0, {1}} },
+ {"1.37",
+ "SELECT 'abCDF' LIKE 'a_%';",
+ {0, {1}} },
+ {"1.38",
+ "SELECT 'CDF' LIKE 'a_%';",
+ {0, {0}} },
+}
+
+test:do_catchsql_set_test(like_test_cases, prefix)
+
+-- Invalid testcases.
+
+for i, tested_string in ipairs(invalid_testcases) do
+
+-- We should raise an error if pattern contains invalid characters.
+ local test_name = prefix .. "2." .. tostring(i)
+ local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';"
+ test:do_catchsql_test(
+ test_name,
+ test_itself, {
+ -- <test_name>
+ 1, "LIKE or GLOB pattern can only contain UTF-8 characters"
+ -- <test_name>
+ })
+
+ test_name = prefix .. "3." .. tostring(i)
+ test_itself = "SELECT 'abc' LIKE 'abc" .. tested_string .. "';"
+ test:do_catchsql_test(
+ test_name,
+ test_itself, {
+ -- <test_name>
+ 1, "LIKE or GLOB pattern can only contain UTF-8 characters"
+ -- <test_name>
+ })
+
+ test_name = prefix .. "4." .. tostring(i)
+ test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "c';"
+ test:do_catchsql_test(
+ test_name,
+ test_itself, {
+ -- <test_name>
+ 1, "LIKE or GLOB pattern can only contain UTF-8 characters"
+ -- <test_name>
+ })
+
+-- Just skipping if row value predicand contains invalid character.
+
+ test_name = prefix .. "5." .. tostring(i)
+ test_itself = "SELECT 'ab" .. tested_string .. "' LIKE 'abc';"
+ test:do_execsql_test(
+ test_name,
+ test_itself, {
+ -- <test_name>
+ 0
+ -- <test_name>
+ })
+
+ test_name = prefix .. "6." .. tostring(i)
+ test_itself = "SELECT 'abc" .. tested_string .. "' LIKE 'abc';"
+ test:do_execsql_test(
+ test_name,
+ test_itself, {
+ -- <test_name>
+ 0
+ -- <test_name>
+ })
+
+ test_name = prefix .. "7." .. tostring(i)
+ test_itself = "SELECT 'ab" .. tested_string .. "c' LIKE 'abc';"
+ test:do_execsql_test(
+ test_name,
+ test_itself, {
+ -- <test_name>
+ 0
+ -- <test_name>
+ })
+end
+
+-- Valid testcases.
+
+for i, tested_string in ipairs(valid_testcases) do
+ test_name = prefix .. "8." .. tostring(i)
+ local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';"
+ test:do_execsql_test(
+ test_name,
+ test_itself, {
+ -- <test_name>
+ 0
+ -- <test_name>
+ })
+
+ test_name = prefix .. "9." .. tostring(i)
+ test_itself = "SELECT 'abc' LIKE 'abc" .. tested_string .. "';"
+ test:do_execsql_test(
+ test_name,
+ test_itself, {
+ -- <test_name>
+ 0
+ -- <test_name>
+ })
+
+ test_name = prefix .. "10." .. tostring(i)
+ test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "c';"
+ test:do_execsql_test(
+ test_name,
+ test_itself, {
+ -- <test_name>
+ 0
+ -- <test_name>
+ })
+ test_name = prefix .. "11." .. tostring(i)
+ test_itself = "SELECT 'ab" .. tested_string .. "' LIKE 'abc';"
+ test:do_execsql_test(
+ test_name,
+ test_itself, {
+ -- <test_name>
+ 0
+ -- <test_name>
+ })
+
+ test_name = prefix .. "12." .. tostring(i)
+ test_itself = "SELECT 'abc" .. tested_string .. "' LIKE 'abc';"
+ test:do_execsql_test(
+ test_name,
+ test_itself, {
+ -- <test_name>
+ 0
+ -- <test_name>
+ })
+
+ test_name = prefix .. "13." .. tostring(i)
+ test_itself = "SELECT 'ab" .. tested_string .. "c' LIKE 'abc';"
+ test:do_execsql_test(
+ test_name,
+ test_itself, {
+ -- <test_name>
+ 0
+ -- <test_name>
+ })
+end
+
+test:finish_test()
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.tarantool.org/pipermail/tarantool-patches/attachments/20180727/7f66b67e/attachment.html>


More information about the Tarantool-patches mailing list