* [tarantool-patches] [PATCH] sql: LIKE & GLOB pattern comparison issue @ 2018-06-28 12:47 N.Tatunov 2018-06-28 12:54 ` [tarantool-patches] " Hollow111 2018-07-18 2:43 ` Alexander Turenko 0 siblings, 2 replies; 22+ messages in thread From: N.Tatunov @ 2018-06-28 12:47 UTC (permalink / raw) To: tarantool-patches; +Cc: alexander.turenko, N.Tatunov Currently function that compares pattern and string for GLOB & LIKE operators doesn't work properly. It uses ICU reading function which perhaps was working differently before and the implementation for the comparison ending isn't paying attention to some special cases, hence in those cases it works improperly. Now the checks for comparison should work fine. Сloses: #3251 Сloses: #3334 --- src/box/sql/func.c | 25 ++++---- test/sql-tap/like1.test.lua | 152 ++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 165 insertions(+), 12 deletions(-) create mode 100755 test/sql-tap/like1.test.lua diff --git a/src/box/sql/func.c b/src/box/sql/func.c index c06e3bd..dcbd7e0 100644 --- a/src/box/sql/func.c +++ b/src/box/sql/func.c @@ -643,6 +643,7 @@ static const struct compareInfo likeInfoAlt = { '%', '_', 0, 0 }; #define SQLITE_MATCH 0 #define SQLITE_NOMATCH 1 #define SQLITE_NOWILDCARDMATCH 2 +#define SQL_NO_SYMBOLS_LEFT 65535 /* * Compare two UTF-8 strings for equality where the first string is @@ -698,29 +699,28 @@ patternCompare(const char * pattern, /* The glob pattern */ const char * string_end = string + strlen(string); UErrorCode status = U_ZERO_ERROR; - while (pattern < pattern_end){ - c = Utf8Read(pattern, pattern_end); + while ((c = Utf8Read(pattern, pattern_end)) != SQL_NO_SYMBOLS_LEFT) { if (c == matchAll) { /* Match "*" */ /* Skip over multiple "*" characters in the pattern. If there * are also "?" characters, skip those as well, but consume a * single character of the input string for each "?" skipped */ - while (pattern < pattern_end){ - c = Utf8Read(pattern, pattern_end); + while ((c = Utf8Read(pattern, pattern_end)) != + SQL_NO_SYMBOLS_LEFT) { if (c != matchAll && c != matchOne) break; - if (c == matchOne - && Utf8Read(string, string_end) == 0) { + if (c == matchOne && + Utf8Read(string, string_end) == + SQL_NO_SYMBOLS_LEFT) return SQLITE_NOWILDCARDMATCH; - } } /* "*" at the end of the pattern matches */ - if (pattern == pattern_end) + if (c == SQL_NO_SYMBOLS_LEFT) return SQLITE_MATCH; if (c == matchOther) { if (pInfo->matchSet == 0) { c = Utf8Read(pattern, pattern_end); - if (c == 0) + if (c == SQL_NO_SYMBOLS_LEFT) return SQLITE_NOWILDCARDMATCH; } else { /* "[...]" immediately follows the "*". We have to do a slow @@ -782,7 +782,7 @@ patternCompare(const char * pattern, /* The glob pattern */ if (c == matchOther) { if (pInfo->matchSet == 0) { c = Utf8Read(pattern, pattern_end); - if (c == 0) + if (c == SQL_NO_SYMBOLS_LEFT) return SQLITE_NOMATCH; zEscaped = pattern; } else { @@ -802,7 +802,7 @@ patternCompare(const char * pattern, /* The glob pattern */ seen = 1; c2 = Utf8Read(pattern, pattern_end); } - while (c2 && c2 != ']') { + while (c2 != SQL_NO_SYMBOLS_LEFT && c2 != ']') { if (c2 == '-' && pattern[0] != ']' && pattern < pattern_end && prior_c > 0) { @@ -839,7 +839,8 @@ patternCompare(const char * pattern, /* The glob pattern */ c == u_tolower(c2)) continue; } - if (c == matchOne && pattern != zEscaped && c2 != 0) + if (c == matchOne && pattern != zEscaped && + c2 != SQL_NO_SYMBOLS_LEFT) continue; return SQLITE_NOMATCH; } diff --git a/test/sql-tap/like1.test.lua b/test/sql-tap/like1.test.lua new file mode 100755 index 0000000..42b4d43 --- /dev/null +++ b/test/sql-tap/like1.test.lua @@ -0,0 +1,152 @@ +#!/usr/bin/env tarantool +test = require("sqltester") +test:plan(13) + +test:do_catchsql_test( + "like-test-1.1", + [[ + CREATE TABLE t2 (column1 INTEGER, + column2 VARCHAR(100), + column3 BLOB, + column4 FLOAT, + PRIMARY KEY (column1, column2)); + INSERT INTO t2 VALUES (1, 'AB', X'4142', 5.5); + INSERT INTO t2 VALUES (1, 'CD', X'2020', 1E4); + INSERT INTO t2 VALUES (2, 'AB', X'2020', 12.34567); + INSERT INTO t2 VALUES (-1000, '', X'', 0.0); + CREATE TABLE t1 (a INT PRIMARY KEY, str VARCHAR(100)); + INSERT INTO t1 VALUES (1, 'ab'); + INSERT INTO t1 VALUES (2, 'abCDF'); + INSERT INTO t1 VALUES (3, 'CDF'); + CREATE TABLE t (s1 char(2) primary key, s2 char(2)); + INSERT INTO t VALUES ('AB', 'AB'); + ]], { + -- <like-test-1.1> + 0 + -- <like-test-1.1> + }) + +test:do_execsql_test( + "like-test-1.2", + [[ + SELECT column1, column2, column1 * column4 FROM t2 WHERE column2 LIKE '_B'; + ]], { + -- <like-test-1.2> + 1, 'AB', 5.5, 2, 'AB', 24.69134 + -- <like-test-1.2> + }) + +test:do_execsql_test( + "like-test-1.3", + [[ + SELECT column1, column2 FROM t2 WHERE column2 LIKE '%B'; + ]], { + -- <like-test-1.3> + 1, 'AB', 2, 'AB' + -- <like-test-1.3> + }) + +test:do_execsql_test( + "like-test-1.4", + [[ + SELECT column1, column2 FROM t2 WHERE column2 LIKE 'A__'; + ]], { + -- <like-test-1.4> + + -- <like-test-1.4> + }) + +test:do_execsql_test( + "like-test-1.5", + [[ + SELECT column1, column2 FROM t2 WHERE column2 LIKE 'A_'; + ]], { + -- <like-test-1.5> + 1, 'AB', 2, 'AB' + -- <like-test-1.5> + }) + +test:do_execsql_test( + "like-test-1.6", + [[ + SELECT column1, column2 FROM t2 WHERE column2 LIKE 'A'; + ]], { + -- <like-test-1.6> + + -- <like-test-1.6> + }) + +test:do_execsql_test( + "like-test-1.7", + [[ + SELECT column1, column2 FROM t2 WHERE column2 LIKE '_'; + ]], { + -- <like-test-1.7> + + -- <like-test-1.7> + }) + +test:do_execsql_test( + "like-test-1.8", + [[ + SELECT * FROM t WHERE s1 LIKE '%A'; + ]], { + -- <like-test-1.8> + + -- <like-test-1.8> + }) + +test:do_execsql_test( + "like-test-1.9", + [[ + SELECT * FROM t WHERE s1 LIKE '%C'; + ]], { + -- <like-test-1.9> + + -- <like-test-1.9> + }) + +test:do_execsql_test( + "like-test-1.10", + [[ + SELECT * FROM t1 WHERE str LIKE '%df'; + ]], { + -- <like-test-1.10> + 2, 'abCDF', 3, 'CDF' + -- <like-test-1.10> + }) + +test:do_execsql_test( + "like-test-1.11", + [[ + SELECT * FROM t1 WHERE str LIKE 'a_'; + ]], { + -- <like-test-1.11> + 1, 'ab' + -- <like-test-1.11> + }) + +test:do_execsql_test( + "like-test-1.12", + [[ + select column1, column2 from t2 where column2 like '__'; + ]], { + -- <like-test-1.12> + 1, 'AB', 1, 'CD', 2, 'AB' + -- <like-test-1.12> + }) + +test:do_execsql_test( + "like-test-1.13", + [[ + DROP TABLE t1; + DROP TABLE t2; + DROP TABLE t; + ]], { + -- <like-test-1.13> + + -- <like-test-1.13> + }) + + +test:finish_test() -- 2.7.4 ^ permalink raw reply [flat|nested] 22+ messages in thread
* [tarantool-patches] Re: [PATCH] sql: LIKE & GLOB pattern comparison issue 2018-06-28 12:47 [tarantool-patches] [PATCH] sql: LIKE & GLOB pattern comparison issue N.Tatunov @ 2018-06-28 12:54 ` Hollow111 2018-07-18 2:43 ` Alexander Turenko 1 sibling, 0 replies; 22+ messages in thread From: Hollow111 @ 2018-06-28 12:54 UTC (permalink / raw) To: tarantool-patches; +Cc: alexander.turenko [-- Attachment #1: Type: text/plain, Size: 10194 bytes --] Branch: https://github.com/tarantool/tarantool/tree/N_Tatunov/gh-3251-where-like-hanging Issue: https://github.com/tarantool/tarantool/issues/3251 Issue: https://github.com/tarantool/tarantool/issues/3334 чт, 28 июн. 2018 г. в 15:47, N.Tatunov <hollow653@gmail.com>: > Currently function that compares pattern and > string for GLOB & LIKE operators doesn't work properly. > It uses ICU reading function which perhaps was working > differently before and the implementation for the > comparison ending isn't paying attention to some > special cases, hence in those cases it works improperly. > Now the checks for comparison should work fine. > > Сloses: #3251 > Сloses: #3334 > --- > src/box/sql/func.c | 25 ++++---- > test/sql-tap/like1.test.lua | 152 > ++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 165 insertions(+), 12 deletions(-) > create mode 100755 test/sql-tap/like1.test.lua > > diff --git a/src/box/sql/func.c b/src/box/sql/func.c > index c06e3bd..dcbd7e0 100644 > --- a/src/box/sql/func.c > +++ b/src/box/sql/func.c > @@ -643,6 +643,7 @@ static const struct compareInfo likeInfoAlt = { '%', > '_', 0, 0 }; > #define SQLITE_MATCH 0 > #define SQLITE_NOMATCH 1 > #define SQLITE_NOWILDCARDMATCH 2 > +#define SQL_NO_SYMBOLS_LEFT 65535 > > /* > * Compare two UTF-8 strings for equality where the first string is > @@ -698,29 +699,28 @@ patternCompare(const char * pattern, /* The > glob pattern */ > const char * string_end = string + strlen(string); > UErrorCode status = U_ZERO_ERROR; > > - while (pattern < pattern_end){ > - c = Utf8Read(pattern, pattern_end); > + while ((c = Utf8Read(pattern, pattern_end)) != > SQL_NO_SYMBOLS_LEFT) { > if (c == matchAll) { /* Match "*" */ > /* Skip over multiple "*" characters in the > pattern. If there > * are also "?" characters, skip those as well, > but consume a > * single character of the input string for each > "?" skipped > */ > - while (pattern < pattern_end){ > - c = Utf8Read(pattern, pattern_end); > + while ((c = Utf8Read(pattern, pattern_end)) != > + SQL_NO_SYMBOLS_LEFT) { > if (c != matchAll && c != matchOne) > break; > - if (c == matchOne > - && Utf8Read(string, string_end) == 0) { > + if (c == matchOne && > + Utf8Read(string, string_end) == > + SQL_NO_SYMBOLS_LEFT) > return SQLITE_NOWILDCARDMATCH; > - } > } > /* "*" at the end of the pattern matches */ > - if (pattern == pattern_end) > + if (c == SQL_NO_SYMBOLS_LEFT) > return SQLITE_MATCH; > if (c == matchOther) { > if (pInfo->matchSet == 0) { > c = Utf8Read(pattern, pattern_end); > - if (c == 0) > + if (c == SQL_NO_SYMBOLS_LEFT) > return > SQLITE_NOWILDCARDMATCH; > } else { > /* "[...]" immediately follows the > "*". We have to do a slow > @@ -782,7 +782,7 @@ patternCompare(const char * pattern, /* The > glob pattern */ > if (c == matchOther) { > if (pInfo->matchSet == 0) { > c = Utf8Read(pattern, pattern_end); > - if (c == 0) > + if (c == SQL_NO_SYMBOLS_LEFT) > return SQLITE_NOMATCH; > zEscaped = pattern; > } else { > @@ -802,7 +802,7 @@ patternCompare(const char * pattern, /* The > glob pattern */ > seen = 1; > c2 = Utf8Read(pattern, > pattern_end); > } > - while (c2 && c2 != ']') { > + while (c2 != SQL_NO_SYMBOLS_LEFT && c2 != > ']') { > if (c2 == '-' && pattern[0] != ']' > && pattern < pattern_end > && prior_c > 0) { > @@ -839,7 +839,8 @@ patternCompare(const char * pattern, /* The > glob pattern */ > c == u_tolower(c2)) > continue; > } > - if (c == matchOne && pattern != zEscaped && c2 != 0) > + if (c == matchOne && pattern != zEscaped && > + c2 != SQL_NO_SYMBOLS_LEFT) > continue; > return SQLITE_NOMATCH; > } > diff --git a/test/sql-tap/like1.test.lua b/test/sql-tap/like1.test.lua > new file mode 100755 > index 0000000..42b4d43 > --- /dev/null > +++ b/test/sql-tap/like1.test.lua > @@ -0,0 +1,152 @@ > +#!/usr/bin/env tarantool > +test = require("sqltester") > +test:plan(13) > + > +test:do_catchsql_test( > + "like-test-1.1", > + [[ > + CREATE TABLE t2 (column1 INTEGER, > + column2 VARCHAR(100), > + column3 BLOB, > + column4 FLOAT, > + PRIMARY KEY (column1, column2)); > + INSERT INTO t2 VALUES (1, 'AB', X'4142', 5.5); > + INSERT INTO t2 VALUES (1, 'CD', X'2020', 1E4); > + INSERT INTO t2 VALUES (2, 'AB', X'2020', 12.34567); > + INSERT INTO t2 VALUES (-1000, '', X'', 0.0); > + CREATE TABLE t1 (a INT PRIMARY KEY, str VARCHAR(100)); > + INSERT INTO t1 VALUES (1, 'ab'); > + INSERT INTO t1 VALUES (2, 'abCDF'); > + INSERT INTO t1 VALUES (3, 'CDF'); > + CREATE TABLE t (s1 char(2) primary key, s2 char(2)); > + INSERT INTO t VALUES ('AB', 'AB'); > + ]], { > + -- <like-test-1.1> > + 0 > + -- <like-test-1.1> > + }) > + > +test:do_execsql_test( > + "like-test-1.2", > + [[ > + SELECT column1, column2, column1 * column4 FROM t2 WHERE > column2 LIKE '_B'; > + ]], { > + -- <like-test-1.2> > + 1, 'AB', 5.5, 2, 'AB', 24.69134 > + -- <like-test-1.2> > + }) > + > +test:do_execsql_test( > + "like-test-1.3", > + [[ > + SELECT column1, column2 FROM t2 WHERE column2 LIKE '%B'; > + ]], { > + -- <like-test-1.3> > + 1, 'AB', 2, 'AB' > + -- <like-test-1.3> > + }) > + > +test:do_execsql_test( > + "like-test-1.4", > + [[ > + SELECT column1, column2 FROM t2 WHERE column2 LIKE 'A__'; > + ]], { > + -- <like-test-1.4> > + > + -- <like-test-1.4> > + }) > + > +test:do_execsql_test( > + "like-test-1.5", > + [[ > + SELECT column1, column2 FROM t2 WHERE column2 LIKE 'A_'; > + ]], { > + -- <like-test-1.5> > + 1, 'AB', 2, 'AB' > + -- <like-test-1.5> > + }) > + > +test:do_execsql_test( > + "like-test-1.6", > + [[ > + SELECT column1, column2 FROM t2 WHERE column2 LIKE 'A'; > + ]], { > + -- <like-test-1.6> > + > + -- <like-test-1.6> > + }) > + > +test:do_execsql_test( > + "like-test-1.7", > + [[ > + SELECT column1, column2 FROM t2 WHERE column2 LIKE '_'; > + ]], { > + -- <like-test-1.7> > + > + -- <like-test-1.7> > + }) > + > +test:do_execsql_test( > + "like-test-1.8", > + [[ > + SELECT * FROM t WHERE s1 LIKE '%A'; > + ]], { > + -- <like-test-1.8> > + > + -- <like-test-1.8> > + }) > + > +test:do_execsql_test( > + "like-test-1.9", > + [[ > + SELECT * FROM t WHERE s1 LIKE '%C'; > + ]], { > + -- <like-test-1.9> > + > + -- <like-test-1.9> > + }) > + > +test:do_execsql_test( > + "like-test-1.10", > + [[ > + SELECT * FROM t1 WHERE str LIKE '%df'; > + ]], { > + -- <like-test-1.10> > + 2, 'abCDF', 3, 'CDF' > + -- <like-test-1.10> > + }) > + > +test:do_execsql_test( > + "like-test-1.11", > + [[ > + SELECT * FROM t1 WHERE str LIKE 'a_'; > + ]], { > + -- <like-test-1.11> > + 1, 'ab' > + -- <like-test-1.11> > + }) > + > +test:do_execsql_test( > + "like-test-1.12", > + [[ > + select column1, column2 from t2 where column2 like '__'; > + ]], { > + -- <like-test-1.12> > + 1, 'AB', 1, 'CD', 2, 'AB' > + -- <like-test-1.12> > + }) > + > +test:do_execsql_test( > + "like-test-1.13", > + [[ > + DROP TABLE t1; > + DROP TABLE t2; > + DROP TABLE t; > + ]], { > + -- <like-test-1.13> > + > + -- <like-test-1.13> > + }) > + > + > +test:finish_test() > -- > 2.7.4 > > [-- Attachment #2: Type: text/html, Size: 13630 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* [tarantool-patches] Re: [PATCH] sql: LIKE & GLOB pattern comparison issue 2018-06-28 12:47 [tarantool-patches] [PATCH] sql: LIKE & GLOB pattern comparison issue N.Tatunov 2018-06-28 12:54 ` [tarantool-patches] " Hollow111 @ 2018-07-18 2:43 ` Alexander Turenko 2018-07-18 5:51 ` Alex Khatskevich 2018-07-18 15:24 ` Nikita Tatunov 1 sibling, 2 replies; 22+ messages in thread From: Alexander Turenko @ 2018-07-18 2:43 UTC (permalink / raw) To: Nikita Tatunov, Alex Khatskevich; +Cc: tarantool-patches i Nikita! I don't have objections against this patch in general, just several minor comments and side notes. Alexey, as I see you are familiar with the code. Can you give a review for the patch? WBR, Alexander Turenko. On Thu, Jun 28, 2018 at 03:47:16PM +0300, N.Tatunov wrote: > Currently function that compares pattern and > string for GLOB & LIKE operators doesn't work properly. > It uses ICU reading function which perhaps was working > differently before and the implementation for the > comparison ending isn't paying attention to some > special cases, hence in those cases it works improperly. > Now the checks for comparison should work fine. > > Сloses: #3251 > Сloses: #3334 Use 'Closes #xxxx' form. Don't sure the form with a colon will work. Utf8Read macro comment now completely irrelevant to its definition. Compare to sqlite3: 584 /* 585 ** For LIKE and GLOB matching on EBCDIC machines, assume that every 586 ** character is exactly one byte in size. Also, provde the Utf8Read() 587 ** macro for fast reading of the next character in the common case where 588 ** the next character is ASCII. 589 */ 590 #if defined(SQLITE_EBCDIC) 591 # define sqlite3Utf8Read(A) (*((*A)++)) 592 # define Utf8Read(A) (*(A++)) 593 #else 594 # define Utf8Read(A) (A[0]<0x80?*(A++):sqlite3Utf8Read(&A)) 595 #endif > --- > src/box/sql/func.c | 25 ++++---- > test/sql-tap/like1.test.lua | 152 ++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 165 insertions(+), 12 deletions(-) > create mode 100755 test/sql-tap/like1.test.lua > > diff --git a/src/box/sql/func.c b/src/box/sql/func.c > index c06e3bd..dcbd7e0 100644 > --- a/src/box/sql/func.c > +++ b/src/box/sql/func.c > @@ -643,6 +643,7 @@ static const struct compareInfo likeInfoAlt = { '%', '_', 0, 0 }; > #define SQLITE_MATCH 0 > #define SQLITE_NOMATCH 1 > #define SQLITE_NOWILDCARDMATCH 2 > +#define SQL_NO_SYMBOLS_LEFT 65535 > 0xffff looks better. I would use name like ICU_ERROR. By the way, it can mean not only end of a string, but also other errors. It is okay to give 'not matched' in the case, I think. Hmm, we can misinterpret some error and give 'matched' result, I guess... maybe we really should check start < end in the macro to distinguish this cases. Don't sure about the test case. As I see from ICU source code it can be some internal buffer overflow error. We can do the check like so: -#define Utf8Read(s, e) ucnv_getNextUChar(pUtf8conv, &s, e, &status) +#define Utf8Read(s, e) (((s) < (e)) ? ucnv_getNextUChar(pUtf8conv, &(s), (e), \ + &(status)) : 0) -#define SQL_NO_SYMBOLS_LEFT 65535 +#define SQL_NO_SYMBOLS_LEFT 0 > /* > * Compare two UTF-8 strings for equality where the first string is > @@ -698,29 +699,28 @@ patternCompare(const char * pattern, /* The glob pattern */ > const char * string_end = string + strlen(string); > UErrorCode status = U_ZERO_ERROR; > > - while (pattern < pattern_end){ > - c = Utf8Read(pattern, pattern_end); > + while ((c = Utf8Read(pattern, pattern_end)) != SQL_NO_SYMBOLS_LEFT) { Here and everywhere below we lean on the fact that ucnv_getNextUChar compares pointers and that is so: 1860 s=*source; 1861 if(sourceLimit<s) { 1862 *err=U_ILLEGAL_ARGUMENT_ERROR; 1863 return 0xffff; 1864 } By the way, the patch reverts partially implemented approach to preliminary check start < end introduced in 198d59ce93. I think it is okay too, because some checks were missed and now they exist again. > diff --git a/test/sql-tap/like1.test.lua b/test/sql-tap/like1.test.lua > new file mode 100755 > index 0000000..42b4d43 > --- /dev/null > +++ b/test/sql-tap/like1.test.lua > @@ -0,0 +1,152 @@ > +#!/usr/bin/env tarantool > +test = require("sqltester") > +test:plan(13) > + Maybe case with % at the end? I tried, that works, but it would be good to have a regression test. ^ permalink raw reply [flat|nested] 22+ messages in thread
* [tarantool-patches] Re: [PATCH] sql: LIKE & GLOB pattern comparison issue 2018-07-18 2:43 ` Alexander Turenko @ 2018-07-18 5:51 ` Alex Khatskevich 2018-07-18 15:24 ` Nikita Tatunov 1 sibling, 0 replies; 22+ messages in thread From: Alex Khatskevich @ 2018-07-18 5:51 UTC (permalink / raw) To: Alexander Turenko, Nikita Tatunov; +Cc: tarantool-patches Hi. > Alexey, as I see you are familiar with the code. Can you give a review > for the patch? Sure, I can review the patch after Alexander's comments are fixed. ^ permalink raw reply [flat|nested] 22+ messages in thread
* [tarantool-patches] Re: [PATCH] sql: LIKE & GLOB pattern comparison issue 2018-07-18 2:43 ` Alexander Turenko 2018-07-18 5:51 ` Alex Khatskevich @ 2018-07-18 15:24 ` Nikita Tatunov 2018-07-18 15:53 ` Alex Khatskevich 1 sibling, 1 reply; 22+ messages in thread From: Nikita Tatunov @ 2018-07-18 15:24 UTC (permalink / raw) To: alexander.turenko; +Cc: avkhatskevich, tarantool-patches [-- Attachment #1: Type: text/plain, Size: 4668 bytes --] ср, 18 июл. 2018 г. в 5:43, Alexander Turenko < alexander.turenko@tarantool.org>: > i Nikita! > > I don't have objections against this patch in general, just several > minor comments and side notes. > > Alexey, as I see you are familiar with the code. Can you give a review > for the patch? > > WBR, Alexander Turenko. > > On Thu, Jun 28, 2018 at 03:47:16PM +0300, N.Tatunov wrote: > > Currently function that compares pattern and > > string for GLOB & LIKE operators doesn't work properly. > > It uses ICU reading function which perhaps was working > > differently before and the implementation for the > > comparison ending isn't paying attention to some > > special cases, hence in those cases it works improperly. > > Now the checks for comparison should work fine. > > > > Сloses: #3251 > > Сloses: #3334 > > Use 'Closes #xxxx' form. Don't sure the form with a colon will work. > > Fixed it. > Utf8Read macro comment now completely irrelevant to its definition. > Compare to sqlite3: > > 584 /* > 585 ** For LIKE and GLOB matching on EBCDIC machines, assume that every > 586 ** character is exactly one byte in size. Also, provde the Utf8Read() > 587 ** macro for fast reading of the next character in the common case > where > 588 ** the next character is ASCII. > 589 */ > 590 #if defined(SQLITE_EBCDIC) > 591 # define sqlite3Utf8Read(A) (*((*A)++)) > 592 # define Utf8Read(A) (*(A++)) > 593 #else > 594 # define Utf8Read(A) > (A[0]<0x80?*(A++):sqlite3Utf8Read(&A)) > 595 #endif > > Yeah, you're right. Fixed it. > > --- > > src/box/sql/func.c | 25 ++++---- > > test/sql-tap/like1.test.lua | 152 > ++++++++++++++++++++++++++++++++++++++++++++ > > 2 files changed, 165 insertions(+), 12 deletions(-) > > create mode 100755 test/sql-tap/like1.test.lua > > > > diff --git a/src/box/sql/func.c b/src/box/sql/func.c > > index c06e3bd..dcbd7e0 100644 > > --- a/src/box/sql/func.c > > +++ b/src/box/sql/func.c > > @@ -643,6 +643,7 @@ static const struct compareInfo likeInfoAlt = { '%', > '_', 0, 0 }; > > #define SQLITE_MATCH 0 > > #define SQLITE_NOMATCH 1 > > #define SQLITE_NOWILDCARDMATCH 2 > > +#define SQL_NO_SYMBOLS_LEFT 65535 > > > > 0xffff looks better. > Isn't actual with the fix. > > I would use name like ICU_ERROR. By the way, it can mean not only end of > a string, but also other errors. It is okay to give 'not matched' in the > case, I think. > > Hmm, we can misinterpret some error and give 'matched' result, I > guess... maybe we really should check start < end in the macro to > distinguish this cases. Don't sure about the test case. As I see from > ICU source code it can be some internal buffer overflow error. We can do > the check like so: > > -#define Utf8Read(s, e) ucnv_getNextUChar(pUtf8conv, &s, e, &status) > +#define Utf8Read(s, e) (((s) < (e)) ? ucnv_getNextUChar(pUtf8conv, &(s), > (e), \ > + &(status)) : 0) > > Yes, nice idea, I wouldn't guess it, thank you for advice. Therefore changed SQL_NO_SYMBOLS_LEFT to SQL_END_OF_STRING, I think it is more understandable. -#define SQL_NO_SYMBOLS_LEFT 65535 > +#define SQL_NO_SYMBOLS_LEFT 0 > /* > > * Compare two UTF-8 strings for equality where the first string is > > @@ -698,29 +699,28 @@ patternCompare(const char * pattern, /* The > glob pattern */ > > const char * string_end = string + strlen(string); > > UErrorCode status = U_ZERO_ERROR; > > > > - while (pattern < pattern_end){ > > - c = Utf8Read(pattern, pattern_end); > > + while ((c = Utf8Read(pattern, pattern_end)) != > SQL_NO_SYMBOLS_LEFT) { > > Here and everywhere below we lean on the fact that ucnv_getNextUChar > compares pointers and that is so: > > 1860 s=*source; > 1861 if(sourceLimit<s) { > 1862 *err=U_ILLEGAL_ARGUMENT_ERROR; > 1863 return 0xffff; > 1864 } > > By the way, the patch reverts partially implemented approach to > preliminary check start < end introduced in 198d59ce93. I think it is > okay too, because some checks were missed and now they exist again. > > > diff --git a/test/sql-tap/like1.test.lua b/test/sql-tap/like1.test.lua > > new file mode 100755 > > index 0000000..42b4d43 > > --- /dev/null > > +++ b/test/sql-tap/like1.test.lua > > @@ -0,0 +1,152 @@ > > +#!/usr/bin/env tarantool > > +test = require("sqltester") > > +test:plan(13) > > + > > Maybe case with % at the end? I tried, that works, but it would be good > to have a regression test. > Yeah, sure. [-- Attachment #2: Type: text/html, Size: 6679 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* [tarantool-patches] Re: [PATCH] sql: LIKE & GLOB pattern comparison issue 2018-07-18 15:24 ` Nikita Tatunov @ 2018-07-18 15:53 ` Alex Khatskevich 2018-07-18 15:57 ` Nikita Tatunov 0 siblings, 1 reply; 22+ messages in thread From: Alex Khatskevich @ 2018-07-18 15:53 UTC (permalink / raw) To: Nikita Tatunov, alexander.turenko; +Cc: tarantool-patches [-- Attachment #1: Type: text/plain, Size: 351 bytes --] Once you have fixed comments, please apply a new full diff at the end of email, to continue review process. Answer with a full diff to this email please (just paste as a plain text output of `git diff` command) On 18.07.2018 18:24, Nikita Tatunov wrote: > > Fixed it. > > Yeah, you're right. Fixed it. > Isn't actual with the fix. > > Yeah, sure. [-- Attachment #2: Type: text/html, Size: 992 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* [tarantool-patches] Re: [PATCH] sql: LIKE & GLOB pattern comparison issue 2018-07-18 15:53 ` Alex Khatskevich @ 2018-07-18 15:57 ` Nikita Tatunov 2018-07-18 17:10 ` Alexander Turenko 2018-07-19 11:56 ` Alex Khatskevich 0 siblings, 2 replies; 22+ messages in thread From: Nikita Tatunov @ 2018-07-18 15:57 UTC (permalink / raw) To: avkhatskevich; +Cc: alexander.turenko, tarantool-patches [-- Attachment #1: Type: text/plain, Size: 7712 bytes --] ср, 18 июл. 2018 г. в 18:53, Alex Khatskevich <avkhatskevich@tarantool.org>: > Once you have fixed comments, please apply a new full diff at the end of > email, to > continue review process. > Answer with a full diff to this email please (just paste as a plain text > output of `git diff` command) > > On 18.07.2018 18:24, Nikita Tatunov wrote: > > > Fixed it. > > Yeah, you're right. Fixed it. > > Isn't actual with the fix. > > Yeah, sure. > > > Sorry. Here it is: diff --git a/src/box/sql/func.c b/src/box/sql/func.c index c06e3bd..a7f7c17 100644 --- a/src/box/sql/func.c +++ b/src/box/sql/func.c @@ -617,13 +617,15 @@ struct compareInfo { u8 noCase; /* true to ignore case differences */ }; -/* - * For LIKE and GLOB matching on EBCDIC machines, assume that every - * character is exactly one byte in size. Also, provde the Utf8Read() - * macro for fast reading of the next character in the common case where - * the next character is ASCII. +/** + * Providing there are symbols in string s this macro returns + * UTF-8 code of character and pushes pointer to the next symbol + * in the string. Otherwise return code is SQL_END_OF_STRING. */ -#define Utf8Read(s, e) ucnv_getNextUChar(pUtf8conv, &s, e, &status) +#define Utf8Read(s, e) (s < e ?\ + ucnv_getNextUChar(pUtf8conv, &s, e, &status) : 0) + +#define SQL_END_OF_STRING 0 static const struct compareInfo globInfo = { '*', '?', '[', 0 }; @@ -698,29 +700,27 @@ patternCompare(const char * pattern, /* The glob pattern */ const char * string_end = string + strlen(string); UErrorCode status = U_ZERO_ERROR; - while (pattern < pattern_end){ - c = Utf8Read(pattern, pattern_end); + while ((c = Utf8Read(pattern, pattern_end))) { if (c == matchAll) { /* Match "*" */ /* Skip over multiple "*" characters in the pattern. If there * are also "?" characters, skip those as well, but consume a * single character of the input string for each "?" skipped */ - while (pattern < pattern_end){ - c = Utf8Read(pattern, pattern_end); + while ((c = Utf8Read(pattern, pattern_end))) { if (c != matchAll && c != matchOne) break; - if (c == matchOne - && Utf8Read(string, string_end) == 0) { + if (c == matchOne && + Utf8Read(string, string_end) == + SQL_END_OF_STRING) return SQLITE_NOWILDCARDMATCH; - } } /* "*" at the end of the pattern matches */ - if (pattern == pattern_end) + if (c == SQL_END_OF_STRING) return SQLITE_MATCH; if (c == matchOther) { if (pInfo->matchSet == 0) { c = Utf8Read(pattern, pattern_end); - if (c == 0) + if (c == SQL_END_OF_STRING) return SQLITE_NOWILDCARDMATCH; } else { /* "[...]" immediately follows the "*". We have to do a slow @@ -782,7 +782,7 @@ patternCompare(const char * pattern, /* The glob pattern */ if (c == matchOther) { if (pInfo->matchSet == 0) { c = Utf8Read(pattern, pattern_end); - if (c == 0) + if (c == SQL_END_OF_STRING) return SQLITE_NOMATCH; zEscaped = pattern; } else { @@ -802,7 +802,7 @@ patternCompare(const char * pattern, /* The glob pattern */ seen = 1; c2 = Utf8Read(pattern, pattern_end); } - while (c2 && c2 != ']') { + while (c2 != SQL_END_OF_STRING && c2 != ']') { if (c2 == '-' && pattern[0] != ']' && pattern < pattern_end && prior_c > 0) { @@ -839,7 +839,8 @@ patternCompare(const char * pattern, /* The glob pattern */ c == u_tolower(c2)) continue; } - if (c == matchOne && pattern != zEscaped && c2 != 0) + if (c == matchOne && pattern != zEscaped && + c2 != SQL_END_OF_STRING) continue; return SQLITE_NOMATCH; } diff --git a/test/sql-tap/like1.test.lua b/test/sql-tap/like1.test.lua new file mode 100755 index 0000000..807ee65 --- /dev/null +++ b/test/sql-tap/like1.test.lua @@ -0,0 +1,181 @@ +#!/usr/bin/env tarantool +test = require("sqltester") +test:plan(16) + +test:do_catchsql_test( + "like-test-1.1", + [[ + CREATE TABLE t2 (column1 INTEGER, + column2 VARCHAR(100), + column3 BLOB, + column4 FLOAT, + PRIMARY KEY (column1, column2)); + INSERT INTO t2 VALUES (1, 'AB', X'4142', 5.5); + INSERT INTO t2 VALUES (1, 'CD', X'2020', 1E4); + INSERT INTO t2 VALUES (2, 'AB', X'2020', 12.34567); + INSERT INTO t2 VALUES (-1000, '', X'', 0.0); + CREATE TABLE t1 (a INT PRIMARY KEY, str VARCHAR(100)); + INSERT INTO t1 VALUES (1, 'ab'); + INSERT INTO t1 VALUES (2, 'abCDF'); + INSERT INTO t1 VALUES (3, 'CDF'); + CREATE TABLE t (s1 char(2) primary key, s2 char(2)); + INSERT INTO t VALUES ('AB', 'AB'); + ]], { + -- <like-test-1.1> + 0 + -- <like-test-1.1> + }) + +test:do_execsql_test( + "like-test-1.2", + [[ + SELECT column1, column2, column1 * column4 FROM t2 WHERE column2 LIKE '_B'; + ]], { + -- <like-test-1.2> + 1, 'AB', 5.5, 2, 'AB', 24.69134 + -- <like-test-1.2> + }) + +test:do_execsql_test( + "like-test-1.3", + [[ + SELECT column1, column2 FROM t2 WHERE column2 LIKE '%B'; + ]], { + -- <like-test-1.3> + 1, 'AB', 2, 'AB' + -- <like-test-1.3> + }) + +test:do_execsql_test( + "like-test-1.4", + [[ + SELECT column1, column2 FROM t2 WHERE column2 LIKE 'A__'; + ]], { + -- <like-test-1.4> + + -- <like-test-1.4> + }) + +test:do_execsql_test( + "like-test-1.5", + [[ + SELECT column1, column2 FROM t2 WHERE column2 LIKE 'A_'; + ]], { + -- <like-test-1.5> + 1, 'AB', 2, 'AB' + -- <like-test-1.5> + }) + +test:do_execsql_test( + "like-test-1.6", + [[ + SELECT column1, column2 FROM t2 WHERE column2 LIKE 'A'; + ]], { + -- <like-test-1.6> + + -- <like-test-1.6> + }) + +test:do_execsql_test( + "like-test-1.7", + [[ + SELECT column1, column2 FROM t2 WHERE column2 LIKE '_'; + ]], { + -- <like-test-1.7> + + -- <like-test-1.7> + }) + +test:do_execsql_test( + "like-test-1.8", + [[ + SELECT * FROM t WHERE s1 LIKE '%A'; + ]], { + -- <like-test-1.8> + + -- <like-test-1.8> + }) + +test:do_execsql_test( + "like-test-1.9", + [[ + SELECT * FROM t WHERE s1 LIKE '%C'; + ]], { + -- <like-test-1.9> + + -- <like-test-1.9> + }) + +test:do_execsql_test( + "like-test-1.10", + [[ + SELECT * FROM t1 WHERE str LIKE '%df'; + ]], { + -- <like-test-1.10> + 2, 'abCDF', 3, 'CDF' + -- <like-test-1.10> + }) + +test:do_execsql_test( + "like-test-1.11", + [[ + SELECT * FROM t1 WHERE str LIKE 'a_'; + ]], { + -- <like-test-1.11> + 1, 'ab' + -- <like-test-1.11> + }) + +test:do_execsql_test( + "like-test-1.12", + [[ + SELECT column1, column2 FROM t2 WHERE column2 LIKE '__'; + ]], { + -- <like-test-1.12> + 1, 'AB', 1, 'CD', 2, 'AB' + -- <like-test-1.12> + }) + +test:do_execsql_test( + "like-test-1.13", + [[ + SELECT str FROM t1 WHERE str LIKE 'ab%'; + ]], { + -- <like-test-1.13> + 'ab', 'abCDF' + -- <like-test-1.13> + }) + +test:do_execsql_test( + "like-test-1.14", + [[ + SELECT str FROM t1 WHERE str LIKE 'abC%'; + ]], { + -- <like-test-1.14> + 'abCDF' + -- <like-test-1.14> + }) + +test:do_execsql_test( + "like-test-1.15", + [[ + SELECT str FROM t1 WHERE str LIKE 'a_%'; + ]], { + -- <like-test-1.15> + 'ab', 'abCDF' + -- <like-test-1.15> + }) + +test:do_execsql_test( + "like-test-1.16", + [[ + DROP TABLE t1; + DROP TABLE t2; + DROP TABLE t; + ]], { + -- <like-test-1.16> + + -- <like-test-1.16> + }) + +test:finish_test() [-- Attachment #2: Type: text/html, Size: 19060 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* [tarantool-patches] Re: [PATCH] sql: LIKE & GLOB pattern comparison issue 2018-07-18 15:57 ` Nikita Tatunov @ 2018-07-18 17:10 ` Alexander Turenko 2018-07-19 11:14 ` Nikita Tatunov 2018-07-19 11:56 ` Alex Khatskevich 1 sibling, 1 reply; 22+ messages in thread From: Alexander Turenko @ 2018-07-18 17:10 UTC (permalink / raw) To: Nikita Tatunov; +Cc: tarantool-patches, Alex Khatskevich Two minor comments are below. WBR, Alexander Turenko. On Wed, Jul 18, 2018 at 06:57:39PM +0300, Nikita Tatunov wrote: > ср, 18 июл. 2018 г. в 18:53, Alex Khatskevich > <[1]avkhatskevich@tarantool.org>: > > Once you have fixed comments, please apply a new full diff at the end > of email, to > continue review process. > Answer with a full diff to this email please (just paste as a plain > text output of `git diff` command) > On 18.07.2018 18:24, Nikita Tatunov wrote: > > -/* > - * For LIKE and GLOB matching on EBCDIC machines, assume that every > - * character is exactly one byte in size. Also, provde the Utf8Read() > - * macro for fast reading of the next character in the common case > where > - * the next character is ASCII. > +/** > + * Providing there are symbols in string s this macro returns > + * UTF-8 code of character and pushes pointer to the next symbol > + * in the string. Otherwise return code is SQL_END_OF_STRING. Nitpicking: pushes -> promotes? > */ > -#define Utf8Read(s, e) ucnv_getNextUChar(pUtf8conv, &s, e, &status) > +#define Utf8Read(s, e) (s < e ?\ > + ucnv_getNextUChar(pUtf8conv, &s, e, &status) : 0) > + > +#define SQL_END_OF_STRING 0 Always wrap argument usages of a function-like macro within its body with parenthesis (consider example I give in the previous email). It is usual way to handle the fact that a macros works as text replacement, so, say #define SUB(x, y) x - y will give the wrong result for the expression SUB(10, 5 - 3), while #define SUB(x, y) ((x) - (y)) will produce the correct result. ^ permalink raw reply [flat|nested] 22+ messages in thread
* [tarantool-patches] Re: [PATCH] sql: LIKE & GLOB pattern comparison issue 2018-07-18 17:10 ` Alexander Turenko @ 2018-07-19 11:14 ` Nikita Tatunov 0 siblings, 0 replies; 22+ messages in thread From: Nikita Tatunov @ 2018-07-19 11:14 UTC (permalink / raw) To: alexander.turenko; +Cc: tarantool-patches, avkhatskevich [-- Attachment #1: Type: text/plain, Size: 2945 bytes --] ср, 18 июл. 2018 г. в 20:10, Alexander Turenko < alexander.turenko@tarantool.org>: > Two minor comments are below. > > WBR, Alexander Turenko. > > On Wed, Jul 18, 2018 at 06:57:39PM +0300, Nikita Tatunov wrote: > > ср, 18 июл. 2018 г. в 18:53, Alex Khatskevich > > <[1]avkhatskevich@tarantool.org>: > > > > Once you have fixed comments, please apply a new full diff at the end > > of email, to > > continue review process. > > Answer with a full diff to this email please (just paste as a plain > > text output of `git diff` command) > > On 18.07.2018 18:24, Nikita Tatunov wrote: > > > > -/* > > - * For LIKE and GLOB matching on EBCDIC machines, assume that every > > - * character is exactly one byte in size. Also, provde the > Utf8Read() > > - * macro for fast reading of the next character in the common case > > where > > - * the next character is ASCII. > > +/** > > + * Providing there are symbols in string s this macro returns > > + * UTF-8 code of character and pushes pointer to the next symbol > > + * in the string. Otherwise return code is SQL_END_OF_STRING. > > Nitpicking: pushes -> promotes? > > Both should be fine, I guess. But ok, changed it. > > */ > > -#define Utf8Read(s, e) ucnv_getNextUChar(pUtf8conv, &s, e, > &status) > > +#define Utf8Read(s, e) (s < e ?\ > > + ucnv_getNextUChar(pUtf8conv, &s, e, &status) : 0) > > + > > +#define SQL_END_OF_STRING 0 > > Always wrap argument usages of a function-like macro within its body > with parenthesis (consider example I give in the previous email). It is > usual way to handle the fact that a macros works as text replacement, > so, say > > #define SUB(x, y) x - y > > will give the wrong result for the expression SUB(10, 5 - 3), while > > #define SUB(x, y) ((x) - (y)) > > will produce the correct result. > Fixed. Also made few minor codestyle fixes in tests compared to prev. patch. diff --git a/src/box/sql/func.c b/src/box/sql/func.c index c06e3bd..a9784cc 100644 --- a/src/box/sql/func.c +++ b/src/box/sql/func.c @@ -617,13 +617,16 @@ struct compareInfo { u8 noCase; /* true to ignore case differences */ }; -/* - * For LIKE and GLOB matching on EBCDIC machines, assume that every - * character is exactly one byte in size. Also, provde the Utf8Read() - * macro for fast reading of the next character in the common case where - * the next character is ASCII. +/** + * Providing there are symbols in string s this + * macro returns UTF-8 code of character and + * promotes pointer to the next symbol in the string. + * Otherwise return code is SQL_END_OF_STRING. */ -#define Utf8Read(s, e) ucnv_getNextUChar(pUtf8conv, &s, e, &status) +#define Utf8Read(s, e) (((s) < (e)) ?\ + ucnv_getNextUChar(pUtf8conv, &s, e, &status) : 0) + +#define SQL_END_OF_STRING 0 [-- Attachment #2: Type: text/html, Size: 4328 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* [tarantool-patches] Re: [PATCH] sql: LIKE & GLOB pattern comparison issue 2018-07-18 15:57 ` Nikita Tatunov 2018-07-18 17:10 ` Alexander Turenko @ 2018-07-19 11:56 ` Alex Khatskevich 2018-07-27 11:28 ` Nikita Tatunov 1 sibling, 1 reply; 22+ messages in thread From: Alex Khatskevich @ 2018-07-19 11:56 UTC (permalink / raw) To: Nikita Tatunov; +Cc: alexander.turenko, tarantool-patches [-- Attachment #1: Type: text/plain, Size: 5235 bytes --] Hi. I have some questions related to the func.c file, however before that I would ask you to fix tests. General ideas: 1. Those tests are regresson tests (it just tests that problem will not appear in the future). We name those tests in the following manear: gh-XXXX-short-description.test.lua 2. The thing you test is not related to a table and other columns. Please, convert the tests to the next format: {[[select '' like '_B';]], {1}]]}. To make it more readable, you can do it like `like_testcases` in `sql-tap/collation.test.lua`. 3. There is two extra things that should be tested: 1. When string or pattern ends with incorrect unicode symbol (e.g. half of the whole unicode symbol) 2. String or pattern contains incorrect unicode symbol. Implementing 3 point may require some additional investigations. E.g. it is ok for a blob to have invalid Unicode symbols. > diff --git a/test/sql-tap/like1.test.lua b/test/sql-tap/like1.test.lua > new file mode 100755 > index 0000000..807ee65 > --- /dev/null > +++ b/test/sql-tap/like1.test.lua > @@ -0,0 +1,181 @@ > +#!/usr/bin/env tarantool > +test = require("sqltester") > +test:plan(16) > + > +test:do_catchsql_test( > +"like-test-1.1", > +[[ > +CREATE TABLE t2 (column1 INTEGER, > + column2 VARCHAR(100), > + column3 BLOB, > + column4 FLOAT, > + PRIMARY KEY (column1, column2)); > +INSERT INTO t2 VALUES (1, 'AB', X'4142', 5.5); > +INSERT INTO t2 VALUES (1, 'CD', X'2020', 1E4); > +INSERT INTO t2 VALUES (2, 'AB', X'2020', 12.34567); > +INSERT INTO t2 VALUES (-1000, '', X'', 0.0); > +CREATE TABLE t1 (a INT PRIMARY KEY, str VARCHAR(100)); > +INSERT INTO t1 VALUES (1, 'ab'); > +INSERT INTO t1 VALUES (2, 'abCDF'); > +INSERT INTO t1 VALUES (3, 'CDF'); > +CREATE TABLE t (s1 char(2) primary key, s2 char(2)); > +INSERT INTO t VALUES ('AB', 'AB'); > +]], { > +-- <like-test-1.1> > +0 > +-- <like-test-1.1> > +}) > + > +test:do_execsql_test( > +"like-test-1.2", > +[[ > +SELECT column1, column2, column1 * column4 FROM t2 WHERE column2 LIKE > '_B'; > +]], { > +-- <like-test-1.2> > +1, 'AB', 5.5, 2, 'AB', 24.69134 > +-- <like-test-1.2> > +}) > + > +test:do_execsql_test( > +"like-test-1.3", > +[[ > +SELECT column1, column2 FROM t2 WHERE column2 LIKE '%B'; > +]], { > + -- <like-test-1.3> > + 1, 'AB', 2, 'AB' > + -- <like-test-1.3> > +}) > + > +test:do_execsql_test( > +"like-test-1.4", > +[[ > +SELECT column1, column2 FROM t2 WHERE column2 LIKE 'A__'; > +]], { > + -- <like-test-1.4> > + > + -- <like-test-1.4> > +}) > + > +test:do_execsql_test( > +"like-test-1.5", > +[[ > +SELECT column1, column2 FROM t2 WHERE column2 LIKE 'A_'; > +]], { > + -- <like-test-1.5> > + 1, 'AB', 2, 'AB' > + -- <like-test-1.5> > +}) > + > +test:do_execsql_test( > +"like-test-1.6", > +[[ > +SELECT column1, column2 FROM t2 WHERE column2 LIKE 'A'; > +]], { > + -- <like-test-1.6> > + > + -- <like-test-1.6> > +}) > + > +test:do_execsql_test( > +"like-test-1.7", > +[[ > +SELECT column1, column2 FROM t2 WHERE column2 LIKE '_'; > +]], { > + -- <like-test-1.7> > + > + -- <like-test-1.7> > +}) > + > +test:do_execsql_test( > +"like-test-1.8", > +[[ > +SELECT * FROM t WHERE s1 LIKE '%A'; > +]], { > + -- <like-test-1.8> > + > + -- <like-test-1.8> > +}) > + > +test:do_execsql_test( > +"like-test-1.9", > +[[ > +SELECT * FROM t WHERE s1 LIKE '%C'; > +]], { > + -- <like-test-1.9> > + > + -- <like-test-1.9> > +}) > + > +test:do_execsql_test( > +"like-test-1.10", > +[[ > +SELECT * FROM t1 WHERE str LIKE '%df'; > +]], { > + -- <like-test-1.10> > + 2, 'abCDF', 3, 'CDF' > + -- <like-test-1.10> > +}) > + > +test:do_execsql_test( > +"like-test-1.11", > +[[ > +SELECT * FROM t1 WHERE str LIKE 'a_'; > +]], { > + -- <like-test-1.11> > + 1, 'ab' > + -- <like-test-1.11> > +}) > + > +test:do_execsql_test( > +"like-test-1.12", > +[[ > +SELECT column1, column2 FROM t2 WHERE column2 LIKE '__'; > +]], { > + -- <like-test-1.12> > + 1, 'AB', 1, 'CD', 2, 'AB' > + -- <like-test-1.12> > +}) > + > +test:do_execsql_test( > +"like-test-1.13", > +[[ > +SELECT str FROM t1 WHERE str LIKE 'ab%'; > +]], { > +-- <like-test-1.13> > +'ab', 'abCDF' > +-- <like-test-1.13> > +}) > + > +test:do_execsql_test( > +"like-test-1.14", > +[[ > +SELECT str FROM t1 WHERE str LIKE 'abC%'; > +]], { > +-- <like-test-1.14> > +'abCDF' > +-- <like-test-1.14> > +}) > + > +test:do_execsql_test( > +"like-test-1.15", > +[[ > +SELECT str FROM t1 WHERE str LIKE 'a_%'; > +]], { > +-- <like-test-1.15> > +'ab', 'abCDF' > +-- <like-test-1.15> > +}) > + > +test:do_execsql_test( > +"like-test-1.16", > +[[ > +DROP TABLE t1; > +DROP TABLE t2; > +DROP TABLE t; > +]], { > + -- <like-test-1.16> > + > + -- <like-test-1.16> > +}) > + > +test:finish_test() [-- Attachment #2: Type: text/html, Size: 14752 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* [tarantool-patches] Re: [PATCH] sql: LIKE & GLOB pattern comparison issue 2018-07-19 11:56 ` Alex Khatskevich @ 2018-07-27 11:28 ` Nikita Tatunov 2018-07-27 13:06 ` Alexander Turenko 0 siblings, 1 reply; 22+ messages in thread From: Nikita Tatunov @ 2018-07-27 11:28 UTC (permalink / raw) To: avkhatskevich; +Cc: Alexander Turenko, tarantool-patches [-- Attachment #1: Type: text/plain, Size: 20263 bytes --] Hello, Alex! Thanks for the review! чт, 19 июл. 2018 г. в 14:56, Alex Khatskevich <avkhatskevich@tarantool.org>: > Hi. > > I have some questions related to the func.c file, however before that I > would ask you to fix tests. > > General ideas: > 1. Those tests are regresson tests (it just tests that problem will not > appear in the future). > We name those tests in the following manear: > gh-XXXX-short-description.test.lua > Done. > 2. The thing you test is not related to a table and other columns. > Please, convert the tests to the next format: {[[select '' like > '_B';]], {1}]]}. > To make it more readable, you can do it like `like_testcases` in > `sql-tap/collation.test.lua`. > Done. > 3. There is two extra things that should be tested: > 1. When string or pattern ends with incorrect unicode symbol (e.g. > half of the whole unicode symbol) > 2. String or pattern contains incorrect unicode symbol. > > Refactored test with this idea taken into account. Now comparing function is only supposed to work with TEXT types, which led to the part of #3572 propositions. Also added error output in case pattern contains invalid symbol & now if string contains invalid symbol it can't be matched with whatever pattern. Some minor fixes to patternCompare function as well. Here's diff: diff --git a/src/box/sql/func.c b/src/box/sql/func.c index c06e3bd..5b53076 100644 --- a/src/box/sql/func.c +++ b/src/box/sql/func.c @@ -617,13 +617,17 @@ struct compareInfo { u8 noCase; /* true to ignore case differences */ }; -/* - * For LIKE and GLOB matching on EBCDIC machines, assume that every - * character is exactly one byte in size. Also, provde the Utf8Read() - * macro for fast reading of the next character in the common case where - * the next character is ASCII. +/** + * Providing there are symbols in string s this + * macro returns UTF-8 code of character and + * promotes pointer to the next symbol in the string. + * Otherwise return code is SQL_END_OF_STRING. */ -#define Utf8Read(s, e) ucnv_getNextUChar(pUtf8conv, &s, e, &status) +#define Utf8Read(s, e) (((s) < (e)) ?\ + ucnv_getNextUChar(pUtf8conv, &(s), (e), &(status)) : 0) + +#define SQL_END_OF_STRING 0 +#define SQL_INVALID_UTF8_SYMBOL 0xfffd static const struct compareInfo globInfo = { '*', '?', '[', 0 }; @@ -643,51 +647,61 @@ static const struct compareInfo likeInfoAlt = { '%', '_', 0, 0 }; #define SQLITE_MATCH 0 #define SQLITE_NOMATCH 1 #define SQLITE_NOWILDCARDMATCH 2 +#define SQL_PROHIBITED_PATTERN 3 -/* - * Compare two UTF-8 strings for equality where the first string is - * a GLOB or LIKE expression. Return values: - * - * SQLITE_MATCH: Match - * SQLITE_NOMATCH: No match - * SQLITE_NOWILDCARDMATCH: No match in spite of having * or % wildcards. +/** + * Compare two UTF-8 strings for equality where the first string + * is a GLOB or LIKE expression. * * Globbing rules: * - * '*' Matches any sequence of zero or more characters. + * '*' Matches any sequence of zero or more characters. * - * '?' Matches exactly one character. + * '?' Matches exactly one character. * - * [...] Matches one character from the enclosed list of - * characters. + * [...] Matches one character from the enclosed list of + * characters. * - * [^...] Matches one character not in the enclosed list. + * [^...] Matches one character not in the enclosed list. * - * With the [...] and [^...] matching, a ']' character can be included - * in the list by making it the first character after '[' or '^'. A - * range of characters can be specified using '-'. Example: - * "[a-z]" matches any single lower-case letter. To match a '-', make - * it the last character in the list. + * With the [...] and [^...] matching, a ']' character can be + * included in the list by making it the first character after + * '[' or '^'. A range of characters can be specified using '-'. + * Example: "[a-z]" matches any single lower-case letter. + * To match a '-', make it the last character in the list. * * Like matching rules: * - * '%' Matches any sequence of zero or more characters + * '%' Matches any sequence of zero or more characters. * - ** '_' Matches any one character + ** '_' Matches any one character. * * Ec Where E is the "esc" character and c is any other - * character, including '%', '_', and esc, match exactly c. + * character, including '%', '_', and esc, match + * exactly c. * * The comments within this routine usually assume glob matching. * - * This routine is usually quick, but can be N**2 in the worst case. + * This routine is usually quick, but can be N**2 in the worst + * case. + * + * @param pattern String containing comparison pattern. + * @param string String being compared. + * @param compareInfo Information about how to compare. + * @param matchOther The escape char (LIKE) or '[' (GLOB). + * + * @retval SQLITE_MATCH: Match. + * SQLITE_NOMATCH: No match. + * SQLITE_NOWILDCARDMATCH: No match in spite of having * + * or % wildcards. + * SQL_PROHIBITED_PATTERN Pattern contains invalid + * symbol. */ static int -patternCompare(const char * pattern, /* The glob pattern */ - const char * string, /* The string to compare against the glob */ - const struct compareInfo *pInfo, /* Information about how to do the compare */ - UChar32 matchOther /* The escape char (LIKE) or '[' (GLOB) */ - ) +sql_utf8_pattern_compare(const char * pattern, + const char * string, + const struct compareInfo *pInfo, + UChar32 matchOther) { UChar32 c, c2; /* Next pattern and input string chars */ UChar32 matchOne = pInfo->matchOne; /* "?" or "_" */ @@ -698,29 +712,41 @@ patternCompare(const char * pattern, /* The glob pattern */ const char * string_end = string + strlen(string); UErrorCode status = U_ZERO_ERROR; - while (pattern < pattern_end){ - c = Utf8Read(pattern, pattern_end); + while ((c = Utf8Read(pattern, pattern_end)) != SQL_END_OF_STRING) { + if (c == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; if (c == matchAll) { /* Match "*" */ /* Skip over multiple "*" characters in the pattern. If there * are also "?" characters, skip those as well, but consume a * single character of the input string for each "?" skipped */ - while (pattern < pattern_end){ - c = Utf8Read(pattern, pattern_end); + while ((c = Utf8Read(pattern, pattern_end)) != + SQL_END_OF_STRING) { + if (c == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; if (c != matchAll && c != matchOne) break; - if (c == matchOne - && Utf8Read(string, string_end) == 0) { + if (c == matchOne && + (c2 = Utf8Read(string, string_end)) == + SQL_END_OF_STRING) return SQLITE_NOWILDCARDMATCH; - } + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQLITE_NOMATCH; } /* "*" at the end of the pattern matches */ - if (pattern == pattern_end) + if (c == SQL_END_OF_STRING) { + while ((c2 = Utf8Read(string, string_end)) != + SQL_END_OF_STRING) + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQLITE_NOMATCH; return SQLITE_MATCH; + } if (c == matchOther) { if (pInfo->matchSet == 0) { c = Utf8Read(pattern, pattern_end); - if (c == 0) + if (c == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; + if (c == SQL_END_OF_STRING) return SQLITE_NOWILDCARDMATCH; } else { /* "[...]" immediately follows the "*". We have to do a slow @@ -729,13 +755,16 @@ patternCompare(const char * pattern, /* The glob pattern */ assert(matchOther < 0x80); /* '[' is a single-byte character */ while (string < string_end) { int bMatch = - patternCompare(&pattern[-1], - string, - pInfo, - matchOther); + sql_utf8_pattern_compare( + &pattern[-1], + string, + pInfo, + matchOther); if (bMatch != SQLITE_NOMATCH) return bMatch; - Utf8Read(string, string_end); + c = Utf8Read(string, string_end); + if (c == SQL_INVALID_UTF8_SYMBOL) + return SQLITE_NOMATCH; } return SQLITE_NOWILDCARDMATCH; } @@ -764,6 +793,8 @@ patternCompare(const char * pattern, /* The glob pattern */ * languages. */ c2 = Utf8Read(string, string_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQLITE_NOMATCH; if (!noCase) { if (c2 != c) continue; @@ -771,9 +802,10 @@ patternCompare(const char * pattern, /* The glob pattern */ if (c2 != c && u_tolower(c2) != c) continue; } - bMatch = - patternCompare(pattern, string, - pInfo, matchOther); + bMatch = sql_utf8_pattern_compare(pattern, + string, + pInfo, + matchOther); if (bMatch != SQLITE_NOMATCH) return bMatch; } @@ -782,7 +814,9 @@ patternCompare(const char * pattern, /* The glob pattern */ if (c == matchOther) { if (pInfo->matchSet == 0) { c = Utf8Read(pattern, pattern_end); - if (c == 0) + if (c == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; + if (c == SQL_END_OF_STRING) return SQLITE_NOMATCH; zEscaped = pattern; } else { @@ -790,23 +824,33 @@ patternCompare(const char * pattern, /* The glob pattern */ int seen = 0; int invert = 0; c = Utf8Read(string, string_end); + if (c == SQL_INVALID_UTF8_SYMBOL) + return SQLITE_NOMATCH; if (string == string_end) return SQLITE_NOMATCH; c2 = Utf8Read(pattern, pattern_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; if (c2 == '^') { invert = 1; c2 = Utf8Read(pattern, pattern_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; } if (c2 == ']') { if (c == ']') seen = 1; c2 = Utf8Read(pattern, pattern_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; } - while (c2 && c2 != ']') { + while (c2 != SQL_END_OF_STRING && c2 != ']') { if (c2 == '-' && pattern[0] != ']' && pattern < pattern_end && prior_c > 0) { c2 = Utf8Read(pattern, pattern_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; if (c >= prior_c && c <= c2) seen = 1; prior_c = 0; @@ -817,14 +861,19 @@ patternCompare(const char * pattern, /* The glob pattern */ prior_c = c2; } c2 = Utf8Read(pattern, pattern_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; } - if (pattern == pattern_end || (seen ^ invert) == 0) { + if (pattern == pattern_end || + (seen ^ invert) == 0) { return SQLITE_NOMATCH; } continue; } } c2 = Utf8Read(string, string_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQLITE_NOMATCH; if (c == c2) continue; if (noCase){ @@ -839,7 +888,8 @@ patternCompare(const char * pattern, /* The glob pattern */ c == u_tolower(c2)) continue; } - if (c == matchOne && pattern != zEscaped && c2 != 0) + if (c == matchOne && pattern != zEscaped && + c2 != SQL_END_OF_STRING) continue; return SQLITE_NOMATCH; } @@ -853,8 +903,7 @@ patternCompare(const char * pattern, /* The glob pattern */ int sqlite3_strglob(const char *zGlobPattern, const char *zString) { - return patternCompare(zGlobPattern, zString, &globInfo, - '['); + return sql_utf8_pattern_compare(zGlobPattern, zString, &globInfo, '['); } /* @@ -864,7 +913,7 @@ sqlite3_strglob(const char *zGlobPattern, const char *zString) int sqlite3_strlike(const char *zPattern, const char *zStr, unsigned int esc) { - return patternCompare(zPattern, zStr, &likeInfoNorm, esc); + return sql_utf8_pattern_compare(zPattern, zStr, &likeInfoNorm, esc); } /* @@ -910,8 +959,9 @@ likeFunc(sqlite3_context * context, int argc, sqlite3_value ** argv) zB = (const char *) sqlite3_value_text(argv[0]); zA = (const char *) sqlite3_value_text(argv[1]); - /* Limit the length of the LIKE or GLOB pattern to avoid problems - * of deep recursion and N*N behavior in patternCompare(). + /* Limit the length of the LIKE or GLOB pattern to avoid + * problems of deep recursion and N*N behavior in + * sql_utf8_pattern_compare(). */ nPat = sqlite3_value_bytes(argv[0]); testcase(nPat == db->aLimit[SQLITE_LIMIT_LIKE_PATTERN_LENGTH]); @@ -947,7 +997,12 @@ likeFunc(sqlite3_context * context, int argc, sqlite3_value ** argv) sqlite3_like_count++; #endif int res; - res = patternCompare(zB, zA, pInfo, escape); + res = sql_utf8_pattern_compare(zB, zA, pInfo, escape); + if (res == SQL_PROHIBITED_PATTERN) { + sqlite3_result_error(context, "LIKE or GLOB pattern can only" + " contain UTF-8 characters", -1); + return; + } sqlite3_result_int(context, res == SQLITE_MATCH); } diff --git a/test/sql-tap/e_expr.test.lua b/test/sql-tap/e_expr.test.lua index 13d3a96..051210a 100755 --- a/test/sql-tap/e_expr.test.lua +++ b/test/sql-tap/e_expr.test.lua @@ -1,6 +1,6 @@ #!/usr/bin/env tarantool test = require("sqltester") -test:plan(12431) +test:plan(10665) --!./tcltestrunner.lua -- 2010 July 16 @@ -77,8 +77,10 @@ local operations = { {"<>", "ne1"}, {"!=", "ne2"}, {"IS", "is"}, - {"LIKE", "like"}, - {"GLOB", "glob"}, +-- NOTE: This test needs refactoring after deletion of GLOB & +-- type restrictions for LIKE. +-- {"LIKE", "like"}, +-- {"GLOB", "glob"}, {"AND", "and"}, {"OR", "or"}, {"MATCH", "match"}, @@ -96,7 +98,7 @@ operations = { {"+", "-"}, {"<<", ">>", "&", "|"}, {"<", "<=", ">", ">="}, - {"=", "==", "!=", "<>", "LIKE", "GLOB"}, --"MATCH", "REGEXP"}, + {"=", "==", "!=", "<>"}, --"LIKE", "GLOB"}, "MATCH", "REGEXP"}, {"AND"}, {"OR"}, } @@ -475,6 +477,7 @@ for _, op in ipairs(oplist) do end end end + --------------------------------------------------------------------------- -- Test the IS and IS NOT operators. -- diff --git a/test/sql-tap/gh-3251-string-pattern-comparison.lua b/test/sql-tap/gh-3251-string-pattern-comparison.lua new file mode 100755 index 0000000..0202efc --- /dev/null +++ b/test/sql-tap/gh-3251-string-pattern-comparison.lua @@ -0,0 +1,238 @@ +#!/usr/bin/env tarantool +test = require("sqltester") +test:plan(106) + +local prefix = "like-test-" + +-- Unciode byte sequences. + +local valid_testcases = { + '\x01', + '\x09', + '\x1F', + '\x7F', + '\xC2\x80', + '\xC2\x90', + '\xC2\x9F', + '\xE2\x80\xA8', + '\x20\x0B', + '\xE2\x80\xA9', +} + +-- Non-Unicode byte sequences. +local invalid_testcases = { + '\xE2\x80', + '\xFE\xFF', + '\xC2', + '\xED\xB0\x80', + '\xD0', +} + +local like_test_cases = +{ + {"1.1", + [[ + CREATE TABLE t2 (column1 INTEGER, + column2 VARCHAR(100), + column3 BLOB, + column4 FLOAT, + PRIMARY KEY (column1, column2)); + INSERT INTO t2 VALUES (1, 'AB', X'4142', 5.5); + INSERT INTO t2 VALUES (1, 'CD', X'2020', 1E4); + INSERT INTO t2 VALUES (2, 'AB', X'2020', 12.34567); + INSERT INTO t2 VALUES (-1000, '', X'', 0.0); + CREATE TABLE t1 (a INT PRIMARY KEY, str VARCHAR(100)); + INSERT INTO t1 VALUES (1, 'ab'); + INSERT INTO t1 VALUES (2, 'abCDF'); + INSERT INTO t1 VALUES (3, 'CDF'); + CREATE TABLE t (s1 CHAR(2) PRIMARY KEY, s2 CHAR(2)); + INSERT INTO t VALUES ('AB', 'AB'); + ]], {0}}, + {"1.2", + [[ + SELECT column1, column2, column1 * column4 FROM + t2 WHERE column2 LIKE '_B'; + ]], + {0, {1, 'AB', 5.5, 2, 'AB', 24.69134}} }, + {"1.3", + "SELECT column1, column2 FROM t2 WHERE column2 LIKE '%B';", + {0, {1, 'AB', 2, 'AB'}} }, + {"1.4", + "SELECT column1, column2 FROM t2 WHERE column2 LIKE 'A__';", + {0, {}} }, + {"1.5", + "SELECT column1, column2 FROM t2 WHERE column2 LIKE 'A_';", + {0, {1, 'AB', 2, 'AB'}} }, + {"1.6", + "SELECT column1, column2 FROM t2 WHERE column2 LIKE 'A';", + {0, {}} }, + {"1.7", + "SELECT column1, column2 FROM t2 WHERE column2 LIKE '_';", + {0, {}} }, + {"1.8", + "SELECT * FROM t WHERE s1 LIKE '%A';", + {0, {}} }, + {"1.9", + "SELECT * FROM t WHERE s1 LIKE '%C';", + {0, {}} }, + {"1.10", + "SELECT * FROM t1 WHERE str LIKE '%df';", + {0, {2, 'abCDF', 3, 'CDF'}} }, + {"1.11", + "SELECT * FROM t1 WHERE str LIKE 'a_';", + {0, {1, 'ab'}} }, + {"1.12", + "SELECT column1, column2 FROM t2 WHERE column2 LIKE '__';", + {0, {1, 'AB', 1, 'CD', 2, 'AB'}} }, + {"1.13", + "SELECT str FROM t1 WHERE str LIKE 'ab%';", + {0, {'ab', 'abCDF'}} }, + {"1.14", + "SELECT str FROM t1 WHERE str LIKE 'abC%';", + {0, {'abCDF'}} }, + {"1.15", + "SELECT str FROM t1 WHERE str LIKE 'a_%';", + {0, {'ab', 'abCDF'}} }, + {"1.16", + [[ + DROP TABLE t1; + DROP TABLE t2; + DROP TABLE t; + ]], {0}}, +} + +test:do_catchsql_set_test(like_test_cases, prefix) + +-- Invalid testcases. + +for i, tested_string in ipairs(invalid_testcases) do + local test_name = prefix .. "2." .. tostring(i) + local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';" + +-- We should raise an error if pattern contains invalid characters. + + test:do_catchsql_test( + test_name, + test_itself, { + -- <test_name> + 1, "LIKE or GLOB pattern can only contain UTF-8 characters" + -- <test_name> + }) + + test_name = prefix .. "3." .. tostring(i) + test_itself = "SELECT 'abc' LIKE 'abc" .. tested_string .. "';" + test:do_catchsql_test( + test_name, + test_itself, { + -- <test_name> + 1, "LIKE or GLOB pattern can only contain UTF-8 characters" + -- <test_name> + }) + + test_name = prefix .. "4." .. tostring(i) + test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "c';" + test:do_catchsql_test( + test_name, + test_itself, { + -- <test_name> + 1, "LIKE or GLOB pattern can only contain UTF-8 characters" + -- <test_name> + }) + +-- Just skipping if row value predicand contains invalid character. + + test_name = prefix .. "5." .. tostring(i) + test_itself = "SELECT 'ab" .. tested_string .. "' LIKE 'abc';" + test:do_execsql_test( + test_name, + test_itself, { + -- <test_name> + 0 + -- <test_name> + }) + + test_name = prefix .. "6." .. tostring(i) + test_itself = "SELECT 'abc" .. tested_string .. "' LIKE 'abc';" + test:do_execsql_test( + test_name, + test_itself, { + -- <test_name> + 0 + -- <test_name> + }) + + test_name = prefix .. "7." .. tostring(i) + test_itself = "SELECT 'ab" .. tested_string .. "c' LIKE 'abc';" + test:do_execsql_test( + test_name, + test_itself, { + -- <test_name> + 0 + -- <test_name> + }) +end + +-- Valid testcases. + +for i, tested_string in ipairs(valid_testcases) do + test_name = prefix .. "8." .. tostring(i) + local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';" + test:do_execsql_test( + test_name, + test_itself, { + -- <test_name> + 0 + -- <test_name> + }) + + test_name = prefix .. "9." .. tostring(i) + test_itself = "SELECT 'abc' LIKE 'abc" .. tested_string .. "';" + test:do_execsql_test( + test_name, + test_itself, { + -- <test_name> + 0 + -- <test_name> + }) + + test_name = prefix .. "10." .. tostring(i) + test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "c';" + test:do_execsql_test( + test_name, + test_itself, { + -- <test_name> + 0 + -- <test_name> + }) + test_name = prefix .. "11." .. tostring(i) + test_itself = "SELECT 'ab" .. tested_string .. "' LIKE 'abc';" + test:do_execsql_test( + test_name, + test_itself, { + -- <test_name> + 0 + -- <test_name> + }) + + test_name = prefix .. "12." .. tostring(i) + test_itself = "SELECT 'abc" .. tested_string .. "' LIKE 'abc';" + test:do_execsql_test( + test_name, + test_itself, { + -- <test_name> + 0 + -- <test_name> + }) + + test_name = prefix .. "13." .. tostring(i) + test_itself = "SELECT 'ab" .. tested_string .. "c' LIKE 'abc';" + test:do_execsql_test( + test_name, + test_itself, { + -- <test_name> + 0 + -- <test_name> + }) +end + +test:finish_test() [-- Attachment #2: Type: text/html, Size: 46147 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* [tarantool-patches] Re: [PATCH] sql: LIKE & GLOB pattern comparison issue 2018-07-27 11:28 ` Nikita Tatunov @ 2018-07-27 13:06 ` Alexander Turenko 2018-07-27 19:11 ` Nikita Tatunov 0 siblings, 1 reply; 22+ messages in thread From: Alexander Turenko @ 2018-07-27 13:06 UTC (permalink / raw) To: Nikita Tatunov; +Cc: tarantool-patches Hi, See comments below. Don't sure we should return 'no match' situation when a left hand expression in not correct unicode string. How other DBs handle that? WBR, Alexander Turenko. > sql: LIKE & GLOB pattern comparison issue > > Currently function that compares pattern and string for GLOB & LIKE > operators doesn't work properly. It uses ICU reading function which > perhaps was working differently before and the implementation for the > comparison ending isn't paying attention to some special cases, hence > in those cases it works improperly. > It used non-ICU function before, consider commit 198d59ce. > With the patch applied an error will be returned in case there's > invalid UTF-8 symbol in pattern & pattern will not be matched > with the string that contains it. Some minor corrections to function > were made as well. > Nitpicking: 'error will be returned' is third situatuon, other then 'matched' and 'not matched'. > Сloses #3251 > Сloses #3334 > Part of #3572 > > 2. The thing you test is not related to a table and other columns. > > Please, convert the tests to the next format: {[[select '' like > > '_B';]], {1}]]}. > > To make it more readable, you can do it like `like_testcases` in > > `sql-tap/collation.test.lua`. > > > > Done. You are still perform comparisons with table columns. > -/* > - * For LIKE and GLOB matching on EBCDIC machines, assume that every > - * character is exactly one byte in size. Also, provde the Utf8Read() > - * macro for fast reading of the next character in the common case where > - * the next character is ASCII. > +/** > + * Providing there are symbols in string s this > + * macro returns UTF-8 code of character and > + * promotes pointer to the next symbol in the string. > + * Otherwise return code is SQL_END_OF_STRING. > */ > -#define Utf8Read(s, e) ucnv_getNextUChar(pUtf8conv, &s, e, &status) > +#define Utf8Read(s, e) (((s) < (e)) ?\ > + ucnv_getNextUChar(pUtf8conv, &(s), (e), &(status)) : 0) I suggest to add a whitespace before backslash to make it looks better. > @@ -643,51 +647,61 @@ static const struct compareInfo likeInfoAlt = { '%', > '_', 0, 0 }; > #define SQLITE_MATCH 0 > #define SQLITE_NOMATCH 1 > #define SQLITE_NOWILDCARDMATCH 2 > +#define SQL_PROHIBITED_PATTERN 3 Update comment before that has a reference to 'patternMatch' function. > * > * Globbing rules: > * > - * '*' Matches any sequence of zero or more characters. > + * '*' Matches any sequence of zero or more characters. Why? > * > - * '?' Matches exactly one character. > + * '?' Matches exactly one character. > * > - * [...] Matches one character from the enclosed list of > - * characters. > + * [...] Matches one character from the enclosed list of > + * characters. > * > - * [^...] Matches one character not in the enclosed list. > + * [^...] Matches one character not in the enclosed list. > * The same question. > + * @retval SQLITE_MATCH: Match. > + * SQLITE_NOMATCH: No match. > + * SQLITE_NOWILDCARDMATCH: No match in spite of having * > + * or % wildcards. > + * SQL_PROHIBITED_PATTERN Pattern contains invalid > + * symbol. Nitpicking: no colon after SQL_PROHIBITED_PATTERN. > */ > static int > -patternCompare(const char * pattern, /* The glob pattern */ > - const char * string, /* The string to compare against the glob */ > - const struct compareInfo *pInfo, /* Information about how to do > the compare */ > - UChar32 matchOther /* The escape char (LIKE) or '[' (GLOB) */ > - ) > +sql_utf8_pattern_compare(const char * pattern, > + const char * string, > + const struct compareInfo *pInfo, > + UChar32 matchOther) Don't get why indent is tabulation + 7 spaces. The function itself looks good except lines longer then 80 columns. > @@ -853,8 +903,7 @@ patternCompare(const char * pattern, /* The glob > pattern */ > int > sqlite3_strglob(const char *zGlobPattern, const char *zString) > { > - return patternCompare(zGlobPattern, zString, &globInfo, > - '['); > + return sql_utf8_pattern_compare(zGlobPattern, zString, &globInfo, '['); > } > > /* > @@ -864,7 +913,7 @@ sqlite3_strglob(const char *zGlobPattern, const char > *zString) > int > sqlite3_strlike(const char *zPattern, const char *zStr, unsigned int esc) > { > - return patternCompare(zPattern, zStr, &likeInfoNorm, esc); > + return sql_utf8_pattern_compare(zPattern, zStr, &likeInfoNorm, esc); > } > Should we add sqlite3_result_error is case of SQL_PROHIBITED_PATTERN for these two functions? > - {"LIKE", "like"}, > - {"GLOB", "glob"}, > +-- NOTE: This test needs refactoring after deletion of GLOB & > +-- type restrictions for LIKE. > +-- {"LIKE", "like"}, > +-- {"GLOB", "glob"}, I would add related issue number. > {"AND", "and"}, > {"OR", "or"}, > {"MATCH", "match"}, > @@ -96,7 +98,7 @@ operations = { > {"+", "-"}, > {"<<", ">>", "&", "|"}, > {"<", "<=", ">", ">="}, > - {"=", "==", "!=", "<>", "LIKE", "GLOB"}, --"MATCH", "REGEXP"}, > + {"=", "==", "!=", "<>"}, --"LIKE", "GLOB"}, "MATCH", "REGEXP"}, Leave comment before MATCH and REGEXP to avoid surprises after removing your comment. > --- /dev/null > +++ b/test/sql-tap/gh-3251-string-pattern-comparison.lua > @@ -0,0 +1,238 @@ > +#!/usr/bin/env tarantool > +test = require("sqltester") > +test:plan(106) > + > +local prefix = "like-test-" > + > +-- Unciode byte sequences. > + 1. Typo: unicode. 2. Extra empty line. > +-- Invalid testcases. > + > +for i, tested_string in ipairs(invalid_testcases) do > + local test_name = prefix .. "2." .. tostring(i) > + local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';" > + > +-- We should raise an error if pattern contains invalid characters. > + 1. Trailing whitespaces. 2. Why comments are not indented as the other code? 3. test_name and test_itself before the section comment looks strange for me here. ^ permalink raw reply [flat|nested] 22+ messages in thread
* [tarantool-patches] Re: [PATCH] sql: LIKE & GLOB pattern comparison issue 2018-07-27 13:06 ` Alexander Turenko @ 2018-07-27 19:11 ` Nikita Tatunov 2018-07-27 20:22 ` Alexander Turenko 0 siblings, 1 reply; 22+ messages in thread From: Nikita Tatunov @ 2018-07-27 19:11 UTC (permalink / raw) To: Alexander Turenko; +Cc: tarantool-patches [-- Attachment #1: Type: text/plain, Size: 17465 bytes --] Hello! пт, 27 июл. 2018 г. в 16:06, Alexander Turenko < alexander.turenko@tarantool.org>: > Hi, > > See comments below. > > Don't sure we should return 'no match' situation when a left hand > expression in not correct unicode string. How other DBs handle that? > > The main point is that since pattern is always valid (otherwise we return an error) it cannot be matched with a string that contains an invalid utf-8 character. - PostgreSQL doesn't allow storing invalid characters in text types and even explicit E'\xD1' anywhere results an error. - MySQL Doesn't allow using invalid characters in text types. May be it's reasons for thinking about prohibiting invalid characters in text? WBR, Alexander Turenko. > > > sql: LIKE & GLOB pattern comparison issue > > > > Currently function that compares pattern and string for GLOB & LIKE > > operators doesn't work properly. It uses ICU reading function which > > perhaps was working differently before and the implementation for the > > comparison ending isn't paying attention to some special cases, hence > > in those cases it works improperly. > > > > It used non-ICU function before, consider commit 198d59ce. > > Thank you, I've got it now. > > With the patch applied an error will be returned in case there's > > invalid UTF-8 symbol in pattern & pattern will not be matched > > with the string that contains it. Some minor corrections to function > > were made as well. > > > > Nitpicking: 'error will be returned' is third situatuon, other then > 'matched' and 'not matched'. > > I guess commit message is about changes. Do I really need to mention valid cases that didn't change? > > Сloses #3251 > > Сloses #3334 > > Part of #3572 > > > > 2. The thing you test is not related to a table and other columns. > > > Please, convert the tests to the next format: {[[select '' like > > > '_B';]], {1}]]}. > > > To make it more readable, you can do it like `like_testcases` in > > > `sql-tap/collation.test.lua`. > > > > > > > Done. > > You are still perform comparisons with table columns. > Thank you, fixed it. More tests now tho. > > > -/* > > - * For LIKE and GLOB matching on EBCDIC machines, assume that every > > - * character is exactly one byte in size. Also, provde the Utf8Read() > > - * macro for fast reading of the next character in the common case where > > - * the next character is ASCII. > > +/** > > + * Providing there are symbols in string s this > > + * macro returns UTF-8 code of character and > > + * promotes pointer to the next symbol in the string. > > + * Otherwise return code is SQL_END_OF_STRING. > > */ > > -#define Utf8Read(s, e) ucnv_getNextUChar(pUtf8conv, &s, e, &status) > > +#define Utf8Read(s, e) (((s) < (e)) ?\ > > + ucnv_getNextUChar(pUtf8conv, &(s), (e), &(status)) : 0) > > I suggest to add a whitespace before backslash to make it looks better. > > Guess so. > > @@ -643,51 +647,61 @@ static const struct compareInfo likeInfoAlt = { > '%', > > '_', 0, 0 }; > > #define SQLITE_MATCH 0 > > #define SQLITE_NOMATCH 1 > > #define SQLITE_NOWILDCARDMATCH 2 > > +#define SQL_PROHIBITED_PATTERN 3 > > Update comment before that has a reference to 'patternMatch' function. > > Didn't notice it, thnx. > > * > > * Globbing rules: > > * > > - * '*' Matches any sequence of zero or more characters. > > + * '*' Matches any sequence of zero or more characters. > > Why? > > > * > > - * '?' Matches exactly one character. > > + * '?' Matches exactly one character. > > * > > - * [...] Matches one character from the enclosed list of > > - * characters. > > + * [...] Matches one character from the enclosed list of > > + * characters. > > * > > - * [^...] Matches one character not in the enclosed list. > > + * [^...] Matches one character not in the enclosed list. > > * > > The same question. > > It was looking like the indentation for similar table below was different. I was wrong. > > + * @retval SQLITE_MATCH: Match. > > + * SQLITE_NOMATCH: No match. > > + * SQLITE_NOWILDCARDMATCH: No match in spite of having * > > + * or % wildcards. > > + * SQL_PROHIBITED_PATTERN Pattern contains invalid > > + * symbol. > > Nitpicking: no colon after SQL_PROHIBITED_PATTERN. > Fixed. > > > */ > > static int > > -patternCompare(const char * pattern, /* The glob pattern */ > > - const char * string, /* The string to compare against the glob > */ > > - const struct compareInfo *pInfo, /* Information about how to do > > the compare */ > > - UChar32 matchOther /* The escape char (LIKE) or '[' (GLOB) */ > > - ) > > +sql_utf8_pattern_compare(const char * pattern, > > + const char * string, > > + const struct compareInfo *pInfo, > > + UChar32 matchOther) > > Don't get why indent is tabulation + 7 spaces. > > Me neither :thinking:. Thank you! > The function itself looks good except lines longer then 80 columns. > As i said before I don't think breaking these lines will increase readability. > > @@ -853,8 +903,7 @@ patternCompare(const char * pattern, /* The glob > > pattern */ > > int > > sqlite3_strglob(const char *zGlobPattern, const char *zString) > > { > > - return patternCompare(zGlobPattern, zString, &globInfo, > > - '['); > > + return sql_utf8_pattern_compare(zGlobPattern, zString, &globInfo, '['); > > } > > > > /* > > @@ -864,7 +913,7 @@ sqlite3_strglob(const char *zGlobPattern, const char > > *zString) > > int > > sqlite3_strlike(const char *zPattern, const char *zStr, unsigned int > esc) > > { > > - return patternCompare(zPattern, zStr, &likeInfoNorm, esc); > > + return sql_utf8_pattern_compare(zPattern, zStr, &likeInfoNorm, esc); > > } > > > > Should we add sqlite3_result_error is case of SQL_PROHIBITED_PATTERN for > these two functions? > > It doesn't matter for "sqlite3_strglob()", since it's only used internally with valid explicitly given pattern in "analysis_loader()". Same for "sqlite3_strlike()". > > - {"LIKE", "like"}, > > - {"GLOB", "glob"}, > > +-- NOTE: This test needs refactoring after deletion of GLOB & > > +-- type restrictions for LIKE. > > +-- {"LIKE", "like"}, > > +-- {"GLOB", "glob"}, > > I would add related issue number. > > True. Done. > > {"AND", "and"}, > > {"OR", "or"}, > > {"MATCH", "match"}, > > @@ -96,7 +98,7 @@ operations = { > > {"+", "-"}, > > {"<<", ">>", "&", "|"}, > > {"<", "<=", ">", ">="}, > > - {"=", "==", "!=", "<>", "LIKE", "GLOB"}, --"MATCH", "REGEXP"}, > > + {"=", "==", "!=", "<>"}, --"LIKE", "GLOB"}, "MATCH", "REGEXP"}, > > Leave comment before MATCH and REGEXP to avoid surprises after removing > your comment. > > Done. > > --- /dev/null > > +++ b/test/sql-tap/gh-3251-string-pattern-comparison.lua > > @@ -0,0 +1,238 @@ > > +#!/usr/bin/env tarantool > > +test = require("sqltester") > > +test:plan(106) > > + > > +local prefix = "like-test-" > > + > > +-- Unciode byte sequences. > > + > > 1. Typo: unicode. > 2. Extra empty line. > > Fixed. > > +-- Invalid testcases. > > + > > +for i, tested_string in ipairs(invalid_testcases) do > > + local test_name = prefix .. "2." .. tostring(i) > > + local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';" > > + > > +-- We should raise an error if pattern contains invalid characters. > > + > > 1. Trailing whitespaces. > 2. Why comments are not indented as the other code? > 3. test_name and test_itself before the section comment looks strange > for me here. > diff --git a/src/box/sql/func.c b/src/box/sql/func.c index 5b53076..2f989d4 100644 --- a/src/box/sql/func.c +++ b/src/box/sql/func.c @@ -623,7 +623,7 @@ struct compareInfo { * promotes pointer to the next symbol in the string. * Otherwise return code is SQL_END_OF_STRING. */ -#define Utf8Read(s, e) (((s) < (e)) ?\ +#define Utf8Read(s, e) (((s) < (e)) ? \ ucnv_getNextUChar(pUtf8conv, &(s), (e), &(status)) : 0) #define SQL_END_OF_STRING 0 @@ -642,7 +642,7 @@ static const struct compareInfo likeInfoNorm = { '%', '_', 0, 1 }; static const struct compareInfo likeInfoAlt = { '%', '_', 0, 0 }; /* - * Possible error returns from patternMatch() + * Possible error returns from sql_utf8_pattern_compare() */ #define SQLITE_MATCH 0 #define SQLITE_NOMATCH 1 @@ -655,14 +655,14 @@ static const struct compareInfo likeInfoAlt = { '%', '_', 0, 0 }; * * Globbing rules: * - * '*' Matches any sequence of zero or more characters. + * '*' Matches any sequence of zero or more characters. * - * '?' Matches exactly one character. + * '?' Matches exactly one character. * - * [...] Matches one character from the enclosed list of - * characters. + * [...] Matches one character from the enclosed list of + * characters. * - * [^...] Matches one character not in the enclosed list. + * [^...] Matches one character not in the enclosed list. * * With the [...] and [^...] matching, a ']' character can be * included in the list by making it the first character after @@ -694,14 +694,14 @@ static const struct compareInfo likeInfoAlt = { '%', '_', 0, 0 }; * SQLITE_NOMATCH: No match. * SQLITE_NOWILDCARDMATCH: No match in spite of having * * or % wildcards. - * SQL_PROHIBITED_PATTERN Pattern contains invalid + * SQL_PROHIBITED_PATTERN: Pattern contains invalid * symbol. */ static int sql_utf8_pattern_compare(const char * pattern, - const char * string, - const struct compareInfo *pInfo, - UChar32 matchOther) + const char * string, + const struct compareInfo *pInfo, + UChar32 matchOther) { UChar32 c, c2; /* Next pattern and input string chars */ UChar32 matchOne = pInfo->matchOne; /* "?" or "_" */ diff --git a/test/sql-tap/e_expr.test.lua b/test/sql-tap/e_expr.test.lua index 051210a..9780d2c 100755 --- a/test/sql-tap/e_expr.test.lua +++ b/test/sql-tap/e_expr.test.lua @@ -78,7 +78,7 @@ local operations = { {"!=", "ne2"}, {"IS", "is"}, -- NOTE: This test needs refactoring after deletion of GLOB & --- type restrictions for LIKE. +-- type restrictions for LIKE. (See #3572) -- {"LIKE", "like"}, -- {"GLOB", "glob"}, {"AND", "and"}, @@ -98,7 +98,12 @@ operations = { {"+", "-"}, {"<<", ">>", "&", "|"}, {"<", "<=", ">", ">="}, - {"=", "==", "!=", "<>"}, --"LIKE", "GLOB"}, "MATCH", "REGEXP"}, +-- NOTE: This test needs refactoring after deletion of GLOB & +-- type restrictions for LIKE. (See #3572) +-- Another NOTE: MATCH & REGEXP aren't supported in Tarantool & +-- are waiting for their hour, don't confuse them +-- being commented with ticket above. + {"=", "==", "!=", "<>"}, --"LIKE", "GLOB"}, --"MATCH", "REGEXP"}, {"AND"}, {"OR"}, } diff --git a/test/sql-tap/gh-3251-string-pattern-comparison.test.lua b/test/sql-tap/gh-3251-string-pattern-comparison.test.lua new file mode 100755 index 0000000..a823bc6 --- /dev/null +++ b/test/sql-tap/gh-3251-string-pattern-comparison.test.lua @@ -0,0 +1,281 @@ +#!/usr/bin/env tarantool +test = require("sqltester") +test:plan(128) + +local prefix = "like-test-" + +-- Unicode byte sequences. +local valid_testcases = { + '\x01', + '\x09', + '\x1F', + '\x7F', + '\xC2\x80', + '\xC2\x90', + '\xC2\x9F', + '\xE2\x80\xA8', + '\x20\x0B', + '\xE2\x80\xA9', +} + +-- Non-Unicode byte sequences. +local invalid_testcases = { + '\xE2\x80', + '\xFE\xFF', + '\xC2', + '\xED\xB0\x80', + '\xD0', +} + +local like_test_cases = +{ + {"1.1", + "SELECT 'AB' LIKE '_B';", + {0, {1}} }, + {"1.2", + "SELECT 'CD' LIKE '_B';", + {0, {0}} }, + {"1.3", + "SELECT '' LIKE '_B';", + {0, {0}} }, + {"1.4", + "SELECT 'AB' LIKE '%B';", + {0, {1}} }, + {"1.5", + "SELECT 'CD' LIKE '%B';", + {0, {0}} }, + {"1.6", + "SELECT '' LIKE '%B';", + {0, {0}} }, + {"1.7", + "SELECT 'AB' LIKE 'A__';", + {0, {0}} }, + {"1.8", + "SELECT 'CD' LIKE 'A__';", + {0, {0}} }, + {"1.9", + "SELECT '' LIKE 'A__';", + {0, {0}} }, + {"1.10", + "SELECT 'AB' LIKE 'A_';", + {0, {1}} }, + {"1.11", + "SELECT 'CD' LIKE 'A_';", + {0, {0}} }, + {"1.12", + "SELECT '' LIKE 'A_';", + {0, {0}} }, + {"1.13", + "SELECT 'AB' LIKE 'A';", + {0, {0}} }, + {"1.14", + "SELECT 'CD' LIKE 'A';", + {0, {0}} }, + {"1.15", + "SELECT '' LIKE 'A';", + {0, {0}} }, + {"1.16", + "SELECT 'AB' LIKE '_';", + {0, {0}} }, + {"1.17", + "SELECT 'CD' LIKE '_';", + {0, {0}} }, + {"1.18", + "SELECT '' LIKE '_';", + {0, {0}} }, + {"1.19", + "SELECT 'AB' LIKE '__';", + {0, {1}} }, + {"1.20", + "SELECT 'CD' LIKE '__';", + {0, {1}} }, + {"1.21", + "SELECT '' LIKE '__';", + {0, {0}} }, + {"1.22", + "SELECT 'AB' LIKE '%A';", + {0, {0}} }, + {"1.23", + "SELECT 'AB' LIKE '%C';", + {0, {0}} }, + {"1.24", + "SELECT 'ab' LIKE '%df';", + {0, {0}} }, + {"1.25", + "SELECT 'abCDF' LIKE '%df';", + {0, {1}} }, + {"1.26", + "SELECT 'CDF' LIKE '%df';", + {0, {1}} }, + {"1.27", + "SELECT 'ab' LIKE 'a_';", + {0, {1}} }, + {"1.28", + "SELECT 'abCDF' LIKE 'a_';", + {0, {0}} }, + {"1.29", + "SELECT 'CDF' LIKE 'a_';", + {0, {0}} }, + {"1.30", + "SELECT 'ab' LIKE 'ab%';", + {0, {1}} }, + {"1.31", + "SELECT 'abCDF' LIKE 'ab%';", + {0, {1}} }, + {"1.32", + "SELECT 'CDF' LIKE 'ab%';", + {0, {0}} }, + {"1.33", + "SELECT 'ab' LIKE 'abC%';", + {0, {0}} }, + {"1.34", + "SELECT 'abCDF' LIKE 'abC%';", + {0, {1}} }, + {"1.35", + "SELECT 'CDF' LIKE 'abC%';", + {0, {0}} }, + {"1.36", + "SELECT 'ab' LIKE 'a_%';", + {0, {1}} }, + {"1.37", + "SELECT 'abCDF' LIKE 'a_%';", + {0, {1}} }, + {"1.38", + "SELECT 'CDF' LIKE 'a_%';", + {0, {0}} }, +} + +test:do_catchsql_set_test(like_test_cases, prefix) + +-- Invalid testcases. + +for i, tested_string in ipairs(invalid_testcases) do + +-- We should raise an error if pattern contains invalid characters. + local test_name = prefix .. "2." .. tostring(i) + local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';" + test:do_catchsql_test( + test_name, + test_itself, { + -- <test_name> + 1, "LIKE or GLOB pattern can only contain UTF-8 characters" + -- <test_name> + }) + + test_name = prefix .. "3." .. tostring(i) + test_itself = "SELECT 'abc' LIKE 'abc" .. tested_string .. "';" + test:do_catchsql_test( + test_name, + test_itself, { + -- <test_name> + 1, "LIKE or GLOB pattern can only contain UTF-8 characters" + -- <test_name> + }) + + test_name = prefix .. "4." .. tostring(i) + test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "c';" + test:do_catchsql_test( + test_name, + test_itself, { + -- <test_name> + 1, "LIKE or GLOB pattern can only contain UTF-8 characters" + -- <test_name> + }) + +-- Just skipping if row value predicand contains invalid character. + + test_name = prefix .. "5." .. tostring(i) + test_itself = "SELECT 'ab" .. tested_string .. "' LIKE 'abc';" + test:do_execsql_test( + test_name, + test_itself, { + -- <test_name> + 0 + -- <test_name> + }) + + test_name = prefix .. "6." .. tostring(i) + test_itself = "SELECT 'abc" .. tested_string .. "' LIKE 'abc';" + test:do_execsql_test( + test_name, + test_itself, { + -- <test_name> + 0 + -- <test_name> + }) + + test_name = prefix .. "7." .. tostring(i) + test_itself = "SELECT 'ab" .. tested_string .. "c' LIKE 'abc';" + test:do_execsql_test( + test_name, + test_itself, { + -- <test_name> + 0 + -- <test_name> + }) +end + +-- Valid testcases. + +for i, tested_string in ipairs(valid_testcases) do + test_name = prefix .. "8." .. tostring(i) + local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';" + test:do_execsql_test( + test_name, + test_itself, { + -- <test_name> + 0 + -- <test_name> + }) + + test_name = prefix .. "9." .. tostring(i) + test_itself = "SELECT 'abc' LIKE 'abc" .. tested_string .. "';" + test:do_execsql_test( + test_name, + test_itself, { + -- <test_name> + 0 + -- <test_name> + }) + + test_name = prefix .. "10." .. tostring(i) + test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "c';" + test:do_execsql_test( + test_name, + test_itself, { + -- <test_name> + 0 + -- <test_name> + }) + test_name = prefix .. "11." .. tostring(i) + test_itself = "SELECT 'ab" .. tested_string .. "' LIKE 'abc';" + test:do_execsql_test( + test_name, + test_itself, { + -- <test_name> + 0 + -- <test_name> + }) + + test_name = prefix .. "12." .. tostring(i) + test_itself = "SELECT 'abc" .. tested_string .. "' LIKE 'abc';" + test:do_execsql_test( + test_name, + test_itself, { + -- <test_name> + 0 + -- <test_name> + }) + + test_name = prefix .. "13." .. tostring(i) + test_itself = "SELECT 'ab" .. tested_string .. "c' LIKE 'abc';" + test:do_execsql_test( + test_name, + test_itself, { + -- <test_name> + 0 + -- <test_name> + }) +end + +test:finish_test() [-- Attachment #2: Type: text/html, Size: 38152 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* [tarantool-patches] Re: [PATCH] sql: LIKE & GLOB pattern comparison issue 2018-07-27 19:11 ` Nikita Tatunov @ 2018-07-27 20:22 ` Alexander Turenko 2018-07-31 13:27 ` Nikita Tatunov 0 siblings, 1 reply; 22+ messages in thread From: Alexander Turenko @ 2018-07-27 20:22 UTC (permalink / raw) To: Nikita Tatunov; +Cc: tarantool-patches Minor comments are below. Looks good for me in general. Please, review the patch with Lesha after the fixes. WBR, Alexander Turenko. > > > With the patch applied an error will be returned in case there's > > > invalid UTF-8 symbol in pattern & pattern will not be matched > > > with the string that contains it. Some minor corrections to function > > > were made as well. > > > > > > > Nitpicking: 'error will be returned' is third situatuon, other then > > 'matched' and 'not matched'. > > > > > I guess commit message is about changes. Do I really need to > mention valid cases that didn't change? > Ok, now I got that 'it' is 'invalid symbol'. Misunderstood it before as description of the one case when a pattern has invalid UTF-8 symbols. The sentence is correct. Maybe it can be clarified like so: '& correct UTF-8 pattern will not be matched with a string with invalid UTF-8 symbols'. Or in another way you like. 'Some minor corrections to function were made as well' -- describe certain behaviour or code changes (like 'fixed infinite loop during pattern matching and incorrect matching as well') or remove this abstract sentence. > > The function itself looks good except lines longer then 80 columns. > > > > As i said before I don't think breaking these lines will increase > readability. > There are two little length exceed place in the code, it needs some refactoring to decrease code blocks nesting level. I think we can lease it as is for now (if other reviewer don't ask for that). But comments within the function are worth to fix to fit the limit. > +test:do_catchsql_set_test(like_test_cases, prefix) > + > +-- Invalid testcases. > + > +for i, tested_string in ipairs(invalid_testcases) do > + > +-- We should raise an error if pattern contains invalid characters. Again, indent comments to the same level as code in the block. If a comment is related to several code blocks (with empty lines btw), then separate it with the empty line from the code (to don't misinterpret it as related to the first block). > + local test_name = prefix .. "2." .. tostring(i) > + local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';" > + test:do_catchsql_test( > + test_name, > + test_itself, { > + -- <test_name> > + 1, "LIKE or GLOB pattern can only contain UTF-8 characters" > + -- <test_name> > + }) > + Are there reason to format it like so? Maybe just: test:do_catchsql_test(test_name, test_itself, {1, "LIKE or GLOB pattern can only contain UTF-8 characters"}) ^ permalink raw reply [flat|nested] 22+ messages in thread
* [tarantool-patches] Re: [PATCH] sql: LIKE & GLOB pattern comparison issue 2018-07-27 20:22 ` Alexander Turenko @ 2018-07-31 13:27 ` Nikita Tatunov 2018-07-31 13:47 ` Alexander Turenko 0 siblings, 1 reply; 22+ messages in thread From: Nikita Tatunov @ 2018-07-31 13:27 UTC (permalink / raw) To: Alexander Turenko; +Cc: tarantool-patches [-- Attachment #1: Type: text/plain, Size: 18694 bytes --] Hello, thank you for review! Diff in the end. пт, 27 июл. 2018 г. в 23:22, Alexander Turenko < alexander.turenko@tarantool.org>: > Minor comments are below. Looks good for me in general. > > Please, review the patch with Lesha after the fixes. > > WBR, Alexander Turenko. > > > > > With the patch applied an error will be returned in case there's > > > > invalid UTF-8 symbol in pattern & pattern will not be matched > > > > with the string that contains it. Some minor corrections to function > > > > were made as well. > > > > > > > > > > Nitpicking: 'error will be returned' is third situatuon, other then > > > 'matched' and 'not matched'. > > > > > > > > I guess commit message is about changes. Do I really need to > > mention valid cases that didn't change? > > > > Ok, now I got that 'it' is 'invalid symbol'. Misunderstood it before as > description of the one case when a pattern has invalid UTF-8 symbols. > The sentence is correct. Maybe it can be clarified like so: '& correct > UTF-8 pattern will not be matched with a string with invalid UTF-8 > symbols'. Or in another way you like. > > 'Some minor corrections to function were made as well' -- describe > certain behaviour or code changes (like 'fixed infinite loop during > pattern matching and incorrect matching as well') or remove this > abstract sentence. > > Fixed. > > > The function itself looks good except lines longer then 80 columns. > > > > > > > As i said before I don't think breaking these lines will increase > > readability. > > > > There are two little length exceed place in the code, it needs some > refactoring to decrease code blocks nesting level. I think we can lease > it as is for now (if other reviewer don't ask for that). > > But comments within the function are worth to fix to fit the limit. > > Fixed. > > +test:do_catchsql_set_test(like_test_cases, prefix) > > + > > +-- Invalid testcases. > > + > > +for i, tested_string in ipairs(invalid_testcases) do > > + > > +-- We should raise an error if pattern contains invalid characters. > > Again, indent comments to the same level as code in the block. > > If a comment is related to several code blocks (with empty lines btw), > then separate it with the empty line from the code (to don't > misinterpret it as related to the first block). > > Fixed > > + local test_name = prefix .. "2." .. tostring(i) > > + local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';" > > + test:do_catchsql_test( > > + test_name, > > + test_itself, { > > + -- <test_name> > > + 1, "LIKE or GLOB pattern can only contain UTF-8 characters" > > + -- <test_name> > > + }) > > + > > Are there reason to format it like so? Maybe just: > > test:do_catchsql_test(test_name, test_itself, > {1, "LIKE or GLOB pattern can only contain UTF-8 characters"}) > Fixed. diff --git a/src/box/sql/func.c b/src/box/sql/func.c index 2f989d4..7f93ef6 100644 --- a/src/box/sql/func.c +++ b/src/box/sql/func.c @@ -703,11 +703,16 @@ sql_utf8_pattern_compare(const char * pattern, const struct compareInfo *pInfo, UChar32 matchOther) { - UChar32 c, c2; /* Next pattern and input string chars */ - UChar32 matchOne = pInfo->matchOne; /* "?" or "_" */ - UChar32 matchAll = pInfo->matchAll; /* "*" or "%" */ - UChar32 noCase = pInfo->noCase; /* True if uppercase==lowercase */ - const char *zEscaped = 0; /* One past the last escaped input char */ + /* Next pattern and input string chars */ + UChar32 c, c2; + /* "?" or "_" */ + UChar32 matchOne = pInfo->matchOne; + /* "*" or "%" */ + UChar32 matchAll = pInfo->matchAll; + /* True if uppercase==lowercase */ + UChar32 noCase = pInfo->noCase; + /* One past the last escaped input char */ + const char *zEscaped = 0; const char * pattern_end = pattern + strlen(pattern); const char * string_end = string + strlen(string); UErrorCode status = U_ZERO_ERROR; @@ -716,9 +721,11 @@ sql_utf8_pattern_compare(const char * pattern, if (c == SQL_INVALID_UTF8_SYMBOL) return SQL_PROHIBITED_PATTERN; if (c == matchAll) { /* Match "*" */ - /* Skip over multiple "*" characters in the pattern. If there - * are also "?" characters, skip those as well, but consume a - * single character of the input string for each "?" skipped + /* Skip over multiple "*" characters in + * the pattern. If there are also "?" + * characters, skip those as well, but + * consume a single character of the + * input string for each "?" skipped. */ while ((c = Utf8Read(pattern, pattern_end)) != SQL_END_OF_STRING) { @@ -733,7 +740,9 @@ sql_utf8_pattern_compare(const char * pattern, if (c2 == SQL_INVALID_UTF8_SYMBOL) return SQLITE_NOMATCH; } - /* "*" at the end of the pattern matches */ + /* + * "*" at the end of the pattern matches. + */ if (c == SQL_END_OF_STRING) { while ((c2 = Utf8Read(string, string_end)) != SQL_END_OF_STRING) @@ -749,10 +758,14 @@ sql_utf8_pattern_compare(const char * pattern, if (c == SQL_END_OF_STRING) return SQLITE_NOWILDCARDMATCH; } else { - /* "[...]" immediately follows the "*". We have to do a slow - * recursive search in this case, but it is an unusual case. + /* "[...]" immediately + * follows the "*". We + * have to do a slow + * recursive search in + * this case, but it is + * an unusual case. */ - assert(matchOther < 0x80); /* '[' is a single-byte character */ + assert(matchOther < 0x80); while (string < string_end) { int bMatch = sql_utf8_pattern_compare( @@ -770,14 +783,17 @@ sql_utf8_pattern_compare(const char * pattern, } } - /* At this point variable c contains the first character of the - * pattern string past the "*". Search in the input string for the - * first matching character and recursively continue the match from - * that point. + /* At this point variable c contains the + * first character of the pattern string + * past the "*". Search in the input + * string for the first matching + * character and recursively continue the + * match from that point. * - * For a case-insensitive search, set variable cx to be the same as - * c but in the other case and search the input string for either - * c or cx. + * For a case-insensitive search, set + * variable cx to be the same as c but in + * the other case and search the input + * string for either c or cx. */ int bMatch; @@ -785,12 +801,14 @@ sql_utf8_pattern_compare(const char * pattern, c = u_tolower(c); while (string < string_end){ /** - * This loop could have been implemented - * without if converting c2 to lower case - * (by holding c_upper and c_lower), however - * it is implemented this way because lower - * works better with German and Turkish - * languages. + * This loop could have been + * implemented without if + * converting c2 to lower case + * by holding c_upper and + * c_lower,however it is + * implemented this way because + * lower works better with German + * and Turkish languages. */ c2 = Utf8Read(string, string_end); if (c2 == SQL_INVALID_UTF8_SYMBOL) @@ -878,11 +896,12 @@ sql_utf8_pattern_compare(const char * pattern, continue; if (noCase){ /** - * Small optimisation. Reduce number of calls - * to u_tolower function. - * SQL standards suggest use to_upper for symbol - * normalisation. However, using to_lower allows to - * respect Turkish 'İ' in default locale. + * Small optimisation. Reduce number of + * calls to u_tolower function. SQL + * standards suggest use to_upper for + * symbol normalisation. However, using + * to_lower allows to respect Turkish 'İ' + * in default locale. */ if (u_tolower(c) == c2 || c == u_tolower(c2)) diff --git a/test/sql-tap/gh-3251-string-pattern-comparison.test.lua b/test/sql-tap/gh-3251-string-pattern-comparison.test.lua index a823bc6..a8e4a12 100755 --- a/test/sql-tap/gh-3251-string-pattern-comparison.test.lua +++ b/test/sql-tap/gh-3251-string-pattern-comparison.test.lua @@ -6,276 +6,208 @@ local prefix = "like-test-" -- Unicode byte sequences. local valid_testcases = { - '\x01', - '\x09', - '\x1F', - '\x7F', - '\xC2\x80', - '\xC2\x90', - '\xC2\x9F', - '\xE2\x80\xA8', - '\x20\x0B', - '\xE2\x80\xA9', + '\x01', + '\x09', + '\x1F', + '\x7F', + '\xC2\x80', + '\xC2\x90', + '\xC2\x9F', + '\xE2\x80\xA8', + '\x20\x0B', + '\xE2\x80\xA9', } -- Non-Unicode byte sequences. local invalid_testcases = { - '\xE2\x80', - '\xFE\xFF', - '\xC2', - '\xED\xB0\x80', - '\xD0', + '\xE2\x80', + '\xFE\xFF', + '\xC2', + '\xED\xB0\x80', + '\xD0', } local like_test_cases = { - {"1.1", - "SELECT 'AB' LIKE '_B';", - {0, {1}} }, - {"1.2", - "SELECT 'CD' LIKE '_B';", - {0, {0}} }, - {"1.3", - "SELECT '' LIKE '_B';", - {0, {0}} }, - {"1.4", - "SELECT 'AB' LIKE '%B';", - {0, {1}} }, - {"1.5", - "SELECT 'CD' LIKE '%B';", - {0, {0}} }, - {"1.6", - "SELECT '' LIKE '%B';", - {0, {0}} }, - {"1.7", - "SELECT 'AB' LIKE 'A__';", - {0, {0}} }, - {"1.8", - "SELECT 'CD' LIKE 'A__';", - {0, {0}} }, - {"1.9", - "SELECT '' LIKE 'A__';", - {0, {0}} }, - {"1.10", - "SELECT 'AB' LIKE 'A_';", - {0, {1}} }, - {"1.11", - "SELECT 'CD' LIKE 'A_';", - {0, {0}} }, - {"1.12", - "SELECT '' LIKE 'A_';", - {0, {0}} }, - {"1.13", - "SELECT 'AB' LIKE 'A';", - {0, {0}} }, - {"1.14", - "SELECT 'CD' LIKE 'A';", - {0, {0}} }, - {"1.15", - "SELECT '' LIKE 'A';", - {0, {0}} }, - {"1.16", - "SELECT 'AB' LIKE '_';", - {0, {0}} }, - {"1.17", - "SELECT 'CD' LIKE '_';", - {0, {0}} }, - {"1.18", - "SELECT '' LIKE '_';", - {0, {0}} }, - {"1.19", - "SELECT 'AB' LIKE '__';", - {0, {1}} }, - {"1.20", - "SELECT 'CD' LIKE '__';", - {0, {1}} }, - {"1.21", - "SELECT '' LIKE '__';", - {0, {0}} }, - {"1.22", - "SELECT 'AB' LIKE '%A';", - {0, {0}} }, - {"1.23", - "SELECT 'AB' LIKE '%C';", - {0, {0}} }, - {"1.24", - "SELECT 'ab' LIKE '%df';", - {0, {0}} }, - {"1.25", - "SELECT 'abCDF' LIKE '%df';", - {0, {1}} }, - {"1.26", - "SELECT 'CDF' LIKE '%df';", - {0, {1}} }, - {"1.27", - "SELECT 'ab' LIKE 'a_';", - {0, {1}} }, - {"1.28", - "SELECT 'abCDF' LIKE 'a_';", - {0, {0}} }, - {"1.29", - "SELECT 'CDF' LIKE 'a_';", - {0, {0}} }, - {"1.30", - "SELECT 'ab' LIKE 'ab%';", - {0, {1}} }, - {"1.31", - "SELECT 'abCDF' LIKE 'ab%';", - {0, {1}} }, - {"1.32", - "SELECT 'CDF' LIKE 'ab%';", - {0, {0}} }, - {"1.33", - "SELECT 'ab' LIKE 'abC%';", - {0, {0}} }, - {"1.34", - "SELECT 'abCDF' LIKE 'abC%';", - {0, {1}} }, - {"1.35", - "SELECT 'CDF' LIKE 'abC%';", - {0, {0}} }, - {"1.36", - "SELECT 'ab' LIKE 'a_%';", - {0, {1}} }, - {"1.37", - "SELECT 'abCDF' LIKE 'a_%';", - {0, {1}} }, - {"1.38", - "SELECT 'CDF' LIKE 'a_%';", - {0, {0}} }, + {"1.1", + "SELECT 'AB' LIKE '_B';", + {0, {1}} }, + {"1.2", + "SELECT 'CD' LIKE '_B';", + {0, {0}} }, + {"1.3", + "SELECT '' LIKE '_B';", + {0, {0}} }, + {"1.4", + "SELECT 'AB' LIKE '%B';", + {0, {1}} }, + {"1.5", + "SELECT 'CD' LIKE '%B';", + {0, {0}} }, + {"1.6", + "SELECT '' LIKE '%B';", + {0, {0}} }, + {"1.7", + "SELECT 'AB' LIKE 'A__';", + {0, {0}} }, + {"1.8", + "SELECT 'CD' LIKE 'A__';", + {0, {0}} }, + {"1.9", + "SELECT '' LIKE 'A__';", + {0, {0}} }, + {"1.10", + "SELECT 'AB' LIKE 'A_';", + {0, {1}} }, + {"1.11", + "SELECT 'CD' LIKE 'A_';", + {0, {0}} }, + {"1.12", + "SELECT '' LIKE 'A_';", + {0, {0}} }, + {"1.13", + "SELECT 'AB' LIKE 'A';", + {0, {0}} }, + {"1.14", + "SELECT 'CD' LIKE 'A';", + {0, {0}} }, + {"1.15", + "SELECT '' LIKE 'A';", + {0, {0}} }, + {"1.16", + "SELECT 'AB' LIKE '_';", + {0, {0}} }, + {"1.17", + "SELECT 'CD' LIKE '_';", + {0, {0}} }, + {"1.18", + "SELECT '' LIKE '_';", + {0, {0}} }, + {"1.19", + "SELECT 'AB' LIKE '__';", + {0, {1}} }, + {"1.20", + "SELECT 'CD' LIKE '__';", + {0, {1}} }, + {"1.21", + "SELECT '' LIKE '__';", + {0, {0}} }, + {"1.22", + "SELECT 'AB' LIKE '%A';", + {0, {0}} }, + {"1.23", + "SELECT 'AB' LIKE '%C';", + {0, {0}} }, + {"1.24", + "SELECT 'ab' LIKE '%df';", + {0, {0}} }, + {"1.25", + "SELECT 'abCDF' LIKE '%df';", + {0, {1}} }, + {"1.26", + "SELECT 'CDF' LIKE '%df';", + {0, {1}} }, + {"1.27", + "SELECT 'ab' LIKE 'a_';", + {0, {1}} }, + {"1.28", + "SELECT 'abCDF' LIKE 'a_';", + {0, {0}} }, + {"1.29", + "SELECT 'CDF' LIKE 'a_';", + {0, {0}} }, + {"1.30", + "SELECT 'ab' LIKE 'ab%';", + {0, {1}} }, + {"1.31", + "SELECT 'abCDF' LIKE 'ab%';", + {0, {1}} }, + {"1.32", + "SELECT 'CDF' LIKE 'ab%';", + {0, {0}} }, + {"1.33", + "SELECT 'ab' LIKE 'abC%';", + {0, {0}} }, + {"1.34", + "SELECT 'abCDF' LIKE 'abC%';", + {0, {1}} }, + {"1.35", + "SELECT 'CDF' LIKE 'abC%';", + {0, {0}} }, + {"1.36", + "SELECT 'ab' LIKE 'a_%';", + {0, {1}} }, + {"1.37", + "SELECT 'abCDF' LIKE 'a_%';", + {0, {1}} }, + {"1.38", + "SELECT 'CDF' LIKE 'a_%';", + {0, {0}} }, } test:do_catchsql_set_test(like_test_cases, prefix) -- Invalid testcases. - for i, tested_string in ipairs(invalid_testcases) do --- We should raise an error if pattern contains invalid characters. - local test_name = prefix .. "2." .. tostring(i) - local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';" - test:do_catchsql_test( - test_name, - test_itself, { - -- <test_name> - 1, "LIKE or GLOB pattern can only contain UTF-8 characters" - -- <test_name> - }) + -- We should raise an error in case + -- pattern contains invalid characters. - test_name = prefix .. "3." .. tostring(i) - test_itself = "SELECT 'abc' LIKE 'abc" .. tested_string .. "';" - test:do_catchsql_test( - test_name, - test_itself, { - -- <test_name> - 1, "LIKE or GLOB pattern can only contain UTF-8 characters" - -- <test_name> - }) + local test_name = prefix .. "2." .. tostring(i) + local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';" + test:do_catchsql_test(test_name, test_itself, + {1, "LIKE or GLOB pattern can only contain UTF-8 characters"}) - test_name = prefix .. "4." .. tostring(i) - test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "c';" - test:do_catchsql_test( - test_name, - test_itself, { - -- <test_name> - 1, "LIKE or GLOB pattern can only contain UTF-8 characters" - -- <test_name> - }) + test_name = prefix .. "3." .. tostring(i) + test_itself = "SELECT 'abc' LIKE 'abc" .. tested_string .. "';" + test:do_catchsql_test(test_name, test_itself, + {1, "LIKE or GLOB pattern can only contain UTF-8 characters"}) --- Just skipping if row value predicand contains invalid character. + test_name = prefix .. "4." .. tostring(i) + test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "c';" + test:do_catchsql_test(test_name, test_itself, + {1, "LIKE or GLOB pattern can only contain UTF-8 characters"}) - test_name = prefix .. "5." .. tostring(i) - test_itself = "SELECT 'ab" .. tested_string .. "' LIKE 'abc';" - test:do_execsql_test( - test_name, - test_itself, { - -- <test_name> - 0 - -- <test_name> - }) + -- Just skipping if row value predicand contains invalid character. - test_name = prefix .. "6." .. tostring(i) - test_itself = "SELECT 'abc" .. tested_string .. "' LIKE 'abc';" - test:do_execsql_test( - test_name, - test_itself, { - -- <test_name> - 0 - -- <test_name> - }) + test_name = prefix .. "5." .. tostring(i) + test_itself = "SELECT 'ab" .. tested_string .. "' LIKE 'abc';" + test:do_execsql_test(test_name, test_itself, {0}) - test_name = prefix .. "7." .. tostring(i) - test_itself = "SELECT 'ab" .. tested_string .. "c' LIKE 'abc';" - test:do_execsql_test( - test_name, - test_itself, { - -- <test_name> - 0 - -- <test_name> - }) + test_name = prefix .. "6." .. tostring(i) + test_itself = "SELECT 'abc" .. tested_string .. "' LIKE 'abc';" + test:do_execsql_test(test_name, test_itself, {0}) + + test_name = prefix .. "7." .. tostring(i) + test_itself = "SELECT 'ab" .. tested_string .. "c' LIKE 'abc';" + test:do_execsql_test(test_name, test_itself, {0}) end -- Valid testcases. - for i, tested_string in ipairs(valid_testcases) do test_name = prefix .. "8." .. tostring(i) local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';" - test:do_execsql_test( - test_name, - test_itself, { - -- <test_name> - 0 - -- <test_name> - }) + test:do_execsql_test(test_name, test_itself, {0}) test_name = prefix .. "9." .. tostring(i) test_itself = "SELECT 'abc' LIKE 'abc" .. tested_string .. "';" - test:do_execsql_test( - test_name, - test_itself, { - -- <test_name> - 0 - -- <test_name> - }) + test:do_execsql_test(test_name, test_itself, {0}) test_name = prefix .. "10." .. tostring(i) test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "c';" - test:do_execsql_test( - test_name, - test_itself, { - -- <test_name> - 0 - -- <test_name> - }) + test:do_execsql_test(test_name, test_itself, {0}) + test_name = prefix .. "11." .. tostring(i) test_itself = "SELECT 'ab" .. tested_string .. "' LIKE 'abc';" - test:do_execsql_test( - test_name, - test_itself, { - -- <test_name> - 0 - -- <test_name> - }) + test:do_execsql_test(test_name, test_itself, {0}) test_name = prefix .. "12." .. tostring(i) test_itself = "SELECT 'abc" .. tested_string .. "' LIKE 'abc';" - test:do_execsql_test( - test_name, - test_itself, { - -- <test_name> - 0 - -- <test_name> - }) + test:do_execsql_test(test_name, test_itself, {0}) test_name = prefix .. "13." .. tostring(i) test_itself = "SELECT 'ab" .. tested_string .. "c' LIKE 'abc';" - test:do_execsql_test( - test_name, - test_itself, { - -- <test_name> - 0 - -- <test_name> - }) + test:do_execsql_test(test_name, test_itself, {0}) end test:finish_test() [-- Attachment #2: Type: text/html, Size: 44406 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* [tarantool-patches] Re: [PATCH] sql: LIKE & GLOB pattern comparison issue 2018-07-31 13:27 ` Nikita Tatunov @ 2018-07-31 13:47 ` Alexander Turenko 2018-08-01 10:35 ` Nikita Tatunov 0 siblings, 1 reply; 22+ messages in thread From: Alexander Turenko @ 2018-07-31 13:47 UTC (permalink / raw) To: Nikita Tatunov, Alex Khatskevich; +Cc: tarantool-patches Hi! LGTM. Alex, can you look again into the patch? WBR, Alexander Turenko. On Tue, Jul 31, 2018 at 04:27:21PM +0300, Nikita Tatunov wrote: > Hello, thank you for review! > Diff in the end. > > <...> ^ permalink raw reply [flat|nested] 22+ messages in thread
* [tarantool-patches] Re: [PATCH] sql: LIKE & GLOB pattern comparison issue 2018-07-31 13:47 ` Alexander Turenko @ 2018-08-01 10:35 ` Nikita Tatunov 2018-08-01 10:51 ` Nikita Tatunov 0 siblings, 1 reply; 22+ messages in thread From: Nikita Tatunov @ 2018-08-01 10:35 UTC (permalink / raw) To: Alexander Turenko; +Cc: avkhatskevich, tarantool-patches Alexey requested full diff: diff --git a/src/box/sql/func.c b/src/box/sql/func.c index c06e3bd..7f93ef6 100644 --- a/src/box/sql/func.c +++ b/src/box/sql/func.c @@ -617,13 +617,17 @@ struct compareInfo { u8 noCase; /* true to ignore case differences */ }; -/* - * For LIKE and GLOB matching on EBCDIC machines, assume that every - * character is exactly one byte in size. Also, provde the Utf8Read() - * macro for fast reading of the next character in the common case where - * the next character is ASCII. +/** + * Providing there are symbols in string s this + * macro returns UTF-8 code of character and + * promotes pointer to the next symbol in the string. + * Otherwise return code is SQL_END_OF_STRING. */ -#define Utf8Read(s, e) ucnv_getNextUChar(pUtf8conv, &s, e, &status) +#define Utf8Read(s, e) (((s) < (e)) ? \ + ucnv_getNextUChar(pUtf8conv, &(s), (e), &(status)) : 0) + +#define SQL_END_OF_STRING 0 +#define SQL_INVALID_UTF8_SYMBOL 0xfffd static const struct compareInfo globInfo = { '*', '?', '[', 0 }; @@ -638,19 +642,16 @@ static const struct compareInfo likeInfoNorm = { '%', '_', 0, 1 }; static const struct compareInfo likeInfoAlt = { '%', '_', 0, 0 }; /* - * Possible error returns from patternMatch() + * Possible error returns from sql_utf8_pattern_compare() */ #define SQLITE_MATCH 0 #define SQLITE_NOMATCH 1 #define SQLITE_NOWILDCARDMATCH 2 +#define SQL_PROHIBITED_PATTERN 3 -/* - * Compare two UTF-8 strings for equality where the first string is - * a GLOB or LIKE expression. Return values: - * - * SQLITE_MATCH: Match - * SQLITE_NOMATCH: No match - * SQLITE_NOWILDCARDMATCH: No match in spite of having * or % wildcards. +/** + * Compare two UTF-8 strings for equality where the first string + * is a GLOB or LIKE expression. * * Globbing rules: * @@ -663,92 +664,136 @@ static const struct compareInfo likeInfoAlt = { '%', '_', 0, 0 }; * * [^...] Matches one character not in the enclosed list. * - * With the [...] and [^...] matching, a ']' character can be included - * in the list by making it the first character after '[' or '^'. A - * range of characters can be specified using '-'. Example: - * "[a-z]" matches any single lower-case letter. To match a '-', make - * it the last character in the list. + * With the [...] and [^...] matching, a ']' character can be + * included in the list by making it the first character after + * '[' or '^'. A range of characters can be specified using '-'. + * Example: "[a-z]" matches any single lower-case letter. + * To match a '-', make it the last character in the list. * * Like matching rules: * - * '%' Matches any sequence of zero or more characters + * '%' Matches any sequence of zero or more characters. * - ** '_' Matches any one character + ** '_' Matches any one character. * * Ec Where E is the "esc" character and c is any other - * character, including '%', '_', and esc, match exactly c. + * character, including '%', '_', and esc, match + * exactly c. * * The comments within this routine usually assume glob matching. * - * This routine is usually quick, but can be N**2 in the worst case. + * This routine is usually quick, but can be N**2 in the worst + * case. + * + * @param pattern String containing comparison pattern. + * @param string String being compared. + * @param compareInfo Information about how to compare. + * @param matchOther The escape char (LIKE) or '[' (GLOB). + * + * @retval SQLITE_MATCH: Match. + * SQLITE_NOMATCH: No match. + * SQLITE_NOWILDCARDMATCH: No match in spite of having * + * or % wildcards. + * SQL_PROHIBITED_PATTERN: Pattern contains invalid + * symbol. */ static int -patternCompare(const char * pattern, /* The glob pattern */ - const char * string, /* The string to compare against the glob */ - const struct compareInfo *pInfo, /* Information about how to do the compare */ - UChar32 matchOther /* The escape char (LIKE) or '[' (GLOB) */ - ) +sql_utf8_pattern_compare(const char * pattern, + const char * string, + const struct compareInfo *pInfo, + UChar32 matchOther) { - UChar32 c, c2; /* Next pattern and input string chars */ - UChar32 matchOne = pInfo->matchOne; /* "?" or "_" */ - UChar32 matchAll = pInfo->matchAll; /* "*" or "%" */ - UChar32 noCase = pInfo->noCase; /* True if uppercase==lowercase */ - const char *zEscaped = 0; /* One past the last escaped input char */ + /* Next pattern and input string chars */ + UChar32 c, c2; + /* "?" or "_" */ + UChar32 matchOne = pInfo->matchOne; + /* "*" or "%" */ + UChar32 matchAll = pInfo->matchAll; + /* True if uppercase==lowercase */ + UChar32 noCase = pInfo->noCase; + /* One past the last escaped input char */ + const char *zEscaped = 0; const char * pattern_end = pattern + strlen(pattern); const char * string_end = string + strlen(string); UErrorCode status = U_ZERO_ERROR; - while (pattern < pattern_end){ - c = Utf8Read(pattern, pattern_end); + while ((c = Utf8Read(pattern, pattern_end)) != SQL_END_OF_STRING) { + if (c == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; if (c == matchAll) { /* Match "*" */ - /* Skip over multiple "*" characters in the pattern. If there - * are also "?" characters, skip those as well, but consume a - * single character of the input string for each "?" skipped + /* Skip over multiple "*" characters in + * the pattern. If there are also "?" + * characters, skip those as well, but + * consume a single character of the + * input string for each "?" skipped. */ - while (pattern < pattern_end){ - c = Utf8Read(pattern, pattern_end); + while ((c = Utf8Read(pattern, pattern_end)) != + SQL_END_OF_STRING) { + if (c == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; if (c != matchAll && c != matchOne) break; - if (c == matchOne - && Utf8Read(string, string_end) == 0) { + if (c == matchOne && + (c2 = Utf8Read(string, string_end)) == + SQL_END_OF_STRING) return SQLITE_NOWILDCARDMATCH; - } + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQLITE_NOMATCH; } - /* "*" at the end of the pattern matches */ - if (pattern == pattern_end) + /* + * "*" at the end of the pattern matches. + */ + if (c == SQL_END_OF_STRING) { + while ((c2 = Utf8Read(string, string_end)) != + SQL_END_OF_STRING) + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQLITE_NOMATCH; return SQLITE_MATCH; + } if (c == matchOther) { if (pInfo->matchSet == 0) { c = Utf8Read(pattern, pattern_end); - if (c == 0) + if (c == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; + if (c == SQL_END_OF_STRING) return SQLITE_NOWILDCARDMATCH; } else { - /* "[...]" immediately follows the "*". We have to do a slow - * recursive search in this case, but it is an unusual case. + /* "[...]" immediately + * follows the "*". We + * have to do a slow + * recursive search in + * this case, but it is + * an unusual case. */ - assert(matchOther < 0x80); /* '[' is a single-byte character */ + assert(matchOther < 0x80); while (string < string_end) { int bMatch = - patternCompare(&pattern[-1], - string, - pInfo, - matchOther); + sql_utf8_pattern_compare( + &pattern[-1], + string, + pInfo, + matchOther); if (bMatch != SQLITE_NOMATCH) return bMatch; - Utf8Read(string, string_end); + c = Utf8Read(string, string_end); + if (c == SQL_INVALID_UTF8_SYMBOL) + return SQLITE_NOMATCH; } return SQLITE_NOWILDCARDMATCH; } } - /* At this point variable c contains the first character of the - * pattern string past the "*". Search in the input string for the - * first matching character and recursively continue the match from - * that point. + /* At this point variable c contains the + * first character of the pattern string + * past the "*". Search in the input + * string for the first matching + * character and recursively continue the + * match from that point. * - * For a case-insensitive search, set variable cx to be the same as - * c but in the other case and search the input string for either - * c or cx. + * For a case-insensitive search, set + * variable cx to be the same as c but in + * the other case and search the input + * string for either c or cx. */ int bMatch; @@ -756,14 +801,18 @@ patternCompare(const char * pattern, /* The glob pattern */ c = u_tolower(c); while (string < string_end){ /** - * This loop could have been implemented - * without if converting c2 to lower case - * (by holding c_upper and c_lower), however - * it is implemented this way because lower - * works better with German and Turkish - * languages. + * This loop could have been + * implemented without if + * converting c2 to lower case + * by holding c_upper and + * c_lower,however it is + * implemented this way because + * lower works better with German + * and Turkish languages. */ c2 = Utf8Read(string, string_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQLITE_NOMATCH; if (!noCase) { if (c2 != c) continue; @@ -771,9 +820,10 @@ patternCompare(const char * pattern, /* The glob pattern */ if (c2 != c && u_tolower(c2) != c) continue; } - bMatch = - patternCompare(pattern, string, - pInfo, matchOther); + bMatch = sql_utf8_pattern_compare(pattern, + string, + pInfo, + matchOther); if (bMatch != SQLITE_NOMATCH) return bMatch; } @@ -782,7 +832,9 @@ patternCompare(const char * pattern, /* The glob pattern */ if (c == matchOther) { if (pInfo->matchSet == 0) { c = Utf8Read(pattern, pattern_end); - if (c == 0) + if (c == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; + if (c == SQL_END_OF_STRING) return SQLITE_NOMATCH; zEscaped = pattern; } else { @@ -790,23 +842,33 @@ patternCompare(const char * pattern, /* The glob pattern */ int seen = 0; int invert = 0; c = Utf8Read(string, string_end); + if (c == SQL_INVALID_UTF8_SYMBOL) + return SQLITE_NOMATCH; if (string == string_end) return SQLITE_NOMATCH; c2 = Utf8Read(pattern, pattern_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; if (c2 == '^') { invert = 1; c2 = Utf8Read(pattern, pattern_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; } if (c2 == ']') { if (c == ']') seen = 1; c2 = Utf8Read(pattern, pattern_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; } - while (c2 && c2 != ']') { + while (c2 != SQL_END_OF_STRING && c2 != ']') { if (c2 == '-' && pattern[0] != ']' && pattern < pattern_end && prior_c > 0) { c2 = Utf8Read(pattern, pattern_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; if (c >= prior_c && c <= c2) seen = 1; prior_c = 0; @@ -817,29 +879,36 @@ patternCompare(const char * pattern, /* The glob pattern */ prior_c = c2; } c2 = Utf8Read(pattern, pattern_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; } - if (pattern == pattern_end || (seen ^ invert) == 0) { + if (pattern == pattern_end || + (seen ^ invert) == 0) { return SQLITE_NOMATCH; } continue; } } c2 = Utf8Read(string, string_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQLITE_NOMATCH; if (c == c2) continue; if (noCase){ /** - * Small optimisation. Reduce number of calls - * to u_tolower function. - * SQL standards suggest use to_upper for symbol - * normalisation. However, using to_lower allows to - * respect Turkish 'İ' in default locale. + * Small optimisation. Reduce number of + * calls to u_tolower function. SQL + * standards suggest use to_upper for + * symbol normalisation. However, using + * to_lower allows to respect Turkish 'İ' + * in default locale. */ if (u_tolower(c) == c2 || c == u_tolower(c2)) continue; } - if (c == matchOne && pattern != zEscaped && c2 != 0) + if (c == matchOne && pattern != zEscaped && + c2 != SQL_END_OF_STRING) continue; return SQLITE_NOMATCH; } @@ -853,8 +922,7 @@ patternCompare(const char * pattern, /* The glob pattern */ int sqlite3_strglob(const char *zGlobPattern, const char *zString) { - return patternCompare(zGlobPattern, zString, &globInfo, - '['); + return sql_utf8_pattern_compare(zGlobPattern, zString, &globInfo, '['); } /* @@ -864,7 +932,7 @@ sqlite3_strglob(const char *zGlobPattern, const char *zString) int sqlite3_strlike(const char *zPattern, const char *zStr, unsigned int esc) { - return patternCompare(zPattern, zStr, &likeInfoNorm, esc); + return sql_utf8_pattern_compare(zPattern, zStr, &likeInfoNorm, esc); } /* @@ -910,8 +978,9 @@ likeFunc(sqlite3_context * context, int argc, sqlite3_value ** argv) zB = (const char *) sqlite3_value_text(argv[0]); zA = (const char *) sqlite3_value_text(argv[1]); - /* Limit the length of the LIKE or GLOB pattern to avoid problems - * of deep recursion and N*N behavior in patternCompare(). + /* Limit the length of the LIKE or GLOB pattern to avoid + * problems of deep recursion and N*N behavior in + * sql_utf8_pattern_compare(). */ nPat = sqlite3_value_bytes(argv[0]); testcase(nPat == db->aLimit[SQLITE_LIMIT_LIKE_PATTERN_LENGTH]); @@ -947,7 +1016,12 @@ likeFunc(sqlite3_context * context, int argc, sqlite3_value ** argv) sqlite3_like_count++; #endif int res; - res = patternCompare(zB, zA, pInfo, escape); + res = sql_utf8_pattern_compare(zB, zA, pInfo, escape); + if (res == SQL_PROHIBITED_PATTERN) { + sqlite3_result_error(context, "LIKE or GLOB pattern can only" + " contain UTF-8 characters", -1); + return; + } sqlite3_result_int(context, res == SQLITE_MATCH); } diff --git a/test-run b/test-run index 77e9327..95562e9 160000 --- a/test-run +++ b/test-run @@ -1 +1 @@ -Subproject commit 77e93279210f8c5c1fd0ed03416fa19a184f0b6d +Subproject commit 95562e95401fef4e0b755ab0bb430974b5d1a29a diff --git a/test/sql-tap/e_expr.test.lua b/test/sql-tap/e_expr.test.lua index 13d3a96..9780d2c 100755 --- a/test/sql-tap/e_expr.test.lua +++ b/test/sql-tap/e_expr.test.lua @@ -1,6 +1,6 @@ #!/usr/bin/env tarantool test = require("sqltester") -test:plan(12431) +test:plan(10665) --!./tcltestrunner.lua -- 2010 July 16 @@ -77,8 +77,10 @@ local operations = { {"<>", "ne1"}, {"!=", "ne2"}, {"IS", "is"}, - {"LIKE", "like"}, - {"GLOB", "glob"}, +-- NOTE: This test needs refactoring after deletion of GLOB & +-- type restrictions for LIKE. (See #3572) +-- {"LIKE", "like"}, +-- {"GLOB", "glob"}, {"AND", "and"}, {"OR", "or"}, {"MATCH", "match"}, @@ -96,7 +98,12 @@ operations = { {"+", "-"}, {"<<", ">>", "&", "|"}, {"<", "<=", ">", ">="}, - {"=", "==", "!=", "<>", "LIKE", "GLOB"}, --"MATCH", "REGEXP"}, +-- NOTE: This test needs refactoring after deletion of GLOB & +-- type restrictions for LIKE. (See #3572) +-- Another NOTE: MATCH & REGEXP aren't supported in Tarantool & +-- are waiting for their hour, don't confuse them +-- being commented with ticket above. + {"=", "==", "!=", "<>"}, --"LIKE", "GLOB"}, --"MATCH", "REGEXP"}, {"AND"}, {"OR"}, } @@ -475,6 +482,7 @@ for _, op in ipairs(oplist) do end end end + --------------------------------------------------------------------------- -- Test the IS and IS NOT operators. -- diff --git a/test/sql-tap/gh-3251-string-pattern-comparison.test.lua b/test/sql-tap/gh-3251-string-pattern-comparison.test.lua new file mode 100755 index 0000000..2a787f2 --- /dev/null +++ b/test/sql-tap/gh-3251-string-pattern-comparison.test.lua @@ -0,0 +1,213 @@ +#!/usr/bin/env tarantool +test = require("sqltester") +test:plan(128) + +local prefix = "like-test-" + +-- Unicode byte sequences. +local valid_testcases = { + '\x01', + '\x09', + '\x1F', + '\x7F', + '\xC2\x80', + '\xC2\x90', + '\xC2\x9F', + '\xE2\x80\xA8', + '\x20\x0B', + '\xE2\x80\xA9', +} + +-- Non-Unicode byte sequences. +local invalid_testcases = { + '\xE2\x80', + '\xFE\xFF', + '\xC2', + '\xED\xB0\x80', + '\xD0', +} + +local like_test_cases = +{ + {"1.1", + "SELECT 'AB' LIKE '_B';", + {0, {1}} }, + {"1.2", + "SELECT 'CD' LIKE '_B';", + {0, {0}} }, + {"1.3", + "SELECT '' LIKE '_B';", + {0, {0}} }, + {"1.4", + "SELECT 'AB' LIKE '%B';", + {0, {1}} }, + {"1.5", + "SELECT 'CD' LIKE '%B';", + {0, {0}} }, + {"1.6", + "SELECT '' LIKE '%B';", + {0, {0}} }, + {"1.7", + "SELECT 'AB' LIKE 'A__';", + {0, {0}} }, + {"1.8", + "SELECT 'CD' LIKE 'A__';", + {0, {0}} }, + {"1.9", + "SELECT '' LIKE 'A__';", + {0, {0}} }, + {"1.10", + "SELECT 'AB' LIKE 'A_';", + {0, {1}} }, + {"1.11", + "SELECT 'CD' LIKE 'A_';", + {0, {0}} }, + {"1.12", + "SELECT '' LIKE 'A_';", + {0, {0}} }, + {"1.13", + "SELECT 'AB' LIKE 'A';", + {0, {0}} }, + {"1.14", + "SELECT 'CD' LIKE 'A';", + {0, {0}} }, + {"1.15", + "SELECT '' LIKE 'A';", + {0, {0}} }, + {"1.16", + "SELECT 'AB' LIKE '_';", + {0, {0}} }, + {"1.17", + "SELECT 'CD' LIKE '_';", + {0, {0}} }, + {"1.18", + "SELECT '' LIKE '_';", + {0, {0}} }, + {"1.19", + "SELECT 'AB' LIKE '__';", + {0, {1}} }, + {"1.20", + "SELECT 'CD' LIKE '__';", + {0, {1}} }, + {"1.21", + "SELECT '' LIKE '__';", + {0, {0}} }, + {"1.22", + "SELECT 'AB' LIKE '%A';", + {0, {0}} }, + {"1.23", + "SELECT 'AB' LIKE '%C';", + {0, {0}} }, + {"1.24", + "SELECT 'ab' LIKE '%df';", + {0, {0}} }, + {"1.25", + "SELECT 'abCDF' LIKE '%df';", + {0, {1}} }, + {"1.26", + "SELECT 'CDF' LIKE '%df';", + {0, {1}} }, + {"1.27", + "SELECT 'ab' LIKE 'a_';", + {0, {1}} }, + {"1.28", + "SELECT 'abCDF' LIKE 'a_';", + {0, {0}} }, + {"1.29", + "SELECT 'CDF' LIKE 'a_';", + {0, {0}} }, + {"1.30", + "SELECT 'ab' LIKE 'ab%';", + {0, {1}} }, + {"1.31", + "SELECT 'abCDF' LIKE 'ab%';", + {0, {1}} }, + {"1.32", + "SELECT 'CDF' LIKE 'ab%';", + {0, {0}} }, + {"1.33", + "SELECT 'ab' LIKE 'abC%';", + {0, {0}} }, + {"1.34", + "SELECT 'abCDF' LIKE 'abC%';", + {0, {1}} }, + {"1.35", + "SELECT 'CDF' LIKE 'abC%';", + {0, {0}} }, + {"1.36", + "SELECT 'ab' LIKE 'a_%';", + {0, {1}} }, + {"1.37", + "SELECT 'abCDF' LIKE 'a_%';", + {0, {1}} }, + {"1.38", + "SELECT 'CDF' LIKE 'a_%';", + {0, {0}} }, +} + +test:do_catchsql_set_test(like_test_cases, prefix) + +-- Invalid testcases. +for i, tested_string in ipairs(invalid_testcases) do + + -- We should raise an error in case + -- pattern contains invalid characters. + + local test_name = prefix .. "2." .. tostring(i) + local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';" + test:do_catchsql_test(test_name, test_itself, + {1, "LIKE or GLOB pattern can only contain UTF-8 characters"}) + + test_name = prefix .. "3." .. tostring(i) + test_itself = "SELECT 'abc' LIKE 'abc" .. tested_string .. "';" + test:do_catchsql_test(test_name, test_itself, + {1, "LIKE or GLOB pattern can only contain UTF-8 characters"}) + + test_name = prefix .. "4." .. tostring(i) + test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "c';" + test:do_catchsql_test(test_name, test_itself, + {1, "LIKE or GLOB pattern can only contain UTF-8 characters"}) + + -- Just skipping if row value predicand contains invalid character. + + test_name = prefix .. "5." .. tostring(i) + test_itself = "SELECT 'ab" .. tested_string .. "' LIKE 'abc';" + test:do_execsql_test(test_name, test_itself, {0}) + + test_name = prefix .. "6." .. tostring(i) + test_itself = "SELECT 'abc" .. tested_string .. "' LIKE 'abc';" + test:do_execsql_test(test_name, test_itself, {0}) + + test_name = prefix .. "7." .. tostring(i) + test_itself = "SELECT 'ab" .. tested_string .. "c' LIKE 'abc';" + test:do_execsql_test(test_name, test_itself, {0}) +end + +-- Valid testcases. +for i, tested_string in ipairs(valid_testcases) do + test_name = prefix .. "8." .. tostring(i) + local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';" + test:do_execsql_test(test_name, test_itself, {0}) + + test_name = prefix .. "9." .. tostring(i) + test_itself = "SELECT 'abc' LIKE 'abc" .. tested_string .. "';" + test:do_execsql_test(test_name, test_itself, {0}) + + test_name = prefix .. "10." .. tostring(i) + test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "c';" + test:do_execsql_test(test_name, test_itself, {0}) + + test_name = prefix .. "11." .. tostring(i) + test_itself = "SELECT 'ab" .. tested_string .. "' LIKE 'abc';" + test:do_execsql_test(test_name, test_itself, {0}) + + test_name = prefix .. "12." .. tostring(i) + test_itself = "SELECT 'abc" .. tested_string .. "' LIKE 'abc';" + test:do_execsql_test(test_name, test_itself, {0}) + + test_name = prefix .. "13." .. tostring(i) + test_itself = "SELECT 'ab" .. tested_string .. "c' LIKE 'abc';" + test:do_execsql_test(test_name, test_itself, {0}) +end + +test:finish_test() ^ permalink raw reply [flat|nested] 22+ messages in thread
* [tarantool-patches] Re: [PATCH] sql: LIKE & GLOB pattern comparison issue 2018-08-01 10:35 ` Nikita Tatunov @ 2018-08-01 10:51 ` Nikita Tatunov 2018-08-01 13:56 ` Alex Khatskevich 0 siblings, 1 reply; 22+ messages in thread From: Nikita Tatunov @ 2018-08-01 10:51 UTC (permalink / raw) To: Alexander Turenko; +Cc: avkhatskevich, tarantool-patches [-- Attachment #1: Type: text/plain, Size: 22234 bytes --] diff --git a/src/box/sql/func.c b/src/box/sql/func.c index c06e3bd..7f93ef6 100644 --- a/src/box/sql/func.c +++ b/src/box/sql/func.c @@ -617,13 +617,17 @@ struct compareInfo { u8 noCase; /* true to ignore case differences */ }; -/* - * For LIKE and GLOB matching on EBCDIC machines, assume that every - * character is exactly one byte in size. Also, provde the Utf8Read() - * macro for fast reading of the next character in the common case where - * the next character is ASCII. +/** + * Providing there are symbols in string s this + * macro returns UTF-8 code of character and + * promotes pointer to the next symbol in the string. + * Otherwise return code is SQL_END_OF_STRING. */ -#define Utf8Read(s, e) ucnv_getNextUChar(pUtf8conv, &s, e, &status) +#define Utf8Read(s, e) (((s) < (e)) ? \ + ucnv_getNextUChar(pUtf8conv, &(s), (e), &(status)) : 0) + +#define SQL_END_OF_STRING 0 +#define SQL_INVALID_UTF8_SYMBOL 0xfffd static const struct compareInfo globInfo = { '*', '?', '[', 0 }; @@ -638,19 +642,16 @@ static const struct compareInfo likeInfoNorm = { '%', '_', 0, 1 }; static const struct compareInfo likeInfoAlt = { '%', '_', 0, 0 }; /* - * Possible error returns from patternMatch() + * Possible error returns from sql_utf8_pattern_compare() */ #define SQLITE_MATCH 0 #define SQLITE_NOMATCH 1 #define SQLITE_NOWILDCARDMATCH 2 +#define SQL_PROHIBITED_PATTERN 3 -/* - * Compare two UTF-8 strings for equality where the first string is - * a GLOB or LIKE expression. Return values: - * - * SQLITE_MATCH: Match - * SQLITE_NOMATCH: No match - * SQLITE_NOWILDCARDMATCH: No match in spite of having * or % wildcards. +/** + * Compare two UTF-8 strings for equality where the first string + * is a GLOB or LIKE expression. * * Globbing rules: * @@ -663,92 +664,136 @@ static const struct compareInfo likeInfoAlt = { '%', '_', 0, 0 }; * * [^...] Matches one character not in the enclosed list. * - * With the [...] and [^...] matching, a ']' character can be included - * in the list by making it the first character after '[' or '^'. A - * range of characters can be specified using '-'. Example: - * "[a-z]" matches any single lower-case letter. To match a '-', make - * it the last character in the list. + * With the [...] and [^...] matching, a ']' character can be + * included in the list by making it the first character after + * '[' or '^'. A range of characters can be specified using '-'. + * Example: "[a-z]" matches any single lower-case letter. + * To match a '-', make it the last character in the list. * * Like matching rules: * - * '%' Matches any sequence of zero or more characters + * '%' Matches any sequence of zero or more characters. * - ** '_' Matches any one character + ** '_' Matches any one character. * * Ec Where E is the "esc" character and c is any other - * character, including '%', '_', and esc, match exactly c. + * character, including '%', '_', and esc, match + * exactly c. * * The comments within this routine usually assume glob matching. * - * This routine is usually quick, but can be N**2 in the worst case. + * This routine is usually quick, but can be N**2 in the worst + * case. + * + * @param pattern String containing comparison pattern. + * @param string String being compared. + * @param compareInfo Information about how to compare. + * @param matchOther The escape char (LIKE) or '[' (GLOB). + * + * @retval SQLITE_MATCH: Match. + * SQLITE_NOMATCH: No match. + * SQLITE_NOWILDCARDMATCH: No match in spite of having * + * or % wildcards. + * SQL_PROHIBITED_PATTERN: Pattern contains invalid + * symbol. */ static int -patternCompare(const char * pattern, /* The glob pattern */ - const char * string, /* The string to compare against the glob */ - const struct compareInfo *pInfo, /* Information about how to do the compare */ - UChar32 matchOther /* The escape char (LIKE) or '[' (GLOB) */ - ) +sql_utf8_pattern_compare(const char * pattern, + const char * string, + const struct compareInfo *pInfo, + UChar32 matchOther) { - UChar32 c, c2; /* Next pattern and input string chars */ - UChar32 matchOne = pInfo->matchOne; /* "?" or "_" */ - UChar32 matchAll = pInfo->matchAll; /* "*" or "%" */ - UChar32 noCase = pInfo->noCase; /* True if uppercase==lowercase */ - const char *zEscaped = 0; /* One past the last escaped input char */ + /* Next pattern and input string chars */ + UChar32 c, c2; + /* "?" or "_" */ + UChar32 matchOne = pInfo->matchOne; + /* "*" or "%" */ + UChar32 matchAll = pInfo->matchAll; + /* True if uppercase==lowercase */ + UChar32 noCase = pInfo->noCase; + /* One past the last escaped input char */ + const char *zEscaped = 0; const char * pattern_end = pattern + strlen(pattern); const char * string_end = string + strlen(string); UErrorCode status = U_ZERO_ERROR; - while (pattern < pattern_end){ - c = Utf8Read(pattern, pattern_end); + while ((c = Utf8Read(pattern, pattern_end)) != SQL_END_OF_STRING) { + if (c == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; if (c == matchAll) { /* Match "*" */ - /* Skip over multiple "*" characters in the pattern. If there - * are also "?" characters, skip those as well, but consume a - * single character of the input string for each "?" skipped + /* Skip over multiple "*" characters in + * the pattern. If there are also "?" + * characters, skip those as well, but + * consume a single character of the + * input string for each "?" skipped. */ - while (pattern < pattern_end){ - c = Utf8Read(pattern, pattern_end); + while ((c = Utf8Read(pattern, pattern_end)) != + SQL_END_OF_STRING) { + if (c == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; if (c != matchAll && c != matchOne) break; - if (c == matchOne - && Utf8Read(string, string_end) == 0) { + if (c == matchOne && + (c2 = Utf8Read(string, string_end)) == + SQL_END_OF_STRING) return SQLITE_NOWILDCARDMATCH; - } + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQLITE_NOMATCH; } - /* "*" at the end of the pattern matches */ - if (pattern == pattern_end) + /* + * "*" at the end of the pattern matches. + */ + if (c == SQL_END_OF_STRING) { + while ((c2 = Utf8Read(string, string_end)) != + SQL_END_OF_STRING) + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQLITE_NOMATCH; return SQLITE_MATCH; + } if (c == matchOther) { if (pInfo->matchSet == 0) { c = Utf8Read(pattern, pattern_end); - if (c == 0) + if (c == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; + if (c == SQL_END_OF_STRING) return SQLITE_NOWILDCARDMATCH; } else { - /* "[...]" immediately follows the "*". We have to do a slow - * recursive search in this case, but it is an unusual case. + /* "[...]" immediately + * follows the "*". We + * have to do a slow + * recursive search in + * this case, but it is + * an unusual case. */ - assert(matchOther < 0x80); /* '[' is a single-byte character */ + assert(matchOther < 0x80); while (string < string_end) { int bMatch = - patternCompare(&pattern[-1], - string, - pInfo, - matchOther); + sql_utf8_pattern_compare( + &pattern[-1], + string, + pInfo, + matchOther); if (bMatch != SQLITE_NOMATCH) return bMatch; - Utf8Read(string, string_end); + c = Utf8Read(string, string_end); + if (c == SQL_INVALID_UTF8_SYMBOL) + return SQLITE_NOMATCH; } return SQLITE_NOWILDCARDMATCH; } } - /* At this point variable c contains the first character of the - * pattern string past the "*". Search in the input string for the - * first matching character and recursively continue the match from - * that point. + /* At this point variable c contains the + * first character of the pattern string + * past the "*". Search in the input + * string for the first matching + * character and recursively continue the + * match from that point. * - * For a case-insensitive search, set variable cx to be the same as - * c but in the other case and search the input string for either - * c or cx. + * For a case-insensitive search, set + * variable cx to be the same as c but in + * the other case and search the input + * string for either c or cx. */ int bMatch; @@ -756,14 +801,18 @@ patternCompare(const char * pattern, /* The glob pattern */ c = u_tolower(c); while (string < string_end){ /** - * This loop could have been implemented - * without if converting c2 to lower case - * (by holding c_upper and c_lower), however - * it is implemented this way because lower - * works better with German and Turkish - * languages. + * This loop could have been + * implemented without if + * converting c2 to lower case + * by holding c_upper and + * c_lower,however it is + * implemented this way because + * lower works better with German + * and Turkish languages. */ c2 = Utf8Read(string, string_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQLITE_NOMATCH; if (!noCase) { if (c2 != c) continue; @@ -771,9 +820,10 @@ patternCompare(const char * pattern, /* The glob pattern */ if (c2 != c && u_tolower(c2) != c) continue; } - bMatch = - patternCompare(pattern, string, - pInfo, matchOther); + bMatch = sql_utf8_pattern_compare(pattern, + string, + pInfo, + matchOther); if (bMatch != SQLITE_NOMATCH) return bMatch; } @@ -782,7 +832,9 @@ patternCompare(const char * pattern, /* The glob pattern */ if (c == matchOther) { if (pInfo->matchSet == 0) { c = Utf8Read(pattern, pattern_end); - if (c == 0) + if (c == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; + if (c == SQL_END_OF_STRING) return SQLITE_NOMATCH; zEscaped = pattern; } else { @@ -790,23 +842,33 @@ patternCompare(const char * pattern, /* The glob pattern */ int seen = 0; int invert = 0; c = Utf8Read(string, string_end); + if (c == SQL_INVALID_UTF8_SYMBOL) + return SQLITE_NOMATCH; if (string == string_end) return SQLITE_NOMATCH; c2 = Utf8Read(pattern, pattern_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; if (c2 == '^') { invert = 1; c2 = Utf8Read(pattern, pattern_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; } if (c2 == ']') { if (c == ']') seen = 1; c2 = Utf8Read(pattern, pattern_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; } - while (c2 && c2 != ']') { + while (c2 != SQL_END_OF_STRING && c2 != ']') { if (c2 == '-' && pattern[0] != ']' && pattern < pattern_end && prior_c > 0) { c2 = Utf8Read(pattern, pattern_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; if (c >= prior_c && c <= c2) seen = 1; prior_c = 0; @@ -817,29 +879,36 @@ patternCompare(const char * pattern, /* The glob pattern */ prior_c = c2; } c2 = Utf8Read(pattern, pattern_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; } - if (pattern == pattern_end || (seen ^ invert) == 0) { + if (pattern == pattern_end || + (seen ^ invert) == 0) { return SQLITE_NOMATCH; } continue; } } c2 = Utf8Read(string, string_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQLITE_NOMATCH; if (c == c2) continue; if (noCase){ /** - * Small optimisation. Reduce number of calls - * to u_tolower function. - * SQL standards suggest use to_upper for symbol - * normalisation. However, using to_lower allows to - * respect Turkish 'İ' in default locale. + * Small optimisation. Reduce number of + * calls to u_tolower function. SQL + * standards suggest use to_upper for + * symbol normalisation. However, using + * to_lower allows to respect Turkish 'İ' + * in default locale. */ if (u_tolower(c) == c2 || c == u_tolower(c2)) continue; } - if (c == matchOne && pattern != zEscaped && c2 != 0) + if (c == matchOne && pattern != zEscaped && + c2 != SQL_END_OF_STRING) continue; return SQLITE_NOMATCH; } @@ -853,8 +922,7 @@ patternCompare(const char * pattern, /* The glob pattern */ int sqlite3_strglob(const char *zGlobPattern, const char *zString) { - return patternCompare(zGlobPattern, zString, &globInfo, - '['); + return sql_utf8_pattern_compare(zGlobPattern, zString, &globInfo, '['); } /* @@ -864,7 +932,7 @@ sqlite3_strglob(const char *zGlobPattern, const char *zString) int sqlite3_strlike(const char *zPattern, const char *zStr, unsigned int esc) { - return patternCompare(zPattern, zStr, &likeInfoNorm, esc); + return sql_utf8_pattern_compare(zPattern, zStr, &likeInfoNorm, esc); } /* @@ -910,8 +978,9 @@ likeFunc(sqlite3_context * context, int argc, sqlite3_value ** argv) zB = (const char *) sqlite3_value_text(argv[0]); zA = (const char *) sqlite3_value_text(argv[1]); - /* Limit the length of the LIKE or GLOB pattern to avoid problems - * of deep recursion and N*N behavior in patternCompare(). + /* Limit the length of the LIKE or GLOB pattern to avoid + * problems of deep recursion and N*N behavior in + * sql_utf8_pattern_compare(). */ nPat = sqlite3_value_bytes(argv[0]); testcase(nPat == db->aLimit[SQLITE_LIMIT_LIKE_PATTERN_LENGTH]); @@ -947,7 +1016,12 @@ likeFunc(sqlite3_context * context, int argc, sqlite3_value ** argv) sqlite3_like_count++; #endif int res; - res = patternCompare(zB, zA, pInfo, escape); + res = sql_utf8_pattern_compare(zB, zA, pInfo, escape); + if (res == SQL_PROHIBITED_PATTERN) { + sqlite3_result_error(context, "LIKE or GLOB pattern can only" + " contain UTF-8 characters", -1); + return; + } sqlite3_result_int(context, res == SQLITE_MATCH); } diff --git a/test-run b/test-run index 77e9327..95562e9 160000 --- a/test-run +++ b/test-run @@ -1 +1 @@ -Subproject commit 77e93279210f8c5c1fd0ed03416fa19a184f0b6d +Subproject commit 95562e95401fef4e0b755ab0bb430974b5d1a29a diff --git a/test/sql-tap/e_expr.test.lua b/test/sql-tap/e_expr.test.lua index 13d3a96..9780d2c 100755 --- a/test/sql-tap/e_expr.test.lua +++ b/test/sql-tap/e_expr.test.lua @@ -1,6 +1,6 @@ #!/usr/bin/env tarantool test = require("sqltester") -test:plan(12431) +test:plan(10665) --!./tcltestrunner.lua -- 2010 July 16 @@ -77,8 +77,10 @@ local operations = { {"<>", "ne1"}, {"!=", "ne2"}, {"IS", "is"}, - {"LIKE", "like"}, - {"GLOB", "glob"}, +-- NOTE: This test needs refactoring after deletion of GLOB & +-- type restrictions for LIKE. (See #3572) +-- {"LIKE", "like"}, +-- {"GLOB", "glob"}, {"AND", "and"}, {"OR", "or"}, {"MATCH", "match"}, @@ -96,7 +98,12 @@ operations = { {"+", "-"}, {"<<", ">>", "&", "|"}, {"<", "<=", ">", ">="}, - {"=", "==", "!=", "<>", "LIKE", "GLOB"}, --"MATCH", "REGEXP"}, +-- NOTE: This test needs refactoring after deletion of GLOB & +-- type restrictions for LIKE. (See #3572) +-- Another NOTE: MATCH & REGEXP aren't supported in Tarantool & +-- are waiting for their hour, don't confuse them +-- being commented with ticket above. + {"=", "==", "!=", "<>"}, --"LIKE", "GLOB"}, --"MATCH", "REGEXP"}, {"AND"}, {"OR"}, } @@ -475,6 +482,7 @@ for _, op in ipairs(oplist) do end end end + --------------------------------------------------------------------------- -- Test the IS and IS NOT operators. -- diff --git a/test/sql-tap/gh-3251-string-pattern-comparison.test.lua b/test/sql-tap/gh-3251-string-pattern-comparison.test.lua new file mode 100755 index 0000000..2a787f2 --- /dev/null +++ b/test/sql-tap/gh-3251-string-pattern-comparison.test.lua @@ -0,0 +1,213 @@ +#!/usr/bin/env tarantool +test = require("sqltester") +test:plan(128) + +local prefix = "like-test-" + +-- Unicode byte sequences. +local valid_testcases = { + '\x01', + '\x09', + '\x1F', + '\x7F', + '\xC2\x80', + '\xC2\x90', + '\xC2\x9F', + '\xE2\x80\xA8', + '\x20\x0B', + '\xE2\x80\xA9', +} + +-- Non-Unicode byte sequences. +local invalid_testcases = { + '\xE2\x80', + '\xFE\xFF', + '\xC2', + '\xED\xB0\x80', + '\xD0', +} + +local like_test_cases = +{ + {"1.1", + "SELECT 'AB' LIKE '_B';", + {0, {1}} }, + {"1.2", + "SELECT 'CD' LIKE '_B';", + {0, {0}} }, + {"1.3", + "SELECT '' LIKE '_B';", + {0, {0}} }, + {"1.4", + "SELECT 'AB' LIKE '%B';", + {0, {1}} }, + {"1.5", + "SELECT 'CD' LIKE '%B';", + {0, {0}} }, + {"1.6", + "SELECT '' LIKE '%B';", + {0, {0}} }, + {"1.7", + "SELECT 'AB' LIKE 'A__';", + {0, {0}} }, + {"1.8", + "SELECT 'CD' LIKE 'A__';", + {0, {0}} }, + {"1.9", + "SELECT '' LIKE 'A__';", + {0, {0}} }, + {"1.10", + "SELECT 'AB' LIKE 'A_';", + {0, {1}} }, + {"1.11", + "SELECT 'CD' LIKE 'A_';", + {0, {0}} }, + {"1.12", + "SELECT '' LIKE 'A_';", + {0, {0}} }, + {"1.13", + "SELECT 'AB' LIKE 'A';", + {0, {0}} }, + {"1.14", + "SELECT 'CD' LIKE 'A';", + {0, {0}} }, + {"1.15", + "SELECT '' LIKE 'A';", + {0, {0}} }, + {"1.16", + "SELECT 'AB' LIKE '_';", + {0, {0}} }, + {"1.17", + "SELECT 'CD' LIKE '_';", + {0, {0}} }, + {"1.18", + "SELECT '' LIKE '_';", + {0, {0}} }, + {"1.19", + "SELECT 'AB' LIKE '__';", + {0, {1}} }, + {"1.20", + "SELECT 'CD' LIKE '__';", + {0, {1}} }, + {"1.21", + "SELECT '' LIKE '__';", + {0, {0}} }, + {"1.22", + "SELECT 'AB' LIKE '%A';", + {0, {0}} }, + {"1.23", + "SELECT 'AB' LIKE '%C';", + {0, {0}} }, + {"1.24", + "SELECT 'ab' LIKE '%df';", + {0, {0}} }, + {"1.25", + "SELECT 'abCDF' LIKE '%df';", + {0, {1}} }, + {"1.26", + "SELECT 'CDF' LIKE '%df';", + {0, {1}} }, + {"1.27", + "SELECT 'ab' LIKE 'a_';", + {0, {1}} }, + {"1.28", + "SELECT 'abCDF' LIKE 'a_';", + {0, {0}} }, + {"1.29", + "SELECT 'CDF' LIKE 'a_';", + {0, {0}} }, + {"1.30", + "SELECT 'ab' LIKE 'ab%';", + {0, {1}} }, + {"1.31", + "SELECT 'abCDF' LIKE 'ab%';", + {0, {1}} }, + {"1.32", + "SELECT 'CDF' LIKE 'ab%';", + {0, {0}} }, + {"1.33", + "SELECT 'ab' LIKE 'abC%';", + {0, {0}} }, + {"1.34", + "SELECT 'abCDF' LIKE 'abC%';", + {0, {1}} }, + {"1.35", + "SELECT 'CDF' LIKE 'abC%';", + {0, {0}} }, + {"1.36", + "SELECT 'ab' LIKE 'a_%';", + {0, {1}} }, + {"1.37", + "SELECT 'abCDF' LIKE 'a_%';", + {0, {1}} }, + {"1.38", + "SELECT 'CDF' LIKE 'a_%';", + {0, {0}} }, +} + +test:do_catchsql_set_test(like_test_cases, prefix) + +-- Invalid testcases. +for i, tested_string in ipairs(invalid_testcases) do + + -- We should raise an error in case + -- pattern contains invalid characters. + + local test_name = prefix .. "2." .. tostring(i) + local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';" + test:do_catchsql_test(test_name, test_itself, + {1, "LIKE or GLOB pattern can only contain UTF-8 characters"}) + + test_name = prefix .. "3." .. tostring(i) + test_itself = "SELECT 'abc' LIKE 'abc" .. tested_string .. "';" + test:do_catchsql_test(test_name, test_itself, + {1, "LIKE or GLOB pattern can only contain UTF-8 characters"}) + + test_name = prefix .. "4." .. tostring(i) + test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "c';" + test:do_catchsql_test(test_name, test_itself, + {1, "LIKE or GLOB pattern can only contain UTF-8 characters"}) + + -- Just skipping if row value predicand contains invalid character. + + test_name = prefix .. "5." .. tostring(i) + test_itself = "SELECT 'ab" .. tested_string .. "' LIKE 'abc';" + test:do_execsql_test(test_name, test_itself, {0}) + + test_name = prefix .. "6." .. tostring(i) + test_itself = "SELECT 'abc" .. tested_string .. "' LIKE 'abc';" + test:do_execsql_test(test_name, test_itself, {0}) + + test_name = prefix .. "7." .. tostring(i) + test_itself = "SELECT 'ab" .. tested_string .. "c' LIKE 'abc';" + test:do_execsql_test(test_name, test_itself, {0}) +end + +-- Valid testcases. +for i, tested_string in ipairs(valid_testcases) do + test_name = prefix .. "8." .. tostring(i) + local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';" + test:do_execsql_test(test_name, test_itself, {0}) + + test_name = prefix .. "9." .. tostring(i) + test_itself = "SELECT 'abc' LIKE 'abc" .. tested_string .. "';" + test:do_execsql_test(test_name, test_itself, {0}) + + test_name = prefix .. "10." .. tostring(i) + test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "c';" + test:do_execsql_test(test_name, test_itself, {0}) + + test_name = prefix .. "11." .. tostring(i) + test_itself = "SELECT 'ab" .. tested_string .. "' LIKE 'abc';" + test:do_execsql_test(test_name, test_itself, {0}) + + test_name = prefix .. "12." .. tostring(i) + test_itself = "SELECT 'abc" .. tested_string .. "' LIKE 'abc';" + test:do_execsql_test(test_name, test_itself, {0}) + + test_name = prefix .. "13." .. tostring(i) + test_itself = "SELECT 'ab" .. tested_string .. "c' LIKE 'abc';" + test:do_execsql_test(test_name, test_itself, {0}) +end + +test:finish_test() [-- Attachment #2: Type: text/html, Size: 45127 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* [tarantool-patches] Re: [PATCH] sql: LIKE & GLOB pattern comparison issue 2018-08-01 10:51 ` Nikita Tatunov @ 2018-08-01 13:56 ` Alex Khatskevich 2018-08-01 18:10 ` Nikita Tatunov 0 siblings, 1 reply; 22+ messages in thread From: Alex Khatskevich @ 2018-08-01 13:56 UTC (permalink / raw) To: Nikita Tatunov, Alexander Turenko; +Cc: tarantool-patches [-- Attachment #1: Type: text/plain, Size: 25986 bytes --] On 01.08.2018 13:51, Nikita Tatunov wrote: > diff --git a/src/box/sql/func.c b/src/box/sql/func.c > index c06e3bd..7f93ef6 100644 > --- a/src/box/sql/func.c > +++ b/src/box/sql/func.c > @@ -617,13 +617,17 @@ struct compareInfo { > u8 noCase;/* true to ignore case differences */ > }; > -/* > - * For LIKE and GLOB matching on EBCDIC machines, assume that every > - * character is exactly one byte in size. Also, provde the Utf8Read() > - * macro for fast reading of the next character in the common case where > - * the next character is ASCII. > +/** > + * Providing there are symbols in string s this > + * macro returns UTF-8 code of character and > + * promotes pointer to the next symbol in the string. > + * Otherwise return code is SQL_END_OF_STRING. > */ > -#define Utf8Read(s, e) ucnv_getNextUChar(pUtf8conv, &s, e, &status) > +#define Utf8Read(s, e) (((s) < (e)) ? \ > +ucnv_getNextUChar(pUtf8conv, &(s), (e), &(status)) : 0) [Later I will ask you to return this macro back, so, you may not do this] As I understand, you are returning `0` from Utf8Read in case of end of the string. Let's return `SQL_END_OF_STRING` instead of just `0`. > + > +#define SQL_END_OF_STRING 0 > +#define SQL_INVALID_UTF8_SYMBOL 0xfffd > static const struct compareInfo globInfo = { '*', '?', '[', 0 }; > @@ -638,19 +642,16 @@ static const struct compareInfo likeInfoNorm = { > '%', '_', 0, 1 }; > static const struct compareInfo likeInfoAlt = { '%', '_', 0, 0 }; > /* > - * Possible error returns from patternMatch() > + * Possible error returns from sql_utf8_pattern_compare() > */ > #define SQLITE_MATCH 0 > #define SQLITE_NOMATCH 1 > #define SQLITE_NOWILDCARDMATCH 2 > +#define SQL_PROHIBITED_PATTERN 3 I am not sure that the invalid (with invalid symbols) pattern can be called `prohibited`. Rename somehow? My proposal: SQL_INVALID_PATTERN. Moreover, You have named this definition with the `SQL` prefix, which is good, however, similar definitions are still prefixed with `SQLITE`. I would like you to rename those in this (preferred) or in a separate commit for consistency. > -/* > - * Compare two UTF-8 strings for equality where the first string is > - * a GLOB or LIKE expression. Return values: > - * > - * SQLITE_MATCH: Match > - * SQLITE_NOMATCH: No match > - * SQLITE_NOWILDCARDMATCH: No match in spite of having * or % > wildcards. > +/** > + * Compare two UTF-8 strings for equality where the first string > + * is a GLOB or LIKE expression. > * > * Globbing rules: > * > @@ -663,92 +664,136 @@ static const struct compareInfo likeInfoAlt = { > '%', '_', 0, 0 }; > * > * [^...] Matches one character not in the enclosed list. > * > - * With the [...] and [^...] matching, a ']' character can be included > - * in the list by making it the first character after '[' or '^'. A > - * range of characters can be specified using '-'. Example: > - * "[a-z]" matches any single lower-case letter. To match a '-', make > - * it the last character in the list. > + * With the [...] and [^...] matching, a ']' character can be > + * included in the list by making it the first character after > + * '[' or '^'. A range of characters can be specified using '-'. > + * Example: "[a-z]" matches any single lower-case letter. > + * To match a '-', make it the last character in the list. Does it work for UTF characters? I suppose no. Let's write about it here + let's file an issue to make it work with UTF characters. > * > * Like matching rules: > * > - * '%' Matches any sequence of zero or more characters > + * '%' Matches any sequence of zero or more characters. > * > - ** '_' Matches any one character > + ** '_' Matches any one character. > * > * Ec Where E is the "esc" character and c is any other > - * character, including '%', '_', and esc, match > exactly c. > + * character, including '%', '_', and esc, match > + * exactly c. > * > * The comments within this routine usually assume glob matching. > * > - * This routine is usually quick, but can be N**2 in the worst case. > + * This routine is usually quick, but can be N**2 in the worst > + * case. > + * > + * @param pattern String containing comparison pattern. > + * @param string String being compared. > + * @param compareInfo Information about how to compare. > + * @param matchOther The escape char (LIKE) or '[' (GLOB). > + * > + * @retval SQLITE_MATCH: Match. > + * SQLITE_NOMATCH: No match. > + * SQLITE_NOWILDCARDMATCH: No match in spite of having * > + * or % wildcards. > + * SQL_PROHIBITED_PATTERN: Pattern contains invalid > + * symbol. Minor: It is not very good that you use symbol and character interchangeably. I suppose that `character` should be used everywhere. > */ > static int > -patternCompare(const char * pattern,/* The glob pattern */ > - const char * string,/* The string to compare against the glob */ > - const struct compareInfo *pInfo,/* Information about how to do > the compare */ > - UChar32 matchOther/* The escape char (LIKE) or '[' (GLOB) */ > - ) > +sql_utf8_pattern_compare(const char * pattern, > +const char * string, > +const struct compareInfo *pInfo, > +UChar32 matchOther) "star" sign should stick to the attribute name. https://tarantool.io/en/doc/1.9/dev_guide/c_style_guide/#chapter-3-1-spaces To prevent such typos in the future, you can use special perl script which checks coding style in Linux kernel patches. > { > -UChar32 c, c2;/* Next pattern and input string chars */ > -UChar32 matchOne = pInfo->matchOne;/* "?" or "_" */ > -UChar32 matchAll = pInfo->matchAll;/* "*" or "%" */ > -UChar32 noCase = pInfo->noCase;/* True if uppercase==lowercase */ > -const char *zEscaped = 0;/* One past the last escaped input char */ > +/* Next pattern and input string chars */ > +UChar32 c, c2; > +/* "?" or "_" */ > +UChar32 matchOne = pInfo->matchOne; > +/* "*" or "%" */ > +UChar32 matchAll = pInfo->matchAll; > +/* True if uppercase==lowercase */ > +UChar32 noCase = pInfo->noCase; > +/* One past the last escaped input char */ > +const char *zEscaped = 0; > const char * pattern_end = pattern + strlen(pattern); > const char * string_end = string + strlen(string); > UErrorCode status = U_ZERO_ERROR; > -while (pattern < pattern_end){ > -c = Utf8Read(pattern, pattern_end); > +while ((c = Utf8Read(pattern, pattern_end)) != SQL_END_OF_STRING) { REMEMBER THIS POINT #1 > +if (c == SQL_INVALID_UTF8_SYMBOL) > +return SQL_PROHIBITED_PATTERN; > if (c == matchAll) {/* Match "*" */ > -/* Skip over multiple "*" characters in the pattern. If there > -* are also "?" characters, skip those as well, but consume a > -* single character of the input string for each "?" skipped > +/* Skip over multiple "*" characters in > +* the pattern. If there are also "?" > +* characters, skip those as well, but > +* consume a single character of the > +* input string for each "?" skipped. > */ > -while (pattern < pattern_end){ > -c = Utf8Read(pattern, pattern_end); > +while ((c = Utf8Read(pattern, pattern_end)) != > + SQL_END_OF_STRING) { > +if (c == SQL_INVALID_UTF8_SYMBOL) > +return SQL_PROHIBITED_PATTERN; > if (c != matchAll && c != matchOne) > break; > -if (c == matchOne > - && Utf8Read(string, string_end) == 0) { > +if (c == matchOne && > + (c2 = Utf8Read(string, string_end)) == > +SQL_END_OF_STRING) > return SQLITE_NOWILDCARDMATCH; > -} > +if (c2 == SQL_INVALID_UTF8_SYMBOL) > +return SQLITE_NOMATCH; > } > -/* "*" at the end of the pattern matches */ > -if (pattern == pattern_end) > +/* > +* "*" at the end of the pattern matches. > +*/ > +if (c == SQL_END_OF_STRING) { > +while ((c2 = Utf8Read(string, string_end)) != > + SQL_END_OF_STRING) > +if (c2 == SQL_INVALID_UTF8_SYMBOL) > +return SQLITE_NOMATCH; > return SQLITE_MATCH; > +} > if (c == matchOther) { > if (pInfo->matchSet == 0) { > c = Utf8Read(pattern, pattern_end); > -if (c == 0) > +if (c == SQL_INVALID_UTF8_SYMBOL) > +return SQL_PROHIBITED_PATTERN; > +if (c == SQL_END_OF_STRING) > return SQLITE_NOWILDCARDMATCH; > } else { > -/* "[...]" immediately follows the "*". We have to do a slow > -* recursive search in this case, but it is an unusual case. > +/* "[...]" immediately > +* follows the "*". We > +* have to do a slow > +* recursive search in > +* this case, but it is > +* an unusual case. > */ > -assert(matchOther < 0x80);/* '[' is a single-byte character */ > +assert(matchOther < 0x80); > while (string < string_end) { REMEMBER THIS POINT #2 > int bMatch = > -patternCompare(&pattern[-1], > - string, > - pInfo, > - matchOther); > +sql_utf8_pattern_compare( > +&pattern[-1], > +string, > +pInfo, > +matchOther); > if (bMatch != SQLITE_NOMATCH) > return bMatch; > -Utf8Read(string, string_end); > +c = Utf8Read(string, string_end); > +if (c == SQL_INVALID_UTF8_SYMBOL) > +return SQLITE_NOMATCH; look at <REMEMBER THIS POINT #1,2> and other `Utf8Read` usages. You have introduced SQL_END_OF_STRING and changed `Utf8Read` pattern to use it in half of cases? Moreover,in that place you do check `string < string_end` implicitly inside of `Utf8Read` but you never use that result. I suppose you should return old iteration style and `Utf8Read` macro. ``` while (string < string_end) { c = Utf8Read(string, string_end); if (c == SQL_INVALID_UTF8_SYMBOL) return SQLITE_NOMATCH; ``` > } > return SQLITE_NOWILDCARDMATCH; > } > } > -/* At this point variable c contains the first character of the > -* pattern string past the "*". Search in the input string for the > -* first matching character and recursively continue the match from > -* that point. > +/* At this point variable c contains the > +* first character of the pattern string > +* past the "*". Search in the input > +* string for the first matching > +* character and recursively continue the > +* match from that point. > * > -* For a case-insensitive search, set variable cx to be the same as > -* c but in the other case and search the input string for either > -* c or cx. > +* For a case-insensitive search, set > +* variable cx to be the same as c but in > +* the other case and search the input > +* string for either c or cx. > */ > int bMatch; > @@ -756,14 +801,18 @@ patternCompare(const char * pattern,/* The glob > pattern */ > c = u_tolower(c); > while (string < string_end){ > /** > -* This loop could have been implemented > -* without if converting c2 to lower case > -* (by holding c_upper and c_lower), however > -* it is implemented this way because lower > -* works better with German and Turkish > -* languages. > +* This loop could have been > +* implemented without if > +* converting c2 to lower case > +* by holding c_upper and > +* c_lower,however it is > +* implemented this way because > +* lower works better with German > +* and Turkish languages. > */ > c2 = Utf8Read(string, string_end); > +if (c2 == SQL_INVALID_UTF8_SYMBOL) > +return SQLITE_NOMATCH; > if (!noCase) { > if (c2 != c) > continue; > @@ -771,9 +820,10 @@ patternCompare(const char * pattern,/* The glob > pattern */ > if (c2 != c && u_tolower(c2) != c) > continue; > } > -bMatch = > -patternCompare(pattern, string, > - pInfo, matchOther); > +bMatch = sql_utf8_pattern_compare(pattern, > + string, > + pInfo, > +matchOther); > if (bMatch != SQLITE_NOMATCH) > return bMatch; > } > @@ -782,7 +832,9 @@ patternCompare(const char * pattern,/* The glob > pattern */ > if (c == matchOther) { > if (pInfo->matchSet == 0) { > c = Utf8Read(pattern, pattern_end); > -if (c == 0) > +if (c == SQL_INVALID_UTF8_SYMBOL) > +return SQL_PROHIBITED_PATTERN; > +if (c == SQL_END_OF_STRING) > return SQLITE_NOMATCH; > zEscaped = pattern; > } else { > @@ -790,23 +842,33 @@ patternCompare(const char * pattern,/* The glob > pattern */ > int seen = 0; > int invert = 0; > c = Utf8Read(string, string_end); > +if (c == SQL_INVALID_UTF8_SYMBOL) > +return SQLITE_NOMATCH; > if (string == string_end) > return SQLITE_NOMATCH; > c2 = Utf8Read(pattern, pattern_end); > +if (c2 == SQL_INVALID_UTF8_SYMBOL) > +return SQL_PROHIBITED_PATTERN; > if (c2 == '^') { > invert = 1; > c2 = Utf8Read(pattern, pattern_end); > +if (c2 == SQL_INVALID_UTF8_SYMBOL) > +return SQL_PROHIBITED_PATTERN; > } > if (c2 == ']') { > if (c == ']') > seen = 1; > c2 = Utf8Read(pattern, pattern_end); > +if (c2 == SQL_INVALID_UTF8_SYMBOL) > +return SQL_PROHIBITED_PATTERN; > } > -while (c2 && c2 != ']') { > +while (c2 != SQL_END_OF_STRING && c2 != ']') { > if (c2 == '-' && pattern[0] != ']' > && pattern < pattern_end > && prior_c > 0) { > c2 = Utf8Read(pattern, pattern_end); > +if (c2 == SQL_INVALID_UTF8_SYMBOL) > +return SQL_PROHIBITED_PATTERN; > if (c >= prior_c && c <= c2) > seen = 1; > prior_c = 0; > @@ -817,29 +879,36 @@ patternCompare(const char * pattern,/* The glob > pattern */ > prior_c = c2; > } > c2 = Utf8Read(pattern, pattern_end); > +if (c2 == SQL_INVALID_UTF8_SYMBOL) > +return SQL_PROHIBITED_PATTERN; > } > -if (pattern == pattern_end || (seen ^ invert) == 0) { > +if (pattern == pattern_end || > + (seen ^ invert) == 0) { > return SQLITE_NOMATCH; > } > continue; > } > } > c2 = Utf8Read(string, string_end); > +if (c2 == SQL_INVALID_UTF8_SYMBOL) > +return SQLITE_NOMATCH; > if (c == c2) > continue; > if (noCase){ > /** > -* Small optimisation. Reduce number of calls > -* to u_tolower function. > -* SQL standards suggest use to_upper for symbol > -* normalisation. However, using to_lower allows to > -* respect Turkish 'İ' in default locale. > +* Small optimisation. Reduce number of > +* calls to u_tolower function. SQL > +* standards suggest use to_upper for > +* symbol normalisation. However, using > +* to_lower allows to respect Turkish 'İ' > +* in default locale. > */ > if (u_tolower(c) == c2 || > c == u_tolower(c2)) > continue; > } > -if (c == matchOne && pattern != zEscaped && c2 != 0) > +if (c == matchOne && pattern != zEscaped && > + c2 != SQL_END_OF_STRING) > continue; > return SQLITE_NOMATCH; > } > @@ -853,8 +922,7 @@ patternCompare(const char * pattern,/* The glob > pattern */ > int > sqlite3_strglob(const char *zGlobPattern, const char *zString) > { > -return patternCompare(zGlobPattern, zString, &globInfo, > - '['); > +return sql_utf8_pattern_compare(zGlobPattern, zString, &globInfo, '['); > } > /* > @@ -864,7 +932,7 @@ sqlite3_strglob(const char *zGlobPattern, const > char *zString) > int > sqlite3_strlike(const char *zPattern, const char *zStr, unsigned int esc) > { > -return patternCompare(zPattern, zStr, &likeInfoNorm, esc); > +return sql_utf8_pattern_compare(zPattern, zStr, &likeInfoNorm, esc); > } > /* > @@ -910,8 +978,9 @@ likeFunc(sqlite3_context * context, int argc, > sqlite3_value ** argv) > zB = (const char *) sqlite3_value_text(argv[0]); > zA = (const char *) sqlite3_value_text(argv[1]); > -/* Limit the length of the LIKE or GLOB pattern to avoid problems > -* of deep recursion and N*N behavior in patternCompare(). > +/* Limit the length of the LIKE or GLOB pattern to avoid > +* problems of deep recursion and N*N behavior in > +* sql_utf8_pattern_compare(). > */ > nPat = sqlite3_value_bytes(argv[0]); > testcase(nPat == db->aLimit[SQLITE_LIMIT_LIKE_PATTERN_LENGTH]); > @@ -947,7 +1016,12 @@ likeFunc(sqlite3_context * context, int argc, > sqlite3_value ** argv) > sqlite3_like_count++; > #endif > int res; > -res = patternCompare(zB, zA, pInfo, escape); > +res = sql_utf8_pattern_compare(zB, zA, pInfo, escape); > +if (res == SQL_PROHIBITED_PATTERN) { > +sqlite3_result_error(context, "LIKE or GLOB pattern can only" > + " contain UTF-8 characters", -1); > +return; > +} > sqlite3_result_int(context, res == SQLITE_MATCH); > } > diff --git a/test-run b/test-run > index 77e9327..95562e9 160000 > --- a/test-run > +++ b/test-run > @@ -1 +1 @@ > -Subproject commit 77e93279210f8c5c1fd0ed03416fa19a184f0b6d > +Subproject commit 95562e95401fef4e0b755ab0bb430974b5d1a29a > diff --git a/test/sql-tap/e_expr.test.lua b/test/sql-tap/e_expr.test.lua > index 13d3a96..9780d2c 100755 > --- a/test/sql-tap/e_expr.test.lua > +++ b/test/sql-tap/e_expr.test.lua > @@ -1,6 +1,6 @@ > #!/usr/bin/env tarantool > test = require("sqltester") > -test:plan(12431) > +test:plan(10665) > --!./tcltestrunner.lua > -- 2010 July 16 > @@ -77,8 +77,10 @@ local operations = { > {"<>", "ne1"}, > {"!=", "ne2"}, > {"IS", "is"}, > - {"LIKE", "like"}, > - {"GLOB", "glob"}, > +-- NOTE: This test needs refactoring after deletion of GLOB & > +--type restrictions for LIKE. (See #3572) > +-- {"LIKE", "like"}, > +-- {"GLOB", "glob"}, Yes, this behavior is not valid anymore. To make sure that likes and globs will be tested in the future, please, delete this commented lines and add your own simple test, which tries to call `like` and `glob` with inappropriate types. It is important to have a functional tests for any possible behavior. > {"AND", "and"}, > {"OR", "or"}, > {"MATCH", "match"}, > @@ -96,7 +98,12 @@ operations = { > {"+", "-"}, > {"<<", ">>", "&", "|"}, > {"<", "<=", ">", ">="}, > - {"=", "==", "!=", "<>", "LIKE", "GLOB"}, --"MATCH", "REGEXP"}, > +-- NOTE: This test needs refactoring after deletion of GLOB & > +--type restrictions for LIKE. (See #3572) > +-- Another NOTE: MATCH & REGEXP aren't supported in Tarantool & > +-- are waiting for their hour, don't confuse them > +--being commented with ticket above. > + {"=", "==", "!=", "<>"}, --"LIKE", "GLOB"}, --"MATCH", "REGEXP"}, > {"AND"}, > {"OR"}, > } > @@ -475,6 +482,7 @@ for _, op in ipairs(oplist) do > end > end > end > + > --------------------------------------------------------------------------- > -- Test the IS and IS NOT operators. > -- > diff --git a/test/sql-tap/gh-3251-string-pattern-comparison.test.lua > b/test/sql-tap/gh-3251-string-pattern-comparison.test.lua > new file mode 100755 > index 0000000..2a787f2 > --- /dev/null > +++ b/test/sql-tap/gh-3251-string-pattern-comparison.test.lua > @@ -0,0 +1,213 @@ > +#!/usr/bin/env tarantool > +test = require("sqltester") > +test:plan(128) > + > +local prefix = "like-test-" > + > +-- Unicode byte sequences. > +local valid_testcases = { > + '\x01', > + '\x09', > + '\x1F', > + '\x7F', > + '\xC2\x80', > + '\xC2\x90', > + '\xC2\x9F', > + '\xE2\x80\xA8', > + '\x20\x0B', > + '\xE2\x80\xA9', > +} optional: add descriptions to those byte sequences (what it is). > + > +-- Non-Unicode byte sequences. > +local invalid_testcases = { > + '\xE2\x80', > + '\xFE\xFF', > + '\xC2', > + '\xED\xB0\x80', > + '\xD0', > +} Place that after like_test_cases, just before it is used. > + > +local like_test_cases = > +{ > + {"1.1", > + "SELECT 'AB' LIKE '_B';", > + {0, {1}} }, > + {"1.2", > + "SELECT 'CD' LIKE '_B';", > + {0, {0}} }, > + {"1.3", > + "SELECT '' LIKE '_B';", > + {0, {0}} }, > + {"1.4", > + "SELECT 'AB' LIKE '%B';", > + {0, {1}} }, > + {"1.5", > + "SELECT 'CD' LIKE '%B';", > + {0, {0}} }, > + {"1.6", > + "SELECT '' LIKE '%B';", > + {0, {0}} }, > + {"1.7", > + "SELECT 'AB' LIKE 'A__';", > + {0, {0}} }, > + {"1.8", > + "SELECT 'CD' LIKE 'A__';", > + {0, {0}} }, > + {"1.9", > + "SELECT '' LIKE 'A__';", > + {0, {0}} }, > + {"1.10", > + "SELECT 'AB' LIKE 'A_';", > + {0, {1}} }, > + {"1.11", > + "SELECT 'CD' LIKE 'A_';", > + {0, {0}} }, > + {"1.12", > + "SELECT '' LIKE 'A_';", > + {0, {0}} }, > + {"1.13", > + "SELECT 'AB' LIKE 'A';", > + {0, {0}} }, > + {"1.14", > + "SELECT 'CD' LIKE 'A';", > + {0, {0}} }, > + {"1.15", > + "SELECT '' LIKE 'A';", > + {0, {0}} }, > + {"1.16", > + "SELECT 'AB' LIKE '_';", > + {0, {0}} }, > + {"1.17", > + "SELECT 'CD' LIKE '_';", > + {0, {0}} }, > + {"1.18", > + "SELECT '' LIKE '_';", > + {0, {0}} }, > + {"1.19", > + "SELECT 'AB' LIKE '__';", > + {0, {1}} }, > + {"1.20", > + "SELECT 'CD' LIKE '__';", > + {0, {1}} }, > + {"1.21", > + "SELECT '' LIKE '__';", > + {0, {0}} }, > + {"1.22", > + "SELECT 'AB' LIKE '%A';", > + {0, {0}} }, > + {"1.23", > + "SELECT 'AB' LIKE '%C';", > + {0, {0}} }, > + {"1.24", > + "SELECT 'ab' LIKE '%df';", > + {0, {0}} }, > + {"1.25", > + "SELECT 'abCDF' LIKE '%df';", > + {0, {1}} }, > + {"1.26", > + "SELECT 'CDF' LIKE '%df';", > + {0, {1}} }, > + {"1.27", > + "SELECT 'ab' LIKE 'a_';", > + {0, {1}} }, > + {"1.28", > + "SELECT 'abCDF' LIKE 'a_';", > + {0, {0}} }, > + {"1.29", > + "SELECT 'CDF' LIKE 'a_';", > + {0, {0}} }, > + {"1.30", > + "SELECT 'ab' LIKE 'ab%';", > + {0, {1}} }, > + {"1.31", > + "SELECT 'abCDF' LIKE 'ab%';", > + {0, {1}} }, > + {"1.32", > + "SELECT 'CDF' LIKE 'ab%';", > + {0, {0}} }, > + {"1.33", > + "SELECT 'ab' LIKE 'abC%';", > + {0, {0}} }, > + {"1.34", > + "SELECT 'abCDF' LIKE 'abC%';", > + {0, {1}} }, > + {"1.35", > + "SELECT 'CDF' LIKE 'abC%';", > + {0, {0}} }, > + {"1.36", > + "SELECT 'ab' LIKE 'a_%';", > + {0, {1}} }, > + {"1.37", > + "SELECT 'abCDF' LIKE 'a_%';", > + {0, {1}} }, > + {"1.38", > + "SELECT 'CDF' LIKE 'a_%';", > + {0, {0}} }, > +} Please, add some tests for unicode strings. (or replace letters in those tests with unicode letters) > + > +test:do_catchsql_set_test(like_test_cases, prefix) > + > +-- Invalid testcases. > +for i, tested_string in ipairs(invalid_testcases) do > + > + -- We should raise an error in case > + -- pattern contains invalid characters. > + > + local test_name = prefix .. "2." .. tostring(i) > + local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';" > + test:do_catchsql_test(test_name, test_itself, > + {1, "LIKE or GLOB pattern can only contain > UTF-8 characters"}) > + > + test_name = prefix .. "3." .. tostring(i) > + test_itself = "SELECT 'abc' LIKE 'abc" .. tested_string .. "';" > + test:do_catchsql_test(test_name, test_itself, > + {1, "LIKE or GLOB pattern can only contain > UTF-8 characters"}) > + > + test_name = prefix .. "4." .. tostring(i) > + test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "c';" > + test:do_catchsql_test(test_name, test_itself, > + {1, "LIKE or GLOB pattern can only contain > UTF-8 characters"}) > + > + -- Just skipping if row value predicand contains invalid character. What the predicand is? Is it a typo? > + > + test_name = prefix .. "5." .. tostring(i) > + test_itself = "SELECT 'ab" .. tested_string .. "' LIKE 'abc';" > + test:do_execsql_test(test_name, test_itself, {0}) > + > + test_name = prefix .. "6." .. tostring(i) > + test_itself = "SELECT 'abc" .. tested_string .. "' LIKE 'abc';" > + test:do_execsql_test(test_name, test_itself, {0}) > + > + test_name = prefix .. "7." .. tostring(i) > + test_itself = "SELECT 'ab" .. tested_string .. "c' LIKE 'abc';" > + test:do_execsql_test(test_name, test_itself, {0}) > +end > + > +-- Valid testcases. > +for i, tested_string in ipairs(valid_testcases) do > + test_name = prefix .. "8." .. tostring(i) > + local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';" > + test:do_execsql_test(test_name, test_itself, {0}) > + > + test_name = prefix .. "9." .. tostring(i) > + test_itself = "SELECT 'abc' LIKE 'abc" .. tested_string .. "';" > + test:do_execsql_test(test_name, test_itself, {0}) > + > + test_name = prefix .. "10." .. tostring(i) > + test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "c';" > + test:do_execsql_test(test_name,test_itself, {0}) > + > + test_name = prefix .. "11." .. tostring(i) > + test_itself = "SELECT 'ab" .. tested_string .. "' LIKE 'abc';" > + test:do_execsql_test(test_name,test_itself, {0}) > + > + test_name = prefix .. "12." .. tostring(i) > + test_itself = "SELECT 'abc" .. tested_string .. "' LIKE 'abc';" > + test:do_execsql_test(test_name, test_itself, {0}) > + > + test_name = prefix .. "13." .. tostring(i) > + test_itself = "SELECT 'ab" .. tested_string .. "c' LIKE 'abc';" > + test:do_execsql_test(test_name, test_itself, {0}) > +end > + > +test:finish_test() Why I cannot find a test of `GLOB`? Even if we delete it in the future, it should be tested. You can write much less tests for glob. E.g. this ``` select '1' glob '[0-4]'; ``` somewhy returns 0. Sorry, some of the tests I ask you to write are a little out of scope of the ticket and they should already have been written. But I suppose most of ambiguity should be clarified now. This ticket has raised important questions related to those functions. [-- Attachment #2: Type: text/html, Size: 56912 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* [tarantool-patches] Re: [PATCH] sql: LIKE & GLOB pattern comparison issue 2018-08-01 13:56 ` Alex Khatskevich @ 2018-08-01 18:10 ` Nikita Tatunov 2018-08-01 18:14 ` Nikita Tatunov 2018-08-08 12:38 ` Alex Khatskevich 0 siblings, 2 replies; 22+ messages in thread From: Nikita Tatunov @ 2018-08-01 18:10 UTC (permalink / raw) To: avkhatskevich; +Cc: Alexander Turenko, tarantool-patches [-- Attachment #1: Type: text/plain, Size: 50993 bytes --] ср, 1 авг. 2018 г. в 16:56, Alex Khatskevich <avkhatskevich@tarantool.org>: > > > On 01.08.2018 13:51, Nikita Tatunov wrote: > > diff --git a/src/box/sql/func.c b/src/box/sql/func.c > index c06e3bd..7f93ef6 100644 > --- a/src/box/sql/func.c > +++ b/src/box/sql/func.c > @@ -617,13 +617,17 @@ struct compareInfo { > u8 noCase; /* true to ignore case differences */ > }; > > -/* > - * For LIKE and GLOB matching on EBCDIC machines, assume that every > - * character is exactly one byte in size. Also, provde the Utf8Read() > - * macro for fast reading of the next character in the common case where > - * the next character is ASCII. > +/** > + * Providing there are symbols in string s this > + * macro returns UTF-8 code of character and > + * promotes pointer to the next symbol in the string. > + * Otherwise return code is SQL_END_OF_STRING. > */ > -#define Utf8Read(s, e) ucnv_getNextUChar(pUtf8conv, &s, e, &status) > +#define Utf8Read(s, e) (((s) < (e)) ? \ > + ucnv_getNextUChar(pUtf8conv, &(s), (e), &(status)) : 0) > > [Later I will ask you to return this macro back, so, you may not do this] > As I understand, you are returning `0` from Utf8Read in case of end of the > string. > Let's return `SQL_END_OF_STRING` instead of just `0`. > Yeah, thank you, didn't notice it. > + > +#define SQL_END_OF_STRING 0 > +#define SQL_INVALID_UTF8_SYMBOL 0xfffd > > static const struct compareInfo globInfo = { '*', '?', '[', 0 }; > > @@ -638,19 +642,16 @@ static const struct compareInfo likeInfoNorm = { > '%', '_', 0, 1 }; > static const struct compareInfo likeInfoAlt = { '%', '_', 0, 0 }; > > /* > - * Possible error returns from patternMatch() > + * Possible error returns from sql_utf8_pattern_compare() > */ > #define SQLITE_MATCH 0 > #define SQLITE_NOMATCH 1 > #define SQLITE_NOWILDCARDMATCH 2 > +#define SQL_PROHIBITED_PATTERN 3 > > I am not sure that the invalid (with invalid symbols) pattern can be > called `prohibited`. > Rename somehow? My proposal: SQL_INVALID_PATTERN. > Probably you're right, I was also thinking of changing it somehow. > Moreover, You have named this definition with the `SQL` prefix, which is > good, however, > similar definitions are still prefixed with `SQLITE`. I would like you to > rename those in > this (preferred) or in a separate commit for consistency. > > > > Tarantool != SQLite and I think we should get away from this approach. The thing that names haven't been refactored yet isn't in my control. You can ask Nikita's opinion on it, I guess he will tell you almost the same. > -/* > - * Compare two UTF-8 strings for equality where the first string is > - * a GLOB or LIKE expression. Return values: > - * > - * SQLITE_MATCH: Match > - * SQLITE_NOMATCH: No match > - * SQLITE_NOWILDCARDMATCH: No match in spite of having * or % > wildcards. > +/** > + * Compare two UTF-8 strings for equality where the first string > + * is a GLOB or LIKE expression. > * > * Globbing rules: > * > @@ -663,92 +664,136 @@ static const struct compareInfo likeInfoAlt = { > '%', '_', 0, 0 }; > * > * [^...] Matches one character not in the enclosed list. > * > - * With the [...] and [^...] matching, a ']' character can be included > - * in the list by making it the first character after '[' or '^'. A > - * range of characters can be specified using '-'. Example: > - * "[a-z]" matches any single lower-case letter. To match a '-', make > - * it the last character in the list. > + * With the [...] and [^...] matching, a ']' character can be > + * included in the list by making it the first character after > + * '[' or '^'. A range of characters can be specified using '-'. > + * Example: "[a-z]" matches any single lower-case letter. > + * To match a '-', make it the last character in the list. > > Does it work for UTF characters? I suppose no. > Let's write about it here + let's file an issue to make it > work with UTF characters. > Soon this function is gonna be refactored & GLOB is gonna be removed anyways. > * > * Like matching rules: > * > - * '%' Matches any sequence of zero or more characters > + * '%' Matches any sequence of zero or more characters. > * > - ** '_' Matches any one character > + ** '_' Matches any one character. > * > * Ec Where E is the "esc" character and c is any other > - * character, including '%', '_', and esc, match exactly c. > + * character, including '%', '_', and esc, match > + * exactly c. > * > * The comments within this routine usually assume glob matching. > * > - * This routine is usually quick, but can be N**2 in the worst case. > + * This routine is usually quick, but can be N**2 in the worst > + * case. > + * > + * @param pattern String containing comparison pattern. > + * @param string String being compared. > + * @param compareInfo Information about how to compare. > + * @param matchOther The escape char (LIKE) or '[' (GLOB). > + * > + * @retval SQLITE_MATCH: Match. > + * SQLITE_NOMATCH: No match. > + * SQLITE_NOWILDCARDMATCH: No match in spite of having * > + * or % wildcards. > + * SQL_PROHIBITED_PATTERN: Pattern contains invalid > + * symbol. > > Minor: It is not very good that you use symbol and character > interchangeably. > I suppose that `character` should be used everywhere. > They're synonyms, aren't they? > */ > static int > -patternCompare(const char * pattern, /* The glob pattern */ > - const char * string, /* The string to compare against the glob */ > - const struct compareInfo *pInfo, /* Information about how to do > the compare */ > - UChar32 matchOther /* The escape char (LIKE) or '[' (GLOB) */ > - ) > +sql_utf8_pattern_compare(const char * pattern, > + const char * string, > + const struct compareInfo *pInfo, > + UChar32 matchOther) > > "star" sign should stick to the attribute name. > https://tarantool.io/en/doc/1.9/dev_guide/c_style_guide/#chapter-3-1-spaces > > To prevent such typos in the future, you can use special perl > script which checks coding style in Linux kernel patches. > Subject of the ticket wasn't about refactoring function, but thnx, changed it. REMEMBER THIS POINT #1 > { > - UChar32 c, c2; /* Next pattern and input string chars */ > - UChar32 matchOne = pInfo->matchOne; /* "?" or "_" */ > - UChar32 matchAll = pInfo->matchAll; /* "*" or "%" */ > - UChar32 noCase = pInfo->noCase; /* True if uppercase==lowercase */ > - const char *zEscaped = 0; /* One past the last escaped input char */ > + /* Next pattern and input string chars */ > + UChar32 c, c2; > + /* "?" or "_" */ > + UChar32 matchOne = pInfo->matchOne; > + /* "*" or "%" */ > + UChar32 matchAll = pInfo->matchAll; > + /* True if uppercase==lowercase */ > + UChar32 noCase = pInfo->noCase; > + /* One past the last escaped input char */ > + const char *zEscaped = 0; > const char * pattern_end = pattern + strlen(pattern); > const char * string_end = string + strlen(string); > UErrorCode status = U_ZERO_ERROR; > > - while (pattern < pattern_end){ > - c = Utf8Read(pattern, pattern_end); > + while ((c = Utf8Read(pattern, pattern_end)) != SQL_END_OF_STRING) { > > REMEMBER THIS POINT #1 > > + if (c == SQL_INVALID_UTF8_SYMBOL) > + return SQL_PROHIBITED_PATTERN; > if (c == matchAll) { /* Match "*" */ > - /* Skip over multiple "*" characters in the pattern. If there > - * are also "?" characters, skip those as well, but consume a > - * single character of the input string for each "?" skipped > + /* Skip over multiple "*" characters in > + * the pattern. If there are also "?" > + * characters, skip those as well, but > + * consume a single character of the > + * input string for each "?" skipped. > */ > - while (pattern < pattern_end){ > - c = Utf8Read(pattern, pattern_end); > + while ((c = Utf8Read(pattern, pattern_end)) != > + SQL_END_OF_STRING) { > + if (c == SQL_INVALID_UTF8_SYMBOL) > + return SQL_PROHIBITED_PATTERN; > if (c != matchAll && c != matchOne) > break; > - if (c == matchOne > - && Utf8Read(string, string_end) == 0) { > + if (c == matchOne && > + (c2 = Utf8Read(string, string_end)) == > + SQL_END_OF_STRING) > return SQLITE_NOWILDCARDMATCH; > - } > + if (c2 == SQL_INVALID_UTF8_SYMBOL) > + return SQLITE_NOMATCH; > } > - /* "*" at the end of the pattern matches */ > - if (pattern == pattern_end) > + /* > + * "*" at the end of the pattern matches. > + */ > + if (c == SQL_END_OF_STRING) { > + while ((c2 = Utf8Read(string, string_end)) != > + SQL_END_OF_STRING) > + if (c2 == SQL_INVALID_UTF8_SYMBOL) > + return SQLITE_NOMATCH; > return SQLITE_MATCH; > + } > if (c == matchOther) { > if (pInfo->matchSet == 0) { > c = Utf8Read(pattern, pattern_end); > - if (c == 0) > + if (c == SQL_INVALID_UTF8_SYMBOL) > + return SQL_PROHIBITED_PATTERN; > + if (c == SQL_END_OF_STRING) > return SQLITE_NOWILDCARDMATCH; > } else { > - /* "[...]" immediately follows the "*". We have to do a slow > - * recursive search in this case, but it is an unusual case. > + /* "[...]" immediately > + * follows the "*". We > + * have to do a slow > + * recursive search in > + * this case, but it is > + * an unusual case. > */ > - assert(matchOther < 0x80); /* '[' is a single-byte character */ > + assert(matchOther < 0x80); > while (string < string_end) { > > REMEMBER THIS POINT #2 > > int bMatch = > - patternCompare(&pattern[-1], > - string, > - pInfo, > - matchOther); > + sql_utf8_pattern_compare( > + &pattern[-1], > + string, > + pInfo, > + matchOther); > if (bMatch != SQLITE_NOMATCH) > return bMatch; > - Utf8Read(string, string_end); > + c = Utf8Read(string, string_end); > + if (c == SQL_INVALID_UTF8_SYMBOL) > + return SQLITE_NOMATCH; > > look at <REMEMBER THIS POINT #1,2> and other `Utf8Read` usages. > You have introduced SQL_END_OF_STRING and changed `Utf8Read` pattern to > use it in > half of cases? > > 1) Look at <REMEMBER THIS POINT #1>. 2) I have introduced it not to use explicit 0 and to fix the bug. 3) Fixed it though. > Moreover,in that place you do check `string < string_end` implicitly > inside of > `Utf8Read` but you never use that result. > I suppose you should return old iteration style and `Utf8Read` macro. > ``` > while (string < string_end) { > c = Utf8Read(string, string_end); > if (c == SQL_INVALID_UTF8_SYMBOL) > return SQLITE_NOMATCH; > ``` > I think it's better to use this approach, but yes: return prev. version of macro: 1) 'zero byte' ticket will be partially fixed. 2) 0xffff is non-unicode anyways. > } > return SQLITE_NOWILDCARDMATCH; > } > } > > - /* At this point variable c contains the first character of the > - * pattern string past the "*". Search in the input string for the > - * first matching character and recursively continue the match from > - * that point. > + /* At this point variable c contains the > + * first character of the pattern string > + * past the "*". Search in the input > + * string for the first matching > + * character and recursively continue the > + * match from that point. > * > - * For a case-insensitive search, set variable cx to be the same as > - * c but in the other case and search the input string for either > - * c or cx. > + * For a case-insensitive search, set > + * variable cx to be the same as c but in > + * the other case and search the input > + * string for either c or cx. > */ > > int bMatch; > @@ -756,14 +801,18 @@ patternCompare(const char * pattern, /* The glob > pattern */ > c = u_tolower(c); > while (string < string_end){ > /** > - * This loop could have been implemented > - * without if converting c2 to lower case > - * (by holding c_upper and c_lower), however > - * it is implemented this way because lower > - * works better with German and Turkish > - * languages. > + * This loop could have been > + * implemented without if > + * converting c2 to lower case > + * by holding c_upper and > + * c_lower,however it is > + * implemented this way because > + * lower works better with German > + * and Turkish languages. > */ > c2 = Utf8Read(string, string_end); > + if (c2 == SQL_INVALID_UTF8_SYMBOL) > + return SQLITE_NOMATCH; > if (!noCase) { > if (c2 != c) > continue; > @@ -771,9 +820,10 @@ patternCompare(const char * pattern, /* The glob > pattern */ > if (c2 != c && u_tolower(c2) != c) > continue; > } > - bMatch = > - patternCompare(pattern, string, > - pInfo, matchOther); > + bMatch = sql_utf8_pattern_compare(pattern, > + string, > + pInfo, > + matchOther); > if (bMatch != SQLITE_NOMATCH) > return bMatch; > } > @@ -782,7 +832,9 @@ patternCompare(const char * pattern, /* The glob > pattern */ > if (c == matchOther) { > if (pInfo->matchSet == 0) { > c = Utf8Read(pattern, pattern_end); > - if (c == 0) > + if (c == SQL_INVALID_UTF8_SYMBOL) > + return SQL_PROHIBITED_PATTERN; > + if (c == SQL_END_OF_STRING) > return SQLITE_NOMATCH; > zEscaped = pattern; > } else { > @@ -790,23 +842,33 @@ patternCompare(const char * pattern, /* The glob > pattern */ > int seen = 0; > int invert = 0; > c = Utf8Read(string, string_end); > + if (c == SQL_INVALID_UTF8_SYMBOL) > + return SQLITE_NOMATCH; > if (string == string_end) > return SQLITE_NOMATCH; > c2 = Utf8Read(pattern, pattern_end); > + if (c2 == SQL_INVALID_UTF8_SYMBOL) > + return SQL_PROHIBITED_PATTERN; > if (c2 == '^') { > invert = 1; > c2 = Utf8Read(pattern, pattern_end); > + if (c2 == SQL_INVALID_UTF8_SYMBOL) > + return SQL_PROHIBITED_PATTERN; > } > if (c2 == ']') { > if (c == ']') > seen = 1; > c2 = Utf8Read(pattern, pattern_end); > + if (c2 == SQL_INVALID_UTF8_SYMBOL) > + return SQL_PROHIBITED_PATTERN; > } > - while (c2 && c2 != ']') { > + while (c2 != SQL_END_OF_STRING && c2 != ']') { > if (c2 == '-' && pattern[0] != ']' > && pattern < pattern_end > && prior_c > 0) { > c2 = Utf8Read(pattern, pattern_end); > + if (c2 == SQL_INVALID_UTF8_SYMBOL) > + return SQL_PROHIBITED_PATTERN; > if (c >= prior_c && c <= c2) > seen = 1; > prior_c = 0; > @@ -817,29 +879,36 @@ patternCompare(const char * pattern, /* The glob > pattern */ > prior_c = c2; > } > c2 = Utf8Read(pattern, pattern_end); > + if (c2 == SQL_INVALID_UTF8_SYMBOL) > + return SQL_PROHIBITED_PATTERN; > } > - if (pattern == pattern_end || (seen ^ invert) == 0) { > + if (pattern == pattern_end || > + (seen ^ invert) == 0) { > return SQLITE_NOMATCH; > } > continue; > } > } > c2 = Utf8Read(string, string_end); > + if (c2 == SQL_INVALID_UTF8_SYMBOL) > + return SQLITE_NOMATCH; > if (c == c2) > continue; > if (noCase){ > /** > - * Small optimisation. Reduce number of calls > - * to u_tolower function. > - * SQL standards suggest use to_upper for symbol > - * normalisation. However, using to_lower allows to > - * respect Turkish 'İ' in default locale. > + * Small optimisation. Reduce number of > + * calls to u_tolower function. SQL > + * standards suggest use to_upper for > + * symbol normalisation. However, using > + * to_lower allows to respect Turkish 'İ' > + * in default locale. > */ > if (u_tolower(c) == c2 || > c == u_tolower(c2)) > continue; > } > - if (c == matchOne && pattern != zEscaped && c2 != 0) > + if (c == matchOne && pattern != zEscaped && > + c2 != SQL_END_OF_STRING) > continue; > return SQLITE_NOMATCH; > } > @@ -853,8 +922,7 @@ patternCompare(const char * pattern, /* The glob > pattern */ > int > sqlite3_strglob(const char *zGlobPattern, const char *zString) > { > - return patternCompare(zGlobPattern, zString, &globInfo, > - '['); > + return sql_utf8_pattern_compare(zGlobPattern, zString, &globInfo, '['); > } > > /* > @@ -864,7 +932,7 @@ sqlite3_strglob(const char *zGlobPattern, const char > *zString) > int > sqlite3_strlike(const char *zPattern, const char *zStr, unsigned int esc) > { > - return patternCompare(zPattern, zStr, &likeInfoNorm, esc); > + return sql_utf8_pattern_compare(zPattern, zStr, &likeInfoNorm, esc); > } > > /* > @@ -910,8 +978,9 @@ likeFunc(sqlite3_context * context, int argc, > sqlite3_value ** argv) > zB = (const char *) sqlite3_value_text(argv[0]); > zA = (const char *) sqlite3_value_text(argv[1]); > > - /* Limit the length of the LIKE or GLOB pattern to avoid problems > - * of deep recursion and N*N behavior in patternCompare(). > + /* Limit the length of the LIKE or GLOB pattern to avoid > + * problems of deep recursion and N*N behavior in > + * sql_utf8_pattern_compare(). > */ > nPat = sqlite3_value_bytes(argv[0]); > testcase(nPat == db->aLimit[SQLITE_LIMIT_LIKE_PATTERN_LENGTH]); > @@ -947,7 +1016,12 @@ likeFunc(sqlite3_context * context, int argc, > sqlite3_value ** argv) > sqlite3_like_count++; > #endif > int res; > - res = patternCompare(zB, zA, pInfo, escape); > + res = sql_utf8_pattern_compare(zB, zA, pInfo, escape); > + if (res == SQL_PROHIBITED_PATTERN) { > + sqlite3_result_error(context, "LIKE or GLOB pattern can only" > + " contain UTF-8 characters", -1); > + return; > + } > sqlite3_result_int(context, res == SQLITE_MATCH); > } > > diff --git a/test-run b/test-run > index 77e9327..95562e9 160000 > --- a/test-run > +++ b/test-run > @@ -1 +1 @@ > -Subproject commit 77e93279210f8c5c1fd0ed03416fa19a184f0b6d > +Subproject commit 95562e95401fef4e0b755ab0bb430974b5d1a29a > diff --git a/test/sql-tap/e_expr.test.lua b/test/sql-tap/e_expr.test.lua > index 13d3a96..9780d2c 100755 > --- a/test/sql-tap/e_expr.test.lua > +++ b/test/sql-tap/e_expr.test.lua > @@ -1,6 +1,6 @@ > #!/usr/bin/env tarantool > test = require("sqltester") > -test:plan(12431) > +test:plan(10665) > > --!./tcltestrunner.lua > -- 2010 July 16 > @@ -77,8 +77,10 @@ local operations = { > {"<>", "ne1"}, > {"!=", "ne2"}, > {"IS", "is"}, > - {"LIKE", "like"}, > - {"GLOB", "glob"}, > +-- NOTE: This test needs refactoring after deletion of GLOB & > +-- type restrictions for LIKE. (See #3572) > +-- {"LIKE", "like"}, > +-- {"GLOB", "glob"}, > > Yes, this behavior is not valid anymore. > To make sure that likes and globs will be tested in the future, please, > delete this > commented lines and add your own simple test, which tries to call `like` > and `glob` > with inappropriate types. > It is important to have a functional tests for any possible behavior. > 1) Globs are going to be deleted as you can see few lines above. 2) I'm gonna refactor that when LIKE is in its final state (basically after #3572 is closed & static build is into 2.1). > {"AND", "and"}, > {"OR", "or"}, > {"MATCH", "match"}, > @@ -96,7 +98,12 @@ operations = { > {"+", "-"}, > {"<<", ">>", "&", "|"}, > {"<", "<=", ">", ">="}, > - {"=", "==", "!=", "<>", "LIKE", "GLOB"}, --"MATCH", "REGEXP"}, > +-- NOTE: This test needs refactoring after deletion of GLOB & > +-- type restrictions for LIKE. (See #3572) > +-- Another NOTE: MATCH & REGEXP aren't supported in Tarantool & > +-- are waiting for their hour, don't confuse them > +-- being commented with ticket above. > + {"=", "==", "!=", "<>"}, --"LIKE", "GLOB"}, --"MATCH", "REGEXP"}, > {"AND"}, > {"OR"}, > } > @@ -475,6 +482,7 @@ for _, op in ipairs(oplist) do > end > end > end > + > > --------------------------------------------------------------------------- > -- Test the IS and IS NOT operators. > -- > diff --git a/test/sql-tap/gh-3251-string-pattern-comparison.test.lua > b/test/sql-tap/gh-3251-string-pattern-comparison.test.lua > new file mode 100755 > index 0000000..2a787f2 > --- /dev/null > +++ b/test/sql-tap/gh-3251-string-pattern-comparison.test.lua > @@ -0,0 +1,213 @@ > +#!/usr/bin/env tarantool > +test = require("sqltester") > +test:plan(128) > + > +local prefix = "like-test-" > + > +-- Unicode byte sequences. > +local valid_testcases = { > + '\x01', > + '\x09', > + '\x1F', > + '\x7F', > + '\xC2\x80', > + '\xC2\x90', > + '\xC2\x9F', > + '\xE2\x80\xA8', > + '\x20\x0B', > + '\xE2\x80\xA9', > +} > > optional: add descriptions to those byte sequences (what it is). > Some valid unicode symbols which is only that matters for LIKE operator. > + > +-- Non-Unicode byte sequences. > +local invalid_testcases = { > + '\xE2\x80', > + '\xFE\xFF', > + '\xC2', > + '\xED\xB0\x80', > + '\xD0', > +} > > Place that after like_test_cases, just before it is used. > Changed it. > + > +local like_test_cases = > +{ > + {"1.1", > + "SELECT 'AB' LIKE '_B';", > + {0, {1}} }, > + {"1.2", > + "SELECT 'CD' LIKE '_B';", > + {0, {0}} }, > + {"1.3", > + "SELECT '' LIKE '_B';", > + {0, {0}} }, > + {"1.4", > + "SELECT 'AB' LIKE '%B';", > + {0, {1}} }, > + {"1.5", > + "SELECT 'CD' LIKE '%B';", > + {0, {0}} }, > + {"1.6", > + "SELECT '' LIKE '%B';", > + {0, {0}} }, > + {"1.7", > + "SELECT 'AB' LIKE 'A__';", > + {0, {0}} }, > + {"1.8", > + "SELECT 'CD' LIKE 'A__';", > + {0, {0}} }, > + {"1.9", > + "SELECT '' LIKE 'A__';", > + {0, {0}} }, > + {"1.10", > + "SELECT 'AB' LIKE 'A_';", > + {0, {1}} }, > + {"1.11", > + "SELECT 'CD' LIKE 'A_';", > + {0, {0}} }, > + {"1.12", > + "SELECT '' LIKE 'A_';", > + {0, {0}} }, > + {"1.13", > + "SELECT 'AB' LIKE 'A';", > + {0, {0}} }, > + {"1.14", > + "SELECT 'CD' LIKE 'A';", > + {0, {0}} }, > + {"1.15", > + "SELECT '' LIKE 'A';", > + {0, {0}} }, > + {"1.16", > + "SELECT 'AB' LIKE '_';", > + {0, {0}} }, > + {"1.17", > + "SELECT 'CD' LIKE '_';", > + {0, {0}} }, > + {"1.18", > + "SELECT '' LIKE '_';", > + {0, {0}} }, > + {"1.19", > + "SELECT 'AB' LIKE '__';", > + {0, {1}} }, > + {"1.20", > + "SELECT 'CD' LIKE '__';", > + {0, {1}} }, > + {"1.21", > + "SELECT '' LIKE '__';", > + {0, {0}} }, > + {"1.22", > + "SELECT 'AB' LIKE '%A';", > + {0, {0}} }, > + {"1.23", > + "SELECT 'AB' LIKE '%C';", > + {0, {0}} }, > + {"1.24", > + "SELECT 'ab' LIKE '%df';", > + {0, {0}} }, > + {"1.25", > + "SELECT 'abCDF' LIKE '%df';", > + {0, {1}} }, > + {"1.26", > + "SELECT 'CDF' LIKE '%df';", > + {0, {1}} }, > + {"1.27", > + "SELECT 'ab' LIKE 'a_';", > + {0, {1}} }, > + {"1.28", > + "SELECT 'abCDF' LIKE 'a_';", > + {0, {0}} }, > + {"1.29", > + "SELECT 'CDF' LIKE 'a_';", > + {0, {0}} }, > + {"1.30", > + "SELECT 'ab' LIKE 'ab%';", > + {0, {1}} }, > + {"1.31", > + "SELECT 'abCDF' LIKE 'ab%';", > + {0, {1}} }, > + {"1.32", > + "SELECT 'CDF' LIKE 'ab%';", > + {0, {0}} }, > + {"1.33", > + "SELECT 'ab' LIKE 'abC%';", > + {0, {0}} }, > + {"1.34", > + "SELECT 'abCDF' LIKE 'abC%';", > + {0, {1}} }, > + {"1.35", > + "SELECT 'CDF' LIKE 'abC%';", > + {0, {0}} }, > + {"1.36", > + "SELECT 'ab' LIKE 'a_%';", > + {0, {1}} }, > + {"1.37", > + "SELECT 'abCDF' LIKE 'a_%';", > + {0, {1}} }, > + {"1.38", > + "SELECT 'CDF' LIKE 'a_%';", > + {0, {0}} }, > +} > > Please, add some tests for unicode strings. (or replace letters in those > tests with unicode letters) > Changed existing tests a little bit. > + > +test:do_catchsql_set_test(like_test_cases, prefix) > + > +-- Invalid testcases. > +for i, tested_string in ipairs(invalid_testcases) do > + > + -- We should raise an error in case > + -- pattern contains invalid characters. > + > + local test_name = prefix .. "2." .. tostring(i) > + local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';" > + test:do_catchsql_test(test_name, test_itself, > + {1, "LIKE or GLOB pattern can only contain > UTF-8 characters"}) > + > + test_name = prefix .. "3." .. tostring(i) > + test_itself = "SELECT 'abc' LIKE 'abc" .. tested_string .. "';" > + test:do_catchsql_test(test_name, test_itself, > + {1, "LIKE or GLOB pattern can only contain > UTF-8 characters"}) > + > + test_name = prefix .. "4." .. tostring(i) > + test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "c';" > + test:do_catchsql_test(test_name, test_itself, > + {1, "LIKE or GLOB pattern can only contain > UTF-8 characters"}) > + > + -- Just skipping if row value predicand contains invalid character. > > What the predicand is? Is it a typo? > You can find it in ANSI SQL: <row value predicand>. Basically it's just operand inside of a predicate. > + > + test_name = prefix .. "5." .. tostring(i) > + test_itself = "SELECT 'ab" .. tested_string .. "' LIKE 'abc';" > + test:do_execsql_test(test_name, test_itself, {0}) > + > + test_name = prefix .. "6." .. tostring(i) > + test_itself = "SELECT 'abc" .. tested_string .. "' LIKE 'abc';" > + test:do_execsql_test(test_name, test_itself, {0}) > + > + test_name = prefix .. "7." .. tostring(i) > + test_itself = "SELECT 'ab" .. tested_string .. "c' LIKE 'abc';" > + test:do_execsql_test(test_name, test_itself, {0}) > +end > + > +-- Valid testcases. > +for i, tested_string in ipairs(valid_testcases) do > + test_name = prefix .. "8." .. tostring(i) > + local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';" > + test:do_execsql_test(test_name, test_itself, {0}) > + > + test_name = prefix .. "9." .. tostring(i) > + test_itself = "SELECT 'abc' LIKE 'abc" .. tested_string .. "';" > + test:do_execsql_test(test_name, test_itself, {0}) > + > + test_name = prefix .. "10." .. tostring(i) > + test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "c';" > + test:do_execsql_test(test_name, test_itself, {0}) > + > + test_name = prefix .. "11." .. tostring(i) > + test_itself = "SELECT 'ab" .. tested_string .. "' LIKE 'abc';" > + test:do_execsql_test(test_name, test_itself, {0}) > + > + test_name = prefix .. "12." .. tostring(i) > + test_itself = "SELECT 'abc" .. tested_string .. "' LIKE 'abc';" > + test:do_execsql_test(test_name, test_itself, {0}) > + > + test_name = prefix .. "13." .. tostring(i) > + test_itself = "SELECT 'ab" .. tested_string .. "c' LIKE 'abc';" > + test:do_execsql_test(test_name, test_itself, {0}) > +end > + > +test:finish_test() > > Why I cannot find a test of `GLOB`? Even if we delete it in the future, it > should be tested. You can write much less tests for glob. > E.g. this > ``` > select '1' glob '[0-4]'; > ``` > somewhy returns 0. > I actually don't think that operator that is going to be deleted in a few days should be tested. It's just useless and redundant code. > > Sorry, some of the tests I ask you to write are a little out of scope of > the ticket and they should already have been written. > But I suppose most of ambiguity should be clarified now. This ticket has > raised important questions related to those functions. > commit f15fbaef2182084dec7cdc6c661509cb908df892 Author: N.Tatunov <hollow653@gmail.com> Date: Thu Jun 28 15:17:32 2018 +0300 sql: LIKE & GLOB pattern comparison issue Currently function that compares pattern and string for GLOB & LIKE operators doesn't work properly. It uses ICU reading function which was assumed having other return codes and the implementation for the comparison ending isn't paying attention to some special cases, hence in those cases it works improperly. With the patch applied an error will be returned in case there's an invalid UTF-8 symbol in pattern & pattern containing only valid UTF-8 symbols will not be matched with the string that contains invalid symbol. Сloses #3251 Сloses #3334 Part of #3572 diff --git a/src/box/sql/func.c b/src/box/sql/func.c index c06e3bd..7f93ef6 100644 --- a/src/box/sql/func.c +++ b/src/box/sql/func.c @@ -617,13 +617,17 @@ struct compareInfo { u8 noCase; /* true to ignore case differences */ }; -/* - * For LIKE and GLOB matching on EBCDIC machines, assume that every - * character is exactly one byte in size. Also, provde the Utf8Read() - * macro for fast reading of the next character in the common case where - * the next character is ASCII. +/** + * Providing there are symbols in string s this + * macro returns UTF-8 code of character and + * promotes pointer to the next symbol in the string. + * Otherwise return code is SQL_END_OF_STRING. */ -#define Utf8Read(s, e) ucnv_getNextUChar(pUtf8conv, &s, e, &status) +#define Utf8Read(s, e) (((s) < (e)) ? \ + ucnv_getNextUChar(pUtf8conv, &(s), (e), &(status)) : 0) + +#define SQL_END_OF_STRING 0 +#define SQL_INVALID_UTF8_SYMBOL 0xfffd static const struct compareInfo globInfo = { '*', '?', '[', 0 }; @@ -638,19 +642,16 @@ static const struct compareInfo likeInfoNorm = { '%', '_', 0, 1 }; static const struct compareInfo likeInfoAlt = { '%', '_', 0, 0 }; /* - * Possible error returns from patternMatch() + * Possible error returns from sql_utf8_pattern_compare() */ #define SQLITE_MATCH 0 #define SQLITE_NOMATCH 1 #define SQLITE_NOWILDCARDMATCH 2 +#define SQL_PROHIBITED_PATTERN 3 -/* - * Compare two UTF-8 strings for equality where the first string is - * a GLOB or LIKE expression. Return values: - * - * SQLITE_MATCH: Match - * SQLITE_NOMATCH: No match - * SQLITE_NOWILDCARDMATCH: No match in spite of having * or % wildcards. +/** + * Compare two UTF-8 strings for equality where the first string + * is a GLOB or LIKE expression. * * Globbing rules: * @@ -663,92 +664,136 @@ static const struct compareInfo likeInfoAlt = { '%', '_', 0, 0 }; * * [^...] Matches one character not in the enclosed list. * - * With the [...] and [^...] matching, a ']' character can be included - * in the list by making it the first character after '[' or '^'. A - * range of characters can be specified using '-'. Example: - * "[a-z]" matches any single lower-case letter. To match a '-', make - * it the last character in the list. + * With the [...] and [^...] matching, a ']' character can be + * included in the list by making it the first character after + * '[' or '^'. A range of characters can be specified using '-'. + * Example: "[a-z]" matches any single lower-case letter. + * To match a '-', make it the last character in the list. * * Like matching rules: * - * '%' Matches any sequence of zero or more characters + * '%' Matches any sequence of zero or more characters. * - ** '_' Matches any one character + ** '_' Matches any one character. * * Ec Where E is the "esc" character and c is any other - * character, including '%', '_', and esc, match exactly c. + * character, including '%', '_', and esc, match + * exactly c. * * The comments within this routine usually assume glob matching. * - * This routine is usually quick, but can be N**2 in the worst case. + * This routine is usually quick, but can be N**2 in the worst + * case. + * + * @param pattern String containing comparison pattern. + * @param string String being compared. + * @param compareInfo Information about how to compare. + * @param matchOther The escape char (LIKE) or '[' (GLOB). + * + * @retval SQLITE_MATCH: Match. + * SQLITE_NOMATCH: No match. + * SQLITE_NOWILDCARDMATCH: No match in spite of having * + * or % wildcards. + * SQL_PROHIBITED_PATTERN: Pattern contains invalid + * symbol. */ static int -patternCompare(const char * pattern, /* The glob pattern */ - const char * string, /* The string to compare against the glob */ - const struct compareInfo *pInfo, /* Information about how to do the compare */ - UChar32 matchOther /* The escape char (LIKE) or '[' (GLOB) */ - ) +sql_utf8_pattern_compare(const char * pattern, + const char * string, + const struct compareInfo *pInfo, + UChar32 matchOther) { - UChar32 c, c2; /* Next pattern and input string chars */ - UChar32 matchOne = pInfo->matchOne; /* "?" or "_" */ - UChar32 matchAll = pInfo->matchAll; /* "*" or "%" */ - UChar32 noCase = pInfo->noCase; /* True if uppercase==lowercase */ - const char *zEscaped = 0; /* One past the last escaped input char */ + /* Next pattern and input string chars */ + UChar32 c, c2; + /* "?" or "_" */ + UChar32 matchOne = pInfo->matchOne; + /* "*" or "%" */ + UChar32 matchAll = pInfo->matchAll; + /* True if uppercase==lowercase */ + UChar32 noCase = pInfo->noCase; + /* One past the last escaped input char */ + const char *zEscaped = 0; const char * pattern_end = pattern + strlen(pattern); const char * string_end = string + strlen(string); UErrorCode status = U_ZERO_ERROR; - while (pattern < pattern_end){ - c = Utf8Read(pattern, pattern_end); + while ((c = Utf8Read(pattern, pattern_end)) != SQL_END_OF_STRING) { + if (c == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; if (c == matchAll) { /* Match "*" */ - /* Skip over multiple "*" characters in the pattern. If there - * are also "?" characters, skip those as well, but consume a - * single character of the input string for each "?" skipped + /* Skip over multiple "*" characters in + * the pattern. If there are also "?" + * characters, skip those as well, but + * consume a single character of the + * input string for each "?" skipped. */ - while (pattern < pattern_end){ - c = Utf8Read(pattern, pattern_end); + while ((c = Utf8Read(pattern, pattern_end)) != + SQL_END_OF_STRING) { + if (c == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; if (c != matchAll && c != matchOne) break; - if (c == matchOne - && Utf8Read(string, string_end) == 0) { + if (c == matchOne && + (c2 = Utf8Read(string, string_end)) == + SQL_END_OF_STRING) return SQLITE_NOWILDCARDMATCH; - } + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQLITE_NOMATCH; } - /* "*" at the end of the pattern matches */ - if (pattern == pattern_end) + /* + * "*" at the end of the pattern matches. + */ + if (c == SQL_END_OF_STRING) { + while ((c2 = Utf8Read(string, string_end)) != + SQL_END_OF_STRING) + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQLITE_NOMATCH; return SQLITE_MATCH; + } if (c == matchOther) { if (pInfo->matchSet == 0) { c = Utf8Read(pattern, pattern_end); - if (c == 0) + if (c == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; + if (c == SQL_END_OF_STRING) return SQLITE_NOWILDCARDMATCH; } else { - /* "[...]" immediately follows the "*". We have to do a slow - * recursive search in this case, but it is an unusual case. + /* "[...]" immediately + * follows the "*". We + * have to do a slow + * recursive search in + * this case, but it is + * an unusual case. */ - assert(matchOther < 0x80); /* '[' is a single-byte character */ + assert(matchOther < 0x80); while (string < string_end) { int bMatch = - patternCompare(&pattern[-1], - string, - pInfo, - matchOther); + sql_utf8_pattern_compare( + &pattern[-1], + string, + pInfo, + matchOther); if (bMatch != SQLITE_NOMATCH) return bMatch; - Utf8Read(string, string_end); + c = Utf8Read(string, string_end); + if (c == SQL_INVALID_UTF8_SYMBOL) + return SQLITE_NOMATCH; } return SQLITE_NOWILDCARDMATCH; } } - /* At this point variable c contains the first character of the - * pattern string past the "*". Search in the input string for the - * first matching character and recursively continue the match from - * that point. + /* At this point variable c contains the + * first character of the pattern string + * past the "*". Search in the input + * string for the first matching + * character and recursively continue the + * match from that point. * - * For a case-insensitive search, set variable cx to be the same as - * c but in the other case and search the input string for either - * c or cx. + * For a case-insensitive search, set + * variable cx to be the same as c but in + * the other case and search the input + * string for either c or cx. */ int bMatch; @@ -756,14 +801,18 @@ patternCompare(const char * pattern, /* The glob pattern */ c = u_tolower(c); while (string < string_end){ /** - * This loop could have been implemented - * without if converting c2 to lower case - * (by holding c_upper and c_lower), however - * it is implemented this way because lower - * works better with German and Turkish - * languages. + * This loop could have been + * implemented without if + * converting c2 to lower case + * by holding c_upper and + * c_lower,however it is + * implemented this way because + * lower works better with German + * and Turkish languages. */ c2 = Utf8Read(string, string_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQLITE_NOMATCH; if (!noCase) { if (c2 != c) continue; @@ -771,9 +820,10 @@ patternCompare(const char * pattern, /* The glob pattern */ if (c2 != c && u_tolower(c2) != c) continue; } - bMatch = - patternCompare(pattern, string, - pInfo, matchOther); + bMatch = sql_utf8_pattern_compare(pattern, + string, + pInfo, + matchOther); if (bMatch != SQLITE_NOMATCH) return bMatch; } @@ -782,7 +832,9 @@ patternCompare(const char * pattern, /* The glob pattern */ if (c == matchOther) { if (pInfo->matchSet == 0) { c = Utf8Read(pattern, pattern_end); - if (c == 0) + if (c == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; + if (c == SQL_END_OF_STRING) return SQLITE_NOMATCH; zEscaped = pattern; } else { @@ -790,23 +842,33 @@ patternCompare(const char * pattern, /* The glob pattern */ int seen = 0; int invert = 0; c = Utf8Read(string, string_end); + if (c == SQL_INVALID_UTF8_SYMBOL) + return SQLITE_NOMATCH; if (string == string_end) return SQLITE_NOMATCH; c2 = Utf8Read(pattern, pattern_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; if (c2 == '^') { invert = 1; c2 = Utf8Read(pattern, pattern_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; } if (c2 == ']') { if (c == ']') seen = 1; c2 = Utf8Read(pattern, pattern_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; } - while (c2 && c2 != ']') { + while (c2 != SQL_END_OF_STRING && c2 != ']') { if (c2 == '-' && pattern[0] != ']' && pattern < pattern_end && prior_c > 0) { c2 = Utf8Read(pattern, pattern_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; if (c >= prior_c && c <= c2) seen = 1; prior_c = 0; @@ -817,29 +879,36 @@ patternCompare(const char * pattern, /* The glob pattern */ prior_c = c2; } c2 = Utf8Read(pattern, pattern_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQL_PROHIBITED_PATTERN; } - if (pattern == pattern_end || (seen ^ invert) == 0) { + if (pattern == pattern_end || + (seen ^ invert) == 0) { return SQLITE_NOMATCH; } continue; } } c2 = Utf8Read(string, string_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQLITE_NOMATCH; if (c == c2) continue; if (noCase){ /** - * Small optimisation. Reduce number of calls - * to u_tolower function. - * SQL standards suggest use to_upper for symbol - * normalisation. However, using to_lower allows to - * respect Turkish 'İ' in default locale. + * Small optimisation. Reduce number of + * calls to u_tolower function. SQL + * standards suggest use to_upper for + * symbol normalisation. However, using + * to_lower allows to respect Turkish 'İ' + * in default locale. */ if (u_tolower(c) == c2 || c == u_tolower(c2)) continue; } - if (c == matchOne && pattern != zEscaped && c2 != 0) + if (c == matchOne && pattern != zEscaped && + c2 != SQL_END_OF_STRING) continue; return SQLITE_NOMATCH; } @@ -853,8 +922,7 @@ patternCompare(const char * pattern, /* The glob pattern */ int sqlite3_strglob(const char *zGlobPattern, const char *zString) { - return patternCompare(zGlobPattern, zString, &globInfo, - '['); + return sql_utf8_pattern_compare(zGlobPattern, zString, &globInfo, '['); } /* @@ -864,7 +932,7 @@ sqlite3_strglob(const char *zGlobPattern, const char *zString) int sqlite3_strlike(const char *zPattern, const char *zStr, unsigned int esc) { - return patternCompare(zPattern, zStr, &likeInfoNorm, esc); + return sql_utf8_pattern_compare(zPattern, zStr, &likeInfoNorm, esc); } /* @@ -910,8 +978,9 @@ likeFunc(sqlite3_context * context, int argc, sqlite3_value ** argv) zB = (const char *) sqlite3_value_text(argv[0]); zA = (const char *) sqlite3_value_text(argv[1]); - /* Limit the length of the LIKE or GLOB pattern to avoid problems - * of deep recursion and N*N behavior in patternCompare(). + /* Limit the length of the LIKE or GLOB pattern to avoid + * problems of deep recursion and N*N behavior in + * sql_utf8_pattern_compare(). */ nPat = sqlite3_value_bytes(argv[0]); testcase(nPat == db->aLimit[SQLITE_LIMIT_LIKE_PATTERN_LENGTH]); @@ -947,7 +1016,12 @@ likeFunc(sqlite3_context * context, int argc, sqlite3_value ** argv) sqlite3_like_count++; #endif int res; - res = patternCompare(zB, zA, pInfo, escape); + res = sql_utf8_pattern_compare(zB, zA, pInfo, escape); + if (res == SQL_PROHIBITED_PATTERN) { + sqlite3_result_error(context, "LIKE or GLOB pattern can only" + " contain UTF-8 characters", -1); + return; + } sqlite3_result_int(context, res == SQLITE_MATCH); } diff --git a/test/sql-tap/e_expr.test.lua b/test/sql-tap/e_expr.test.lua index 13d3a96..9780d2c 100755 --- a/test/sql-tap/e_expr.test.lua +++ b/test/sql-tap/e_expr.test.lua @@ -1,6 +1,6 @@ #!/usr/bin/env tarantool test = require("sqltester") -test:plan(12431) +test:plan(10665) --!./tcltestrunner.lua -- 2010 July 16 @@ -77,8 +77,10 @@ local operations = { {"<>", "ne1"}, {"!=", "ne2"}, {"IS", "is"}, - {"LIKE", "like"}, - {"GLOB", "glob"}, +-- NOTE: This test needs refactoring after deletion of GLOB & +-- type restrictions for LIKE. (See #3572) +-- {"LIKE", "like"}, +-- {"GLOB", "glob"}, {"AND", "and"}, {"OR", "or"}, {"MATCH", "match"}, @@ -96,7 +98,12 @@ operations = { {"+", "-"}, {"<<", ">>", "&", "|"}, {"<", "<=", ">", ">="}, - {"=", "==", "!=", "<>", "LIKE", "GLOB"}, --"MATCH", "REGEXP"}, +-- NOTE: This test needs refactoring after deletion of GLOB & +-- type restrictions for LIKE. (See #3572) +-- Another NOTE: MATCH & REGEXP aren't supported in Tarantool & +-- are waiting for their hour, don't confuse them +-- being commented with ticket above. + {"=", "==", "!=", "<>"}, --"LIKE", "GLOB"}, --"MATCH", "REGEXP"}, {"AND"}, {"OR"}, } @@ -475,6 +482,7 @@ for _, op in ipairs(oplist) do end end end + --------------------------------------------------------------------------- -- Test the IS and IS NOT operators. -- diff --git a/test/sql-tap/gh-3251-string-pattern-comparison.test.lua b/test/sql-tap/gh-3251-string-pattern-comparison.test.lua new file mode 100755 index 0000000..2a787f2 --- /dev/null +++ b/test/sql-tap/gh-3251-string-pattern-comparison.test.lua @@ -0,0 +1,213 @@ +#!/usr/bin/env tarantool +test = require("sqltester") +test:plan(128) + +local prefix = "like-test-" + +-- Unicode byte sequences. +local valid_testcases = { + '\x01', + '\x09', + '\x1F', + '\x7F', + '\xC2\x80', + '\xC2\x90', + '\xC2\x9F', + '\xE2\x80\xA8', + '\x20\x0B', + '\xE2\x80\xA9', +} + +-- Non-Unicode byte sequences. +local invalid_testcases = { + '\xE2\x80', + '\xFE\xFF', + '\xC2', + '\xED\xB0\x80', + '\xD0', +} + +local like_test_cases = +{ + {"1.1", + "SELECT 'AB' LIKE '_B';", + {0, {1}} }, + {"1.2", + "SELECT 'CD' LIKE '_B';", + {0, {0}} }, + {"1.3", + "SELECT '' LIKE '_B';", + {0, {0}} }, + {"1.4", + "SELECT 'AB' LIKE '%B';", + {0, {1}} }, + {"1.5", + "SELECT 'CD' LIKE '%B';", + {0, {0}} }, + {"1.6", + "SELECT '' LIKE '%B';", + {0, {0}} }, + {"1.7", + "SELECT 'AB' LIKE 'A__';", + {0, {0}} }, + {"1.8", + "SELECT 'CD' LIKE 'A__';", + {0, {0}} }, + {"1.9", + "SELECT '' LIKE 'A__';", + {0, {0}} }, + {"1.10", + "SELECT 'AB' LIKE 'A_';", + {0, {1}} }, + {"1.11", + "SELECT 'CD' LIKE 'A_';", + {0, {0}} }, + {"1.12", + "SELECT '' LIKE 'A_';", + {0, {0}} }, + {"1.13", + "SELECT 'AB' LIKE 'A';", + {0, {0}} }, + {"1.14", + "SELECT 'CD' LIKE 'A';", + {0, {0}} }, + {"1.15", + "SELECT '' LIKE 'A';", + {0, {0}} }, + {"1.16", + "SELECT 'AB' LIKE '_';", + {0, {0}} }, + {"1.17", + "SELECT 'CD' LIKE '_';", + {0, {0}} }, + {"1.18", + "SELECT '' LIKE '_';", + {0, {0}} }, + {"1.19", + "SELECT 'AB' LIKE '__';", + {0, {1}} }, + {"1.20", + "SELECT 'CD' LIKE '__';", + {0, {1}} }, + {"1.21", + "SELECT '' LIKE '__';", + {0, {0}} }, + {"1.22", + "SELECT 'AB' LIKE '%A';", + {0, {0}} }, + {"1.23", + "SELECT 'AB' LIKE '%C';", + {0, {0}} }, + {"1.24", + "SELECT 'ab' LIKE '%df';", + {0, {0}} }, + {"1.25", + "SELECT 'abCDF' LIKE '%df';", + {0, {1}} }, + {"1.26", + "SELECT 'CDF' LIKE '%df';", + {0, {1}} }, + {"1.27", + "SELECT 'ab' LIKE 'a_';", + {0, {1}} }, + {"1.28", + "SELECT 'abCDF' LIKE 'a_';", + {0, {0}} }, + {"1.29", + "SELECT 'CDF' LIKE 'a_';", + {0, {0}} }, + {"1.30", + "SELECT 'ab' LIKE 'ab%';", + {0, {1}} }, + {"1.31", + "SELECT 'abCDF' LIKE 'ab%';", + {0, {1}} }, + {"1.32", + "SELECT 'CDF' LIKE 'ab%';", + {0, {0}} }, + {"1.33", + "SELECT 'ab' LIKE 'abC%';", + {0, {0}} }, + {"1.34", + "SELECT 'abCDF' LIKE 'abC%';", + {0, {1}} }, + {"1.35", + "SELECT 'CDF' LIKE 'abC%';", + {0, {0}} }, + {"1.36", + "SELECT 'ab' LIKE 'a_%';", + {0, {1}} }, + {"1.37", + "SELECT 'abCDF' LIKE 'a_%';", + {0, {1}} }, + {"1.38", + "SELECT 'CDF' LIKE 'a_%';", + {0, {0}} }, +} + +test:do_catchsql_set_test(like_test_cases, prefix) + +-- Invalid testcases. +for i, tested_string in ipairs(invalid_testcases) do + + -- We should raise an error in case + -- pattern contains invalid characters. + + local test_name = prefix .. "2." .. tostring(i) + local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';" + test:do_catchsql_test(test_name, test_itself, + {1, "LIKE or GLOB pattern can only contain UTF-8 characters"}) + + test_name = prefix .. "3." .. tostring(i) + test_itself = "SELECT 'abc' LIKE 'abc" .. tested_string .. "';" + test:do_catchsql_test(test_name, test_itself, + {1, "LIKE or GLOB pattern can only contain UTF-8 characters"}) + + test_name = prefix .. "4." .. tostring(i) + test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "c';" + test:do_catchsql_test(test_name, test_itself, + {1, "LIKE or GLOB pattern can only contain UTF-8 characters"}) + + -- Just skipping if row value predicand contains invalid character. + + test_name = prefix .. "5." .. tostring(i) + test_itself = "SELECT 'ab" .. tested_string .. "' LIKE 'abc';" + test:do_execsql_test(test_name, test_itself, {0}) + + test_name = prefix .. "6." .. tostring(i) + test_itself = "SELECT 'abc" .. tested_string .. "' LIKE 'abc';" + test:do_execsql_test(test_name, test_itself, {0}) + + test_name = prefix .. "7." .. tostring(i) + test_itself = "SELECT 'ab" .. tested_string .. "c' LIKE 'abc';" + test:do_execsql_test(test_name, test_itself, {0}) +end + +-- Valid testcases. +for i, tested_string in ipairs(valid_testcases) do + test_name = prefix .. "8." .. tostring(i) + local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';" + test:do_execsql_test(test_name, test_itself, {0}) + + test_name = prefix .. "9." .. tostring(i) + test_itself = "SELECT 'abc' LIKE 'abc" .. tested_string .. "';" + test:do_execsql_test(test_name, test_itself, {0}) + + test_name = prefix .. "10." .. tostring(i) + test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "c';" + test:do_execsql_test(test_name, test_itself, {0}) + + test_name = prefix .. "11." .. tostring(i) + test_itself = "SELECT 'ab" .. tested_string .. "' LIKE 'abc';" + test:do_execsql_test(test_name, test_itself, {0}) + + test_name = prefix .. "12." .. tostring(i) + test_itself = "SELECT 'abc" .. tested_string .. "' LIKE 'abc';" + test:do_execsql_test(test_name, test_itself, {0}) + + test_name = prefix .. "13." .. tostring(i) + test_itself = "SELECT 'ab" .. tested_string .. "c' LIKE 'abc';" + test:do_execsql_test(test_name, test_itself, {0}) +end + +test:finish_test() [-- Attachment #2: Type: text/html, Size: 113918 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* [tarantool-patches] Re: [PATCH] sql: LIKE & GLOB pattern comparison issue 2018-08-01 18:10 ` Nikita Tatunov @ 2018-08-01 18:14 ` Nikita Tatunov 2018-08-08 12:38 ` Alex Khatskevich 1 sibling, 0 replies; 22+ messages in thread From: Nikita Tatunov @ 2018-08-01 18:14 UTC (permalink / raw) To: avkhatskevich; +Cc: Alexander Turenko, tarantool-patches [-- Attachment #1: Type: text/plain, Size: 22887 bytes --] commit 5f2c0b8aa78d3638473a2a12b6d1059657144d58 Author: N.Tatunov <hollow653@gmail.com> Date: Thu Jun 28 15:17:32 2018 +0300 sql: LIKE & GLOB pattern comparison issue Currently function that compares pattern and string for GLOB & LIKE operators doesn't work properly. It uses ICU reading function which was assumed having other return codes and the implementation for the comparison ending isn't paying attention to some special cases, hence in those cases it works improperly. With the patch applied an error will be returned in case there's an invalid UTF-8 symbol in pattern & pattern containing only valid UTF-8 symbols will not be matched with the string that contains invalid symbol. Сloses #3251 Сloses #3334 Part of #3572 diff --git a/src/box/sql/func.c b/src/box/sql/func.c index c06e3bd..27c6f42 100644 --- a/src/box/sql/func.c +++ b/src/box/sql/func.c @@ -617,13 +617,16 @@ struct compareInfo { u8 noCase; /* true to ignore case differences */ }; -/* - * For LIKE and GLOB matching on EBCDIC machines, assume that every - * character is exactly one byte in size. Also, provde the Utf8Read() - * macro for fast reading of the next character in the common case where - * the next character is ASCII. +/** + * Providing there are symbols in string s this + * macro returns UTF-8 code of character and + * promotes pointer to the next symbol in the string. + * Otherwise return code is SQL_END_OF_STRING. */ -#define Utf8Read(s, e) ucnv_getNextUChar(pUtf8conv, &s, e, &status) +#define Utf8Read(s, e) ucnv_getNextUChar(pUtf8conv, &(s), (e), &(status)) + +#define SQL_END_OF_STRING 0xffff +#define SQL_INVALID_UTF8_SYMBOL 0xfffd static const struct compareInfo globInfo = { '*', '?', '[', 0 }; @@ -638,19 +641,16 @@ static const struct compareInfo likeInfoNorm = { '%', '_', 0, 1 }; static const struct compareInfo likeInfoAlt = { '%', '_', 0, 0 }; /* - * Possible error returns from patternMatch() + * Possible error returns from sql_utf8_pattern_compare() */ #define SQLITE_MATCH 0 #define SQLITE_NOMATCH 1 #define SQLITE_NOWILDCARDMATCH 2 +#define SQL_INVALID_PATTERN 3 -/* - * Compare two UTF-8 strings for equality where the first string is - * a GLOB or LIKE expression. Return values: - * - * SQLITE_MATCH: Match - * SQLITE_NOMATCH: No match - * SQLITE_NOWILDCARDMATCH: No match in spite of having * or % wildcards. +/** + * Compare two UTF-8 strings for equality where the first string + * is a GLOB or LIKE expression. * * Globbing rules: * @@ -663,92 +663,136 @@ static const struct compareInfo likeInfoAlt = { '%', '_', 0, 0 }; * * [^...] Matches one character not in the enclosed list. * - * With the [...] and [^...] matching, a ']' character can be included - * in the list by making it the first character after '[' or '^'. A - * range of characters can be specified using '-'. Example: - * "[a-z]" matches any single lower-case letter. To match a '-', make - * it the last character in the list. + * With the [...] and [^...] matching, a ']' character can be + * included in the list by making it the first character after + * '[' or '^'. A range of characters can be specified using '-'. + * Example: "[a-z]" matches any single lower-case letter. + * To match a '-', make it the last character in the list. * * Like matching rules: * - * '%' Matches any sequence of zero or more characters + * '%' Matches any sequence of zero or more characters. * - ** '_' Matches any one character + ** '_' Matches any one character. * * Ec Where E is the "esc" character and c is any other - * character, including '%', '_', and esc, match exactly c. + * character, including '%', '_', and esc, match + * exactly c. * * The comments within this routine usually assume glob matching. * - * This routine is usually quick, but can be N**2 in the worst case. + * This routine is usually quick, but can be N**2 in the worst + * case. + * + * @param pattern String containing comparison pattern. + * @param string String being compared. + * @param compareInfo Information about how to compare. + * @param matchOther The escape char (LIKE) or '[' (GLOB). + * + * @retval SQLITE_MATCH: Match. + * SQLITE_NOMATCH: No match. + * SQLITE_NOWILDCARDMATCH: No match in spite of having * + * or % wildcards. + * SQL_INVALID_PATTERN: Pattern contains invalid + * symbol. */ static int -patternCompare(const char * pattern, /* The glob pattern */ - const char * string, /* The string to compare against the glob */ - const struct compareInfo *pInfo, /* Information about how to do the compare */ - UChar32 matchOther /* The escape char (LIKE) or '[' (GLOB) */ - ) +sql_utf8_pattern_compare(const char *pattern, + const char *string, + const struct compareInfo *pInfo, + UChar32 matchOther) { - UChar32 c, c2; /* Next pattern and input string chars */ - UChar32 matchOne = pInfo->matchOne; /* "?" or "_" */ - UChar32 matchAll = pInfo->matchAll; /* "*" or "%" */ - UChar32 noCase = pInfo->noCase; /* True if uppercase==lowercase */ - const char *zEscaped = 0; /* One past the last escaped input char */ + /* Next pattern and input string chars */ + UChar32 c, c2; + /* "?" or "_" */ + UChar32 matchOne = pInfo->matchOne; + /* "*" or "%" */ + UChar32 matchAll = pInfo->matchAll; + /* True if uppercase==lowercase */ + UChar32 noCase = pInfo->noCase; + /* One past the last escaped input char */ + const char *zEscaped = 0; const char * pattern_end = pattern + strlen(pattern); const char * string_end = string + strlen(string); UErrorCode status = U_ZERO_ERROR; - while (pattern < pattern_end){ - c = Utf8Read(pattern, pattern_end); + while ((c = Utf8Read(pattern, pattern_end)) != SQL_END_OF_STRING) { + if (c == SQL_INVALID_UTF8_SYMBOL) + return SQL_INVALID_PATTERN; if (c == matchAll) { /* Match "*" */ - /* Skip over multiple "*" characters in the pattern. If there - * are also "?" characters, skip those as well, but consume a - * single character of the input string for each "?" skipped + /* Skip over multiple "*" characters in + * the pattern. If there are also "?" + * characters, skip those as well, but + * consume a single character of the + * input string for each "?" skipped. */ - while (pattern < pattern_end){ - c = Utf8Read(pattern, pattern_end); + while ((c = Utf8Read(pattern, pattern_end)) != + SQL_END_OF_STRING) { + if (c == SQL_INVALID_UTF8_SYMBOL) + return SQL_INVALID_PATTERN; if (c != matchAll && c != matchOne) break; - if (c == matchOne - && Utf8Read(string, string_end) == 0) { + if (c == matchOne && + (c2 = Utf8Read(string, string_end)) == + SQL_END_OF_STRING) return SQLITE_NOWILDCARDMATCH; - } + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQLITE_NOMATCH; } - /* "*" at the end of the pattern matches */ - if (pattern == pattern_end) + /* + * "*" at the end of the pattern matches. + */ + if (c == SQL_END_OF_STRING) { + while ((c2 = Utf8Read(string, string_end)) != + SQL_END_OF_STRING) + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQLITE_NOMATCH; return SQLITE_MATCH; + } if (c == matchOther) { if (pInfo->matchSet == 0) { c = Utf8Read(pattern, pattern_end); - if (c == 0) + if (c == SQL_INVALID_UTF8_SYMBOL) + return SQL_INVALID_PATTERN; + if (c == SQL_END_OF_STRING) return SQLITE_NOWILDCARDMATCH; } else { - /* "[...]" immediately follows the "*". We have to do a slow - * recursive search in this case, but it is an unusual case. + /* "[...]" immediately + * follows the "*". We + * have to do a slow + * recursive search in + * this case, but it is + * an unusual case. */ - assert(matchOther < 0x80); /* '[' is a single-byte character */ - while (string < string_end) { + assert(matchOther < 0x80); + while (c != SQL_END_OF_STRING) { int bMatch = - patternCompare(&pattern[-1], - string, - pInfo, - matchOther); + sql_utf8_pattern_compare( + &pattern[-1], + string, + pInfo, + matchOther); if (bMatch != SQLITE_NOMATCH) return bMatch; - Utf8Read(string, string_end); + c = Utf8Read(string, string_end); + if (c == SQL_INVALID_UTF8_SYMBOL) + return SQLITE_NOMATCH; } return SQLITE_NOWILDCARDMATCH; } } - /* At this point variable c contains the first character of the - * pattern string past the "*". Search in the input string for the - * first matching character and recursively continue the match from - * that point. + /* At this point variable c contains the + * first character of the pattern string + * past the "*". Search in the input + * string for the first matching + * character and recursively continue the + * match from that point. * - * For a case-insensitive search, set variable cx to be the same as - * c but in the other case and search the input string for either - * c or cx. + * For a case-insensitive search, set + * variable cx to be the same as c but in + * the other case and search the input + * string for either c or cx. */ int bMatch; @@ -756,14 +800,18 @@ patternCompare(const char * pattern, /* The glob pattern */ c = u_tolower(c); while (string < string_end){ /** - * This loop could have been implemented - * without if converting c2 to lower case - * (by holding c_upper and c_lower), however - * it is implemented this way because lower - * works better with German and Turkish - * languages. + * This loop could have been + * implemented without if + * converting c2 to lower case + * by holding c_upper and + * c_lower,however it is + * implemented this way because + * lower works better with German + * and Turkish languages. */ c2 = Utf8Read(string, string_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQLITE_NOMATCH; if (!noCase) { if (c2 != c) continue; @@ -771,9 +819,10 @@ patternCompare(const char * pattern, /* The glob pattern */ if (c2 != c && u_tolower(c2) != c) continue; } - bMatch = - patternCompare(pattern, string, - pInfo, matchOther); + bMatch = sql_utf8_pattern_compare(pattern, + string, + pInfo, + matchOther); if (bMatch != SQLITE_NOMATCH) return bMatch; } @@ -782,7 +831,9 @@ patternCompare(const char * pattern, /* The glob pattern */ if (c == matchOther) { if (pInfo->matchSet == 0) { c = Utf8Read(pattern, pattern_end); - if (c == 0) + if (c == SQL_INVALID_UTF8_SYMBOL) + return SQL_INVALID_PATTERN; + if (c == SQL_END_OF_STRING) return SQLITE_NOMATCH; zEscaped = pattern; } else { @@ -790,23 +841,33 @@ patternCompare(const char * pattern, /* The glob pattern */ int seen = 0; int invert = 0; c = Utf8Read(string, string_end); + if (c == SQL_INVALID_UTF8_SYMBOL) + return SQLITE_NOMATCH; if (string == string_end) return SQLITE_NOMATCH; c2 = Utf8Read(pattern, pattern_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQL_INVALID_PATTERN; if (c2 == '^') { invert = 1; c2 = Utf8Read(pattern, pattern_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQL_INVALID_PATTERN; } if (c2 == ']') { if (c == ']') seen = 1; c2 = Utf8Read(pattern, pattern_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQL_INVALID_PATTERN; } - while (c2 && c2 != ']') { + while (c2 != SQL_END_OF_STRING && c2 != ']') { if (c2 == '-' && pattern[0] != ']' && pattern < pattern_end && prior_c > 0) { c2 = Utf8Read(pattern, pattern_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQL_INVALID_PATTERN; if (c >= prior_c && c <= c2) seen = 1; prior_c = 0; @@ -817,29 +878,36 @@ patternCompare(const char * pattern, /* The glob pattern */ prior_c = c2; } c2 = Utf8Read(pattern, pattern_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQL_INVALID_PATTERN; } - if (pattern == pattern_end || (seen ^ invert) == 0) { + if (pattern == pattern_end || + (seen ^ invert) == 0) { return SQLITE_NOMATCH; } continue; } } c2 = Utf8Read(string, string_end); + if (c2 == SQL_INVALID_UTF8_SYMBOL) + return SQLITE_NOMATCH; if (c == c2) continue; if (noCase){ /** - * Small optimisation. Reduce number of calls - * to u_tolower function. - * SQL standards suggest use to_upper for symbol - * normalisation. However, using to_lower allows to - * respect Turkish 'İ' in default locale. + * Small optimisation. Reduce number of + * calls to u_tolower function. SQL + * standards suggest use to_upper for + * symbol normalisation. However, using + * to_lower allows to respect Turkish 'İ' + * in default locale. */ if (u_tolower(c) == c2 || c == u_tolower(c2)) continue; } - if (c == matchOne && pattern != zEscaped && c2 != 0) + if (c == matchOne && pattern != zEscaped && + c2 != SQL_END_OF_STRING) continue; return SQLITE_NOMATCH; } @@ -853,8 +921,7 @@ patternCompare(const char * pattern, /* The glob pattern */ int sqlite3_strglob(const char *zGlobPattern, const char *zString) { - return patternCompare(zGlobPattern, zString, &globInfo, - '['); + return sql_utf8_pattern_compare(zGlobPattern, zString, &globInfo, '['); } /* @@ -864,7 +931,7 @@ sqlite3_strglob(const char *zGlobPattern, const char *zString) int sqlite3_strlike(const char *zPattern, const char *zStr, unsigned int esc) { - return patternCompare(zPattern, zStr, &likeInfoNorm, esc); + return sql_utf8_pattern_compare(zPattern, zStr, &likeInfoNorm, esc); } /* @@ -910,8 +977,9 @@ likeFunc(sqlite3_context * context, int argc, sqlite3_value ** argv) zB = (const char *) sqlite3_value_text(argv[0]); zA = (const char *) sqlite3_value_text(argv[1]); - /* Limit the length of the LIKE or GLOB pattern to avoid problems - * of deep recursion and N*N behavior in patternCompare(). + /* Limit the length of the LIKE or GLOB pattern to avoid + * problems of deep recursion and N*N behavior in + * sql_utf8_pattern_compare(). */ nPat = sqlite3_value_bytes(argv[0]); testcase(nPat == db->aLimit[SQLITE_LIMIT_LIKE_PATTERN_LENGTH]); @@ -947,7 +1015,12 @@ likeFunc(sqlite3_context * context, int argc, sqlite3_value ** argv) sqlite3_like_count++; #endif int res; - res = patternCompare(zB, zA, pInfo, escape); + res = sql_utf8_pattern_compare(zB, zA, pInfo, escape); + if (res == SQL_INVALID_PATTERN) { + sqlite3_result_error(context, "LIKE or GLOB pattern can only" + " contain UTF-8 characters", -1); + return; + } sqlite3_result_int(context, res == SQLITE_MATCH); } diff --git a/test/sql-tap/e_expr.test.lua b/test/sql-tap/e_expr.test.lua index 13d3a96..9780d2c 100755 --- a/test/sql-tap/e_expr.test.lua +++ b/test/sql-tap/e_expr.test.lua @@ -1,6 +1,6 @@ #!/usr/bin/env tarantool test = require("sqltester") -test:plan(12431) +test:plan(10665) --!./tcltestrunner.lua -- 2010 July 16 @@ -77,8 +77,10 @@ local operations = { {"<>", "ne1"}, {"!=", "ne2"}, {"IS", "is"}, - {"LIKE", "like"}, - {"GLOB", "glob"}, +-- NOTE: This test needs refactoring after deletion of GLOB & +-- type restrictions for LIKE. (See #3572) +-- {"LIKE", "like"}, +-- {"GLOB", "glob"}, {"AND", "and"}, {"OR", "or"}, {"MATCH", "match"}, @@ -96,7 +98,12 @@ operations = { {"+", "-"}, {"<<", ">>", "&", "|"}, {"<", "<=", ">", ">="}, - {"=", "==", "!=", "<>", "LIKE", "GLOB"}, --"MATCH", "REGEXP"}, +-- NOTE: This test needs refactoring after deletion of GLOB & +-- type restrictions for LIKE. (See #3572) +-- Another NOTE: MATCH & REGEXP aren't supported in Tarantool & +-- are waiting for their hour, don't confuse them +-- being commented with ticket above. + {"=", "==", "!=", "<>"}, --"LIKE", "GLOB"}, --"MATCH", "REGEXP"}, {"AND"}, {"OR"}, } @@ -475,6 +482,7 @@ for _, op in ipairs(oplist) do end end end + --------------------------------------------------------------------------- -- Test the IS and IS NOT operators. -- diff --git a/test/sql-tap/gh-3251-string-pattern-comparison.test.lua b/test/sql-tap/gh-3251-string-pattern-comparison.test.lua new file mode 100755 index 0000000..addf0e3 --- /dev/null +++ b/test/sql-tap/gh-3251-string-pattern-comparison.test.lua @@ -0,0 +1,213 @@ +#!/usr/bin/env tarantool +test = require("sqltester") +test:plan(128) + +local prefix = "like-test-" + +local like_test_cases = +{ + {"1.1", + "SELECT 'AB' LIKE '_B';", + {0, {1}} }, + {"1.2", + "SELECT 'CD' LIKE '_B';", + {0, {0}} }, + {"1.3", + "SELECT '' LIKE '_B';", + {0, {0}} }, + {"1.4", + "SELECT 'AB' LIKE '%B';", + {0, {1}} }, + {"1.5", + "SELECT 'CD' LIKE '%B';", + {0, {0}} }, + {"1.6", + "SELECT '' LIKE '%B';", + {0, {0}} }, + {"1.7", + "SELECT 'AB' LIKE 'A__';", + {0, {0}} }, + {"1.8", + "SELECT 'CD' LIKE 'A__';", + {0, {0}} }, + {"1.9", + "SELECT '' LIKE 'A__';", + {0, {0}} }, + {"1.10", + "SELECT 'AB' LIKE 'A_';", + {0, {1}} }, + {"1.11", + "SELECT 'CD' LIKE 'A_';", + {0, {0}} }, + {"1.12", + "SELECT '' LIKE 'A_';", + {0, {0}} }, + {"1.13", + "SELECT 'AB' LIKE 'A';", + {0, {0}} }, + {"1.14", + "SELECT 'CD' LIKE 'A';", + {0, {0}} }, + {"1.15", + "SELECT '' LIKE 'A';", + {0, {0}} }, + {"1.16", + "SELECT 'AB' LIKE '_';", + {0, {0}} }, + {"1.17", + "SELECT 'CD' LIKE '_';", + {0, {0}} }, + {"1.18", + "SELECT '' LIKE '_';", + {0, {0}} }, + {"1.19", + "SELECT 'AB' LIKE '__';", + {0, {1}} }, + {"1.20", + "SELECT 'CD' LIKE '__';", + {0, {1}} }, + {"1.21", + "SELECT '' LIKE '__';", + {0, {0}} }, + {"1.22", + "SELECT 'AB' LIKE '%A';", + {0, {0}} }, + {"1.23", + "SELECT 'AB' LIKE '%C';", + {0, {0}} }, + {"1.24", + "SELECT 'ёф' LIKE '%œش';", + {0, {0}} }, + {"1.25", + "SELECT 'ёфÅŒش' LIKE '%œش';", + {0, {1}} }, + {"1.26", + "SELECT 'ÅŒش' LIKE '%œش';", + {0, {1}} }, + {"1.27", + "SELECT 'ёф' LIKE 'ё_';", + {0, {1}} }, + {"1.28", + "SELECT 'ёфÅŒش' LIKE 'ё_';", + {0, {0}} }, + {"1.29", + "SELECT 'ÅŒش' LIKE 'ё_';", + {0, {0}} }, + {"1.30", + "SELECT 'ёф' LIKE 'ёф%';", + {0, {1}} }, + {"1.31", + "SELECT 'ёфÅŒش' LIKE 'ёф%';", + {0, {1}} }, + {"1.32", + "SELECT 'ÅŒش' LIKE 'ёф%';", + {0, {0}} }, + {"1.33", + "SELECT 'ёф' LIKE 'ёфÅ%';", + {0, {0}} }, + {"1.34", + "SELECT 'ёфÅŒش' LIKE 'ёфÅ%';", + {0, {1}} }, + {"1.35", + "SELECT 'ÅŒش' LIKE 'ёфش%';", + {0, {0}} }, + {"1.36", + "SELECT 'ёф' LIKE 'ё_%';", + {0, {1}} }, + {"1.37", + "SELECT 'ёфÅŒش' LIKE 'ё_%';", + {0, {1}} }, + {"1.38", + "SELECT 'ÅŒش' LIKE 'ё_%';", + {0, {0}} }, +} + +test:do_catchsql_set_test(like_test_cases, prefix) + +-- Non-Unicode byte sequences. +local invalid_testcases = { + '\xE2\x80', + '\xFE\xFF', + '\xC2', + '\xED\xB0\x80', + '\xD0', +} + +-- Invalid testcases. +for i, tested_string in ipairs(invalid_testcases) do + + -- We should raise an error in case + -- pattern contains invalid characters. + + local test_name = prefix .. "2." .. tostring(i) + local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';" + test:do_catchsql_test(test_name, test_itself, + {1, "LIKE or GLOB pattern can only contain UTF-8 characters"}) + + test_name = prefix .. "3." .. tostring(i) + test_itself = "SELECT 'abc' LIKE 'abc" .. tested_string .. "';" + test:do_catchsql_test(test_name, test_itself, + {1, "LIKE or GLOB pattern can only contain UTF-8 characters"}) + + test_name = prefix .. "4." .. tostring(i) + test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "c';" + test:do_catchsql_test(test_name, test_itself, + {1, "LIKE or GLOB pattern can only contain UTF-8 characters"}) + + -- Just skipping if row value predicand contains invalid character. + + test_name = prefix .. "5." .. tostring(i) + test_itself = "SELECT 'ab" .. tested_string .. "' LIKE 'abc';" + test:do_execsql_test(test_name, test_itself, {0}) + + test_name = prefix .. "6." .. tostring(i) + test_itself = "SELECT 'abc" .. tested_string .. "' LIKE 'abc';" + test:do_execsql_test(test_name, test_itself, {0}) + + test_name = prefix .. "7." .. tostring(i) + test_itself = "SELECT 'ab" .. tested_string .. "c' LIKE 'abc';" + test:do_execsql_test(test_name, test_itself, {0}) +end + +-- Unicode byte sequences. +local valid_testcases = { + '\x01', + '\x09', + '\x1F', + '\x7F', + '\xC2\x80', + '\xC2\x90', + '\xC2\x9F', + '\xE2\x80\xA8', + '\x20\x0B', + '\xE2\x80\xA9', +} + +-- Valid testcases. +for i, tested_string in ipairs(valid_testcases) do + test_name = prefix .. "8." .. tostring(i) + local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';" + test:do_execsql_test(test_name, test_itself, {0}) + + test_name = prefix .. "9." .. tostring(i) + test_itself = "SELECT 'abc' LIKE 'abc" .. tested_string .. "';" + test:do_execsql_test(test_name, test_itself, {0}) + + test_name = prefix .. "10." .. tostring(i) + test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "c';" + test:do_execsql_test(test_name, test_itself, {0}) + + test_name = prefix .. "11." .. tostring(i) + test_itself = "SELECT 'ab" .. tested_string .. "' LIKE 'abc';" + test:do_execsql_test(test_name, test_itself, {0}) + + test_name = prefix .. "12." .. tostring(i) + test_itself = "SELECT 'abc" .. tested_string .. "' LIKE 'abc';" + test:do_execsql_test(test_name, test_itself, {0}) + + test_name = prefix .. "13." .. tostring(i) + test_itself = "SELECT 'ab" .. tested_string .. "c' LIKE 'abc';" + test:do_execsql_test(test_name, test_itself, {0}) +end + +test:finish_test() [-- Attachment #2: Type: text/html, Size: 46048 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* [tarantool-patches] Re: [PATCH] sql: LIKE & GLOB pattern comparison issue 2018-08-01 18:10 ` Nikita Tatunov 2018-08-01 18:14 ` Nikita Tatunov @ 2018-08-08 12:38 ` Alex Khatskevich 1 sibling, 0 replies; 22+ messages in thread From: Alex Khatskevich @ 2018-08-08 12:38 UTC (permalink / raw) To: Nikita Tatunov; +Cc: Alexander Turenko, tarantool-patches [-- Attachment #1: Type: text/plain, Size: 31853 bytes --] On 01.08.2018 21:10, Nikita Tatunov wrote: > > > ср, 1 авг. 2018 г. в 16:56, Alex Khatskevich > <avkhatskevich@tarantool.org <mailto:avkhatskevich@tarantool.org>>: > > > > On 01.08.2018 13:51, Nikita Tatunov wrote: >> diff --git a/src/box/sql/func.c b/src/box/sql/func.c >> index c06e3bd..7f93ef6 100644 >> --- a/src/box/sql/func.c >> +++ b/src/box/sql/func.c >> @@ -617,13 +617,17 @@ struct compareInfo { >> u8 noCase;/* true to ignore case differences */ >> }; >> -/* >> - * For LIKE and GLOB matching on EBCDIC machines, assume that every >> - * character is exactly one byte in size. Also, provde the >> Utf8Read() >> - * macro for fast reading of the next character in the common >> case where >> - * the next character is ASCII. >> +/** >> + * Providing there are symbols in string s this >> + * macro returns UTF-8 code of character and >> + * promotes pointer to the next symbol in the string. >> + * Otherwise return code is SQL_END_OF_STRING. >> */ >> -#define Utf8Read(s, e) ucnv_getNextUChar(pUtf8conv, &s, e, &status) >> +#define Utf8Read(s, e) (((s) < (e)) ? \ >> +ucnv_getNextUChar(pUtf8conv, &(s), (e), &(status)) : 0) > [Later I will ask you to return this macro back, so, you may not > do this] > As I understand, you are returning `0` from Utf8Read in case of > end of the string. > Let's return `SQL_END_OF_STRING` instead of just `0`. > > > Yeah, thank you, didn't notice it. > >> + >> +#define SQL_END_OF_STRING 0 >> +#define SQL_INVALID_UTF8_SYMBOL 0xfffd >> static const struct compareInfo globInfo = { '*', '?', '[', 0 }; >> @@ -638,19 +642,16 @@ static const struct compareInfo >> likeInfoNorm = { '%', '_', 0, 1 }; >> static const struct compareInfo likeInfoAlt = { '%', '_', 0, 0 }; >> /* >> - * Possible error returns from patternMatch() >> + * Possible error returns from sql_utf8_pattern_compare() >> */ >> #define SQLITE_MATCH 0 >> #define SQLITE_NOMATCH 1 >> #define SQLITE_NOWILDCARDMATCH 2 >> +#define SQL_PROHIBITED_PATTERN 3 > I am not sure that the invalid (with invalid symbols) pattern can > be called `prohibited`. > Rename somehow? My proposal: SQL_INVALID_PATTERN. > > > Probably you're right, I was also thinking of changing it somehow. > > Moreover, You have named this definition with the `SQL` prefix, > which is good, however, > similar definitions are still prefixed with `SQLITE`. I would like > you to rename those in > this (preferred) or in a separate commit for consistency. > > Tarantool != SQLite and I think we should get away from this approach. > The thing that names haven't been refactored yet isn't in my control. > You can ask Nikita's opinion on it, I guess he will tell you almost > the same. I have consult @nikita and he asked to rename those SQLITE to SQL. > >> -/* >> - * Compare two UTF-8 strings for equality where the first string is >> - * a GLOB or LIKE expression. Return values: >> - * >> - * SQLITE_MATCH: Match >> - * SQLITE_NOMATCH: No match >> - * SQLITE_NOWILDCARDMATCH: No match in spite of having * or >> % wildcards. >> +/** >> + * Compare two UTF-8 strings for equality where the first string >> + * is a GLOB or LIKE expression. >> * >> * Globbing rules: >> * >> @@ -663,92 +664,136 @@ static const struct compareInfo >> likeInfoAlt = { '%', '_', 0, 0 }; >> * >> * [^...] Matches one character not in the enclosed list. >> * >> - * With the [...] and [^...] matching, a ']' character can be >> included >> - * in the list by making it the first character after '[' or '^'. A >> - * range of characters can be specified using '-'. Example: >> - * "[a-z]" matches any single lower-case letter. To match a >> '-', make >> - * it the last character in the list. >> + * With the [...] and [^...] matching, a ']' character can be >> + * included in the list by making it the first character after >> + * '[' or '^'. A range of characters can be specified using '-'. >> + * Example: "[a-z]" matches any single lower-case letter. >> + * To match a '-', make it the last character in the list. > Does it work for UTF characters? I suppose no. > Let's write about it here + let's file an issue to make it > work with UTF characters. > > > Soon this function is gonna be refactored & GLOB is gonna be removed > anyways. Yes. you will find comment on it below. > >> * >> * Like matching rules: >> * >> - * '%' Matches any sequence of zero or more characters >> + * '%' Matches any sequence of zero or more characters. >> * >> - ** '_' Matches any one character >> + ** '_' Matches any one character. >> * >> * Ec Where E is the "esc" character and c is any other >> - * character, including '%', '_', and esc, match >> exactly c. >> + * character, including '%', '_', and esc, match >> + * exactly c. >> * >> * The comments within this routine usually assume glob matching. >> * >> - * This routine is usually quick, but can be N**2 in the worst case. >> + * This routine is usually quick, but can be N**2 in the worst >> + * case. >> + * >> + * @param pattern String containing comparison pattern. >> + * @param string String being compared. >> + * @param compareInfo Information about how to compare. >> + * @param matchOther The escape char (LIKE) or '[' (GLOB). >> + * >> + * @retval SQLITE_MATCH: Match. >> + * SQLITE_NOMATCH: No match. >> + * SQLITE_NOWILDCARDMATCH: No match in spite of having * >> + * or % wildcards. >> + * SQL_PROHIBITED_PATTERN: Pattern contains invalid >> + * symbol. > Minor: It is not very good that you use symbol and character > interchangeably. > I suppose that `character` should be used everywhere. > > > They're synonyms, aren't they? > >> */ >> static int >> -patternCompare(const char * pattern,/* The glob pattern */ >> - const char * string,/* The string to compare against the glob */ >> - const struct compareInfo *pInfo,/* Information about how to do >> the compare */ >> - UChar32 matchOther/* The escape char (LIKE) or '[' (GLOB) */ >> - ) >> +sql_utf8_pattern_compare(const char * pattern, >> +const char * string, >> +const struct compareInfo *pInfo, >> +UChar32 matchOther) > "star" sign should stick to the attribute name. > https://tarantool.io/en/doc/1.9/dev_guide/c_style_guide/#chapter-3-1-spaces > > To prevent such typos in the future, you can use special perl > script which checks coding style in Linux kernel patches. > > > Subject of the ticket wasn't about refactoring function, but thnx, > changed it. > REMEMBER THIS POINT #1 > >> { >> -UChar32 c, c2;/* Next pattern and input string chars */ >> -UChar32 matchOne = pInfo->matchOne;/* "?" or "_" */ >> -UChar32 matchAll = pInfo->matchAll;/* "*" or "%" */ >> -UChar32 noCase = pInfo->noCase;/* True if uppercase==lowercase */ >> -const char *zEscaped = 0;/* One past the last escaped input char */ >> +/* Next pattern and input string chars */ >> +UChar32 c, c2; >> +/* "?" or "_" */ >> +UChar32 matchOne = pInfo->matchOne; >> +/* "*" or "%" */ >> +UChar32 matchAll = pInfo->matchAll; >> +/* True if uppercase==lowercase */ >> +UChar32 noCase = pInfo->noCase; >> +/* One past the last escaped input char */ >> +const char *zEscaped = 0; >> const char * pattern_end = pattern + strlen(pattern); >> const char * string_end = string + strlen(string); >> UErrorCode status = U_ZERO_ERROR; >> -while (pattern < pattern_end){ >> -c = Utf8Read(pattern, pattern_end); >> +while ((c = Utf8Read(pattern, pattern_end)) != SQL_END_OF_STRING) { > REMEMBER THIS POINT #1 >> +if (c == SQL_INVALID_UTF8_SYMBOL) >> +return SQL_PROHIBITED_PATTERN; >> if (c == matchAll) {/* Match "*" */ >> -/* Skip over multiple "*" characters in the pattern. If there >> -* are also "?" characters, skip those as well, but consume a >> -* single character of the input string for each "?" skipped >> +/* Skip over multiple "*" characters in >> +* the pattern. If there are also "?" >> +* characters, skip those as well, but >> +* consume a single character of the >> +* input string for each "?" skipped. >> */ >> -while (pattern < pattern_end){ >> -c = Utf8Read(pattern, pattern_end); >> +while ((c = Utf8Read(pattern, pattern_end)) != >> + SQL_END_OF_STRING) { >> +if (c == SQL_INVALID_UTF8_SYMBOL) >> +return SQL_PROHIBITED_PATTERN; >> if (c != matchAll && c != matchOne) >> break; >> -if (c == matchOne >> - && Utf8Read(string, string_end) == 0) { >> +if (c == matchOne && >> + (c2 = Utf8Read(string, string_end)) == >> + SQL_END_OF_STRING) >> return SQLITE_NOWILDCARDMATCH; >> -} >> +if (c2 == SQL_INVALID_UTF8_SYMBOL) >> +return SQLITE_NOMATCH; >> } >> -/* "*" at the end of the pattern matches */ >> -if (pattern == pattern_end) >> +/* >> +* "*" at the end of the pattern matches. >> +*/ >> +if (c == SQL_END_OF_STRING) { >> +while ((c2 = Utf8Read(string, string_end)) != >> + SQL_END_OF_STRING) >> +if (c2 == SQL_INVALID_UTF8_SYMBOL) >> +return SQLITE_NOMATCH; >> return SQLITE_MATCH; >> +} >> if (c == matchOther) { >> if (pInfo->matchSet == 0) { >> c = Utf8Read(pattern, pattern_end); >> -if (c == 0) >> +if (c == SQL_INVALID_UTF8_SYMBOL) >> +return SQL_PROHIBITED_PATTERN; >> +if (c == SQL_END_OF_STRING) >> return SQLITE_NOWILDCARDMATCH; >> } else { >> -/* "[...]" immediately follows the "*". We have to do a slow >> -* recursive search in this case, but it is an unusual case. >> +/* "[...]" immediately >> +* follows the "*". We >> +* have to do a slow >> +* recursive search in >> +* this case, but it is >> +* an unusual case. >> */ >> -assert(matchOther < 0x80);/* '[' is a single-byte character */ >> +assert(matchOther < 0x80); >> while (string < string_end) { > REMEMBER THIS POINT #2 > >> int bMatch = >> - patternCompare(&pattern[-1], >> - string, >> - pInfo, >> - matchOther); >> + sql_utf8_pattern_compare( >> +&pattern[-1], >> +string, >> +pInfo, >> +matchOther); >> if (bMatch != SQLITE_NOMATCH) >> return bMatch; >> -Utf8Read(string, string_end); >> +c = Utf8Read(string, string_end); >> +if (c == SQL_INVALID_UTF8_SYMBOL) >> +return SQLITE_NOMATCH; > look at <REMEMBER THIS POINT #1,2> and other `Utf8Read` usages. > You have introduced SQL_END_OF_STRING and changed `Utf8Read` > pattern to use it in > half of cases? > > > 1) Look at <REMEMBER THIS POINT #1>. > 2) I have introduced it not to use explicit 0 and to fix the bug. > 3) Fixed it though. > > Moreover,in that place you do check `string < string_end` > implicitly inside of > `Utf8Read` but you never use that result. > > I suppose you should return old iteration style and `Utf8Read` macro. > ``` > while (string < string_end) { > c = Utf8Read(string, string_end); > if (c == SQL_INVALID_UTF8_SYMBOL) > return SQLITE_NOMATCH; > ``` > > > I think it's better to use this approach, but yes: return prev. > version of macro: > 1) 'zero byte' ticket will be partially fixed. > 2) 0xffffis non-unicodeanyways. As we concluded verbally, please, return the old style iteration. (`while (string < string_end)`) > >> } >> return SQLITE_NOWILDCARDMATCH; >> } >> } >> -/* At this point variable c contains the first character of the >> -* pattern string past the "*". Search in the input string for the >> -* first matching character and recursively continue the match from >> -* that point. >> +/* At this point variable c contains the >> +* first character of the pattern string >> +* past the "*". Search in the input >> +* string for the first matching >> +* character and recursively continue the >> +* match from that point. >> * >> -* For a case-insensitive search, set variable cx to be the same as >> -* c but in the other case and search the input string for either >> -* c or cx. >> +* For a case-insensitive search, set >> +* variable cx to be the same as c but in >> +* the other case and search the input >> +* string for either c or cx. >> */ >> int bMatch; >> @@ -756,14 +801,18 @@ patternCompare(const char * pattern,/* The >> glob pattern */ >> c = u_tolower(c); >> while (string < string_end){ >> /** >> -* This loop could have been implemented >> -* without if converting c2 to lower case >> -* (by holding c_upper and c_lower), however >> -* it is implemented this way because lower >> -* works better with German and Turkish >> -* languages. >> +* This loop could have been >> +* implemented without if >> +* converting c2 to lower case >> +* by holding c_upper and >> +* c_lower,however it is >> +* implemented this way because >> +* lower works better with German >> +* and Turkish languages. >> */ >> c2 = Utf8Read(string, string_end); >> +if (c2 == SQL_INVALID_UTF8_SYMBOL) >> +return SQLITE_NOMATCH; >> if (!noCase) { >> if (c2 != c) >> continue; >> @@ -771,9 +820,10 @@ patternCompare(const char * pattern,/* The >> glob pattern */ >> if (c2 != c && u_tolower(c2) != c) >> continue; >> } >> -bMatch = >> - patternCompare(pattern, string, >> - pInfo, matchOther); >> +bMatch = sql_utf8_pattern_compare(pattern, >> +string, >> +pInfo, >> +matchOther); >> if (bMatch != SQLITE_NOMATCH) >> return bMatch; >> } >> @@ -782,7 +832,9 @@ patternCompare(const char * pattern,/* The >> glob pattern */ >> if (c == matchOther) { >> if (pInfo->matchSet == 0) { >> c = Utf8Read(pattern, pattern_end); >> -if (c == 0) >> +if (c == SQL_INVALID_UTF8_SYMBOL) >> +return SQL_PROHIBITED_PATTERN; >> +if (c == SQL_END_OF_STRING) >> return SQLITE_NOMATCH; >> zEscaped = pattern; >> } else { >> @@ -790,23 +842,33 @@ patternCompare(const char * pattern,/* The >> glob pattern */ >> int seen = 0; >> int invert = 0; >> c = Utf8Read(string, string_end); >> +if (c == SQL_INVALID_UTF8_SYMBOL) >> +return SQLITE_NOMATCH; >> if (string == string_end) >> return SQLITE_NOMATCH; >> c2 = Utf8Read(pattern, pattern_end); >> +if (c2 == SQL_INVALID_UTF8_SYMBOL) >> +return SQL_PROHIBITED_PATTERN; >> if (c2 == '^') { >> invert = 1; >> c2 = Utf8Read(pattern, pattern_end); >> +if (c2 == SQL_INVALID_UTF8_SYMBOL) >> +return SQL_PROHIBITED_PATTERN; >> } >> if (c2 == ']') { >> if (c == ']') >> seen = 1; >> c2 = Utf8Read(pattern, pattern_end); >> +if (c2 == SQL_INVALID_UTF8_SYMBOL) >> +return SQL_PROHIBITED_PATTERN; >> } >> -while (c2 && c2 != ']') { >> +while (c2 != SQL_END_OF_STRING && c2 != ']') { >> if (c2 == '-' && pattern[0] != ']' >> && pattern < pattern_end >> && prior_c > 0) { >> c2 = Utf8Read(pattern, pattern_end); >> +if (c2 == SQL_INVALID_UTF8_SYMBOL) >> +return SQL_PROHIBITED_PATTERN; >> if (c >= prior_c && c <= c2) >> seen = 1; >> prior_c = 0; >> @@ -817,29 +879,36 @@ patternCompare(const char * pattern,/* The >> glob pattern */ >> prior_c = c2; >> } >> c2 = Utf8Read(pattern, pattern_end); >> +if (c2 == SQL_INVALID_UTF8_SYMBOL) >> +return SQL_PROHIBITED_PATTERN; >> } >> -if (pattern == pattern_end || (seen ^ invert) == 0) { >> +if (pattern == pattern_end || >> + (seen ^ invert) == 0) { >> return SQLITE_NOMATCH; >> } >> continue; >> } >> } >> c2 = Utf8Read(string, string_end); >> +if (c2 == SQL_INVALID_UTF8_SYMBOL) >> +return SQLITE_NOMATCH; >> if (c == c2) >> continue; >> if (noCase){ >> /** >> -* Small optimisation. Reduce number of calls >> -* to u_tolower function. >> -* SQL standards suggest use to_upper for symbol >> -* normalisation. However, using to_lower allows to >> -* respect Turkish 'İ' in default locale. >> +* Small optimisation. Reduce number of >> +* calls to u_tolower function. SQL >> +* standards suggest use to_upper for >> +* symbol normalisation. However, using >> +* to_lower allows to respect Turkish 'İ' >> +* in default locale. >> */ >> if (u_tolower(c) == c2 || >> c == u_tolower(c2)) >> continue; >> } >> -if (c == matchOne && pattern != zEscaped && c2 != 0) >> +if (c == matchOne && pattern != zEscaped && >> +c2 != SQL_END_OF_STRING) >> continue; >> return SQLITE_NOMATCH; >> } >> @@ -853,8 +922,7 @@ patternCompare(const char * pattern,/* The >> glob pattern */ >> int >> sqlite3_strglob(const char *zGlobPattern, const char *zString) >> { >> -return patternCompare(zGlobPattern, zString, &globInfo, >> - '['); >> +return sql_utf8_pattern_compare(zGlobPattern, zString, >> &globInfo, '['); >> } >> /* >> @@ -864,7 +932,7 @@ sqlite3_strglob(const char *zGlobPattern, >> const char *zString) >> int >> sqlite3_strlike(const char *zPattern, const char *zStr, unsigned >> int esc) >> { >> -return patternCompare(zPattern, zStr, &likeInfoNorm, esc); >> +return sql_utf8_pattern_compare(zPattern, zStr, &likeInfoNorm, esc); >> } >> /* >> @@ -910,8 +978,9 @@ likeFunc(sqlite3_context * context, int argc, >> sqlite3_value ** argv) >> zB = (const char *) sqlite3_value_text(argv[0]); >> zA = (const char *) sqlite3_value_text(argv[1]); >> -/* Limit the length of the LIKE or GLOB pattern to avoid problems >> -* of deep recursion and N*N behavior in patternCompare(). >> +/* Limit the length of the LIKE or GLOB pattern to avoid >> +* problems of deep recursion and N*N behavior in >> +* sql_utf8_pattern_compare(). >> */ >> nPat = sqlite3_value_bytes(argv[0]); >> testcase(nPat == db->aLimit[SQLITE_LIMIT_LIKE_PATTERN_LENGTH]); >> @@ -947,7 +1016,12 @@ likeFunc(sqlite3_context * context, int >> argc, sqlite3_value ** argv) >> sqlite3_like_count++; >> #endif >> int res; >> -res = patternCompare(zB, zA, pInfo, escape); >> +res = sql_utf8_pattern_compare(zB, zA, pInfo, escape); >> +if (res == SQL_PROHIBITED_PATTERN) { >> +sqlite3_result_error(context, "LIKE or GLOB pattern can only" >> + " contain UTF-8 characters", -1); >> +return; >> +} >> sqlite3_result_int(context, res == SQLITE_MATCH); >> } >> diff --git a/test-run b/test-run >> index 77e9327..95562e9 160000 >> --- a/test-run >> +++ b/test-run >> @@ -1 +1 @@ >> -Subproject commit 77e93279210f8c5c1fd0ed03416fa19a184f0b6d >> +Subproject commit 95562e95401fef4e0b755ab0bb430974b5d1a29a >> diff --git a/test/sql-tap/e_expr.test.lua >> b/test/sql-tap/e_expr.test.lua >> index 13d3a96..9780d2c 100755 >> --- a/test/sql-tap/e_expr.test.lua >> +++ b/test/sql-tap/e_expr.test.lua >> @@ -1,6 +1,6 @@ >> #!/usr/bin/env tarantool >> test = require("sqltester") >> -test:plan(12431) >> +test:plan(10665) >> --!./tcltestrunner.lua >> -- 2010 July 16 >> @@ -77,8 +77,10 @@ local operations = { >> {"<>", "ne1"}, >> {"!=", "ne2"}, >> {"IS", "is"}, >> - {"LIKE", "like"}, >> - {"GLOB", "glob"}, >> +-- NOTE: This test needs refactoring after deletion of GLOB & >> +--type restrictions for LIKE. (See #3572) >> +-- {"LIKE", "like"}, >> +-- {"GLOB", "glob"}, > Yes, this behavior is not valid anymore. > To make sure that likes and globs will be tested in the future, > please, delete this > commented lines and add your own simple test, which tries to call > `like` and `glob` > with inappropriate types. > It is important to have a functional tests for any possible behavior. > > > 1) Globs are going to be deleted as you can see few lines above. I had a talk with @kirill, we decided to not maintain those tests and to not leave glob w/o tests. So, please, delete glob in a different commit and place this commit on top of it. > 2) I'm gonna refactor that when LIKE is in its final state (basically > after #3572 is closed > & static build is into 2.1). There are 2 options: 1. You implement those tests now and we are pushing this commit 2. You rebase this branch on static types and add those tests and we push this commit after static types. Those tests = test behavior of like with different types passed. > >> {"AND", "and"}, >> {"OR", "or"}, >> {"MATCH", "match"}, >> @@ -96,7 +98,12 @@ operations = { >> {"+", "-"}, >> {"<<", ">>", "&", "|"}, >> {"<", "<=", ">", ">="}, >> - {"=", "==", "!=", "<>", "LIKE", "GLOB"}, --"MATCH", "REGEXP"}, >> +-- NOTE: This test needs refactoring after deletion of GLOB & >> +--type restrictions for LIKE. (See #3572) >> +-- Another NOTE: MATCH & REGEXP aren't supported in Tarantool & >> +-- are waiting for their hour, don't confuse them >> +--being commented with ticket above. >> + {"=", "==", "!=", "<>"}, --"LIKE", "GLOB"}, --"MATCH", >> "REGEXP"}, >> {"AND"}, >> {"OR"}, >> } >> @@ -475,6 +482,7 @@ for _, op in ipairs(oplist) do >> end >> end >> end >> + >> --------------------------------------------------------------------------- >> -- Test the IS and IS NOT operators. >> -- >> diff --git >> a/test/sql-tap/gh-3251-string-pattern-comparison.test.lua >> b/test/sql-tap/gh-3251-string-pattern-comparison.test.lua >> new file mode 100755 >> index 0000000..2a787f2 >> --- /dev/null >> +++ b/test/sql-tap/gh-3251-string-pattern-comparison.test.lua >> @@ -0,0 +1,213 @@ >> +#!/usr/bin/env tarantool >> +test = require("sqltester") >> +test:plan(128) >> + >> +local prefix = "like-test-" >> + >> +-- Unicode byte sequences. >> +local valid_testcases = { >> + '\x01', >> + '\x09', >> + '\x1F', >> + '\x7F', >> + '\xC2\x80', >> + '\xC2\x90', >> + '\xC2\x9F', >> + '\xE2\x80\xA8', >> + '\x20\x0B', >> + '\xE2\x80\xA9', >> +} > optional: add descriptions to those byte sequences (what it is). > > > Some valid unicode symbols which is only that matters for LIKE operator. > >> + >> +-- Non-Unicode byte sequences. >> +local invalid_testcases = { >> + '\xE2\x80', >> + '\xFE\xFF', >> + '\xC2', >> + '\xED\xB0\x80', >> + '\xD0', >> +} > Place that after like_test_cases, just before it is used. > > > Changed it. > >> + >> +local like_test_cases = >> +{ >> + {"1.1", >> + "SELECT 'AB' LIKE '_B';", >> + {0, {1}} }, >> + {"1.2", >> + "SELECT 'CD' LIKE '_B';", >> + {0, {0}} }, >> + {"1.3", >> + "SELECT '' LIKE '_B';", >> + {0, {0}} }, >> + {"1.4", >> + "SELECT 'AB' LIKE '%B';", >> + {0, {1}} }, >> + {"1.5", >> + "SELECT 'CD' LIKE '%B';", >> + {0, {0}} }, >> + {"1.6", >> + "SELECT '' LIKE '%B';", >> + {0, {0}} }, >> + {"1.7", >> + "SELECT 'AB' LIKE 'A__';", >> + {0, {0}} }, >> + {"1.8", >> + "SELECT 'CD' LIKE 'A__';", >> + {0, {0}} }, >> + {"1.9", >> + "SELECT '' LIKE 'A__';", >> + {0, {0}} }, >> + {"1.10", >> + "SELECT 'AB' LIKE 'A_';", >> + {0, {1}} }, >> + {"1.11", >> + "SELECT 'CD' LIKE 'A_';", >> + {0, {0}} }, >> + {"1.12", >> + "SELECT '' LIKE 'A_';", >> + {0, {0}} }, >> + {"1.13", >> + "SELECT 'AB' LIKE 'A';", >> + {0, {0}} }, >> + {"1.14", >> + "SELECT 'CD' LIKE 'A';", >> + {0, {0}} }, >> + {"1.15", >> + "SELECT '' LIKE 'A';", >> + {0, {0}} }, >> + {"1.16", >> + "SELECT 'AB' LIKE '_';", >> + {0, {0}} }, >> + {"1.17", >> + "SELECT 'CD' LIKE '_';", >> + {0, {0}} }, >> + {"1.18", >> + "SELECT '' LIKE '_';", >> + {0, {0}} }, >> + {"1.19", >> + "SELECT 'AB' LIKE '__';", >> + {0, {1}} }, >> + {"1.20", >> + "SELECT 'CD' LIKE '__';", >> + {0, {1}} }, >> + {"1.21", >> + "SELECT '' LIKE '__';", >> + {0, {0}} }, >> + {"1.22", >> + "SELECT 'AB' LIKE '%A';", >> + {0, {0}} }, >> + {"1.23", >> + "SELECT 'AB' LIKE '%C';", >> + {0, {0}} }, >> + {"1.24", >> + "SELECT 'ab' LIKE '%df';", >> + {0, {0}} }, >> + {"1.25", >> + "SELECT 'abCDF' LIKE '%df';", >> + {0, {1}} }, >> + {"1.26", >> + "SELECT 'CDF' LIKE '%df';", >> + {0, {1}} }, >> + {"1.27", >> + "SELECT 'ab' LIKE 'a_';", >> + {0, {1}} }, >> + {"1.28", >> + "SELECT 'abCDF' LIKE 'a_';", >> + {0, {0}} }, >> + {"1.29", >> + "SELECT 'CDF' LIKE 'a_';", >> + {0, {0}} }, >> + {"1.30", >> + "SELECT 'ab' LIKE 'ab%';", >> + {0, {1}} }, >> + {"1.31", >> + "SELECT 'abCDF' LIKE 'ab%';", >> + {0, {1}} }, >> + {"1.32", >> + "SELECT 'CDF' LIKE 'ab%';", >> + {0, {0}} }, >> + {"1.33", >> + "SELECT 'ab' LIKE 'abC%';", >> + {0, {0}} }, >> + {"1.34", >> + "SELECT 'abCDF' LIKE 'abC%';", >> + {0, {1}} }, >> + {"1.35", >> + "SELECT 'CDF' LIKE 'abC%';", >> + {0, {0}} }, >> + {"1.36", >> + "SELECT 'ab' LIKE 'a_%';", >> + {0, {1}} }, >> + {"1.37", >> + "SELECT 'abCDF' LIKE 'a_%';", >> + {0, {1}} }, >> + {"1.38", >> + "SELECT 'CDF' LIKE 'a_%';", >> + {0, {0}} }, >> +} > Please, add some tests for unicode strings. (or replace letters in > those tests with unicode letters) > > > Changed existing tests a little bit. > >> + >> +test:do_catchsql_set_test(like_test_cases, prefix) >> + >> +-- Invalid testcases. >> +for i, tested_string in ipairs(invalid_testcases) do >> + >> + -- We should raise an error in case >> + -- pattern contains invalid characters. >> + >> + local test_name = prefix .. "2." .. tostring(i) >> + local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string >> .. "';" >> + test:do_catchsql_test(test_name, test_itself, >> + {1, "LIKE or GLOB pattern can only >> contain UTF-8 characters"}) >> + >> + test_name = prefix .. "3." .. tostring(i) >> + test_itself = "SELECT 'abc' LIKE 'abc" .. tested_string .. "';" >> + test:do_catchsql_test(test_name, test_itself, >> + {1, "LIKE or GLOB pattern can only >> contain UTF-8 characters"}) >> + >> + test_name = prefix .. "4." .. tostring(i) >> + test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "c';" >> + test:do_catchsql_test(test_name, test_itself, >> + {1, "LIKE or GLOB pattern can only >> contain UTF-8 characters"}) >> + >> + -- Just skipping if row value predicand contains invalid >> character. > What the predicand is? Is it a typo? > > > You can find it in ANSI SQL: <row value predicand>. > Basically it's just operand inside of a predicate. > >> + >> + test_name = prefix .. "5." .. tostring(i) >> + test_itself = "SELECT 'ab" .. tested_string .. "' LIKE 'abc';" >> + test:do_execsql_test(test_name, test_itself, {0}) >> + >> + test_name = prefix .. "6." .. tostring(i) >> + test_itself = "SELECT 'abc" .. tested_string .. "' LIKE 'abc';" >> + test:do_execsql_test(test_name, test_itself, {0}) >> + >> + test_name = prefix .. "7." .. tostring(i) >> + test_itself = "SELECT 'ab" .. tested_string .. "c' LIKE 'abc';" >> + test:do_execsql_test(test_name, test_itself, {0}) >> +end >> + >> +-- Valid testcases. >> +for i, tested_string in ipairs(valid_testcases) do >> + test_name = prefix .. "8." .. tostring(i) >> + local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string >> .. "';" >> + test:do_execsql_test(test_name, test_itself, {0}) >> + >> + test_name = prefix .. "9." .. tostring(i) >> + test_itself = "SELECT 'abc' LIKE 'abc" .. tested_string .. "';" >> + test:do_execsql_test(test_name, test_itself, {0}) >> + >> + test_name = prefix .. "10." .. tostring(i) >> + test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "c';" >> + test:do_execsql_test(test_name,test_itself, {0}) >> + >> + test_name = prefix .. "11." .. tostring(i) >> + test_itself = "SELECT 'ab" .. tested_string .. "' LIKE 'abc';" >> + test:do_execsql_test(test_name,test_itself, {0}) >> + >> + test_name = prefix .. "12." .. tostring(i) >> + test_itself = "SELECT 'abc" .. tested_string .. "' LIKE 'abc';" >> + test:do_execsql_test(test_name, test_itself, {0}) >> + >> + test_name = prefix .. "13." .. tostring(i) >> + test_itself = "SELECT 'ab" .. tested_string .. "c' LIKE 'abc';" >> + test:do_execsql_test(test_name, test_itself, {0}) >> +end >> + >> +test:finish_test() > Why I cannot find a test of `GLOB`? Even if we delete it in the > future, it should be tested. You can write much less tests for glob. > > E.g. this > ``` > select '1' glob '[0-4]'; > ``` > somewhy returns 0. > > > I actually don't think that operator that is going to be deleted in a > few days should be tested. > It's just useless and redundant code. [-- Attachment #2: Type: text/html, Size: 80368 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2018-08-08 12:38 UTC | newest] Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-06-28 12:47 [tarantool-patches] [PATCH] sql: LIKE & GLOB pattern comparison issue N.Tatunov 2018-06-28 12:54 ` [tarantool-patches] " Hollow111 2018-07-18 2:43 ` Alexander Turenko 2018-07-18 5:51 ` Alex Khatskevich 2018-07-18 15:24 ` Nikita Tatunov 2018-07-18 15:53 ` Alex Khatskevich 2018-07-18 15:57 ` Nikita Tatunov 2018-07-18 17:10 ` Alexander Turenko 2018-07-19 11:14 ` Nikita Tatunov 2018-07-19 11:56 ` Alex Khatskevich 2018-07-27 11:28 ` Nikita Tatunov 2018-07-27 13:06 ` Alexander Turenko 2018-07-27 19:11 ` Nikita Tatunov 2018-07-27 20:22 ` Alexander Turenko 2018-07-31 13:27 ` Nikita Tatunov 2018-07-31 13:47 ` Alexander Turenko 2018-08-01 10:35 ` Nikita Tatunov 2018-08-01 10:51 ` Nikita Tatunov 2018-08-01 13:56 ` Alex Khatskevich 2018-08-01 18:10 ` Nikita Tatunov 2018-08-01 18:14 ` Nikita Tatunov 2018-08-08 12:38 ` Alex Khatskevich
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox