From: Nikita Tatunov <hollow653@gmail.com> To: Alexander Turenko <alexander.turenko@tarantool.org> Cc: tarantool-patches@freelists.org Subject: [tarantool-patches] Re: [PATCH] sql: LIKE & GLOB pattern comparison issue Date: Fri, 27 Jul 2018 22:11:35 +0300 [thread overview] Message-ID: <CAEi+_arMh=rLm7jNAX-ERHjf6GQ5D7RqQK=+wSXCwJjZouoyeA@mail.gmail.com> (raw) In-Reply-To: <20180727130601.b2oby7dleapd5upg@tkn_work_nb> [-- Attachment #1: Type: text/plain, Size: 17465 bytes --] Hello! пт, 27 июл. 2018 г. в 16:06, Alexander Turenko < alexander.turenko@tarantool.org>: > Hi, > > See comments below. > > Don't sure we should return 'no match' situation when a left hand > expression in not correct unicode string. How other DBs handle that? > > The main point is that since pattern is always valid (otherwise we return an error) it cannot be matched with a string that contains an invalid utf-8 character. - PostgreSQL doesn't allow storing invalid characters in text types and even explicit E'\xD1' anywhere results an error. - MySQL Doesn't allow using invalid characters in text types. May be it's reasons for thinking about prohibiting invalid characters in text? WBR, Alexander Turenko. > > > sql: LIKE & GLOB pattern comparison issue > > > > Currently function that compares pattern and string for GLOB & LIKE > > operators doesn't work properly. It uses ICU reading function which > > perhaps was working differently before and the implementation for the > > comparison ending isn't paying attention to some special cases, hence > > in those cases it works improperly. > > > > It used non-ICU function before, consider commit 198d59ce. > > Thank you, I've got it now. > > With the patch applied an error will be returned in case there's > > invalid UTF-8 symbol in pattern & pattern will not be matched > > with the string that contains it. Some minor corrections to function > > were made as well. > > > > Nitpicking: 'error will be returned' is third situatuon, other then > 'matched' and 'not matched'. > > I guess commit message is about changes. Do I really need to mention valid cases that didn't change? > > Сloses #3251 > > Сloses #3334 > > Part of #3572 > > > > 2. The thing you test is not related to a table and other columns. > > > Please, convert the tests to the next format: {[[select '' like > > > '_B';]], {1}]]}. > > > To make it more readable, you can do it like `like_testcases` in > > > `sql-tap/collation.test.lua`. > > > > > > > Done. > > You are still perform comparisons with table columns. > Thank you, fixed it. More tests now tho. > > > -/* > > - * For LIKE and GLOB matching on EBCDIC machines, assume that every > > - * character is exactly one byte in size. Also, provde the Utf8Read() > > - * macro for fast reading of the next character in the common case where > > - * the next character is ASCII. > > +/** > > + * Providing there are symbols in string s this > > + * macro returns UTF-8 code of character and > > + * promotes pointer to the next symbol in the string. > > + * Otherwise return code is SQL_END_OF_STRING. > > */ > > -#define Utf8Read(s, e) ucnv_getNextUChar(pUtf8conv, &s, e, &status) > > +#define Utf8Read(s, e) (((s) < (e)) ?\ > > + ucnv_getNextUChar(pUtf8conv, &(s), (e), &(status)) : 0) > > I suggest to add a whitespace before backslash to make it looks better. > > Guess so. > > @@ -643,51 +647,61 @@ static const struct compareInfo likeInfoAlt = { > '%', > > '_', 0, 0 }; > > #define SQLITE_MATCH 0 > > #define SQLITE_NOMATCH 1 > > #define SQLITE_NOWILDCARDMATCH 2 > > +#define SQL_PROHIBITED_PATTERN 3 > > Update comment before that has a reference to 'patternMatch' function. > > Didn't notice it, thnx. > > * > > * Globbing rules: > > * > > - * '*' Matches any sequence of zero or more characters. > > + * '*' Matches any sequence of zero or more characters. > > Why? > > > * > > - * '?' Matches exactly one character. > > + * '?' Matches exactly one character. > > * > > - * [...] Matches one character from the enclosed list of > > - * characters. > > + * [...] Matches one character from the enclosed list of > > + * characters. > > * > > - * [^...] Matches one character not in the enclosed list. > > + * [^...] Matches one character not in the enclosed list. > > * > > The same question. > > It was looking like the indentation for similar table below was different. I was wrong. > > + * @retval SQLITE_MATCH: Match. > > + * SQLITE_NOMATCH: No match. > > + * SQLITE_NOWILDCARDMATCH: No match in spite of having * > > + * or % wildcards. > > + * SQL_PROHIBITED_PATTERN Pattern contains invalid > > + * symbol. > > Nitpicking: no colon after SQL_PROHIBITED_PATTERN. > Fixed. > > > */ > > static int > > -patternCompare(const char * pattern, /* The glob pattern */ > > - const char * string, /* The string to compare against the glob > */ > > - const struct compareInfo *pInfo, /* Information about how to do > > the compare */ > > - UChar32 matchOther /* The escape char (LIKE) or '[' (GLOB) */ > > - ) > > +sql_utf8_pattern_compare(const char * pattern, > > + const char * string, > > + const struct compareInfo *pInfo, > > + UChar32 matchOther) > > Don't get why indent is tabulation + 7 spaces. > > Me neither :thinking:. Thank you! > The function itself looks good except lines longer then 80 columns. > As i said before I don't think breaking these lines will increase readability. > > @@ -853,8 +903,7 @@ patternCompare(const char * pattern, /* The glob > > pattern */ > > int > > sqlite3_strglob(const char *zGlobPattern, const char *zString) > > { > > - return patternCompare(zGlobPattern, zString, &globInfo, > > - '['); > > + return sql_utf8_pattern_compare(zGlobPattern, zString, &globInfo, '['); > > } > > > > /* > > @@ -864,7 +913,7 @@ sqlite3_strglob(const char *zGlobPattern, const char > > *zString) > > int > > sqlite3_strlike(const char *zPattern, const char *zStr, unsigned int > esc) > > { > > - return patternCompare(zPattern, zStr, &likeInfoNorm, esc); > > + return sql_utf8_pattern_compare(zPattern, zStr, &likeInfoNorm, esc); > > } > > > > Should we add sqlite3_result_error is case of SQL_PROHIBITED_PATTERN for > these two functions? > > It doesn't matter for "sqlite3_strglob()", since it's only used internally with valid explicitly given pattern in "analysis_loader()". Same for "sqlite3_strlike()". > > - {"LIKE", "like"}, > > - {"GLOB", "glob"}, > > +-- NOTE: This test needs refactoring after deletion of GLOB & > > +-- type restrictions for LIKE. > > +-- {"LIKE", "like"}, > > +-- {"GLOB", "glob"}, > > I would add related issue number. > > True. Done. > > {"AND", "and"}, > > {"OR", "or"}, > > {"MATCH", "match"}, > > @@ -96,7 +98,7 @@ operations = { > > {"+", "-"}, > > {"<<", ">>", "&", "|"}, > > {"<", "<=", ">", ">="}, > > - {"=", "==", "!=", "<>", "LIKE", "GLOB"}, --"MATCH", "REGEXP"}, > > + {"=", "==", "!=", "<>"}, --"LIKE", "GLOB"}, "MATCH", "REGEXP"}, > > Leave comment before MATCH and REGEXP to avoid surprises after removing > your comment. > > Done. > > --- /dev/null > > +++ b/test/sql-tap/gh-3251-string-pattern-comparison.lua > > @@ -0,0 +1,238 @@ > > +#!/usr/bin/env tarantool > > +test = require("sqltester") > > +test:plan(106) > > + > > +local prefix = "like-test-" > > + > > +-- Unciode byte sequences. > > + > > 1. Typo: unicode. > 2. Extra empty line. > > Fixed. > > +-- Invalid testcases. > > + > > +for i, tested_string in ipairs(invalid_testcases) do > > + local test_name = prefix .. "2." .. tostring(i) > > + local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';" > > + > > +-- We should raise an error if pattern contains invalid characters. > > + > > 1. Trailing whitespaces. > 2. Why comments are not indented as the other code? > 3. test_name and test_itself before the section comment looks strange > for me here. > diff --git a/src/box/sql/func.c b/src/box/sql/func.c index 5b53076..2f989d4 100644 --- a/src/box/sql/func.c +++ b/src/box/sql/func.c @@ -623,7 +623,7 @@ struct compareInfo { * promotes pointer to the next symbol in the string. * Otherwise return code is SQL_END_OF_STRING. */ -#define Utf8Read(s, e) (((s) < (e)) ?\ +#define Utf8Read(s, e) (((s) < (e)) ? \ ucnv_getNextUChar(pUtf8conv, &(s), (e), &(status)) : 0) #define SQL_END_OF_STRING 0 @@ -642,7 +642,7 @@ static const struct compareInfo likeInfoNorm = { '%', '_', 0, 1 }; static const struct compareInfo likeInfoAlt = { '%', '_', 0, 0 }; /* - * Possible error returns from patternMatch() + * Possible error returns from sql_utf8_pattern_compare() */ #define SQLITE_MATCH 0 #define SQLITE_NOMATCH 1 @@ -655,14 +655,14 @@ static const struct compareInfo likeInfoAlt = { '%', '_', 0, 0 }; * * Globbing rules: * - * '*' Matches any sequence of zero or more characters. + * '*' Matches any sequence of zero or more characters. * - * '?' Matches exactly one character. + * '?' Matches exactly one character. * - * [...] Matches one character from the enclosed list of - * characters. + * [...] Matches one character from the enclosed list of + * characters. * - * [^...] Matches one character not in the enclosed list. + * [^...] Matches one character not in the enclosed list. * * With the [...] and [^...] matching, a ']' character can be * included in the list by making it the first character after @@ -694,14 +694,14 @@ static const struct compareInfo likeInfoAlt = { '%', '_', 0, 0 }; * SQLITE_NOMATCH: No match. * SQLITE_NOWILDCARDMATCH: No match in spite of having * * or % wildcards. - * SQL_PROHIBITED_PATTERN Pattern contains invalid + * SQL_PROHIBITED_PATTERN: Pattern contains invalid * symbol. */ static int sql_utf8_pattern_compare(const char * pattern, - const char * string, - const struct compareInfo *pInfo, - UChar32 matchOther) + const char * string, + const struct compareInfo *pInfo, + UChar32 matchOther) { UChar32 c, c2; /* Next pattern and input string chars */ UChar32 matchOne = pInfo->matchOne; /* "?" or "_" */ diff --git a/test/sql-tap/e_expr.test.lua b/test/sql-tap/e_expr.test.lua index 051210a..9780d2c 100755 --- a/test/sql-tap/e_expr.test.lua +++ b/test/sql-tap/e_expr.test.lua @@ -78,7 +78,7 @@ local operations = { {"!=", "ne2"}, {"IS", "is"}, -- NOTE: This test needs refactoring after deletion of GLOB & --- type restrictions for LIKE. +-- type restrictions for LIKE. (See #3572) -- {"LIKE", "like"}, -- {"GLOB", "glob"}, {"AND", "and"}, @@ -98,7 +98,12 @@ operations = { {"+", "-"}, {"<<", ">>", "&", "|"}, {"<", "<=", ">", ">="}, - {"=", "==", "!=", "<>"}, --"LIKE", "GLOB"}, "MATCH", "REGEXP"}, +-- NOTE: This test needs refactoring after deletion of GLOB & +-- type restrictions for LIKE. (See #3572) +-- Another NOTE: MATCH & REGEXP aren't supported in Tarantool & +-- are waiting for their hour, don't confuse them +-- being commented with ticket above. + {"=", "==", "!=", "<>"}, --"LIKE", "GLOB"}, --"MATCH", "REGEXP"}, {"AND"}, {"OR"}, } diff --git a/test/sql-tap/gh-3251-string-pattern-comparison.test.lua b/test/sql-tap/gh-3251-string-pattern-comparison.test.lua new file mode 100755 index 0000000..a823bc6 --- /dev/null +++ b/test/sql-tap/gh-3251-string-pattern-comparison.test.lua @@ -0,0 +1,281 @@ +#!/usr/bin/env tarantool +test = require("sqltester") +test:plan(128) + +local prefix = "like-test-" + +-- Unicode byte sequences. +local valid_testcases = { + '\x01', + '\x09', + '\x1F', + '\x7F', + '\xC2\x80', + '\xC2\x90', + '\xC2\x9F', + '\xE2\x80\xA8', + '\x20\x0B', + '\xE2\x80\xA9', +} + +-- Non-Unicode byte sequences. +local invalid_testcases = { + '\xE2\x80', + '\xFE\xFF', + '\xC2', + '\xED\xB0\x80', + '\xD0', +} + +local like_test_cases = +{ + {"1.1", + "SELECT 'AB' LIKE '_B';", + {0, {1}} }, + {"1.2", + "SELECT 'CD' LIKE '_B';", + {0, {0}} }, + {"1.3", + "SELECT '' LIKE '_B';", + {0, {0}} }, + {"1.4", + "SELECT 'AB' LIKE '%B';", + {0, {1}} }, + {"1.5", + "SELECT 'CD' LIKE '%B';", + {0, {0}} }, + {"1.6", + "SELECT '' LIKE '%B';", + {0, {0}} }, + {"1.7", + "SELECT 'AB' LIKE 'A__';", + {0, {0}} }, + {"1.8", + "SELECT 'CD' LIKE 'A__';", + {0, {0}} }, + {"1.9", + "SELECT '' LIKE 'A__';", + {0, {0}} }, + {"1.10", + "SELECT 'AB' LIKE 'A_';", + {0, {1}} }, + {"1.11", + "SELECT 'CD' LIKE 'A_';", + {0, {0}} }, + {"1.12", + "SELECT '' LIKE 'A_';", + {0, {0}} }, + {"1.13", + "SELECT 'AB' LIKE 'A';", + {0, {0}} }, + {"1.14", + "SELECT 'CD' LIKE 'A';", + {0, {0}} }, + {"1.15", + "SELECT '' LIKE 'A';", + {0, {0}} }, + {"1.16", + "SELECT 'AB' LIKE '_';", + {0, {0}} }, + {"1.17", + "SELECT 'CD' LIKE '_';", + {0, {0}} }, + {"1.18", + "SELECT '' LIKE '_';", + {0, {0}} }, + {"1.19", + "SELECT 'AB' LIKE '__';", + {0, {1}} }, + {"1.20", + "SELECT 'CD' LIKE '__';", + {0, {1}} }, + {"1.21", + "SELECT '' LIKE '__';", + {0, {0}} }, + {"1.22", + "SELECT 'AB' LIKE '%A';", + {0, {0}} }, + {"1.23", + "SELECT 'AB' LIKE '%C';", + {0, {0}} }, + {"1.24", + "SELECT 'ab' LIKE '%df';", + {0, {0}} }, + {"1.25", + "SELECT 'abCDF' LIKE '%df';", + {0, {1}} }, + {"1.26", + "SELECT 'CDF' LIKE '%df';", + {0, {1}} }, + {"1.27", + "SELECT 'ab' LIKE 'a_';", + {0, {1}} }, + {"1.28", + "SELECT 'abCDF' LIKE 'a_';", + {0, {0}} }, + {"1.29", + "SELECT 'CDF' LIKE 'a_';", + {0, {0}} }, + {"1.30", + "SELECT 'ab' LIKE 'ab%';", + {0, {1}} }, + {"1.31", + "SELECT 'abCDF' LIKE 'ab%';", + {0, {1}} }, + {"1.32", + "SELECT 'CDF' LIKE 'ab%';", + {0, {0}} }, + {"1.33", + "SELECT 'ab' LIKE 'abC%';", + {0, {0}} }, + {"1.34", + "SELECT 'abCDF' LIKE 'abC%';", + {0, {1}} }, + {"1.35", + "SELECT 'CDF' LIKE 'abC%';", + {0, {0}} }, + {"1.36", + "SELECT 'ab' LIKE 'a_%';", + {0, {1}} }, + {"1.37", + "SELECT 'abCDF' LIKE 'a_%';", + {0, {1}} }, + {"1.38", + "SELECT 'CDF' LIKE 'a_%';", + {0, {0}} }, +} + +test:do_catchsql_set_test(like_test_cases, prefix) + +-- Invalid testcases. + +for i, tested_string in ipairs(invalid_testcases) do + +-- We should raise an error if pattern contains invalid characters. + local test_name = prefix .. "2." .. tostring(i) + local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';" + test:do_catchsql_test( + test_name, + test_itself, { + -- <test_name> + 1, "LIKE or GLOB pattern can only contain UTF-8 characters" + -- <test_name> + }) + + test_name = prefix .. "3." .. tostring(i) + test_itself = "SELECT 'abc' LIKE 'abc" .. tested_string .. "';" + test:do_catchsql_test( + test_name, + test_itself, { + -- <test_name> + 1, "LIKE or GLOB pattern can only contain UTF-8 characters" + -- <test_name> + }) + + test_name = prefix .. "4." .. tostring(i) + test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "c';" + test:do_catchsql_test( + test_name, + test_itself, { + -- <test_name> + 1, "LIKE or GLOB pattern can only contain UTF-8 characters" + -- <test_name> + }) + +-- Just skipping if row value predicand contains invalid character. + + test_name = prefix .. "5." .. tostring(i) + test_itself = "SELECT 'ab" .. tested_string .. "' LIKE 'abc';" + test:do_execsql_test( + test_name, + test_itself, { + -- <test_name> + 0 + -- <test_name> + }) + + test_name = prefix .. "6." .. tostring(i) + test_itself = "SELECT 'abc" .. tested_string .. "' LIKE 'abc';" + test:do_execsql_test( + test_name, + test_itself, { + -- <test_name> + 0 + -- <test_name> + }) + + test_name = prefix .. "7." .. tostring(i) + test_itself = "SELECT 'ab" .. tested_string .. "c' LIKE 'abc';" + test:do_execsql_test( + test_name, + test_itself, { + -- <test_name> + 0 + -- <test_name> + }) +end + +-- Valid testcases. + +for i, tested_string in ipairs(valid_testcases) do + test_name = prefix .. "8." .. tostring(i) + local test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "';" + test:do_execsql_test( + test_name, + test_itself, { + -- <test_name> + 0 + -- <test_name> + }) + + test_name = prefix .. "9." .. tostring(i) + test_itself = "SELECT 'abc' LIKE 'abc" .. tested_string .. "';" + test:do_execsql_test( + test_name, + test_itself, { + -- <test_name> + 0 + -- <test_name> + }) + + test_name = prefix .. "10." .. tostring(i) + test_itself = "SELECT 'abc' LIKE 'ab" .. tested_string .. "c';" + test:do_execsql_test( + test_name, + test_itself, { + -- <test_name> + 0 + -- <test_name> + }) + test_name = prefix .. "11." .. tostring(i) + test_itself = "SELECT 'ab" .. tested_string .. "' LIKE 'abc';" + test:do_execsql_test( + test_name, + test_itself, { + -- <test_name> + 0 + -- <test_name> + }) + + test_name = prefix .. "12." .. tostring(i) + test_itself = "SELECT 'abc" .. tested_string .. "' LIKE 'abc';" + test:do_execsql_test( + test_name, + test_itself, { + -- <test_name> + 0 + -- <test_name> + }) + + test_name = prefix .. "13." .. tostring(i) + test_itself = "SELECT 'ab" .. tested_string .. "c' LIKE 'abc';" + test:do_execsql_test( + test_name, + test_itself, { + -- <test_name> + 0 + -- <test_name> + }) +end + +test:finish_test() [-- Attachment #2: Type: text/html, Size: 38152 bytes --]
next prev parent reply other threads:[~2018-07-27 19:11 UTC|newest] Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top 2018-06-28 12:47 [tarantool-patches] " N.Tatunov 2018-06-28 12:54 ` [tarantool-patches] " Hollow111 2018-07-18 2:43 ` Alexander Turenko 2018-07-18 5:51 ` Alex Khatskevich 2018-07-18 15:24 ` Nikita Tatunov 2018-07-18 15:53 ` Alex Khatskevich 2018-07-18 15:57 ` Nikita Tatunov 2018-07-18 17:10 ` Alexander Turenko 2018-07-19 11:14 ` Nikita Tatunov 2018-07-19 11:56 ` Alex Khatskevich 2018-07-27 11:28 ` Nikita Tatunov 2018-07-27 13:06 ` Alexander Turenko 2018-07-27 19:11 ` Nikita Tatunov [this message] 2018-07-27 20:22 ` Alexander Turenko 2018-07-31 13:27 ` Nikita Tatunov 2018-07-31 13:47 ` Alexander Turenko 2018-08-01 10:35 ` Nikita Tatunov 2018-08-01 10:51 ` Nikita Tatunov 2018-08-01 13:56 ` Alex Khatskevich 2018-08-01 18:10 ` Nikita Tatunov 2018-08-01 18:14 ` Nikita Tatunov 2018-08-08 12:38 ` Alex Khatskevich
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to='CAEi+_arMh=rLm7jNAX-ERHjf6GQ5D7RqQK=+wSXCwJjZouoyeA@mail.gmail.com' \ --to=hollow653@gmail.com \ --cc=alexander.turenko@tarantool.org \ --cc=tarantool-patches@freelists.org \ --subject='[tarantool-patches] Re: [PATCH] sql: LIKE & GLOB pattern comparison issue' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox