[tarantool-patches] Re: [PATCH 2/2] sql: make LIKE predicate dependent on collation
n.pettik
korablev at tarantool.org
Wed Jul 17 19:19:18 MSK 2019
> On 14 Jul 2019, at 01:51, Roman Khabibov <roman.habibov at tarantool.org> wrote:
>
> According to ANSI, LIKE should match characters taking into
> account passed collation.
>
> ISO/IEC JTC 1/SC 32 2011, Part 2: Foundation, page 445
Nit: it’s likely to be meaningless to indicate exact page;
I’d better outline number of section.
Moreover, I guess this ticket deserves doc request
explaining user-visible changes.
I’ve pushed code-style fixes at the top of your branch.
> 7 files changed, 187 insertions(+), 111 deletions(-)
>
> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
> index 4e4d14cf7..b660d05c5 100644
> --- a/src/box/sql/func.c
> +++ b/src/box/sql/func.c
>
> @@ -699,7 +723,7 @@ enum pattern_match_status {
> * @param string String being compared.
> * @param pattern_end Ptr to pattern last symbol.
> * @param string_end Ptr to string last symbol.
> - * @param is_like_ci true if LIKE is case insensitive.
> + * @param coll Pointer to collation.
> * @param match_other The escape char for LIKE.
> *
> * @retval One of pattern_match_status values.
> @@ -709,7 +733,7 @@ sql_utf8_pattern_compare(const char *pattern,
> const char *string,
> const char *pattern_end,
> const char *string_end,
> - const int is_like_ci,
> + struct coll *coll,
> UChar32 match_other)
> {
> /* Next pattern and input string chars. */
> @@ -717,9 +741,13 @@ sql_utf8_pattern_compare(const char *pattern,
> /* One past the last escaped input char. */
> const char *zEscaped = 0;
> UErrorCode status = U_ZERO_ERROR;
> + const char *pat_char_ptr;
> + const char *str_char_ptr;
> + int pat_char_len;
> + int str_char_len;
Declaring variables without initialisation is considered
to be bad practise. Also intend to use appropriate types -
in this case size_t.
> while (pattern < pattern_end) {
> - c = Utf8Read(pattern, pattern_end);
> + c = step_utf8_char(&pattern, pattern_end, &pat_char_ptr, &pat_char_len);
Broken indentation.
> if (c == SQL_INVALID_UTF8_SYMBOL)
> return INVALID_PATTERN;
> if (c == MATCH_ALL_WILDCARD) {
>
> @@ -847,9 +867,10 @@ sql_utf8_pattern_compare(const char *pattern,
> int
> sql_strlike_cs(const char *zPattern, const char *zStr, unsigned int esc)
> {
I guess you can remove both strike_cs and strlike_ci functions.
> + struct coll_id *p = coll_by_name("unicode", strlen("unicode"));
> return sql_utf8_pattern_compare(zPattern, zStr,
> zPattern + strlen(zPattern),
> - zStr + strlen(zStr), 0, esc);
> + zStr + strlen(zStr), p->coll, esc);
> }
>
> @@ -228,13 +228,11 @@ operatorMask(int op)
> * In order for the operator to be optimizible, the RHS must be a
> * string literal that does not begin with a wildcard. The LHS
> * must be a column that may only be NULL, a string, or a BLOB,
> - * never a number. The collating sequence for the column on the
> - * LHS must be appropriate for the operator.
> + * never a number.
> *
> * @param pParse Parsing and code generating context.
> * @param pExpr Test this expression.
> - * @param ppPrefix Pointer to TK_STRING expression with
> - * pattern prefix.
> + * @param ppPrefix Pointer to expression with pattern prefix.
If smth has changed, please underline this in comment.
> * @param pisComplete True if the only wildcard is '%' in the
> * last character.
> * @retval True if the given expr is a LIKE operator & is
> @@ -276,9 +274,20 @@ like_optimization_is_valid(Parse *pParse, Expr *pExpr, Expr **ppPrefix,
> */
> return 0;
> }
> +
> + /* Only for "binary" and "unicode_ci" collations. */
Why only two collations? Why not “unicode” for instance?
What if collation is a part of expression, not field:
… a COLLATE “unicode” LIKE … ?
> + struct field_def *field_def = pLeft->space_def->fields + pLeft->iColumn;
> + if (field_def->coll_id != 0 && field_def->coll_id != 2 &&
> + field_def->coll_id != 3)
It’s definitively bad practice to rely on collation’s id.
> + return 0;
> assert(pLeft->iColumn != (-1)); /* Because IPK never has AFF_TEXT */
>
> pRight = sqlExprSkipCollate(pList->a[0].pExpr);
> + const char *coll_name = NULL;
> + if (pRight != pList->a[0].pExpr) {
> + assert(pList->a[0].pExpr->op == TK_COLLATE);
> + coll_name = pList->a[0].pExpr->u.zToken;
Please, accompany this snippet with comment.
> + }
> op = pRight->op;
> if (op == TK_VARIABLE) {
> Vdbe *pReprepare = pParse->pReprepare;
> @@ -308,6 +317,10 @@ like_optimization_is_valid(Parse *pParse, Expr *pExpr, Expr **ppPrefix,
> pParse->is_aborted = true;
> else
> pPrefix->u.zToken[cnt] = 0;
> + if (coll_name != NULL)
> + pPrefix = sqlExprAddCollateString(pParse,
> + pPrefix,
> + coll_name);
The same: I don’t understand what’s going on here.
Please, add explanation to the code.
> @@ -977,8 +990,6 @@ exprAnalyze(SrcList * pSrc, /* the FROM clause */
> Expr *pStr1 = 0;
> /* RHS of LIKE ends with wildcard. */
> int isComplete = 0;
> - /* uppercase equivalent to lowercase. */
> - int noCase = 0;
> /* Top-level operator. pExpr->op. */
> int op;
> /* Parsing context. */
> @@ -1165,59 +1176,22 @@ exprAnalyze(SrcList * pSrc, /* the FROM clause */
> const u16 wtFlags = TERM_LIKEOPT | TERM_VIRTUAL | TERM_DYNAMIC;
>
> pLeft = pExpr->x.pList->a[1].pExpr;
> - pStr2 = sqlExprDup(db, pStr1, 0);
> -
> - /*
> - * Convert the lower bound to upper-case and the
> - * upper bound to lower-case (upper-case is less
> - * than lower-case in ASCII) so that the range
> - * constraints also work for BLOBs.
> - */
> - if (noCase && !pParse->db->mallocFailed) {
> - int i;
> - char c;
> - pTerm->wtFlags |= TERM_LIKE;
> - for (i = 0; (c = pStr1->u.zToken[i]) != 0; i++) {
> - pStr1->u.zToken[i] = sqlToupper(c);
> - pStr2->u.zToken[i] = sqlTolower(c);
> - }
> - }
> + pStr2 = sqlExprDup(db, sqlExprSkipCollate(pStr1), 0);
>
> if (!db->mallocFailed) {
> u8 c, *pC; /* Last character before the first wildcard */
> pC = (u8 *) & pStr2->u.
> zToken[sqlStrlen30(pStr2->u.zToken) - 1];
> c = *pC;
> - if (noCase) {
> - /* The point is to increment the last character before the first
> - * wildcard. But if we increment '@', that will push it into the
> - * alphabetic range where case conversions will mess up the
> - * inequality. To avoid this, make sure to also run the full
> - * LIKE on all candidate expressions by clearing the isComplete flag
> - */
> - if (c == 'A' - 1)
> - isComplete = 0;
> - c = sqlUpperToLower[c];
> - }
Are you sure that this code removal wouldn’t broke smth?
> *pC = c + 1;
> }
> pNewExpr1 = sqlExprDup(db, pLeft, 0);
> - if (noCase) {
> - pNewExpr1 =
> - sqlExprAddCollateString(pParse, pNewExpr1,
> - "unicode_ci");
> - }
> pNewExpr1 = sqlPExpr(pParse, TK_GE, pNewExpr1, pStr1);
> transferJoinMarkings(pNewExpr1, pExpr);
> idxNew1 = whereClauseInsert(pWC, pNewExpr1, wtFlags);
> testcase(idxNew1 == 0);
> exprAnalyze(pSrc, pWC, idxNew1);
> pNewExpr2 = sqlExprDup(db, pLeft, 0);
> - if (noCase) {
> - pNewExpr2 =
> - sqlExprAddCollateString(pParse, pNewExpr2,
> - "unicode_ci");
> - }
> pNewExpr2 = sqlPExpr(pParse, TK_LT, pNewExpr2, pStr2);
> transferJoinMarkings(pNewExpr2, pExpr);
> idxNew2 = whereClauseInsert(pWC, pNewExpr2, wtFlags);
> diff --git a/test/sql-tap/collation.test.lua b/test/sql-tap/collation.test.lua
> index 79547361c..9e404282a 100755
> --- a/test/sql-tap/collation.test.lua
> +++ b/test/sql-tap/collation.test.lua
> @@ -1,6 +1,6 @@
> #!/usr/bin/env tarantool
> test = require("sqltester")
> -test:plan(174)
> +test:plan(173)
Please add test verifying that arguments of incompatible
collations can’t participate in LIKE predicate.
>
> local prefix = "collation-"
>
> @@ -477,13 +477,13 @@ for _, data_collation in ipairs(data_collations) do
> end
> end
>
> --- Like uses collation (only for unicode_ci and binary)
> +-- <LIKE> uses collation. If <LIKE> has explicit <COLLATE>, use it
> +-- instead of implicit.
> local like_testcases =
> {
> {"2.0",
> [[
> - CREATE TABLE tx1 (s1 VARCHAR(5) PRIMARY KEY);
> - CREATE INDEX I1 on tx1(s1 collate "unicode_ci”);
Why did you drop index?
> + CREATE TABLE tx1 (s1 VARCHAR(5) PRIMARY KEY COLLATE "unicode_ci");
> INSERT INTO tx1 VALUES('aaa');
> INSERT INTO tx1 VALUES('Aab');
> INSERT INTO tx1 VALUES('İac');
> @@ -491,35 +491,32 @@ local like_testcases =
> ]], {0}},
> {"2.1.1",
> "SELECT * FROM tx1 WHERE s1 LIKE 'A%' order by s1;",
> - {0, {"Aab", "aaa"}} },
> + {0, {"aaa","Aab"}} },
Why order of results has changed?
> {"2.1.2",
> "EXPLAIN QUERY PLAN SELECT * FROM tx1 WHERE s1 LIKE 'A%';",
> - {0, {0, 0, 0, "SEARCH TABLE TX1 USING COVERING INDEX I1 (S1>? AND S1<?) (~16384 rows)"}}},
> + {0, {0, 0, 0, "SEARCH TABLE TX1 USING PRIMARY KEY (S1>? AND S1<?) (~16384 rows)"}}},
I guess this is due to dropped index.
> {"2.2.0",
> - "PRAGMA case_sensitive_like = true;",
> - {0}},
> - {"2.2.1",
> - "SELECT * FROM tx1 WHERE s1 LIKE 'A%' order by s1;",
> + "SELECT * FROM tx1 WHERE s1 LIKE 'A%' COLLATE \"unicode\" order by s1;",
> {0, {"Aab"}} },
> - {"2.2.2",
> + {"2.2.1",
> "EXPLAIN QUERY PLAN SELECT * FROM tx1 WHERE s1 LIKE 'A%';",
> {0, {0, 0, 0, "/USING PRIMARY KEY/"}} },
> {"2.3.0",
> - "PRAGMA case_sensitive_like = false;",
> - {0}},
> - {"2.3.1",
> "SELECT * FROM tx1 WHERE s1 LIKE 'i%' order by s1;",
> - {0, {"iad", "İac"}}},
> - {"2.3.2",
> - "SELECT * FROM tx1 WHERE s1 LIKE 'İ%'order by s1;",
> - {0, {"iad", "İac"}} },
> + {0, {"İac", "iad"}}},
> + {"2.3.1",
> + "SELECT * FROM tx1 WHERE s1 LIKE 'İ%' COLLATE \"unicode\" order by s1;",
> + {0, {"İac"}} },
> {"2.4.0",
> [[
> INSERT INTO tx1 VALUES('ЯЁЮ');
> ]], {0} },
> {"2.4.1",
> + "SELECT * FROM tx1 WHERE s1 LIKE 'яёю' COLLATE \"unicode\";",
> + {0, {}} },
> + {"2.4.2",
> "SELECT * FROM tx1 WHERE s1 LIKE 'яёю';",
> - {0, {"ЯЁЮ"}} },
> + {0, {"ЯЁЮ"}} }
Stray diff.
> }
>
> test:do_catchsql_set_test(like_testcases, prefix)
> diff --git a/test/sql-tap/gh-3251-string-pattern-comparison.test.lua b/test/sql-tap/gh-3251-string-pattern-comparison.test.lua
> index 95e8c98d0..eb9d7c3b9 100755
> --- a/test/sql-tap/gh-3251-string-pattern-comparison.test.lua
> +++ b/test/sql-tap/gh-3251-string-pattern-comparison.test.lua
> @@ -1,6 +1,6 @@
> #!/usr/bin/env tarantool
> test = require("sqltester")
> -test:plan(128)
> +test:plan(156)
>
> local prefix = "like-test-"
>
> @@ -80,10 +80,10 @@ local like_test_cases =
> {0, {false}} },
> {"1.25",
> "SELECT 'ёфÅŒش' LIKE '%œش';",
> - {0, {true}} },
> + {0, {false}} },
Could you add an appropriate collation to avoid fixing
test results?
> {"1.26",
> "SELECT 'ÅŒش' LIKE '%œش';",
> - {0, {true}} },
> + {0, {false}} },
> {"1.27",
> "SELECT 'ёф' LIKE 'ё_';",
> {0, {true}} },
> @@ -120,6 +120,90 @@ local like_test_cases =
> {"1.38",
> "SELECT 'ÅŒش' LIKE 'ё_%';",
> {0, {false}} },
> + {"1.39",
> + "SELECT 'A' LIKE 'A' COLLATE \"unicode\";",
> + {0, {true}} },
Please, add not only tests involving constant literals, but
also columns with implicit/explicit collations.
More information about the Tarantool-patches
mailing list