From: "n.pettik" <korablev@tarantool.org> To: tarantool-patches@freelists.org Cc: Roman Khabibov <roman.habibov@tarantool.org> Subject: [tarantool-patches] Re: [PATCH 2/2] sql: make LIKE predicate dependent on collation Date: Wed, 17 Jul 2019 19:19:18 +0300 [thread overview] Message-ID: <C4250A78-E57F-4DCF-B67B-364DED1E9562@tarantool.org> (raw) In-Reply-To: <67ec7b5425d16078e45571c99ba9b58859b3c7b8.1563057282.git.roman.habibov@tarantool.org> > On 14 Jul 2019, at 01:51, Roman Khabibov <roman.habibov@tarantool.org> wrote: > > According to ANSI, LIKE should match characters taking into > account passed collation. > > ISO/IEC JTC 1/SC 32 2011, Part 2: Foundation, page 445 Nit: it’s likely to be meaningless to indicate exact page; I’d better outline number of section. Moreover, I guess this ticket deserves doc request explaining user-visible changes. I’ve pushed code-style fixes at the top of your branch. > 7 files changed, 187 insertions(+), 111 deletions(-) > > diff --git a/src/box/sql/func.c b/src/box/sql/func.c > index 4e4d14cf7..b660d05c5 100644 > --- a/src/box/sql/func.c > +++ b/src/box/sql/func.c > > @@ -699,7 +723,7 @@ enum pattern_match_status { > * @param string String being compared. > * @param pattern_end Ptr to pattern last symbol. > * @param string_end Ptr to string last symbol. > - * @param is_like_ci true if LIKE is case insensitive. > + * @param coll Pointer to collation. > * @param match_other The escape char for LIKE. > * > * @retval One of pattern_match_status values. > @@ -709,7 +733,7 @@ sql_utf8_pattern_compare(const char *pattern, > const char *string, > const char *pattern_end, > const char *string_end, > - const int is_like_ci, > + struct coll *coll, > UChar32 match_other) > { > /* Next pattern and input string chars. */ > @@ -717,9 +741,13 @@ sql_utf8_pattern_compare(const char *pattern, > /* One past the last escaped input char. */ > const char *zEscaped = 0; > UErrorCode status = U_ZERO_ERROR; > + const char *pat_char_ptr; > + const char *str_char_ptr; > + int pat_char_len; > + int str_char_len; Declaring variables without initialisation is considered to be bad practise. Also intend to use appropriate types - in this case size_t. > while (pattern < pattern_end) { > - c = Utf8Read(pattern, pattern_end); > + c = step_utf8_char(&pattern, pattern_end, &pat_char_ptr, &pat_char_len); Broken indentation. > if (c == SQL_INVALID_UTF8_SYMBOL) > return INVALID_PATTERN; > if (c == MATCH_ALL_WILDCARD) { > > @@ -847,9 +867,10 @@ sql_utf8_pattern_compare(const char *pattern, > int > sql_strlike_cs(const char *zPattern, const char *zStr, unsigned int esc) > { I guess you can remove both strike_cs and strlike_ci functions. > + struct coll_id *p = coll_by_name("unicode", strlen("unicode")); > return sql_utf8_pattern_compare(zPattern, zStr, > zPattern + strlen(zPattern), > - zStr + strlen(zStr), 0, esc); > + zStr + strlen(zStr), p->coll, esc); > } > > @@ -228,13 +228,11 @@ operatorMask(int op) > * In order for the operator to be optimizible, the RHS must be a > * string literal that does not begin with a wildcard. The LHS > * must be a column that may only be NULL, a string, or a BLOB, > - * never a number. The collating sequence for the column on the > - * LHS must be appropriate for the operator. > + * never a number. > * > * @param pParse Parsing and code generating context. > * @param pExpr Test this expression. > - * @param ppPrefix Pointer to TK_STRING expression with > - * pattern prefix. > + * @param ppPrefix Pointer to expression with pattern prefix. If smth has changed, please underline this in comment. > * @param pisComplete True if the only wildcard is '%' in the > * last character. > * @retval True if the given expr is a LIKE operator & is > @@ -276,9 +274,20 @@ like_optimization_is_valid(Parse *pParse, Expr *pExpr, Expr **ppPrefix, > */ > return 0; > } > + > + /* Only for "binary" and "unicode_ci" collations. */ Why only two collations? Why not “unicode” for instance? What if collation is a part of expression, not field: … a COLLATE “unicode” LIKE … ? > + struct field_def *field_def = pLeft->space_def->fields + pLeft->iColumn; > + if (field_def->coll_id != 0 && field_def->coll_id != 2 && > + field_def->coll_id != 3) It’s definitively bad practice to rely on collation’s id. > + return 0; > assert(pLeft->iColumn != (-1)); /* Because IPK never has AFF_TEXT */ > > pRight = sqlExprSkipCollate(pList->a[0].pExpr); > + const char *coll_name = NULL; > + if (pRight != pList->a[0].pExpr) { > + assert(pList->a[0].pExpr->op == TK_COLLATE); > + coll_name = pList->a[0].pExpr->u.zToken; Please, accompany this snippet with comment. > + } > op = pRight->op; > if (op == TK_VARIABLE) { > Vdbe *pReprepare = pParse->pReprepare; > @@ -308,6 +317,10 @@ like_optimization_is_valid(Parse *pParse, Expr *pExpr, Expr **ppPrefix, > pParse->is_aborted = true; > else > pPrefix->u.zToken[cnt] = 0; > + if (coll_name != NULL) > + pPrefix = sqlExprAddCollateString(pParse, > + pPrefix, > + coll_name); The same: I don’t understand what’s going on here. Please, add explanation to the code. > @@ -977,8 +990,6 @@ exprAnalyze(SrcList * pSrc, /* the FROM clause */ > Expr *pStr1 = 0; > /* RHS of LIKE ends with wildcard. */ > int isComplete = 0; > - /* uppercase equivalent to lowercase. */ > - int noCase = 0; > /* Top-level operator. pExpr->op. */ > int op; > /* Parsing context. */ > @@ -1165,59 +1176,22 @@ exprAnalyze(SrcList * pSrc, /* the FROM clause */ > const u16 wtFlags = TERM_LIKEOPT | TERM_VIRTUAL | TERM_DYNAMIC; > > pLeft = pExpr->x.pList->a[1].pExpr; > - pStr2 = sqlExprDup(db, pStr1, 0); > - > - /* > - * Convert the lower bound to upper-case and the > - * upper bound to lower-case (upper-case is less > - * than lower-case in ASCII) so that the range > - * constraints also work for BLOBs. > - */ > - if (noCase && !pParse->db->mallocFailed) { > - int i; > - char c; > - pTerm->wtFlags |= TERM_LIKE; > - for (i = 0; (c = pStr1->u.zToken[i]) != 0; i++) { > - pStr1->u.zToken[i] = sqlToupper(c); > - pStr2->u.zToken[i] = sqlTolower(c); > - } > - } > + pStr2 = sqlExprDup(db, sqlExprSkipCollate(pStr1), 0); > > if (!db->mallocFailed) { > u8 c, *pC; /* Last character before the first wildcard */ > pC = (u8 *) & pStr2->u. > zToken[sqlStrlen30(pStr2->u.zToken) - 1]; > c = *pC; > - if (noCase) { > - /* The point is to increment the last character before the first > - * wildcard. But if we increment '@', that will push it into the > - * alphabetic range where case conversions will mess up the > - * inequality. To avoid this, make sure to also run the full > - * LIKE on all candidate expressions by clearing the isComplete flag > - */ > - if (c == 'A' - 1) > - isComplete = 0; > - c = sqlUpperToLower[c]; > - } Are you sure that this code removal wouldn’t broke smth? > *pC = c + 1; > } > pNewExpr1 = sqlExprDup(db, pLeft, 0); > - if (noCase) { > - pNewExpr1 = > - sqlExprAddCollateString(pParse, pNewExpr1, > - "unicode_ci"); > - } > pNewExpr1 = sqlPExpr(pParse, TK_GE, pNewExpr1, pStr1); > transferJoinMarkings(pNewExpr1, pExpr); > idxNew1 = whereClauseInsert(pWC, pNewExpr1, wtFlags); > testcase(idxNew1 == 0); > exprAnalyze(pSrc, pWC, idxNew1); > pNewExpr2 = sqlExprDup(db, pLeft, 0); > - if (noCase) { > - pNewExpr2 = > - sqlExprAddCollateString(pParse, pNewExpr2, > - "unicode_ci"); > - } > pNewExpr2 = sqlPExpr(pParse, TK_LT, pNewExpr2, pStr2); > transferJoinMarkings(pNewExpr2, pExpr); > idxNew2 = whereClauseInsert(pWC, pNewExpr2, wtFlags); > diff --git a/test/sql-tap/collation.test.lua b/test/sql-tap/collation.test.lua > index 79547361c..9e404282a 100755 > --- a/test/sql-tap/collation.test.lua > +++ b/test/sql-tap/collation.test.lua > @@ -1,6 +1,6 @@ > #!/usr/bin/env tarantool > test = require("sqltester") > -test:plan(174) > +test:plan(173) Please add test verifying that arguments of incompatible collations can’t participate in LIKE predicate. > > local prefix = "collation-" > > @@ -477,13 +477,13 @@ for _, data_collation in ipairs(data_collations) do > end > end > > --- Like uses collation (only for unicode_ci and binary) > +-- <LIKE> uses collation. If <LIKE> has explicit <COLLATE>, use it > +-- instead of implicit. > local like_testcases = > { > {"2.0", > [[ > - CREATE TABLE tx1 (s1 VARCHAR(5) PRIMARY KEY); > - CREATE INDEX I1 on tx1(s1 collate "unicode_ci”); Why did you drop index? > + CREATE TABLE tx1 (s1 VARCHAR(5) PRIMARY KEY COLLATE "unicode_ci"); > INSERT INTO tx1 VALUES('aaa'); > INSERT INTO tx1 VALUES('Aab'); > INSERT INTO tx1 VALUES('İac'); > @@ -491,35 +491,32 @@ local like_testcases = > ]], {0}}, > {"2.1.1", > "SELECT * FROM tx1 WHERE s1 LIKE 'A%' order by s1;", > - {0, {"Aab", "aaa"}} }, > + {0, {"aaa","Aab"}} }, Why order of results has changed? > {"2.1.2", > "EXPLAIN QUERY PLAN SELECT * FROM tx1 WHERE s1 LIKE 'A%';", > - {0, {0, 0, 0, "SEARCH TABLE TX1 USING COVERING INDEX I1 (S1>? AND S1<?) (~16384 rows)"}}}, > + {0, {0, 0, 0, "SEARCH TABLE TX1 USING PRIMARY KEY (S1>? AND S1<?) (~16384 rows)"}}}, I guess this is due to dropped index. > {"2.2.0", > - "PRAGMA case_sensitive_like = true;", > - {0}}, > - {"2.2.1", > - "SELECT * FROM tx1 WHERE s1 LIKE 'A%' order by s1;", > + "SELECT * FROM tx1 WHERE s1 LIKE 'A%' COLLATE \"unicode\" order by s1;", > {0, {"Aab"}} }, > - {"2.2.2", > + {"2.2.1", > "EXPLAIN QUERY PLAN SELECT * FROM tx1 WHERE s1 LIKE 'A%';", > {0, {0, 0, 0, "/USING PRIMARY KEY/"}} }, > {"2.3.0", > - "PRAGMA case_sensitive_like = false;", > - {0}}, > - {"2.3.1", > "SELECT * FROM tx1 WHERE s1 LIKE 'i%' order by s1;", > - {0, {"iad", "İac"}}}, > - {"2.3.2", > - "SELECT * FROM tx1 WHERE s1 LIKE 'İ%'order by s1;", > - {0, {"iad", "İac"}} }, > + {0, {"İac", "iad"}}}, > + {"2.3.1", > + "SELECT * FROM tx1 WHERE s1 LIKE 'İ%' COLLATE \"unicode\" order by s1;", > + {0, {"İac"}} }, > {"2.4.0", > [[ > INSERT INTO tx1 VALUES('ЯЁЮ'); > ]], {0} }, > {"2.4.1", > + "SELECT * FROM tx1 WHERE s1 LIKE 'яёю' COLLATE \"unicode\";", > + {0, {}} }, > + {"2.4.2", > "SELECT * FROM tx1 WHERE s1 LIKE 'яёю';", > - {0, {"ЯЁЮ"}} }, > + {0, {"ЯЁЮ"}} } Stray diff. > } > > test:do_catchsql_set_test(like_testcases, prefix) > diff --git a/test/sql-tap/gh-3251-string-pattern-comparison.test.lua b/test/sql-tap/gh-3251-string-pattern-comparison.test.lua > index 95e8c98d0..eb9d7c3b9 100755 > --- a/test/sql-tap/gh-3251-string-pattern-comparison.test.lua > +++ b/test/sql-tap/gh-3251-string-pattern-comparison.test.lua > @@ -1,6 +1,6 @@ > #!/usr/bin/env tarantool > test = require("sqltester") > -test:plan(128) > +test:plan(156) > > local prefix = "like-test-" > > @@ -80,10 +80,10 @@ local like_test_cases = > {0, {false}} }, > {"1.25", > "SELECT 'ёфÅŒش' LIKE '%œش';", > - {0, {true}} }, > + {0, {false}} }, Could you add an appropriate collation to avoid fixing test results? > {"1.26", > "SELECT 'ÅŒش' LIKE '%œش';", > - {0, {true}} }, > + {0, {false}} }, > {"1.27", > "SELECT 'ёф' LIKE 'ё_';", > {0, {true}} }, > @@ -120,6 +120,90 @@ local like_test_cases = > {"1.38", > "SELECT 'ÅŒش' LIKE 'ё_%';", > {0, {false}} }, > + {"1.39", > + "SELECT 'A' LIKE 'A' COLLATE \"unicode\";", > + {0, {true}} }, Please, add not only tests involving constant literals, but also columns with implicit/explicit collations.
next prev parent reply other threads:[~2019-07-17 16:19 UTC|newest] Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-07-13 22:51 [tarantool-patches] [PATCH 0/2] Make " Roman Khabibov 2019-07-13 22:51 ` [tarantool-patches] [PATCH 1/2] sql: remove "PRAGMA case_sensitive_like" Roman Khabibov 2019-07-17 16:18 ` [tarantool-patches] " n.pettik 2019-07-24 21:02 ` Roman Khabibov 2019-07-13 22:51 ` [tarantool-patches] [PATCH 2/2] sql: make LIKE predicate dependent on collation Roman Khabibov 2019-07-17 16:19 ` n.pettik [this message] 2019-07-24 21:02 ` [tarantool-patches] " Roman Khabibov 2019-08-01 10:27 ` n.pettik 2019-08-01 12:26 ` [tarantool-patches] Re: [PATCH 0/2] Make " Kirill Yukhin
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=C4250A78-E57F-4DCF-B67B-364DED1E9562@tarantool.org \ --to=korablev@tarantool.org \ --cc=roman.habibov@tarantool.org \ --cc=tarantool-patches@freelists.org \ --subject='[tarantool-patches] Re: [PATCH 2/2] sql: make LIKE predicate dependent on collation' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox