[tarantool-patches] Re: [PATCH 2/2] sql: make LIKE predicate dependent on collation

Wed Jul 17 19:19:18 MSK 2019

> On 14 Jul 2019, at 01:51, Roman Khabibov <roman.habibov at tarantool.org> wrote:
> 
> According to ANSI, LIKE should match characters taking into
> account passed collation.
> 
> ISO/IEC JTC 1/SC 32 2011, Part 2: Foundation, page 445

Nit: it’s likely to be meaningless to indicate exact page;
I’d better outline number of section.

Moreover, I guess this ticket deserves doc request
explaining user-visible changes.

I’ve pushed code-style fixes at the top of your branch.

> 7 files changed, 187 insertions(+), 111 deletions(-)
> 
> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
> index 4e4d14cf7..b660d05c5 100644
> --- a/src/box/sql/func.c
> +++ b/src/box/sql/func.c
> 
> @@ -699,7 +723,7 @@ enum pattern_match_status {
>  * @param string String being compared.
>  * @param pattern_end Ptr to pattern last symbol.
>  * @param string_end Ptr to string last symbol.
> - * @param is_like_ci true if LIKE is case insensitive.
> + * @param coll Pointer to collation.
>  * @param match_other The escape char for LIKE.
>  *
>  * @retval One of pattern_match_status values.
> @@ -709,7 +733,7 @@ sql_utf8_pattern_compare(const char *pattern,
> 			 const char *string,
> 			 const char *pattern_end,
> 			 const char *string_end,
> -			 const int is_like_ci,
> +			 struct coll *coll,
> 			 UChar32 match_other)
> {
> 	/* Next pattern and input string chars. */
> @@ -717,9 +741,13 @@ sql_utf8_pattern_compare(const char *pattern,
> 	/* One past the last escaped input char. */
> 	const char *zEscaped = 0;
> 	UErrorCode status = U_ZERO_ERROR;
> +	const char *pat_char_ptr;
> +	const char *str_char_ptr;
> +	int pat_char_len;
> +	int str_char_len;

Declaring variables without initialisation is considered
to be bad practise. Also intend to use appropriate types -
in this case size_t.

> 	while (pattern < pattern_end) {
> -		c = Utf8Read(pattern, pattern_end);
> +	c = step_utf8_char(&pattern, pattern_end, &pat_char_ptr, &pat_char_len);

Broken indentation.

> 		if (c == SQL_INVALID_UTF8_SYMBOL)
> 			return INVALID_PATTERN;
> 		if (c == MATCH_ALL_WILDCARD) {
> 
> @@ -847,9 +867,10 @@ sql_utf8_pattern_compare(const char *pattern,
> int
> sql_strlike_cs(const char *zPattern, const char *zStr, unsigned int esc)
> {

I guess you can remove both strike_cs and strlike_ci functions.

> +	struct coll_id *p = coll_by_name("unicode", strlen("unicode"));
> 	return sql_utf8_pattern_compare(zPattern, zStr,
> 		                        zPattern + strlen(zPattern),
> -		                        zStr + strlen(zStr), 0, esc);
> +		                        zStr + strlen(zStr), p->coll, esc);
> }
> 
> @@ -228,13 +228,11 @@ operatorMask(int op)
>  * In order for the operator to be optimizible, the RHS must be a
>  * string literal that does not begin with a wildcard. The LHS
>  * must be a column that may only be NULL, a string, or a BLOB,
> - * never a number. The collating sequence for the column on the
> - * LHS must be appropriate for the operator.
> + * never a number.
>  *
>  * @param pParse      Parsing and code generating context.
>  * @param pExpr       Test this expression.
> - * @param ppPrefix    Pointer to TK_STRING expression with
> - *                    pattern prefix.
> + * @param ppPrefix    Pointer to expression with pattern prefix.

If smth has changed, please underline this in comment.

>  * @param pisComplete True if the only wildcard is '%' in the
>  *                    last character.
>  * @retval True if the given expr is a LIKE operator & is
> @@ -276,9 +274,20 @@ like_optimization_is_valid(Parse *pParse, Expr *pExpr, Expr **ppPrefix,
> 		 */
> 		return 0;
> 	}
> +
> +	/* Only for "binary" and "unicode_ci" collations. */

Why only two collations? Why not “unicode” for instance?
What if collation is a part of expression, not field:
… a COLLATE “unicode” LIKE … ?

> +	struct field_def *field_def = pLeft->space_def->fields + pLeft->iColumn;
> +	if (field_def->coll_id != 0 && field_def->coll_id != 2 &&
> +	    field_def->coll_id != 3)

It’s definitively bad practice to rely on collation’s id.

> +		return 0;
> 	assert(pLeft->iColumn != (-1));	/* Because IPK never has AFF_TEXT */
> 
> 	pRight = sqlExprSkipCollate(pList->a[0].pExpr);
> +	const char *coll_name = NULL;
> +	if (pRight != pList->a[0].pExpr) {
> +		assert(pList->a[0].pExpr->op == TK_COLLATE);
> +		coll_name = pList->a[0].pExpr->u.zToken;

Please, accompany this snippet with comment.

> +	}
> 	op = pRight->op;
> 	if (op == TK_VARIABLE) {
> 		Vdbe *pReprepare = pParse->pReprepare;
> @@ -308,6 +317,10 @@ like_optimization_is_valid(Parse *pParse, Expr *pExpr, Expr **ppPrefix,
> 				pParse->is_aborted = true;
> 			else
> 				pPrefix->u.zToken[cnt] = 0;
> +			if (coll_name != NULL)
> +				pPrefix = sqlExprAddCollateString(pParse,
> +								  pPrefix,
> +								  coll_name);

The same: I don’t understand what’s going on here.
Please, add explanation to the code.

> @@ -977,8 +990,6 @@ exprAnalyze(SrcList * pSrc,	/* the FROM clause */
> 	Expr *pStr1 = 0;
> 	/* RHS of LIKE ends with wildcard. */
> 	int isComplete = 0;
> -	/* uppercase equivalent to lowercase. */
> -	int noCase = 0;
> 	/* Top-level operator. pExpr->op. */
> 	int op;
> 	/* Parsing context. */
> @@ -1165,59 +1176,22 @@ exprAnalyze(SrcList * pSrc,	/* the FROM clause */
> 		const u16 wtFlags = TERM_LIKEOPT | TERM_VIRTUAL | TERM_DYNAMIC;
> 
> 		pLeft = pExpr->x.pList->a[1].pExpr;
> -		pStr2 = sqlExprDup(db, pStr1, 0);
> -
> -		/*
> -		 * Convert the lower bound to upper-case and the
> -		 * upper bound to lower-case (upper-case is less
> -		 * than lower-case in ASCII) so that the range
> -		 * constraints also work for BLOBs.
> -		 */
> -		if (noCase && !pParse->db->mallocFailed) {
> -			int i;
> -			char c;
> -			pTerm->wtFlags |= TERM_LIKE;
> -			for (i = 0; (c = pStr1->u.zToken[i]) != 0; i++) {
> -				pStr1->u.zToken[i] = sqlToupper(c);
> -				pStr2->u.zToken[i] = sqlTolower(c);
> -			}
> -		}
> +		pStr2 = sqlExprDup(db, sqlExprSkipCollate(pStr1), 0);
> 
> 		if (!db->mallocFailed) {
> 			u8 c, *pC;	/* Last character before the first wildcard */
> 			pC = (u8 *) & pStr2->u.
> 			    zToken[sqlStrlen30(pStr2->u.zToken) - 1];
> 			c = *pC;
> -			if (noCase) {
> -				/* The point is to increment the last character before the first
> -				 * wildcard.  But if we increment '@', that will push it into the
> -				 * alphabetic range where case conversions will mess up the
> -				 * inequality.  To avoid this, make sure to also run the full
> -				 * LIKE on all candidate expressions by clearing the isComplete flag
> -				 */
> -				if (c == 'A' - 1)
> -					isComplete = 0;
> -				c = sqlUpperToLower[c];
> -			}

Are you sure that this code removal wouldn’t broke smth?

> 			*pC = c + 1;
> 		}
> 		pNewExpr1 = sqlExprDup(db, pLeft, 0);
> -		if (noCase) {
> -			pNewExpr1 =
> -				sqlExprAddCollateString(pParse, pNewExpr1,
> -							    "unicode_ci");
> -		}
> 		pNewExpr1 = sqlPExpr(pParse, TK_GE, pNewExpr1, pStr1);
> 		transferJoinMarkings(pNewExpr1, pExpr);
> 		idxNew1 = whereClauseInsert(pWC, pNewExpr1, wtFlags);
> 		testcase(idxNew1 == 0);
> 		exprAnalyze(pSrc, pWC, idxNew1);
> 		pNewExpr2 = sqlExprDup(db, pLeft, 0);
> -		if (noCase) {
> -			pNewExpr2 =
> -				sqlExprAddCollateString(pParse, pNewExpr2,
> -							    "unicode_ci");
> -		}
> 		pNewExpr2 = sqlPExpr(pParse, TK_LT, pNewExpr2, pStr2);
> 		transferJoinMarkings(pNewExpr2, pExpr);
> 		idxNew2 = whereClauseInsert(pWC, pNewExpr2, wtFlags);
> diff --git a/test/sql-tap/collation.test.lua b/test/sql-tap/collation.test.lua
> index 79547361c..9e404282a 100755
> --- a/test/sql-tap/collation.test.lua
> +++ b/test/sql-tap/collation.test.lua
> @@ -1,6 +1,6 @@
> #!/usr/bin/env tarantool
> test = require("sqltester")
> -test:plan(174)
> +test:plan(173)

Please add test verifying that arguments of incompatible
collations can’t participate in LIKE predicate.

> 
> local prefix = "collation-"
> 
> @@ -477,13 +477,13 @@ for _, data_collation in ipairs(data_collations) do
>     end
> end
> 
> --- Like uses collation (only for unicode_ci and binary)
> +-- <LIKE> uses collation. If <LIKE> has explicit <COLLATE>, use it
> +-- instead of implicit.
> local like_testcases =
> {
>     {"2.0",
>     [[
> -        CREATE TABLE tx1 (s1 VARCHAR(5) PRIMARY KEY);
> -        CREATE INDEX I1 on tx1(s1 collate "unicode_ci”);

Why did you drop index?

> +        CREATE TABLE tx1 (s1 VARCHAR(5) PRIMARY KEY COLLATE "unicode_ci");
>         INSERT INTO tx1 VALUES('aaa');
>         INSERT INTO tx1 VALUES('Aab');
>         INSERT INTO tx1 VALUES('İac');
> @@ -491,35 +491,32 @@ local like_testcases =
>     ]], {0}},
>     {"2.1.1",
>         "SELECT * FROM tx1 WHERE s1 LIKE 'A%' order by s1;",
> -        {0, {"Aab", "aaa"}} },
> +        {0, {"aaa","Aab"}} },

Why order of results has changed?

>     {"2.1.2",
>         "EXPLAIN QUERY PLAN SELECT * FROM tx1 WHERE s1 LIKE 'A%';",
> -        {0, {0, 0, 0, "SEARCH TABLE TX1 USING COVERING INDEX I1 (S1>? AND S1<?) (~16384 rows)"}}},
> +        {0, {0, 0, 0, "SEARCH TABLE TX1 USING PRIMARY KEY (S1>? AND S1<?) (~16384 rows)"}}},

I guess this is due to dropped index.

>     {"2.2.0",
> -        "PRAGMA case_sensitive_like = true;",
> -        {0}},
> -    {"2.2.1",
> -        "SELECT * FROM tx1 WHERE s1 LIKE 'A%' order by s1;",
> +        "SELECT * FROM tx1 WHERE s1 LIKE 'A%' COLLATE \"unicode\" order by s1;",
>         {0, {"Aab"}} },
> -    {"2.2.2",
> +    {"2.2.1",
>         "EXPLAIN QUERY PLAN SELECT * FROM tx1 WHERE s1 LIKE 'A%';",
>         {0, {0, 0, 0, "/USING PRIMARY KEY/"}} },
>     {"2.3.0",
> -        "PRAGMA case_sensitive_like = false;",
> -        {0}},
> -    {"2.3.1",
>         "SELECT * FROM tx1 WHERE s1 LIKE 'i%' order by s1;",
> -        {0, {"iad", "İac"}}},
> -    {"2.3.2",
> -        "SELECT * FROM tx1 WHERE s1 LIKE 'İ%'order by s1;",
> -        {0, {"iad", "İac"}} },
> +        {0, {"İac", "iad"}}},
> +    {"2.3.1",
> +        "SELECT * FROM tx1 WHERE s1 LIKE 'İ%' COLLATE \"unicode\" order by s1;",
> +        {0, {"İac"}} },
>     {"2.4.0",
>     [[
>         INSERT INTO tx1 VALUES('ЯЁЮ');
>     ]], {0} },
>     {"2.4.1",
> +        "SELECT * FROM tx1 WHERE s1 LIKE 'яёю' COLLATE \"unicode\";",
> +        {0, {}} },
> +    {"2.4.2",
>         "SELECT * FROM tx1 WHERE s1 LIKE 'яёю';",
> -        {0, {"ЯЁЮ"}} },
> +        {0, {"ЯЁЮ"}} }

Stray diff.

> }
> 
> test:do_catchsql_set_test(like_testcases, prefix)
> diff --git a/test/sql-tap/gh-3251-string-pattern-comparison.test.lua b/test/sql-tap/gh-3251-string-pattern-comparison.test.lua
> index 95e8c98d0..eb9d7c3b9 100755
> --- a/test/sql-tap/gh-3251-string-pattern-comparison.test.lua
> +++ b/test/sql-tap/gh-3251-string-pattern-comparison.test.lua
> @@ -1,6 +1,6 @@
> #!/usr/bin/env tarantool
> test = require("sqltester")
> -test:plan(128)
> +test:plan(156)
> 
> local prefix = "like-test-"
> 
> @@ -80,10 +80,10 @@ local like_test_cases =
>         {0, {false}} },
>     {"1.25",
>         "SELECT 'ёфÅŒش' LIKE '%œش';",
> -        {0, {true}} },
> +        {0, {false}} },

Could you add an appropriate collation to avoid fixing
test results?

>     {"1.26",
>         "SELECT 'ÅŒش' LIKE '%œش';",
> -        {0, {true}} },
> +        {0, {false}} },
>     {"1.27",
>         "SELECT 'ёф' LIKE 'ё_';",
>         {0, {true}} },
> @@ -120,6 +120,90 @@ local like_test_cases =
>     {"1.38",
>         "SELECT 'ÅŒش' LIKE 'ё_%';",
>         {0, {false}} },
> +    {"1.39",
> +        "SELECT 'A' LIKE 'A' COLLATE \"unicode\";",
> +        {0, {true}} },

Please, add not only tests involving constant literals, but
also columns with implicit/explicit collations.