[tarantool-patches] Re: [PATCH] sql: replace instr() with position()

i.koptelov ivan.koptelov at tarantool.org
Wed Mar 13 14:56:20 MSK 2019


Thank you for the review. Firstly, I would like to discuss some
design issues and then send codestyle & functionality changes.
> On 12 Mar 2019, at 16:31, n.pettik <korablev at tarantool.org> wrote:
> 
>> On 10 Mar 2019, at 10:41, Ivan Koptelov <ivan.koptelov at tarantool.org> wrote:
>> 
>> Before this patch we had instr() SQL function.
>> After the patch it is renamed to position().
>> Also a few things have been changed: arguments
>> order, allowed arguments types and usage of
>> collations.
>> 
>> The order of the arguments has changed:
>> Before: instr(haystack, needle).
>> After: position(needle, haystack).
> 
> What are reasons to change arguments' order?
> I guess to make it closer to ANSI syntax <POSITION a IN b>.
> 
>> Type checking became more strict: before it was
>> possible to call the function with INTEGER arguments
>> or with arguments of different types. Now both arguments
>> must have the same type and be either text or binary
>> strings.
> 
> Are there any reasons to support BLOBs as arguments?
> For instance, we’ve banned this opportunity for LIKE func.
Standard says:

<position expression> ::= <character position expression>
			| <binary position expression>

By the way, is LIKE a function? Isn’t it an expression which
can be placed only after WHERE ?
> 
>> Before the patch collations were not taken into
>> consideration during the search. Now it is fixed, and
>> both implicit (column) collations and explicit
>> (using COLLATE expression) are used. There are
>> several possible cases:
>> 
>> - One of collations is mentioned alongside with explicit
>> COLLATE clause, which forces this collation over another
>> one. It is allowed to have the same forced collations;
>> - Both collations are derived from table columns and they
>> are the same;
>> - One collation is derived from table column and another
>> one is not specified (i.e. COLL_NONE);
> 
> This is regulated by ANSI “Type combination” rules.
> Skip this paragraph and simply mention this fact.
> 
>> In all other cases they are not accounted to be compatible
>> and error is raised.
>> 
>> Related to #3933
> 
> I would use “Workaround for” label.
> Also, please underline the fact that syntax is still
> different from ANSI one and describe reasons for that.
> 
>> 
>> @TarantoolBot document
>> Title: instr() is replaced with position()
>> Changes are described in the associated commit
>> message.
> 
> AFAIK we do this vice versa: move vital parts of commit
> message under doc bot tag request.
> 
>> diff --git a/src/box/sql/expr.c b/src/box/sql/expr.c
>> index a75f23756..59b43ec41 100644
>> --- a/src/box/sql/expr.c
>> +++ b/src/box/sql/expr.c
>> @@ -4054,7 +4054,8 @@ sqlExprCodeTarget(Parse * pParse, Expr * pExpr, int target)
>> 					constMask |= MASKBIT32(i);
>> 				}
>> 				if ((pDef->funcFlags & SQL_FUNC_NEEDCOLL) !=
>> -				    0 && coll == NULL) {
>> +				    0 && coll == NULL &&
>> +				    !(pDef->funcFlags & SQL_FUNC_ARG_COLL)) {
> 
> Why these conditions can’t be compatible?
> What is the difference between SQL_FUNC_NEEDCOLL
> and new flag? Does NEEDCOLL refer to returning value?
> Please, comment this part.
These conditions can’t be compatible because
SQL_FUNC_NEEDCOLL and SQL_FUNC_ARG_COLL ask for
different actions. SQL_FUNC_NEEDCOLL asks
to find first argument with collation in
the arguments expression list (from left
to right) and use it inside function
as a collation of the whole expression.
SQL_FUNC_ARG_COLL asks to check if
arguments collations are compatible
and use the right one if they are.
I would add comment to the code.

If we want to establish proper collation
handling in all functions, we should
probably use only one flag to do check
and set the right collation.

I would merge these flags into one, OK?

> 
>> 					bool unused;
>> 					uint32_t id;
>> 					if (sql_expr_coll(pParse,
>> @@ -4064,6 +4065,37 @@ sqlExprCodeTarget(Parse * pParse, Expr * pExpr, int target)
>> 						return 0;
>> 				}
>> 			}
>> +			if (pDef->funcFlags & SQL_FUNC_ARG_COLL) {
>> +				assert(nFarg == 2);
> 
> Add comment to this assertion or better to the whole branch.
> It is not obvious at first sight. Why only functions with two args…
> For instance, there’s MAX function, which may take any number
> of arguments. Collations are also vital for this function, since
> they are used to compare strings.
> 
> What is more, I suggest to move this part in a separate patch,
> since position is not the only function that must apply collations
> on its operands. I don’t insist on dozens of tests on each built-in
> function, just a few.
So do you propose to make two patches: first with adding proper
collation checking/setting in sqlExprCodeTarget() and another
with adding proper collation usage in functions in func.c?
Is it OK if I put them in one patch set?
> 
>> +				struct coll *colls[2] = {NULL};
> 
> I’d rather use explicit naming:
> struct coll *lhs_coll = NULL;
> struct coll *rhs_coll = NULL;
> 
> It makes code a bit longer, but on the other hand IMHO
> makes it less prone to typos.
But this function sqlExprCodeTarget() is called for expression
list (of arguments). If we would use it for functions like MAX
(as we want to) then I think an array should be used. Or just
two loops to check every pair of arguments collations for
compatibility.  
> 
>> +				uint32_t colls_ids[2] = {0};
>> +				bool is_forced[2] = {false};
>> +				if (sql_expr_coll(pParse, pFarg->a[0].pExpr,
>> +						  &is_forced[0], &colls_ids[0],
>> +						  &colls[0]) != 0)
>> +					return 0;
>> +				if (sql_expr_coll(pParse, pFarg->a[1].pExpr,
>> +						  &is_forced[1], &colls_ids[1],
>> +						  &colls[1]) != 0)
>> +					return 0;
>> +
>> +				uint32_t coll_id;
>> +				if (collations_check_compatibility(colls_ids[0],
>> +								   is_forced[0],
>> +								   colls_ids[1],
>> +								   is_forced[1],
>> +								   &coll_id)
>> +								   != 0) {
>> +					pParse->rc = SQL_TARANTOOL_ERROR;
>> +					pParse->nErr++;
>> +					return 0;
>> +				}
>> +
>> +				coll = (coll_id == colls_ids[0]) ? colls[0] :
>> +								   colls[1];
>> +				if (coll == NULL)
>> +					coll = coll_by_id(COLL_NONE)->coll;
> 
> Why do we need this?
Because if there is no collation at all among
arguments, then the context would not have any
collation as well. And sqlGetFuncCollSeq()
would fail with assertion in this case, because
it always expects some collations being set in context.
> 
>> +			}
>> 			if (pFarg) {
>> 				if (constMask) {
>> 					r1 = pParse->nMem + 1;
>> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
>> index 2de6ef5ce..f44b7ce78 100644
>> --- a/src/box/sql/func.c
>> +++ b/src/box/sql/func.c
>> @@ -38,6 +38,7 @@
>> #include "vdbeInt.h"
>> #include "version.h"
>> #include "coll/coll.h"
>> +#include "tarantoolInt.h"
>> #include <unicode/ustring.h>
>> #include <unicode/ucasemap.h>
>> #include <unicode/ucnv.h>
>> @@ -212,59 +213,173 @@ absFunc(sql_context * context, int argc, sql_value ** argv)
>> }
>> 
>> /*
>> - * Implementation of the instr() function.
>> + * Returns name of SQL type, which is given by the int code.
>> *
>> - * instr(haystack,needle) finds the first occurrence of needle
>> + * @param sql_type One of enum sql_type codes.
>> + * @retval string representing name of the given type.
>> + */
>> +static inline char *
>> +sql_value_type_to_str(enum sql_type sql_type) {
> 
> I don’t recommend using this function: mem_type_to_str()
> already exists, just make it non-static. sql_type is extracted
> as type of VDBE memory cell, so in fact it is the same thing. 
Sorry, I’ve missed mem_type_to_str(). I would use it and remove
sql_value_type_to_str()
> 
>> +	switch(sql_type) {
>> +		case SQL_INTEGER:
>> +			return "INTEGER";
>> +		case SQL_TEXT:
>> +			return "TEXT";
>> +		case SQL_FLOAT:
>> +			return "REAL";
>> +		case SQL_BLOB:
>> +			return "BLOB";
>> +		default:
>> +			return "NULL";
>> +	}
>> +}
>> +
>> +/*
>> + * Implementation of the position() function.
>> + *
>> + * position(haystack,needle) finds the first occurrence of needle
>> * in haystack and returns the number of previous characters plus 1,
>> * or 0 if needle does not occur within haystack.
>> *
>> - * If both haystack and needle are BLOBs, then the result is one more than
>> - * the number of bytes in haystack prior to the first occurrence of needle,
>> - * or 0 if needle never occurs in haystack.
>> + * Haystack and needle must both have the same type, either
>> + * TEXT or BLOB.
>> + *
>> + * If haystack and needle are BLOBs, then the result is one more
>> + * than the number of bytes in haystack prior to the first
>> + * occurrence of needle, or 0 if needle never occurs in haystack.
>> + *
>> + * If haystack and needle are TEXTs, then their collations (if
>> + * any) are taken into consideration. If only one param has a
>> + * collation, then it is used. If both params have collations,
>> + * then the right one is chosen by
>> + * box/sql/sqlInt.h/collations_check_compatibility() function
>> + * (If collations are not compatible then the error is raised).
>> + *
>> + * Note that the "collation-selection" logic is implemented in
>> + * box/sql/expr.c/sqlExprCodeTarget() function.
>> */
>> static void
>> -instrFunc(sql_context * context, int argc, sql_value ** argv)
>> +positionFunc(sql_context *context, int argc, sql_value **argv)
> 
> Please, refactor this function in two steps:
> firstly provide code style fixes, then functional ones.
> Otherwise it complicates review process.
> 
>> {
>> -	const unsigned char *zHaystack;
>> -	const unsigned char *zNeedle;
>> -	int nHaystack;
>> -	int nNeedle;
>> -	int typeHaystack, typeNeedle;
>> +	const unsigned char *haystack;
>> +	const unsigned char *needle;
>> +	int n_haystack_bytes;
>> +	int n_needle_bytes;
> 
> Ok, if you started refactoring, please finish it:
> move args declaration to their initialisations,
> fix name of function, != 0 -> != NULL etc.
> 
>> +	int type_haystack, type_needle;
>> 	int N = 1;
>> -	int isText;
>> 
>> 	UNUSED_PARAMETER(argc);
>> -	typeHaystack = sql_value_type(argv[0]);
>> -	typeNeedle = sql_value_type(argv[1]);
>> -	if (typeHaystack == SQL_NULL || typeNeedle == SQL_NULL)
>> +	type_haystack = sql_value_type(argv[1]);
>> +	type_needle = sql_value_type(argv[0]);
>> +	if (type_haystack == SQL_NULL || type_needle == SQL_NULL)
>> +		return;
>> +	/*
>> + 	 * Position function can be called only with string
>> + 	 * or blob params.
>> + 	 */
>> +	int inconsistent_type = 0;
>> +	if (type_needle != SQL_TEXT && type_needle != SQL_BLOB)
>> +		inconsistent_type = type_needle;
>> +	if (type_haystack != SQL_TEXT && type_haystack != SQL_BLOB)
>> +		inconsistent_type = type_haystack;
>> +	if (inconsistent_type > 0) {
>> +		diag_set(ClientError, ER_INCONSISTENT_TYPES, "TEXT or BLOB",
>> +			 sql_value_type_to_str(inconsistent_type));
>> +		context->pVdbe->pParse->rc = SQL_TARANTOOL_ERROR;
>> +		context->pVdbe->pParse->nErr++;
>> +		sql_result_error(context, tarantoolErrorMessage(), -1);
>> +		return;
>> +	}
>> +	/*
>> +	 * Both params of Position function must be of the same
>> +	 * type.
>> +	 */
>> +	if (type_haystack != type_needle) {
>> +		diag_set(ClientError, ER_INCONSISTENT_TYPES,
>> +			 sql_value_type_to_str(type_needle),
>> +			 sql_value_type_to_str(type_haystack));
>> +		context->pVdbe->pParse->rc = SQL_TARANTOOL_ERROR;
>> +		context->pVdbe->pParse->nErr++;
>> +		sql_result_error(context, tarantoolErrorMessage(), -1);
>> 		return;
>> -	nHaystack = sql_value_bytes(argv[0]);
>> -	nNeedle = sql_value_bytes(argv[1]);
>> -	if (nNeedle > 0) {
>> -		if (typeHaystack == SQL_BLOB && typeNeedle == SQL_BLOB) {
>> -			zHaystack = sql_value_blob(argv[0]);
>> -			zNeedle = sql_value_blob(argv[1]);
>> -			assert(zNeedle != 0);
>> -			assert(zHaystack != 0 || nHaystack == 0);
>> -			isText = 0;
>> +	}
>> +	n_haystack_bytes = sql_value_bytes(argv[1]);
>> +	n_needle_bytes = sql_value_bytes(argv[0]);
>> +	if (n_needle_bytes > 0) {
>> +		if (type_haystack == SQL_BLOB) {
>> +			haystack = sql_value_blob(argv[1]);
>> +			needle = sql_value_blob(argv[0]);
>> +			assert(needle != 0);
>> +			assert(haystack != 0 || n_haystack_bytes == 0);
>> +
>> +			while (n_needle_bytes <= n_haystack_bytes
>> +			       && memcmp(haystack, needle, n_needle_bytes) != 0) {
>> +				N++;
>> +				n_haystack_bytes--;
>> +				haystack++;
>> +			}
>> +			if (n_needle_bytes > n_haystack_bytes)
>> +				N = 0;
>> 		} else {
>> -			zHaystack = sql_value_text(argv[0]);
>> -			zNeedle = sql_value_text(argv[1]);
>> -			isText = 1;
>> -			if (zHaystack == 0 || zNeedle == 0)
>> +			/*
>> +			 * Code below handles not only simple
>> +			 * cases like position('a', 'bca'), but
>> +			 * also more complex ones:
>> +			 * position('a', 'bcá' COLLATE "unicode_ci")
>> +			 * To do so, we need to use comparison
>> +			 * window, which has constant character
>> +			 * size, but variable byte size.
>> +			 * Character size is equal to
>> +			 * needle char size.
>> +			 */
>> +			haystack = sql_value_text(argv[1]);
>> +			needle = sql_value_text(argv[0]);
>> +			if (haystack == 0 || needle == 0)
>> 				return;
> 
> How they can be NULL, if this case is filtered in the beginning:
> 
> if (type_haystack == SQL_NULL || type_needle == SQL_NULL)
>       return;
> 
> 
>> -		}
>> -		while (nNeedle <= nHaystack
>> -		       && memcmp(zHaystack, zNeedle, nNeedle) != 0) {
>> -			N++;
>> -			do {
>> -				nHaystack--;
>> -				zHaystack++;
>> -			} while (isText && (zHaystack[0] & 0xc0) == 0x80);
>> -		}
>> -		if (nNeedle > nHaystack)
>> +			struct coll *coll = sqlGetFuncCollSeq(context);
>> +
>> + 			int n_needle_chars =
>> + 				sql_utf8_char_count(needle, n_needle_bytes);
>> +			int n_haystack_chars =
>> +				sql_utf8_char_count(haystack, n_haystack_bytes);
>> +
>> +			if (n_haystack_chars < n_needle_chars) {
>> +				N = 0;
> 
> What’s N?
Position of needle beginning in haystack. I would change the naming.
> 
>> diff --git a/src/box/sql/sqlInt.h b/src/box/sql/sqlInt.h
>> index 1d8fae5b0..4d059dd7e 100644
>> --- a/src/box/sql/sqlInt.h
>> +++ b/src/box/sql/sqlInt.h
>> @@ -1666,6 +1666,13 @@ struct FuncDestructor {
>> #define SQL_FUNC_SLOCHNG  0x2000	/* "Slow Change". Value constant during a
>> 					 * single query - might change over time
>> 					 */
>> +#define SQL_FUNC_ARG_COLL 0x4000
> 
> ARG_NEED_COLL?





More information about the Tarantool-patches mailing list