[tarantool-patches] Re: [PATCH] sql: replace instr() with position()

Tue Mar 12 16:31:30 MSK 2019

> On 10 Mar 2019, at 10:41, Ivan Koptelov <ivan.koptelov at tarantool.org> wrote:
> 
> Before this patch we had instr() SQL function.
> After the patch it is renamed to position().
> Also a few things have been changed: arguments
> order, allowed arguments types and usage of
> collations.
> 
> The order of the arguments has changed:
> Before: instr(haystack, needle).
> After: position(needle, haystack).

What are reasons to change arguments' order?
I guess to make it closer to ANSI syntax <POSITION a IN b>.

> Type checking became more strict: before it was
> possible to call the function with INTEGER arguments
> or with arguments of different types. Now both arguments
> must have the same type and be either text or binary
> strings.

Are there any reasons to support BLOBs as arguments?
For instance, we’ve banned this opportunity for LIKE func.

> Before the patch collations were not taken into
> consideration during the search. Now it is fixed, and
> both implicit (column) collations and explicit
> (using COLLATE expression) are used. There are
> several possible cases:
> 
> - One of collations is mentioned alongside with explicit
> COLLATE clause, which forces this collation over another
> one. It is allowed to have the same forced collations;
> - Both collations are derived from table columns and they
> are the same;
> - One collation is derived from table column and another
> one is not specified (i.e. COLL_NONE);

This is regulated by ANSI “Type combination” rules.
Skip this paragraph and simply mention this fact.

> In all other cases they are not accounted to be compatible
> and error is raised.
> 
> Related to #3933

I would use “Workaround for” label.
Also, please underline the fact that syntax is still
different from ANSI one and describe reasons for that.

> 
> @TarantoolBot document
> Title: instr() is replaced with position()
> Changes are described in the associated commit
> message.

AFAIK we do this vice versa: move vital parts of commit
message under doc bot tag request.

> diff --git a/src/box/sql/expr.c b/src/box/sql/expr.c
> index a75f23756..59b43ec41 100644
> --- a/src/box/sql/expr.c
> +++ b/src/box/sql/expr.c
> @@ -4054,7 +4054,8 @@ sqlExprCodeTarget(Parse * pParse, Expr * pExpr, int target)
> 					constMask |= MASKBIT32(i);
> 				}
> 				if ((pDef->funcFlags & SQL_FUNC_NEEDCOLL) !=
> -				    0 && coll == NULL) {
> +				    0 && coll == NULL &&
> +				    !(pDef->funcFlags & SQL_FUNC_ARG_COLL)) {

Why these conditions can’t be compatible?
What is the difference between SQL_FUNC_NEEDCOLL
and new flag? Does NEEDCOLL refer to returning value?
Please, comment this part.

> 					bool unused;
> 					uint32_t id;
> 					if (sql_expr_coll(pParse,
> @@ -4064,6 +4065,37 @@ sqlExprCodeTarget(Parse * pParse, Expr * pExpr, int target)
> 						return 0;
> 				}
> 			}
> +			if (pDef->funcFlags & SQL_FUNC_ARG_COLL) {
> +				assert(nFarg == 2);

Add comment to this assertion or better to the whole branch.
It is not obvious at first sight. Why only functions with two args…
For instance, there’s MAX function, which may take any number
of arguments. Collations are also vital for this function, since
they are used to compare strings.

What is more, I suggest to move this part in a separate patch,
since position is not the only function that must apply collations
on its operands. I don’t insist on dozens of tests on each built-in
function, just a few.

> +				struct coll *colls[2] = {NULL};

I’d rather use explicit naming:
struct coll *lhs_coll = NULL;
struct coll *rhs_coll = NULL;

It makes code a bit longer, but on the other hand IMHO
makes it less prone to typos.

> +				uint32_t colls_ids[2] = {0};
> +				bool is_forced[2] = {false};
> +				if (sql_expr_coll(pParse, pFarg->a[0].pExpr,
> +						  &is_forced[0], &colls_ids[0],
> +						  &colls[0]) != 0)
> +					return 0;
> +				if (sql_expr_coll(pParse, pFarg->a[1].pExpr,
> +						  &is_forced[1], &colls_ids[1],
> +						  &colls[1]) != 0)
> +					return 0;
> +
> +				uint32_t coll_id;
> +				if (collations_check_compatibility(colls_ids[0],
> +								   is_forced[0],
> +								   colls_ids[1],
> +								   is_forced[1],
> +								   &coll_id)
> +								   != 0) {
> +					pParse->rc = SQL_TARANTOOL_ERROR;
> +					pParse->nErr++;
> +					return 0;
> +				}
> +
> +				coll = (coll_id == colls_ids[0]) ? colls[0] :
> +								   colls[1];
> +				if (coll == NULL)
> +					coll = coll_by_id(COLL_NONE)->coll;

Why do we need this?

> +			}
> 			if (pFarg) {
> 				if (constMask) {
> 					r1 = pParse->nMem + 1;
> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
> index 2de6ef5ce..f44b7ce78 100644
> --- a/src/box/sql/func.c
> +++ b/src/box/sql/func.c
> @@ -38,6 +38,7 @@
> #include "vdbeInt.h"
> #include "version.h"
> #include "coll/coll.h"
> +#include "tarantoolInt.h"
> #include <unicode/ustring.h>
> #include <unicode/ucasemap.h>
> #include <unicode/ucnv.h>
> @@ -212,59 +213,173 @@ absFunc(sql_context * context, int argc, sql_value ** argv)
> }
> 
> /*
> - * Implementation of the instr() function.
> + * Returns name of SQL type, which is given by the int code.
>  *
> - * instr(haystack,needle) finds the first occurrence of needle
> + * @param sql_type One of enum sql_type codes.
> + * @retval string representing name of the given type.
> + */
> +static inline char *
> +sql_value_type_to_str(enum sql_type sql_type) {

I don’t recommend using this function: mem_type_to_str()
already exists, just make it non-static. sql_type is extracted
as type of VDBE memory cell, so in fact it is the same thing. 

> +	switch(sql_type) {
> +		case SQL_INTEGER:
> +			return "INTEGER";
> +		case SQL_TEXT:
> +			return "TEXT";
> +		case SQL_FLOAT:
> +			return "REAL";
> +		case SQL_BLOB:
> +			return "BLOB";
> +		default:
> +			return "NULL";
> +	}
> +}
> +
> +/*
> + * Implementation of the position() function.
> + *
> + * position(haystack,needle) finds the first occurrence of needle
>  * in haystack and returns the number of previous characters plus 1,
>  * or 0 if needle does not occur within haystack.
>  *
> - * If both haystack and needle are BLOBs, then the result is one more than
> - * the number of bytes in haystack prior to the first occurrence of needle,
> - * or 0 if needle never occurs in haystack.
> + * Haystack and needle must both have the same type, either
> + * TEXT or BLOB.
> + *
> + * If haystack and needle are BLOBs, then the result is one more
> + * than the number of bytes in haystack prior to the first
> + * occurrence of needle, or 0 if needle never occurs in haystack.
> + *
> + * If haystack and needle are TEXTs, then their collations (if
> + * any) are taken into consideration. If only one param has a
> + * collation, then it is used. If both params have collations,
> + * then the right one is chosen by
> + * box/sql/sqlInt.h/collations_check_compatibility() function
> + * (If collations are not compatible then the error is raised).
> + *
> + * Note that the "collation-selection" logic is implemented in
> + * box/sql/expr.c/sqlExprCodeTarget() function.
>  */
> static void
> -instrFunc(sql_context * context, int argc, sql_value ** argv)
> +positionFunc(sql_context *context, int argc, sql_value **argv)

Please, refactor this function in two steps:
firstly provide code style fixes, then functional ones.
Otherwise it complicates review process.

> {
> -	const unsigned char *zHaystack;
> -	const unsigned char *zNeedle;
> -	int nHaystack;
> -	int nNeedle;
> -	int typeHaystack, typeNeedle;
> +	const unsigned char *haystack;
> +	const unsigned char *needle;
> +	int n_haystack_bytes;
> +	int n_needle_bytes;

Ok, if you started refactoring, please finish it:
move args declaration to their initialisations,
fix name of function, != 0 -> != NULL etc.

> +	int type_haystack, type_needle;
> 	int N = 1;
> -	int isText;
> 
> 	UNUSED_PARAMETER(argc);
> -	typeHaystack = sql_value_type(argv[0]);
> -	typeNeedle = sql_value_type(argv[1]);
> -	if (typeHaystack == SQL_NULL || typeNeedle == SQL_NULL)
> +	type_haystack = sql_value_type(argv[1]);
> +	type_needle = sql_value_type(argv[0]);
> +	if (type_haystack == SQL_NULL || type_needle == SQL_NULL)
> +		return;
> +	/*
> + 	 * Position function can be called only with string
> + 	 * or blob params.
> + 	 */
> +	int inconsistent_type = 0;
> +	if (type_needle != SQL_TEXT && type_needle != SQL_BLOB)
> +		inconsistent_type = type_needle;
> +	if (type_haystack != SQL_TEXT && type_haystack != SQL_BLOB)
> +		inconsistent_type = type_haystack;
> +	if (inconsistent_type > 0) {
> +		diag_set(ClientError, ER_INCONSISTENT_TYPES, "TEXT or BLOB",
> +			 sql_value_type_to_str(inconsistent_type));
> +		context->pVdbe->pParse->rc = SQL_TARANTOOL_ERROR;
> +		context->pVdbe->pParse->nErr++;
> +		sql_result_error(context, tarantoolErrorMessage(), -1);
> +		return;
> +	}
> +	/*
> +	 * Both params of Position function must be of the same
> +	 * type.
> +	 */
> +	if (type_haystack != type_needle) {
> +		diag_set(ClientError, ER_INCONSISTENT_TYPES,
> +			 sql_value_type_to_str(type_needle),
> +			 sql_value_type_to_str(type_haystack));
> +		context->pVdbe->pParse->rc = SQL_TARANTOOL_ERROR;
> +		context->pVdbe->pParse->nErr++;
> +		sql_result_error(context, tarantoolErrorMessage(), -1);
> 		return;
> -	nHaystack = sql_value_bytes(argv[0]);
> -	nNeedle = sql_value_bytes(argv[1]);
> -	if (nNeedle > 0) {
> -		if (typeHaystack == SQL_BLOB && typeNeedle == SQL_BLOB) {
> -			zHaystack = sql_value_blob(argv[0]);
> -			zNeedle = sql_value_blob(argv[1]);
> -			assert(zNeedle != 0);
> -			assert(zHaystack != 0 || nHaystack == 0);
> -			isText = 0;
> +	}
> +	n_haystack_bytes = sql_value_bytes(argv[1]);
> +	n_needle_bytes = sql_value_bytes(argv[0]);
> +	if (n_needle_bytes > 0) {
> +		if (type_haystack == SQL_BLOB) {
> +			haystack = sql_value_blob(argv[1]);
> +			needle = sql_value_blob(argv[0]);
> +			assert(needle != 0);
> +			assert(haystack != 0 || n_haystack_bytes == 0);
> +
> +			while (n_needle_bytes <= n_haystack_bytes
> +			       && memcmp(haystack, needle, n_needle_bytes) != 0) {
> +				N++;
> +				n_haystack_bytes--;
> +				haystack++;
> +			}
> +			if (n_needle_bytes > n_haystack_bytes)
> +				N = 0;
> 		} else {
> -			zHaystack = sql_value_text(argv[0]);
> -			zNeedle = sql_value_text(argv[1]);
> -			isText = 1;
> -			if (zHaystack == 0 || zNeedle == 0)
> +			/*
> +			 * Code below handles not only simple
> +			 * cases like position('a', 'bca'), but
> +			 * also more complex ones:
> +			 * position('a', 'bcá' COLLATE "unicode_ci")
> +			 * To do so, we need to use comparison
> +			 * window, which has constant character
> +			 * size, but variable byte size.
> +			 * Character size is equal to
> +			 * needle char size.
> +			 */
> +			haystack = sql_value_text(argv[1]);
> +			needle = sql_value_text(argv[0]);
> +			if (haystack == 0 || needle == 0)
> 				return;

How they can be NULL, if this case is filtered in the beginning:

if (type_haystack == SQL_NULL || type_needle == SQL_NULL)
       return;

> -		}
> -		while (nNeedle <= nHaystack
> -		       && memcmp(zHaystack, zNeedle, nNeedle) != 0) {
> -			N++;
> -			do {
> -				nHaystack--;
> -				zHaystack++;
> -			} while (isText && (zHaystack[0] & 0xc0) == 0x80);
> -		}
> -		if (nNeedle > nHaystack)
> +			struct coll *coll = sqlGetFuncCollSeq(context);
> +
> + 			int n_needle_chars =
> + 				sql_utf8_char_count(needle, n_needle_bytes);
> +			int n_haystack_chars =
> +				sql_utf8_char_count(haystack, n_haystack_bytes);
> +
> +			if (n_haystack_chars < n_needle_chars) {
> +				N = 0;

What’s N?

> diff --git a/src/box/sql/sqlInt.h b/src/box/sql/sqlInt.h
> index 1d8fae5b0..4d059dd7e 100644
> --- a/src/box/sql/sqlInt.h
> +++ b/src/box/sql/sqlInt.h
> @@ -1666,6 +1666,13 @@ struct FuncDestructor {
> #define SQL_FUNC_SLOCHNG  0x2000	/* "Slow Change". Value constant during a
> 					 * single query - might change over time
> 					 */
> +#define SQL_FUNC_ARG_COLL 0x4000

ARG_NEED_COLL?