[tarantool-patches] Re: [PATCH] sql: modify TRIM() function signature

Sun Apr 21 22:36:43 MSK 2019

Hi! Thanks for the patch! See 8 comments below.

>> 4. You again ignored my comment about NULL. Please, find all other
>> places and fix it finally. I said it already 1000 times in 1000
>> reviews - we do not use 0 for pointers. It is a simple rule. Just
>> follow it. Write it down somewhere in a list of code style rules
>> and check them all before sending a patch.
>>
>> Seeing how many my comments you repeatedly ignore, I think that
>> probably you should reconsider the way how you do self-reviews. If
>> you do it via just looking a couple of seconds at the code in the
>> text editor, then it is definitely a bad way.
>>
>> First of all, use 'git diff/show' in console to look only at the
>> patch changes, not at the entire files and functions. If you do not
>> like console, and it is ok, then you can use Sublime Merge
>> desktop program or Sublime Git package for the editor. When you
>> look at the diff only, it is much simpler to notice such violations
>> and even bugs.
> +	int char_cnt = sql_utf8_char_count(z, trim_set_sz);
> +	if (char_cnt == 0)
> 
> +	unsigned char **ind_chars =
> +		contextMalloc(context,
> +			      char_cnt * (sizeof(unsigned char *) + 1));
> +	if (ind_chars == NULL)
> 
> +	int i = 0;
> +	char_cnt = 0;
> 
> +	if ((input_str = sql_value_text(argv[0])) == NULL)
> +		return;
> 
> +	if ((input_str = sql_value_text(argv[1])) == NULL)
> +		return;
> 
> +	} else if ((trim_set = sql_value_text(argv[0])) != NULL) {
> +		int trim_set_sz = sql_value_bytes(argv[0]);
> 
> +	if ((input_str = sql_value_text(argv[2])) == NULL ||
> +	    (trim_set = sql_value_text(argv[1])) == NULL)
> +		return;

1. Sorry, but I do not understand this bunch of '+' and random lines
above. What is it? It is not a diff - I do not see function names,
'-', line numbers. Please, next time provide *full* normal diff
obtained by 'git --no-pager diff' command.

The same about all other 'diff's below.

> 
>> 5. In our code style we do not use 'char' to represent numbers, we
>> use 'uint8_t' or 'int8_t' when we want to use one-byte numbers. It
>> is the same as 'char'/'unsigned char', but looks shorter and it
>> becomes obvious that these values are used as numbers, not text.
>> Firstly I thought that char_len was an array of characters, but
>> it emerged being an array of symbol sizes. In the summary, I
>> suggest to use 'uint8_t *' for char_len array.
> +	uint8_t *char_len = (uint8_t *)&ind_chars[char_cnt];
> 
>>> +	/* Individual characters in the character set. */
>>> +	char unsigned **ind_chars = 0;
>>
>> 6. If you declare it as 'const char unsigned **', then you
>> can remove unnecessary type cast from line 1330.

> If you meant the following line "ind_chars[char_cnt] = (unsigned char *)(z + i);”
> then it isn’t compiled without the cast, because of assigning to "'unsigned char *’
> from 'const unsigned char *’”.

2. No, I meant literally the same what I said - if you declare
'ind_chars' as 'const unsigned char **' instead of 'unsigned char **',
then you do not need this cast. It is a standard. Doubtfully it
depends on compiler, and the diff below must work everywhere.
Please, check it out.

==========================================================================

diff --git a/src/box/sql/func.c b/src/box/sql/func.c
index 6f2a5e3f6..9bd41335e 100644
--- a/src/box/sql/func.c
+++ b/src/box/sql/func.c
@@ -1313,7 +1313,7 @@ trim_procedure(struct sql_context *context, enum trim_side_mask flags,
 	if (char_cnt == 0)
 		goto result;
 	/* Individual characters in the character set. */
-	unsigned char **ind_chars =
+	const unsigned char **ind_chars =
 		contextMalloc(context,
 			      char_cnt * (sizeof(unsigned char *) + 1));
 	if (ind_chars == NULL)
@@ -1325,7 +1325,7 @@ trim_procedure(struct sql_context *context, enum trim_side_mask flags,
 	char_cnt = 0;
 	int handled_bytes_cnt = trim_set_sz;
 	while(handled_bytes_cnt > 0) {
-		ind_chars[char_cnt] = (unsigned char *)(z + i);
+		ind_chars[char_cnt] = z + i;
 		SQL_UTF8_FWD_1(z, i, trim_set_sz);
 		char_len[char_cnt] = z + i - ind_chars[char_cnt];
 		handled_bytes_cnt -= char_len[char_cnt];
==========================================================================

>>> +		if ((flags & TRIM_LEADING) != 0) {
>>> +			while (input_str_sz > 0) {
>>> 				int len = 0;
>>> -				for (i = 0; i < nChar; i++) {
>>> -					len = aLen[i];
>>> -					if (len <= nIn
>>> -					    && memcmp(zIn, azChar[i], len) == 0)
>>> +				for (i = 0; i < char_cnt; i++) {
>>> +					len = char_len[i];
>>> +					if (len <= input_str_sz
>>> +					    && memcmp(input_str,
>>> +						      ind_chars[i], len) == 0)
>>> 						break;
>>> 				}
>>> -				if (i >= nChar)
>>> +				if (i >= char_cnt)
>>> 					break;
>>> -				zIn += len;
>>> -				nIn -= len;
>>> +				input_str += len;
>>> +				input_str_sz -= len;
>>> 			}
>>> 		}
>>> -		if (flags & 2) {
>>> -			while (nIn > 0) {
>>> +		if ((flags & TRIM_TRAILING) != 0) {
>>> +			while (input_str_sz > 0) {
>>> 				int len = 0;
>>> -				for (i = 0; i < nChar; i++) {
>>> -					len = aLen[i];
>>> -					if (len <= nIn
>>> -					    && memcmp(&zIn[nIn - len],
>>> -						      azChar[i], len) == 0)
>>> +				for (i = 0; i < char_cnt; i++) {
>>> +					len = char_len[i];
>>> +					if (len <= input_str_sz
>>> +					    && memcmp(&input_str[input_str_sz - len],
>>> +						      ind_chars[i], len) == 0)
>>
>> 11. Out of 80. And you saw that in your editor, even without
>> 'git diff' and console, because you have 80-rulers. So why did
>> you decide not to fix it?
> I often saw same instances in the Tarantool’s code, when few characters is out of 80.

3. Where? If there are such places, then they violate our code style,
and usually are pushed by those who does not care about it, but
on the other hand sadly has rights to push without reviews directly
into the master.

> In my case, I just didn’t know how to fix that.

4. There are basically 3 ways to fix, always:

1) split into smaller functions;
2) add 'goto's;
3) reorganize the code so as to reduce number of checks.
   They are repetitive or unnecessary quite often.

In this concrete case I think 'goto's are ok.

>>> + */
>>> +static void
>>> +trim_func_one_arg(sql_context * context, int argc, sql_value **argv)
>>
>> 13. In new code we use explicit 'struct' keyword for struct
>> types - sql_context and sql_value. Also, we do not put whitepaces

5. Note, I said 'sql_value' too, and look at your code below. What
have you missed?

>> after '*' when declare a pointer type value. The same for other places.
> +trim_procedure(struct sql_context *context, enum trim_side_mask flags,
> +	       const unsigned char *trim_set, int trim_set_sz,
> +	       const unsigned char *input_str, int input_str_sz)
> 
> +static void
> +trim_func_one_arg(struct sql_context *context, int argc, sql_value **argv)
> 
> +static void
> +trim_func_two_args(struct sql_context *context, int argc, sql_value **argv)
> 
> +static void
> +trim_func_three_args(struct sql_context *context, int argc, sql_value **argv)
> > diff --git a/src/box/sql/func.c b/src/box/sql/func.c
> index abeecefa1..6f2a5e3f6 100644
> --- a/src/box/sql/func.c
> +++ b/src/box/sql/func.c
> @@ -1286,108 +1286,173 @@ replaceFunc(sql_context * context, int argc, sql_value ** argv)
>  	sql_result_text(context, (char *)zOut, j, sql_free);
>  }
>  
> -/*
> - * Implementation of the TRIM(), LTRIM(), and RTRIM() functions.
> - * The userdata is 0x1 for left trim, 0x2 for right trim, 0x3 for both.
> +/**
> + * Remove characters included in @a trim_set from @a input_str
> + * until encounter a character that doesn't belong to @a trim_set.
> + * Remove from the side specified by @a flags.
> + * @param context SQL context.
> + * @param flags Trim specification: left, right or both.
> + * @param trim_set The set of characters for trimming.
> + * @param trim_set_sz Character set size in bytes.
> + * @param input_str Input string for trimming.
> + * @param input_str_sz Input string size in bytes.
>   */
>  static void
> -trimFunc(sql_context * context, int argc, sql_value ** argv)
> +trim_procedure(struct sql_context *context, enum trim_side_mask flags,
> +	       const unsigned char *trim_set, int trim_set_sz,
> +	       const unsigned char *input_str, int input_str_sz)
>  {
> -	const unsigned char *zIn;	/* Input string */
> -	const unsigned char *zCharSet;	/* Set of characters to trim */
> -	int nIn;		/* Number of bytes in input */
> -	int flags;		/* 1: trimleft  2: trimright  3: trim */
> -	int i;			/* Loop counter */
> -	unsigned char *aLen = 0;	/* Length of each character in zCharSet */
> -	unsigned char **azChar = 0;	/* Individual characters in zCharSet */
> -	int nChar;		/* Number of characters in zCharSet */
> -
> -	if (sql_value_type(argv[0]) == SQL_NULL) {
> +	const unsigned char *z = trim_set;
> +	/*
> +	 * Count the number of UTF-8 characters passing through
> +	 * the entire char set, but not up to the '\0' or X'00'
> +	 * character. This allows to handle trimming set
> +	 * containing such characters.
> +	 */
> +	int char_cnt = sql_utf8_char_count(z, trim_set_sz);
> +	if (char_cnt == 0)
> +		goto result;
> +	/* Individual characters in the character set. */
> +	unsigned char **ind_chars =
> +		contextMalloc(context,
> +			      char_cnt * (sizeof(unsigned char *) + 1));
> +	if (ind_chars == NULL)
>  		return;
> +	/* Length of each character in the character set. */
> +	uint8_t *char_len = (uint8_t *)&ind_chars[char_cnt];
> +	z = trim_set;
> +	int i = 0;
> +	char_cnt = 0;
> +	int handled_bytes_cnt = trim_set_sz;
> +	while(handled_bytes_cnt > 0) {
> +		ind_chars[char_cnt] = (unsigned char *)(z + i);
> +		SQL_UTF8_FWD_1(z, i, trim_set_sz);
> +		char_len[char_cnt] = z + i - ind_chars[char_cnt];
> +		handled_bytes_cnt -= char_len[char_cnt];
> +		char_cnt++;
>  	}
> -	zIn = sql_value_text(argv[0]);
> -	if (zIn == 0)
> -		return;
> -	nIn = sql_value_bytes(argv[0]);
> -	assert(zIn == sql_value_text(argv[0]));
> -	if (argc == 1) {
> -		static const unsigned char lenOne[] = { 1 };
> -		static unsigned char *const azOne[] = { (u8 *) " " };
> -		nChar = 1;
> -		aLen = (u8 *) lenOne;
> -		azChar = (unsigned char **)azOne;
> -		zCharSet = 0;
> -	} else if ((zCharSet = sql_value_text(argv[1])) == 0) {
> -		return;
> -	} else {
> -		const unsigned char *z = zCharSet;
> -		int trim_set_sz = sql_value_bytes(argv[1]);
> -		/*
> -		* Count the number of UTF-8 characters passing
> -		* through the entire char set, but not up
> -		* to the '\0' or X'00' character. This allows
> -		* to handle trimming set containing such
> -		* characters.
> -		*/
> -		nChar = sql_utf8_char_count(z, trim_set_sz);
> -		if (nChar > 0) {
> -			azChar =
> -			    contextMalloc(context,
> -					  ((i64) nChar) * (sizeof(char *) + 1));
> -			if (azChar == 0) {
> -				return;
> -			}
> -			aLen = (unsigned char *)&azChar[nChar];
> -			z = zCharSet;
> -			i = 0;
> -			nChar = 0;
> -			int handled_bytes_cnt = trim_set_sz;
> -			while(handled_bytes_cnt > 0) {
> -				azChar[nChar] = (unsigned char *)(z + i);
> -				SQL_UTF8_FWD_1(z, i, trim_set_sz);
> -				aLen[nChar] = (u8) (z + i - azChar[nChar]);
> -				handled_bytes_cnt -= aLen[nChar];
> -				nChar++;
> -			}
> -		}
> -	}
> -	if (nChar > 0) {
> -		flags = SQL_PTR_TO_INT(sql_user_data(context));
> -		if (flags & 1) {
> -			while (nIn > 0) {
> -				int len = 0;
> -				for (i = 0; i < nChar; i++) {
> -					len = aLen[i];
> -					if (len <= nIn
> -					    && memcmp(zIn, azChar[i], len) == 0)
> -						break;
> -				}
> -				if (i >= nChar)
> +
> +	if (char_cnt == 0)
> +		goto result;

6. How is it possible, that first time sql_utf8_char_count()
returned not 0, but now we got 0 on the same string using the
same methods (SQL_UTF8_FWD_1)?

> +	if ((flags & TRIM_LEADING) != 0) {
> +		while (input_str_sz > 0) {
> +			int len = 0;
> +			for (i = 0; i < char_cnt; i++) {
> +				len = char_len[i];
> +				if (len <= input_str_sz
> +				    && memcmp(input_str,
> +					      ind_chars[i], len) == 0)
>  					break;
> -				zIn += len;
> -				nIn -= len;
>  			}
> +			if (i >= char_cnt)
> +				break;
> +			input_str += len;
> +			input_str_sz -= len;
>  		}
> -		if (flags & 2) {
> -			while (nIn > 0) {
> -				int len = 0;
> -				for (i = 0; i < nChar; i++) {
> -					len = aLen[i];
> -					if (len <= nIn
> -					    && memcmp(&zIn[nIn - len],
> -						      azChar[i], len) == 0)
> -						break;
> -				}
> -				if (i >= nChar)
> +	}
> +	if ((flags & TRIM_TRAILING) != 0) {
> +		while (input_str_sz > 0) {
> +			int len = 0;
> +			for (i = 0; i < char_cnt; i++) {
> +				len = char_len[i];
> +				if (len <= input_str_sz
> +				    && memcmp(&input_str[input_str_sz - len],
> +					      ind_chars[i], len) == 0)
>  					break;
> -				nIn -= len;
>  			}
> -		}
> -		if (zCharSet) {
> -			sql_free(azChar);
> +			if (i >= char_cnt)
> +				break;
> +			input_str_sz -= len;
>  		}
>  	}
> -	sql_result_text(context, (char *)zIn, nIn, SQL_TRANSIENT);
> +
> +	if (trim_set_sz != 0)
> +		sql_free(ind_chars);
> +
> +	result: sql_result_text(context, (char *)input_str, input_str_sz,
> +				SQL_TRANSIENT);

7. We never declare labels like that. Please, look at other places
where and how we declare labels. Also it is stated that we use
Linux Kernel code style, as I remember, and it means, that you
can lookup doubtful places on sites like this:
https://www.kernel.org/doc/html/v4.10/process/coding-style.html#centralized-exiting-of-functions


8. Now the function looks much more clean, and it is easy to see that
its performance for the most common case is strongly down. Before
your patch when there was no 'trim character set' argument, whitespaces
where used by default, but they were not allocated on new memory via
contextMalloc() - they were declared on stack as an optimization. Please,
refactor that function so as it does not allocate anything and takes
already somehow obtained trim char set and trim char lengths.

These arguments should only be allocated by trim_func_two_args() and
trim_func_three_args(), and only on demand, when not-default trim set
is passed. And obviously, you should not do it twice in both
places. Create a special function for that which takes
'const unsigned char *trim_set', and returns the prepared arguments.

In addition, I doubt if we really need 'char **' of individual characters.
In fact, it is

  1) bad for performance because of double pointer dereference;

  2) its elements ('char *' is element of 'char **') are bigger
     than the biggest possible character - 8 bytes;

  2) it is not needed at all - having 'char *trim_set' and
     'uint8_t *char_lens' is enough to iterate char by char through
     'trim_set'.

I took the liberty of implementing this and got this diff:

==========================================================================
diff --git a/src/box/sql/func.c b/src/box/sql/func.c
index 6f2a5e3f6..5a4417c73 100644
--- a/src/box/sql/func.c
+++ b/src/box/sql/func.c
@@ -1302,46 +1302,35 @@ trim_procedure(struct sql_context *context, enum trim_side_mask flags,
 	       const unsigned char *trim_set, int trim_set_sz,
 	       const unsigned char *input_str, int input_str_sz)
 {
-	const unsigned char *z = trim_set;
 	/*
 	 * Count the number of UTF-8 characters passing through
 	 * the entire char set, but not up to the '\0' or X'00'
 	 * character. This allows to handle trimming set
 	 * containing such characters.
 	 */
-	int char_cnt = sql_utf8_char_count(z, trim_set_sz);
+	int char_cnt = sql_utf8_char_count(trim_set, trim_set_sz);
 	if (char_cnt == 0)
 		goto result;
-	/* Individual characters in the character set. */
-	unsigned char **ind_chars =
-		contextMalloc(context,
-			      char_cnt * (sizeof(unsigned char *) + 1));
-	if (ind_chars == NULL)
-		return;
 	/* Length of each character in the character set. */
-	uint8_t *char_len = (uint8_t *)&ind_chars[char_cnt];
-	z = trim_set;
-	int i = 0;
-	char_cnt = 0;
-	int handled_bytes_cnt = trim_set_sz;
-	while(handled_bytes_cnt > 0) {
-		ind_chars[char_cnt] = (unsigned char *)(z + i);
+	uint8_t *char_len = (uint8_t *)contextMalloc(context, char_cnt);
+	if (char_len == NULL)
+		return;
+	const unsigned char *z = trim_set;
+	int i = 0, j = 0;
+	while(j < char_cnt) {
+		int old_i = i;
 		SQL_UTF8_FWD_1(z, i, trim_set_sz);
-		char_len[char_cnt] = z + i - ind_chars[char_cnt];
-		handled_bytes_cnt -= char_len[char_cnt];
-		char_cnt++;
+		char_len[j++] = i - old_i;
 	}
 
-	if (char_cnt == 0)
-		goto result;
 	if ((flags & TRIM_LEADING) != 0) {
 		while (input_str_sz > 0) {
 			int len = 0;
-			for (i = 0; i < char_cnt; i++) {
+			z = trim_set;
+			for (i = 0; i < char_cnt; ++i, z += len) {
 				len = char_len[i];
 				if (len <= input_str_sz
-				    && memcmp(input_str,
-					      ind_chars[i], len) == 0)
+				    && memcmp(input_str, z, len) == 0)
 					break;
 			}
 			if (i >= char_cnt)
@@ -1353,11 +1342,12 @@ trim_procedure(struct sql_context *context, enum trim_side_mask flags,
 	if ((flags & TRIM_TRAILING) != 0) {
 		while (input_str_sz > 0) {
 			int len = 0;
-			for (i = 0; i < char_cnt; i++) {
+			z = trim_set;
+			for (i = 0; i < char_cnt; ++i, z += len) {
 				len = char_len[i];
 				if (len <= input_str_sz
 				    && memcmp(&input_str[input_str_sz - len],
-					      ind_chars[i], len) == 0)
+					      z, len) == 0)
 					break;
 			}
 			if (i >= char_cnt)
@@ -1365,9 +1355,7 @@ trim_procedure(struct sql_context *context, enum trim_side_mask flags,
 			input_str_sz -= len;
 		}
 	}
-
-	if (trim_set_sz != 0)
-		sql_free(ind_chars);
+	sql_free(char_len);
 
 	result: sql_result_text(context, (char *)input_str, input_str_sz,
 				SQL_TRANSIENT);
==========================================================================

It passes the tests, what proves unnecessity of 'ind_chars' array.