From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <tarantool-patches-bounce@freelists.org>
Received: from localhost (localhost [127.0.0.1])
	by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id 3BF2E26C3D
	for <tarantool-patches@freelists.org>; Tue, 26 Feb 2019 08:33:06 -0500 (EST)
Received: from turing.freelists.org ([127.0.0.1])
	by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id SN2jlbpIvyeJ for <tarantool-patches@freelists.org>;
	Tue, 26 Feb 2019 08:33:06 -0500 (EST)
Received: from smtp34.i.mail.ru (smtp34.i.mail.ru [94.100.177.94])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id C837626A2D
	for <tarantool-patches@freelists.org>; Tue, 26 Feb 2019 08:33:05 -0500 (EST)
Content-Type: text/plain;
	charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 12.0 \(3445.100.39\))
Subject: [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
From: "i.koptelov" <ivan.koptelov@tarantool.org>
In-Reply-To: <D8CAFB0C-E9A3-40B6-95AC-6751E9892D87@tarantool.org>
Date: Tue, 26 Feb 2019 16:33:01 +0300
Content-Transfer-Encoding: quoted-printable
Message-Id: <C345EB6B-C67C-4CCE-AD76-210C3A18EE8D@tarantool.org>
References: <15e143f4-3ea7-c7d6-d8ac-8a0e20b76449@tarantool.org>
 <1560FF96-FECD-4368-8AF8-F8F2AE7696E3@tarantool.org>
 <fd70561d-6a0e-70dc-4c20-bdcac764040a@tarantool.org>
 <07DBA796-6DD4-41DD-8438-104FE3AE05BB@tarantool.org>
 <4F4E0A7E-199C-4647-A49C-DD0E8A216527@tarantool.org>
 <8EF5CE57-C6B5-493C-94CC-AA3C88639485@tarantool.org>
 <7E6CE8AA-512D-4472-9DBD-8159073386C5@tarantool.org>
 <FC2CA838-42C3-4494-BC6C-37C864D4BF79@tarantool.org>
 <25649276-74CD-46E7-A1EB-F4CE299E637C@tarantool.org>
 <427EE913-3E58-413F-A645-DBF83C809334@tarantool.org>
 <DD522CAF-BD70-4E66-B8A4-C1837370B81D@tarantool.org>
 <583EC402-D1FF-45C4-B18B-8A06D4362200@tarantool.org>
 <3377BC01-D943-4D24-A9A2-BA9B9C67EA92@tarantool.org>
 <DD611E97-01AF-4E10-A81B-E27AA6B8A681@tarantool.org>
 <D8CAFB0C-E9A3-40B6-95AC-6751E9892D87@tarantool.org>
Sender: tarantool-patches-bounce@freelists.org
Errors-to: tarantool-patches-bounce@freelists.org
Reply-To: tarantool-patches@freelists.org
List-help: <mailto:ecartis@freelists.org?Subject=help>
List-unsubscribe: <tarantool-patches-request@freelists.org?Subject=unsubscribe>
List-software: Ecartis version 1.0.0
List-Id: tarantool-patches <tarantool-patches.freelists.org>
List-subscribe: <tarantool-patches-request@freelists.org?Subject=subscribe>
List-owner: <mailto:>
List-post: <mailto:tarantool-patches@freelists.org>
List-archive: <http://www.freelists.org/archives/tarantool-patches>
To: tarantool-patches@freelists.org
Cc: "n.pettik" <korablev@tarantool.org>


> On 25 Feb 2019, at 18:10, n.pettik <korablev@tarantool.org> wrote:
>=20
> That=E2=80=99s I was talking about. But using the macros with the same
> name as in utf library doesn=E2=80=99t look like a good pattern. Yep, =
you
> can use define guards like:
>=20
> #ifdef  U8_COUNT_TRAIL_BYTES_UNSAFE
> #undef U8_COUNT_TRAIL_BYTES_UNSAFE
> #endif
> #define U8_COUNT_TRAIL_BYTES_UNSAFE
>=20
> But I=E2=80=99d rather just give it another name.
> Hence, taking into account comment below,
> we are going to substitute SQL_SKIP_UTF8() with
> implementation borrowed from icu library.
I changed the names to SQL_UTF8_FWD_1_UNSAFE and
SQL_UTF8_COUNT_TRAIL_BYTES_UNSAFE.
>=20
>>>>> Furthermore, description says that it =E2=80=9Cassumes well-formed =
UTF-8=E2=80=9D,
>>>>> which in our case is not true. So who knows what may happen if we =
pass
>>>>> malformed byte sequence. I am not even saying that behaviour of
>>>>> this function on invalid inputs may change later.
>>>>=20
>>>> In it's current implementation U8_FWD_1_UNSAFE satisfy our needs =
safely. Returned
>>>> symbol length would never exceed byte_len.
>>>>=20
>>>> static int
>>>> utf8_char_count(const unsigned char *str, int byte_len)
>>>> {
>>>> 	int symbol_count =3D 0;
>>>> 	for (int i =3D 0; i < byte_len;) {
>>>> 		U8_FWD_1_UNSAFE(str, i);
>>>> 		symbol_count++;
>>>> 	}
>>>> 	return symbol_count;
>>>> }
>>>>=20
>>>> I agree that it is a bad idea to relay on lib behaviour which may
>>>> change lately. So maybe I would just inline these one line macros?
>>>> Or use my own implementation, since it=E2=80=99s more efficient =
(but less beautiful)
>>>=20
>>> Nevermind, let's keep it as is.
>>> I really worry only about the fact that in other places =
SQL_SKIP_UTF8
>>> is used instead. It handles only two-bytes utf8 symbols, meanwhile
>>> U8_FWD_1_UNSAFE() accounts three and four bytes length symbols.
>>> Can we use everywhere the same pattern?
>> Yes, I think, we can.
>=20
> Ok, then will be waiting for updates.
So here are the updates:

sqlUtf8CharLen is removed with utf8_char_count (which is
just moved to utf.c). sqlUtf8CharLen used SQL_SKIP_UTF8
and itself was used only in one place.

diff --git a/src/box/sql/utf.c b/src/box/sql/utf.c
index 22e0de809..5a1862a52 100644
--- a/src/box/sql/utf.c
+++ b/src/box/sql/utf.c
@@ -157,30 +157,31 @@ sqlUtf8Read(const unsigned char **pz	/* =
Pointer to string from which to read cha
  */
 /* #define TRANSLATE_TRACE 1 */
=20
-/*
- * pZ is a UTF-8 encoded unicode string. If nByte is less than zero,
- * return the number of unicode characters in pZ up to (but not =
including)
- * the first 0x00 byte. If nByte is not less than zero, return the
- * number of unicode characters in the first nByte of pZ (or up to
- * the first 0x00, whichever comes first).
+/**
+ * Return number of symbols in the given string.
+ *
+ * Number of symbols !=3D byte size of string because some symbols
+ * are encoded with more than one byte. Also note that all
+ * symbols from 'str' to 'str + byte_len' would be counted,
+ * even if there is a '\0' somewhere between them.
+ *
+ * This function is implemented to be fast and indifferent to
+ * correctness of string being processed. If input string has
+ * even one invalid utf-8 sequence, then the resulting length
+ * could be arbitary in these boundaries (0 < len < byte_len).
+ * @param str String to be counted.
+ * @param byte_len Byte length of given string.
+ * @return number of symbols in the given string.
  */
 int
-sqlUtf8CharLen(const char *zIn, int nByte)
+utf8_char_count(const unsigned char *str, int byte_len)
 {
-	int r =3D 0;
-	const u8 *z =3D (const u8 *)zIn;
-	const u8 *zTerm;
-	if (nByte >=3D 0) {
-		zTerm =3D &z[nByte];
-	} else {
-		zTerm =3D (const u8 *)(-1);
-	}
-	assert(z <=3D zTerm);
-	while (*z !=3D 0 && z < zTerm) {
-		SQL_SKIP_UTF8(z);
-		r++;
+	int symbol_count =3D 0;
+	for (int i =3D 0; i < byte_len;) {
+		SQL_UTF8_FWD_1_UNSAFE(str, i, byte_len);
+		symbol_count++;
 	}
-	return r;
+	return symbol_count;
 }


I inlined SQL_UTF8_FWD_1_UNSAFE and
SQL_UTF8_COUNT_TRAIL_BYTES_UNSAFE from icu
library, because different versions of
the library provide slightly different
behavior of UTF8_FWD_1_UNSAFE. It caused
sql-tap/badutf test to fail on some platforms,
while on the others it was working.

diff --git a/src/box/sql/sqlInt.h b/src/box/sql/sqlInt.h
index 2830ab639..859450e88 100644
--- a/src/box/sql/sqlInt.h
+++ b/src/box/sql/sqlInt.h
@@ -3071,16 +3071,6 @@ struct TreeView {
 };
 #endif				/* SQL_DEBUG */
=20
-/*
- * Assuming zIn points to the first byte of a UTF-8 character,
- * advance zIn to point to the first byte of the next UTF-8 character.
- */
-#define SQL_SKIP_UTF8(zIn) {                        \
-  if( (*(zIn++))>=3D0xc0 ){                              \
-    while( (*zIn & 0xc0)=3D=3D0x80 ){ zIn++; }             \
-  }                                                    \
-}
-
 /*
  * The sql_*_BKPT macros are substitutes for the error codes with
  * the same name but without the _BKPT suffix.  These macros invoke
@@ -4142,11 +4132,51 @@ void
 sql_drop_foreign_key(struct Parse *parse_context, struct SrcList =
*table,
 		     struct Token *constraint);
=20
+/**
+ * Counts the trail bytes for a UTF-8 lead byte of a valid UTF-8
+ * sequence. Returns 0 for 0..0xc1. Undefined for 0xf5..0xff.
+ * leadByte might be evaluated multiple times.
+ *
+ * @param leadByte The first byte of a UTF-8 sequence.
+ *                 Must be 0..0xff.
+ */
+#define SQL_UTF8_COUNT_TRAIL_BYTES_UNSAFE(leadByte) \
+    (((uint8_t)(leadByte) >=3D 0xc2) + ((uint8_t)(leadByte) >=3D 0xe0) =
+ \
+    ((uint8_t)(leadByte) >=3D 0xf0))
+
+/**
+ * Advance the string offset from one code point boundary to the
+ * next. (Post-incrementing iteration.)
+ * "Unsafe" macro, assumes well-formed UTF-8.
+ *
+ * After the whole string is traversed, (str + i) points to the
+ * position right after the last element of the string (*).
+ *
+ * If resulting offset > byte_size then resulting offset is set
+ * to byte_size. This is to provide (*) in cases where it might
+ * be violated.
+ *
+ * SQL_UTF8_FWD_1_UNSAFE sometimes is used get the size of utf-8
+ * character subsequence and we don't want to get summary size
+ * which exceeds total string size (in bytes). Consider example:
+ *
+ * '0xE0' - this is invalid utf-8 string because it consists only
+ * of first byte of 3 byte sequence. After traverse, the
+ * offset =3D=3D 2 and we set it to 1, to keep (*).
+ *
+ * @param s const uint8_t * string.
+ * @param i string offset.
+ * @param byte_size byte size of the string.
+ */
+#define SQL_UTF8_FWD_1_UNSAFE(str, i, byte_size) \
+    (i) +=3D 1 + SQL_UTF8_COUNT_TRAIL_BYTES_UNSAFE((str)[i]); \
+    (i) =3D (i) <=3D (byte_size) ? (i) : (byte_size);
+
 void sqlDetach(Parse *, Expr *);
 int sqlAtoF(const char *z, double *, int);
 int sqlGetInt32(const char *, int *);
 int sqlAtoi(const char *);
-int sqlUtf8CharLen(const char *pData, int nByte);
+int utf8_char_count(const unsigned char *str, int byte_len);
 u32 sqlUtf8Read(const u8 **);
 LogEst sqlLogEst(u64);
 LogEst sqlLogEstAdd(LogEst, LogEst);


Just removed SQL_SKIP_UTF8 with SQL_UTF8_FWD_1_UNSAFE.

diff --git a/src/box/sql/func.c b/src/box/sql/func.c
index 8ddb9780f..5352ca6b8 100644
--- a/src/box/sql/func.c
+++ b/src/box/sql/func.c
@@ -128,33 +128,6 @@ typeofFunc(sql_context * context, int NotUsed, =
sql_value ** argv)
 	sql_result_text(context, z, -1, SQL_STATIC);
 }
=20
-/**
- * Return number of symbols in the given string.
- *
- * Number of symbols !=3D byte size of string because some symbols
- * are encoded with more than one byte. Also note that all
- * symbols from 'str' to 'str + byte_len' would be counted,
- * even if there is a '\0' somewhere between them.
- *
- * This function is implemented to be fast and indifferent to
- * correctness of string being processed. If input string has
- * even one invalid utf-8 sequence, then the resulting length
- * could be arbitary in these boundaries (0 < len < byte_len).
- * @param str String to be counted.
- * @param byte_len Byte length of given string.
- * @return number of symbols in the given string.
- */
-static int
-utf8_char_count(const unsigned char *str, int byte_len)
-{
-	int symbol_count =3D 0;
-	for (int i =3D 0; i < byte_len;) {
-		U8_FWD_1_UNSAFE(str, i);
-		symbol_count++;
-	}
-	return symbol_count;
-}
-
 /*
  * Implementation of the length() function
  */
@@ -413,17 +386,22 @@ substrFunc(sql_context * context, int argc, =
sql_value ** argv)
 		 * used because '\0' is not supposed to be
 		 * end-of-string symbol.
 		 */
-		int n_chars =3D utf8_char_count(z, =
sql_value_bytes(argv[0]));
+		int byte_size =3D sql_value_bytes(argv[0]);
+		int n_chars =3D utf8_char_count(z, byte_size);
 		int cnt =3D 0;
+		int i =3D 0;
 		while (cnt < n_chars && p1) {
-			SQL_SKIP_UTF8(z);
+			SQL_UTF8_FWD_1_UNSAFE(z, i, byte_size);
 			cnt++;
 			p1--;
 		}
+		z +=3D i;
+		i =3D 0;
 		for (z2 =3D z; cnt < n_chars && p2; p2--) {
-			SQL_SKIP_UTF8(z2);
+			SQL_UTF8_FWD_1_UNSAFE(z2, i, byte_size);
 			cnt++;
 		}
+		z2 +=3D i;
 		sql_result_text64(context, (char *)z, z2 - z,
 				      SQL_TRANSIENT);
 	} else {
@@ -890,7 +868,7 @@ likeFunc(sql_context *context, int argc, sql_value =
**argv)
 			return;
 		const char *const err_msg =3D
 			"ESCAPE expression must be a single character";
-		if (sqlUtf8CharLen((char *)zEsc, -1) !=3D 1) {
+		if (utf8_char_count(zEsc, sql_value_bytes(argv[2])) !=3D =
1) {
 			sql_result_error(context, err_msg, -1);
 			return;
 		}
@@ -1268,8 +1246,6 @@ trimFunc(sql_context * context, int argc, =
sql_value ** argv)
 	} else {
 		const unsigned char *z =3D zCharSet;
 		int trim_set_sz =3D sql_value_bytes(argv[1]);
-		int handled_bytes_cnt =3D trim_set_sz;
-		nChar =3D 0;
 		/*
 		* Count the number of UTF-8 characters passing
 		* through the entire char set, but not up
@@ -1277,12 +1253,7 @@ trimFunc(sql_context * context, int argc, =
sql_value ** argv)
 		* to handle trimming set containing such
 		* characters.
 		*/
-		while(handled_bytes_cnt > 0) {
-		const unsigned char *prev_byte =3D z;
-			SQL_SKIP_UTF8(z);
-			handled_bytes_cnt -=3D (z - prev_byte);
-			nChar++;
-		}
+		nChar =3D utf8_char_count(z, trim_set_sz);
 		if (nChar > 0) {
 			azChar =3D
 			    contextMalloc(context,
@@ -1292,12 +1263,13 @@ trimFunc(sql_context * context, int argc, =
sql_value ** argv)
 			}
 			aLen =3D (unsigned char *)&azChar[nChar];
 			z =3D zCharSet;
+			i =3D 0;
 			nChar =3D 0;
-			handled_bytes_cnt =3D trim_set_sz;
+			int handled_bytes_cnt =3D trim_set_sz;
 			while(handled_bytes_cnt > 0) {
-				azChar[nChar] =3D (unsigned char *)z;
-				SQL_SKIP_UTF8(z);
-				aLen[nChar] =3D (u8) (z - =
azChar[nChar]);
+				azChar[nChar] =3D (unsigned char *)(z + =
i);
+				SQL_UTF8_FWD_1_UNSAFE(z, i, =
trim_set_sz);
+				aLen[nChar] =3D (u8) (z + i - =
azChar[nChar]);
 				handled_bytes_cnt -=3D aLen[nChar];
 				nChar++;
 			}=