From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id 3BF2E26C3D for ; Tue, 26 Feb 2019 08:33:06 -0500 (EST) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id SN2jlbpIvyeJ for ; Tue, 26 Feb 2019 08:33:06 -0500 (EST) Received: from smtp34.i.mail.ru (smtp34.i.mail.ru [94.100.177.94]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id C837626A2D for ; Tue, 26 Feb 2019 08:33:05 -0500 (EST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 12.0 \(3445.100.39\)) Subject: [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0' From: "i.koptelov" In-Reply-To: Date: Tue, 26 Feb 2019 16:33:01 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: References: <15e143f4-3ea7-c7d6-d8ac-8a0e20b76449@tarantool.org> <1560FF96-FECD-4368-8AF8-F8F2AE7696E3@tarantool.org> <07DBA796-6DD4-41DD-8438-104FE3AE05BB@tarantool.org> <4F4E0A7E-199C-4647-A49C-DD0E8A216527@tarantool.org> <8EF5CE57-C6B5-493C-94CC-AA3C88639485@tarantool.org> <7E6CE8AA-512D-4472-9DBD-8159073386C5@tarantool.org> <25649276-74CD-46E7-A1EB-F4CE299E637C@tarantool.org> <427EE913-3E58-413F-A645-DBF83C809334@tarantool.org> <583EC402-D1FF-45C4-B18B-8A06D4362200@tarantool.org> <3377BC01-D943-4D24-A9A2-BA9B9C67EA92@tarantool.org> Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-subscribe: List-owner: List-post: List-archive: To: tarantool-patches@freelists.org Cc: "n.pettik" > On 25 Feb 2019, at 18:10, n.pettik wrote: >=20 > That=E2=80=99s I was talking about. But using the macros with the same > name as in utf library doesn=E2=80=99t look like a good pattern. Yep, = you > can use define guards like: >=20 > #ifdef U8_COUNT_TRAIL_BYTES_UNSAFE > #undef U8_COUNT_TRAIL_BYTES_UNSAFE > #endif > #define U8_COUNT_TRAIL_BYTES_UNSAFE >=20 > But I=E2=80=99d rather just give it another name. > Hence, taking into account comment below, > we are going to substitute SQL_SKIP_UTF8() with > implementation borrowed from icu library. I changed the names to SQL_UTF8_FWD_1_UNSAFE and SQL_UTF8_COUNT_TRAIL_BYTES_UNSAFE. >=20 >>>>> Furthermore, description says that it =E2=80=9Cassumes well-formed = UTF-8=E2=80=9D, >>>>> which in our case is not true. So who knows what may happen if we = pass >>>>> malformed byte sequence. I am not even saying that behaviour of >>>>> this function on invalid inputs may change later. >>>>=20 >>>> In it's current implementation U8_FWD_1_UNSAFE satisfy our needs = safely. Returned >>>> symbol length would never exceed byte_len. >>>>=20 >>>> static int >>>> utf8_char_count(const unsigned char *str, int byte_len) >>>> { >>>> int symbol_count =3D 0; >>>> for (int i =3D 0; i < byte_len;) { >>>> U8_FWD_1_UNSAFE(str, i); >>>> symbol_count++; >>>> } >>>> return symbol_count; >>>> } >>>>=20 >>>> I agree that it is a bad idea to relay on lib behaviour which may >>>> change lately. So maybe I would just inline these one line macros? >>>> Or use my own implementation, since it=E2=80=99s more efficient = (but less beautiful) >>>=20 >>> Nevermind, let's keep it as is. >>> I really worry only about the fact that in other places = SQL_SKIP_UTF8 >>> is used instead. It handles only two-bytes utf8 symbols, meanwhile >>> U8_FWD_1_UNSAFE() accounts three and four bytes length symbols. >>> Can we use everywhere the same pattern? >> Yes, I think, we can. >=20 > Ok, then will be waiting for updates. So here are the updates: sqlUtf8CharLen is removed with utf8_char_count (which is just moved to utf.c). sqlUtf8CharLen used SQL_SKIP_UTF8 and itself was used only in one place. diff --git a/src/box/sql/utf.c b/src/box/sql/utf.c index 22e0de809..5a1862a52 100644 --- a/src/box/sql/utf.c +++ b/src/box/sql/utf.c @@ -157,30 +157,31 @@ sqlUtf8Read(const unsigned char **pz /* = Pointer to string from which to read cha */ /* #define TRANSLATE_TRACE 1 */ =20 -/* - * pZ is a UTF-8 encoded unicode string. If nByte is less than zero, - * return the number of unicode characters in pZ up to (but not = including) - * the first 0x00 byte. If nByte is not less than zero, return the - * number of unicode characters in the first nByte of pZ (or up to - * the first 0x00, whichever comes first). +/** + * Return number of symbols in the given string. + * + * Number of symbols !=3D byte size of string because some symbols + * are encoded with more than one byte. Also note that all + * symbols from 'str' to 'str + byte_len' would be counted, + * even if there is a '\0' somewhere between them. + * + * This function is implemented to be fast and indifferent to + * correctness of string being processed. If input string has + * even one invalid utf-8 sequence, then the resulting length + * could be arbitary in these boundaries (0 < len < byte_len). + * @param str String to be counted. + * @param byte_len Byte length of given string. + * @return number of symbols in the given string. */ int -sqlUtf8CharLen(const char *zIn, int nByte) +utf8_char_count(const unsigned char *str, int byte_len) { - int r =3D 0; - const u8 *z =3D (const u8 *)zIn; - const u8 *zTerm; - if (nByte >=3D 0) { - zTerm =3D &z[nByte]; - } else { - zTerm =3D (const u8 *)(-1); - } - assert(z <=3D zTerm); - while (*z !=3D 0 && z < zTerm) { - SQL_SKIP_UTF8(z); - r++; + int symbol_count =3D 0; + for (int i =3D 0; i < byte_len;) { + SQL_UTF8_FWD_1_UNSAFE(str, i, byte_len); + symbol_count++; } - return r; + return symbol_count; } I inlined SQL_UTF8_FWD_1_UNSAFE and SQL_UTF8_COUNT_TRAIL_BYTES_UNSAFE from icu library, because different versions of the library provide slightly different behavior of UTF8_FWD_1_UNSAFE. It caused sql-tap/badutf test to fail on some platforms, while on the others it was working. diff --git a/src/box/sql/sqlInt.h b/src/box/sql/sqlInt.h index 2830ab639..859450e88 100644 --- a/src/box/sql/sqlInt.h +++ b/src/box/sql/sqlInt.h @@ -3071,16 +3071,6 @@ struct TreeView { }; #endif /* SQL_DEBUG */ =20 -/* - * Assuming zIn points to the first byte of a UTF-8 character, - * advance zIn to point to the first byte of the next UTF-8 character. - */ -#define SQL_SKIP_UTF8(zIn) { \ - if( (*(zIn++))>=3D0xc0 ){ \ - while( (*zIn & 0xc0)=3D=3D0x80 ){ zIn++; } \ - } \ -} - /* * The sql_*_BKPT macros are substitutes for the error codes with * the same name but without the _BKPT suffix. These macros invoke @@ -4142,11 +4132,51 @@ void sql_drop_foreign_key(struct Parse *parse_context, struct SrcList = *table, struct Token *constraint); =20 +/** + * Counts the trail bytes for a UTF-8 lead byte of a valid UTF-8 + * sequence. Returns 0 for 0..0xc1. Undefined for 0xf5..0xff. + * leadByte might be evaluated multiple times. + * + * @param leadByte The first byte of a UTF-8 sequence. + * Must be 0..0xff. + */ +#define SQL_UTF8_COUNT_TRAIL_BYTES_UNSAFE(leadByte) \ + (((uint8_t)(leadByte) >=3D 0xc2) + ((uint8_t)(leadByte) >=3D 0xe0) = + \ + ((uint8_t)(leadByte) >=3D 0xf0)) + +/** + * Advance the string offset from one code point boundary to the + * next. (Post-incrementing iteration.) + * "Unsafe" macro, assumes well-formed UTF-8. + * + * After the whole string is traversed, (str + i) points to the + * position right after the last element of the string (*). + * + * If resulting offset > byte_size then resulting offset is set + * to byte_size. This is to provide (*) in cases where it might + * be violated. + * + * SQL_UTF8_FWD_1_UNSAFE sometimes is used get the size of utf-8 + * character subsequence and we don't want to get summary size + * which exceeds total string size (in bytes). Consider example: + * + * '0xE0' - this is invalid utf-8 string because it consists only + * of first byte of 3 byte sequence. After traverse, the + * offset =3D=3D 2 and we set it to 1, to keep (*). + * + * @param s const uint8_t * string. + * @param i string offset. + * @param byte_size byte size of the string. + */ +#define SQL_UTF8_FWD_1_UNSAFE(str, i, byte_size) \ + (i) +=3D 1 + SQL_UTF8_COUNT_TRAIL_BYTES_UNSAFE((str)[i]); \ + (i) =3D (i) <=3D (byte_size) ? (i) : (byte_size); + void sqlDetach(Parse *, Expr *); int sqlAtoF(const char *z, double *, int); int sqlGetInt32(const char *, int *); int sqlAtoi(const char *); -int sqlUtf8CharLen(const char *pData, int nByte); +int utf8_char_count(const unsigned char *str, int byte_len); u32 sqlUtf8Read(const u8 **); LogEst sqlLogEst(u64); LogEst sqlLogEstAdd(LogEst, LogEst); Just removed SQL_SKIP_UTF8 with SQL_UTF8_FWD_1_UNSAFE. diff --git a/src/box/sql/func.c b/src/box/sql/func.c index 8ddb9780f..5352ca6b8 100644 --- a/src/box/sql/func.c +++ b/src/box/sql/func.c @@ -128,33 +128,6 @@ typeofFunc(sql_context * context, int NotUsed, = sql_value ** argv) sql_result_text(context, z, -1, SQL_STATIC); } =20 -/** - * Return number of symbols in the given string. - * - * Number of symbols !=3D byte size of string because some symbols - * are encoded with more than one byte. Also note that all - * symbols from 'str' to 'str + byte_len' would be counted, - * even if there is a '\0' somewhere between them. - * - * This function is implemented to be fast and indifferent to - * correctness of string being processed. If input string has - * even one invalid utf-8 sequence, then the resulting length - * could be arbitary in these boundaries (0 < len < byte_len). - * @param str String to be counted. - * @param byte_len Byte length of given string. - * @return number of symbols in the given string. - */ -static int -utf8_char_count(const unsigned char *str, int byte_len) -{ - int symbol_count =3D 0; - for (int i =3D 0; i < byte_len;) { - U8_FWD_1_UNSAFE(str, i); - symbol_count++; - } - return symbol_count; -} - /* * Implementation of the length() function */ @@ -413,17 +386,22 @@ substrFunc(sql_context * context, int argc, = sql_value ** argv) * used because '\0' is not supposed to be * end-of-string symbol. */ - int n_chars =3D utf8_char_count(z, = sql_value_bytes(argv[0])); + int byte_size =3D sql_value_bytes(argv[0]); + int n_chars =3D utf8_char_count(z, byte_size); int cnt =3D 0; + int i =3D 0; while (cnt < n_chars && p1) { - SQL_SKIP_UTF8(z); + SQL_UTF8_FWD_1_UNSAFE(z, i, byte_size); cnt++; p1--; } + z +=3D i; + i =3D 0; for (z2 =3D z; cnt < n_chars && p2; p2--) { - SQL_SKIP_UTF8(z2); + SQL_UTF8_FWD_1_UNSAFE(z2, i, byte_size); cnt++; } + z2 +=3D i; sql_result_text64(context, (char *)z, z2 - z, SQL_TRANSIENT); } else { @@ -890,7 +868,7 @@ likeFunc(sql_context *context, int argc, sql_value = **argv) return; const char *const err_msg =3D "ESCAPE expression must be a single character"; - if (sqlUtf8CharLen((char *)zEsc, -1) !=3D 1) { + if (utf8_char_count(zEsc, sql_value_bytes(argv[2])) !=3D = 1) { sql_result_error(context, err_msg, -1); return; } @@ -1268,8 +1246,6 @@ trimFunc(sql_context * context, int argc, = sql_value ** argv) } else { const unsigned char *z =3D zCharSet; int trim_set_sz =3D sql_value_bytes(argv[1]); - int handled_bytes_cnt =3D trim_set_sz; - nChar =3D 0; /* * Count the number of UTF-8 characters passing * through the entire char set, but not up @@ -1277,12 +1253,7 @@ trimFunc(sql_context * context, int argc, = sql_value ** argv) * to handle trimming set containing such * characters. */ - while(handled_bytes_cnt > 0) { - const unsigned char *prev_byte =3D z; - SQL_SKIP_UTF8(z); - handled_bytes_cnt -=3D (z - prev_byte); - nChar++; - } + nChar =3D utf8_char_count(z, trim_set_sz); if (nChar > 0) { azChar =3D contextMalloc(context, @@ -1292,12 +1263,13 @@ trimFunc(sql_context * context, int argc, = sql_value ** argv) } aLen =3D (unsigned char *)&azChar[nChar]; z =3D zCharSet; + i =3D 0; nChar =3D 0; - handled_bytes_cnt =3D trim_set_sz; + int handled_bytes_cnt =3D trim_set_sz; while(handled_bytes_cnt > 0) { - azChar[nChar] =3D (unsigned char *)z; - SQL_SKIP_UTF8(z); - aLen[nChar] =3D (u8) (z - = azChar[nChar]); + azChar[nChar] =3D (unsigned char *)(z + = i); + SQL_UTF8_FWD_1_UNSAFE(z, i, = trim_set_sz); + aLen[nChar] =3D (u8) (z + i - = azChar[nChar]); handled_bytes_cnt -=3D aLen[nChar]; nChar++; }=