From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id 597BF23E9E for ; Fri, 4 May 2018 18:33:14 -0400 (EDT) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id iNirwTHGbrvK for ; Fri, 4 May 2018 18:33:14 -0400 (EDT) Received: from smtp58.i.mail.ru (smtp58.i.mail.ru [217.69.128.38]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 7B1B423E9B for ; Fri, 4 May 2018 18:33:13 -0400 (EDT) Date: Sat, 5 May 2018 01:33:23 +0300 From: Alexander Turenko Subject: [tarantool-patches] Re: [PATCH v2 5/5] lua: introduce utf8 built-in globaly visible module Message-ID: <20180504223322.y7i5ymo46hcd7usu@tkn_work_nb> References: <562a24d9c42df6701e85d50b06a47d57e6d884bf.1524955403.git.v.shpilevoy@tarantool.org> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <562a24d9c42df6701e85d50b06a47d57e6d884bf.1524955403.git.v.shpilevoy@tarantool.org> Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-subscribe: List-owner: List-post: List-archive: To: Vladislav Shpilevoy Cc: tarantool-patches@freelists.org Vlad, Are you try to run tests from utf8.lua from [1]? [1]: https://www.lua.org/tests/lua-5.3.4-tests.tar.gz Debug build: ``` /home/alex/p/tarantool-meta/review/tarantool/src/lua/utf8.c: In function ‘utf8_next’: /home/alex/p/tarantool-meta/review/tarantool/src/lua/utf8.c:214:2: error: passing argument 2 of ‘utf8_nextCharSafeBody’ from incompatible pointer type [-Werror=incompatible-pointer-types] U8_NEXT(str, pos, len, c); ^~~~~~~ In file included from /usr/include/unicode/utf.h:217:0, from /usr/include/unicode/utypes.h:44, from /usr/include/unicode/ucasemap.h:24, from /home/alex/p/tarantool-meta/review/tarantool/src/lua/utf8.c:31: /usr/include/unicode/utf8.h:120:1: note: expected ‘int32_t * {aka int *}’ but argument is of type ‘size_t * {aka long unsigned int *}’ utf8_nextCharSafeBody(const uint8_t *s, int32_t *pi, int32_t length, UChar32 c, UBool strict); ^~~~~~~~~~~~~~~~~~~~~ ``` The following cases works differently in Lua 5.3 and in tarantool's implementation: print(utf8.len('12İ☢勺34', 3, 3)) print(utf8.len('\xF4\x9F\xBF\xBF')) Lua 5.3: 1 nil 1 Tarantool: nil 3 nil 4 See also comments below. Thanks! WBR, Alexander Turenko. On Sun, Apr 29, 2018 at 01:45:13AM +0300, Vladislav Shpilevoy wrote: > utf8 is a module partially compatible with Lua 5.3 utf8 and > lua-utf8 third party module. > Partially means, that not all functions are implemented. > > The patch introduces these ones: > upper, lower, len, char, sub. And next. > > All of them works exactly like in Lua 5.3, or if Lua 5.3 has no > a function, then works like in lua-utf8. > Only len and char are from lua 5.3, we can state it explicitly. > Tarantool utf8 has extensions: > > * isupper/lower/alpha/digit, that check some property by a symbol > or by its code; > > * cmp/casecmp, that compare two UTF8 strings. > > Closes #3290 > Closes #3385 > Closes #3081 > --- > src/CMakeLists.txt | 3 +- > src/lua/init.c | 3 + > src/lua/utf8.c | 451 +++++++++++++++++++++++++++++++++++++++++++ > src/lua/utf8.h | 42 ++++ > test/app-tap/string.test.lua | 160 ++++++++++++++- > test/box/ddl.result | 15 ++ > test/box/ddl.test.lua | 8 + > 7 files changed, 680 insertions(+), 2 deletions(-) > create mode 100644 src/lua/utf8.c > create mode 100644 src/lua/utf8.h > <...> > diff --git a/src/lua/utf8.c b/src/lua/utf8.c > new file mode 100644 > index 000000000..d1d3d72c2 > --- /dev/null > +++ b/src/lua/utf8.c > <...> > +/** > + * Calculate a 0-based positive byte offset in a string by any > + * 0-based offset (possibly negative). But you pass 1-based offsets from utf8_len function to that one. > + * @param offset Original 0-based offset with any sign. > + * @param len A string byte length. > + * @retval 0-based positive offset. > + */ > +static inline int > +utf8_convert_offset(int offset, size_t len) > +{ > + if (offset >= 0) > + return offset; > + else if ((size_t)-offset > len) > + return 0; > + return len + offset + 1; It seems that the function is about 1-based offsets (at some cases at least). > +} > + > +/** > + * Calculate length of a UTF8 string. Length here is symbol count. > + * Works like utf8.len in Lua 5.3. > + * @param String to get length. > + * @param Start byte offset. Must point to the start of symbol. On > + * invalid symbol an error is returned. Can be negative. Can be 1 <= |start| <= #str + 1, right? Is it worth to document? Such offset equilibristics is not very intuitive (at least for me). > + * @param End byte offset, can be negative. Can point to the > + * middle of symbol. We need to clarify that a symbol under the end offset is subject to include into the resulting count (inclusive range). I would also explicitly stated that -1 is the end byte. Worth to document allowed range (0 <= |end| <= #str, right?)? > + * @retval not nil Symbol count. > + * @retval nil, error Error. Byte position of the error is > + * returned in the second value. > + */ > +static int > +utf8_len(struct lua_State *L) > +{ > + if (lua_gettop(L) > 3 || !lua_isstring(L, 1)) > + return luaL_error(L, "Usage: utf8.len(, [i, [j]])"); > + size_t slen; > + const char *str = lua_tolstring(L, 1, &slen); > + int len = (int)slen; > + int start_pos = utf8_convert_offset(luaL_optinteger(L, 2, 1), len); > + int end_pos = utf8_convert_offset(luaL_optinteger(L, 3, -1), len); > + if (start_pos < 1 || --start_pos > len || end_pos > len) { > + lua_pushnil(L); > + lua_pushstring(L, "position is out of string"); > + return 2; > + } > + int result = 0; > + if (end_pos > start_pos) { > + UChar32 c; > + ++result; > + /* > + * In Lua 5.3 it is not ok to start from bad > + * position. But next symbols can be any. > + */ I would be not so sure. Consider: print(utf8.len('a\xF4')) Lua 5.3: nil 2 Tarantool: 2 > + U8_NEXT(str, start_pos, end_pos, c); > + if (c == U_SENTINEL) { > + lua_pushnil(L); > + lua_pushinteger(L, start_pos); > + return 2; > + } > + while (start_pos < end_pos) { > + U8_NEXT_UNSAFE(str, start_pos, c); It is controversial decision. Why you are sure Lua 5.3 works in the same manner? > + ++result; > + } > + } > + lua_pushinteger(L, result); > + return 1; > +} > + > +/** > + * Get next symbol code by @an offset. > + * @param String to get symbol code. > + * @param Byte offset from which get. > + * > + * @retval - No more symbols. Do you mean nil? > + * @retval not nil, not nil Byte offset and symbol code. > + */ > +static int > +utf8_next(struct lua_State *L) > <...> > +UCHAR32_CHECKER(islower) > +UCHAR32_CHECKER(isupper) > +UCHAR32_CHECKER(isdigit) > +UCHAR32_CHECKER(isalpha) It would be good to have doxygen-style comment in the source code likewise you posted in #3081.