From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id A5EDF23EF9 for ; Fri, 4 May 2018 19:32:33 -0400 (EDT) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id v4kF9GTL2IjX for ; Fri, 4 May 2018 19:32:33 -0400 (EDT) Received: from smtp1.mail.ru (smtp1.mail.ru [94.100.179.111]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 37AC523EF6 for ; Fri, 4 May 2018 19:32:32 -0400 (EDT) Subject: [tarantool-patches] Re: [PATCH v2 5/5] lua: introduce utf8 built-in globaly visible module References: <562a24d9c42df6701e85d50b06a47d57e6d884bf.1524955403.git.v.shpilevoy@tarantool.org> <20180504223322.y7i5ymo46hcd7usu@tkn_work_nb> From: Vladislav Shpilevoy Message-ID: Date: Sat, 5 May 2018 02:32:27 +0300 MIME-Version: 1.0 In-Reply-To: <20180504223322.y7i5ymo46hcd7usu@tkn_work_nb> Content-Type: text/plain; charset="utf-8"; format="flowed" Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-subscribe: List-owner: List-post: List-archive: To: tarantool-patches@freelists.org, Alexander Turenko Hello. Thanks for review. On 05/05/2018 01:33, Alexander Turenko wrote: > Vlad, > > Are you try to run tests from utf8.lua from [1]? > > [1]: https://www.lua.org/tests/lua-5.3.4-tests.tar.gz > > Debug build: > > ``` > /home/alex/p/tarantool-meta/review/tarantool/src/lua/utf8.c: In function ‘utf8_next’: > /home/alex/p/tarantool-meta/review/tarantool/src/lua/utf8.c:214:2: error: passing argument 2 of ‘utf8_nextCharSafeBody’ from incompatible pointer type [-Werror=incompatible-pointer-types] > U8_NEXT(str, pos, len, c); Looks like you did not fetch the newest version. Travis is ok: https://travis-ci.org/tarantool/tarantool/builds/374952902?utm_source=github_status&utm_medium=notification > ^~~~~~~ > In file included from /usr/include/unicode/utf.h:217:0, > from /usr/include/unicode/utypes.h:44, > from /usr/include/unicode/ucasemap.h:24, > from /home/alex/p/tarantool-meta/review/tarantool/src/lua/utf8.c:31: > /usr/include/unicode/utf8.h:120:1: note: expected ‘int32_t * {aka int *}’ but argument is of type ‘size_t * {aka long unsigned int *}’ > utf8_nextCharSafeBody(const uint8_t *s, int32_t *pi, int32_t length, UChar32 c, UBool strict); > ^~~~~~~~~~~~~~~~~~~~~ > ``` > > The following cases works differently in Lua 5.3 and in tarantool's > implementation: > > print(utf8.len('12İ☢勺34', 3, 3)) > print(utf8.len('\xF4\x9F\xBF\xBF')) > > Lua 5.3: > > 1 > nil 1 > > Tarantool: > > nil 3 > nil 4 Second case works ok already: tarantool> utf8.len('\xF4\x9F\xBF\xBF') --- - null - 1 ... It matches Lua 5.3. First case fixed: diff --git a/src/lua/utf8.c b/src/lua/utf8.c index 1d0776ab1..02ecdede4 100644 --- a/src/lua/utf8.c +++ b/src/lua/utf8.c @@ -176,7 +176,7 @@ utf8_len(struct lua_State *L) * In Lua 5.3 it is not ok to start from bad * position. But next symbols can be any. */ - U8_NEXT(str, start_pos, end_pos, c); + U8_NEXT(str, start_pos, len, c); if (c == U_SENTINEL) { lua_pushnil(L); lua_pushinteger(L, start_pos); diff --git a/test/app-tap/string.test.lua b/test/app-tap/string.test.lua index dc28b87e1..fb96104ed 100755 --- a/test/app-tap/string.test.lua +++ b/test/app-tap/string.test.lua @@ -129,7 +129,7 @@ test:test("strip", function(test) end ) test:test("unicode", function(test) - test:plan(100) + test:plan(101) local str = 'хеЛлоу вОрЛд ё Ё я Я э Э ъ Ъ hElLo WorLd 1234 i I İ 勺#☢༺' local upper_res = 'ХЕЛЛОУ ВОРЛД Ё Ё Я Я Э Э Ъ Ъ HELLO WORLD 1234 I I İ 勺#☢༺' local lower_res = 'хеллоу ворлд ё ё я я э э ъ ъ hello world 1234 i i i̇ 勺#☢༺' @@ -191,6 +191,7 @@ test:test("unicode", function(test) test:is(utf8.len(s, 10, -10), 0, 'it is ok to swap end and start pos') test:is(utf8.len(''), 0, 'empty len') test:is(utf8.len(s, -6, -1), 3, 'pass both negative offsets') + test:is(utf8.len(s, 3, 3), 1, "end in the middle on the same symbol as start") local chars = {} local codes = {} > > See also comments below. > > Thanks! > > WBR, Alexander Turenko. > > On Sun, Apr 29, 2018 at 01:45:13AM +0300, Vladislav Shpilevoy wrote: >> utf8 is a module partially compatible with Lua 5.3 utf8 and >> lua-utf8 third party module. >> Partially means, that not all functions are implemented. >> >> The patch introduces these ones: >> upper, lower, len, char, sub. > > And next. > >> >> All of them works exactly like in Lua 5.3, or if Lua 5.3 has no >> a function, then works like in lua-utf8. >> > > Only len and char are from lua 5.3, we can state it explicitly. Fixed. lua: introduce utf8 built-in globaly visible module utf8 is a module partially compatible with Lua 5.3 utf8 and lua-utf8 third party module. Partially means, that not all functions are implemented. The patch introduces these ones: upper, lower, len, char, sub, next. Len and char works exactly like in Lua 5.3. Other functions work like in lua-utf8. Tarantool utf8 has extensions: * isupper/lower/alpha/digit, that check some property by a symbol or by its code; * cmp/casecmp, that compare two UTF8 strings. Closes #3290 Closes #3385 Closes #3081 >> diff --git a/src/lua/utf8.c b/src/lua/utf8.c >> new file mode 100644 >> index 000000000..d1d3d72c2 >> --- /dev/null >> +++ b/src/lua/utf8.c >> <...> >> +/** >> + * Calculate a 0-based positive byte offset in a string by any >> + * 0-based offset (possibly negative). > > But you pass 1-based offsets from utf8_len function to that one. diff --git a/src/lua/utf8.c b/src/lua/utf8.c index 0f4a84c98..d4f57a5f4 100644 --- a/src/lua/utf8.c +++ b/src/lua/utf8.c @@ -124,11 +124,11 @@ utf8_lower(struct lua_State *L) } /** - * Calculate a 0-based positive byte offset in a string by any - * 0-based offset (possibly negative). - * @param offset Original 0-based offset with any sign. + * Calculate a 1-based positive byte offset in a string by any + * 1-based offset (possibly negative). + * @param offset Original 1-based offset with any sign. * @param len A string byte length. - * @retval 0-based positive offset. + * @retval 1-based positive offset. */ static inline int utf8_convert_offset(int offset, size_t len) @@ -207,14 +207,16 @@ utf8_next(struct lua_State *L) size_t slen; const char *str = lua_tolstring(L, 1, &slen); int len = (int) slen; - int pos = utf8_convert_offset(luaL_optinteger(L, 2, 0), len); + int pos = utf8_convert_offset(luaL_optinteger(L, 2, 1), len); + if (pos > 0) + --pos; if (pos >= len) return 0; UChar32 c; U8_NEXT(str, pos, len, c); if (c == U_SENTINEL) return 0; - lua_pushinteger(L, pos); + lua_pushinteger(L, pos + 1); lua_pushinteger(L, c); return 2; } >> + >> +/** >> + * Calculate length of a UTF8 string. Length here is symbol count. >> + * Works like utf8.len in Lua 5.3. >> + * @param String to get length. >> + * @param Start byte offset. Must point to the start of symbol. On >> + * invalid symbol an error is returned. Can be negative. > > Can be 1 <= |start| <= #str + 1, right? Is it worth to document? Such > offset equilibristics is not very intuitive (at least for me). No, start can be any, as well as end. > >> + * @param End byte offset, can be negative. Can point to the >> + * middle of symbol. > > We need to clarify that a symbol under the end offset is subject to > include into the resulting count (inclusive range). > > I would also explicitly stated that -1 is the end byte. > > Worth to document allowed range (0 <= |end| <= #str, right?)? diff --git a/src/lua/utf8.c b/src/lua/utf8.c index d4f57a5f4..0507da134 100644 --- a/src/lua/utf8.c +++ b/src/lua/utf8.c @@ -142,12 +142,13 @@ utf8_convert_offset(int offset, size_t len) /** * Calculate length of a UTF8 string. Length here is symbol count. - * Works like utf8.len in Lua 5.3. + * Works like utf8.len in Lua 5.3. Can take negative offsets. A + * negative offset is an offset from the end of string. * @param String to get length. * @param Start byte offset. Must point to the start of symbol. On - * invalid symbol an error is returned. Can be negative. - * @param End byte offset, can be negative. Can point to the - * middle of symbol. + * invalid symbol an error is returned. + * @param End byte offset. Can point to the middle of symbol. + * Partial symbol is counted too. * @retval not nil Symbol count. * @retval nil, error Error. Byte position of the error is * returned in the second value. > >> + * @retval not nil Symbol count. >> + * @retval nil, error Error. Byte position of the error is >> + * returned in the second value. >> + */ >> +static int >> +utf8_len(struct lua_State *L) >> +{ >> + if (lua_gettop(L) > 3 || !lua_isstring(L, 1)) >> + return luaL_error(L, "Usage: utf8.len(, [i, [j]])"); >> + size_t slen; >> + const char *str = lua_tolstring(L, 1, &slen); >> + int len = (int)slen; >> + int start_pos = utf8_convert_offset(luaL_optinteger(L, 2, 1), len); >> + int end_pos = utf8_convert_offset(luaL_optinteger(L, 3, -1), len); >> + if (start_pos < 1 || --start_pos > len || end_pos > len) { >> + lua_pushnil(L); >> + lua_pushstring(L, "position is out of string"); >> + return 2; >> + } >> + int result = 0; >> + if (end_pos > start_pos) { >> + UChar32 c; >> + ++result; >> + /* >> + * In Lua 5.3 it is not ok to start from bad >> + * position. But next symbols can be any. >> + */ > > I would be not so sure. Consider: > > print(utf8.len('a\xF4')) > > Lua 5.3: > > nil 2 > > Tarantool: > > 2 diff --git a/src/lua/utf8.c b/src/lua/utf8.c index 1d0776ab1..c31171b2c 100644 --- a/src/lua/utf8.c +++ b/src/lua/utf8.c @@ -171,20 +171,14 @@ utf8_len(struct lua_State *L) int result = 0; if (end_pos > start_pos) { UChar32 c; - ++result; - /* - * In Lua 5.3 it is not ok to start from bad - * position. But next symbols can be any. - */ - U8_NEXT(str, start_pos, end_pos, c); - if (c == U_SENTINEL) { - lua_pushnil(L); - lua_pushinteger(L, start_pos); - return 2; - } while (start_pos < end_pos) { - U8_NEXT_UNSAFE(str, start_pos, c); ++result; + U8_NEXT(str, start_pos, len, c); + if (c == U_SENTINEL) { + lua_pushnil(L); + lua_pushinteger(L, start_pos); + return 2; + } } } lua_pushinteger(L, result); >> +/** >> + * Get next symbol code by @an offset. >> + * @param String to get symbol code. >> + * @param Byte offset from which get. >> + * >> + * @retval - No more symbols. > > Do you mean nil? No. It returns literally nothing. > >> + * @retval not nil, not nil Byte offset and symbol code. >> + */ >> +static int >> +utf8_next(struct lua_State *L) >> <...> >> +UCHAR32_CHECKER(islower) >> +UCHAR32_CHECKER(isupper) >> +UCHAR32_CHECKER(isdigit) >> +UCHAR32_CHECKER(isalpha) > > It would be good to have doxygen-style comment in the source code > likewise you posted in #3081. > -/** Macro to easy create lua wrappers for ICU symbol checkers. */ +/** + * Macro to easy create lua wrappers for ICU symbol checkers. + * @param One stmbol code or string. + * @retval True, if the symbol has a requested property. Else + * false. + */ #define UCHAR32_CHECKER(name) \