From: Alexander Turenko <alexander.turenko@tarantool.org> To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org> Cc: tarantool-patches@freelists.org Subject: [tarantool-patches] Re: [PATCH v2 5/5] lua: introduce utf8 built-in globaly visible module Date: Sat, 5 May 2018 01:33:23 +0300 [thread overview] Message-ID: <20180504223322.y7i5ymo46hcd7usu@tkn_work_nb> (raw) In-Reply-To: <562a24d9c42df6701e85d50b06a47d57e6d884bf.1524955403.git.v.shpilevoy@tarantool.org> Vlad, Are you try to run tests from utf8.lua from [1]? [1]: https://www.lua.org/tests/lua-5.3.4-tests.tar.gz Debug build: ``` /home/alex/p/tarantool-meta/review/tarantool/src/lua/utf8.c: In function ‘utf8_next’: /home/alex/p/tarantool-meta/review/tarantool/src/lua/utf8.c:214:2: error: passing argument 2 of ‘utf8_nextCharSafeBody’ from incompatible pointer type [-Werror=incompatible-pointer-types] U8_NEXT(str, pos, len, c); ^~~~~~~ In file included from /usr/include/unicode/utf.h:217:0, from /usr/include/unicode/utypes.h:44, from /usr/include/unicode/ucasemap.h:24, from /home/alex/p/tarantool-meta/review/tarantool/src/lua/utf8.c:31: /usr/include/unicode/utf8.h:120:1: note: expected ‘int32_t * {aka int *}’ but argument is of type ‘size_t * {aka long unsigned int *}’ utf8_nextCharSafeBody(const uint8_t *s, int32_t *pi, int32_t length, UChar32 c, UBool strict); ^~~~~~~~~~~~~~~~~~~~~ ``` The following cases works differently in Lua 5.3 and in tarantool's implementation: print(utf8.len('12İ☢勺34', 3, 3)) print(utf8.len('\xF4\x9F\xBF\xBF')) Lua 5.3: 1 nil 1 Tarantool: nil 3 nil 4 See also comments below. Thanks! WBR, Alexander Turenko. On Sun, Apr 29, 2018 at 01:45:13AM +0300, Vladislav Shpilevoy wrote: > utf8 is a module partially compatible with Lua 5.3 utf8 and > lua-utf8 third party module. > Partially means, that not all functions are implemented. > > The patch introduces these ones: > upper, lower, len, char, sub. And next. > > All of them works exactly like in Lua 5.3, or if Lua 5.3 has no > a function, then works like in lua-utf8. > Only len and char are from lua 5.3, we can state it explicitly. > Tarantool utf8 has extensions: > > * isupper/lower/alpha/digit, that check some property by a symbol > or by its code; > > * cmp/casecmp, that compare two UTF8 strings. > > Closes #3290 > Closes #3385 > Closes #3081 > --- > src/CMakeLists.txt | 3 +- > src/lua/init.c | 3 + > src/lua/utf8.c | 451 +++++++++++++++++++++++++++++++++++++++++++ > src/lua/utf8.h | 42 ++++ > test/app-tap/string.test.lua | 160 ++++++++++++++- > test/box/ddl.result | 15 ++ > test/box/ddl.test.lua | 8 + > 7 files changed, 680 insertions(+), 2 deletions(-) > create mode 100644 src/lua/utf8.c > create mode 100644 src/lua/utf8.h > <...> > diff --git a/src/lua/utf8.c b/src/lua/utf8.c > new file mode 100644 > index 000000000..d1d3d72c2 > --- /dev/null > +++ b/src/lua/utf8.c > <...> > +/** > + * Calculate a 0-based positive byte offset in a string by any > + * 0-based offset (possibly negative). But you pass 1-based offsets from utf8_len function to that one. > + * @param offset Original 0-based offset with any sign. > + * @param len A string byte length. > + * @retval 0-based positive offset. > + */ > +static inline int > +utf8_convert_offset(int offset, size_t len) > +{ > + if (offset >= 0) > + return offset; > + else if ((size_t)-offset > len) > + return 0; > + return len + offset + 1; It seems that the function is about 1-based offsets (at some cases at least). > +} > + > +/** > + * Calculate length of a UTF8 string. Length here is symbol count. > + * Works like utf8.len in Lua 5.3. > + * @param String to get length. > + * @param Start byte offset. Must point to the start of symbol. On > + * invalid symbol an error is returned. Can be negative. Can be 1 <= |start| <= #str + 1, right? Is it worth to document? Such offset equilibristics is not very intuitive (at least for me). > + * @param End byte offset, can be negative. Can point to the > + * middle of symbol. We need to clarify that a symbol under the end offset is subject to include into the resulting count (inclusive range). I would also explicitly stated that -1 is the end byte. Worth to document allowed range (0 <= |end| <= #str, right?)? > + * @retval not nil Symbol count. > + * @retval nil, error Error. Byte position of the error is > + * returned in the second value. > + */ > +static int > +utf8_len(struct lua_State *L) > +{ > + if (lua_gettop(L) > 3 || !lua_isstring(L, 1)) > + return luaL_error(L, "Usage: utf8.len(<string>, [i, [j]])"); > + size_t slen; > + const char *str = lua_tolstring(L, 1, &slen); > + int len = (int)slen; > + int start_pos = utf8_convert_offset(luaL_optinteger(L, 2, 1), len); > + int end_pos = utf8_convert_offset(luaL_optinteger(L, 3, -1), len); > + if (start_pos < 1 || --start_pos > len || end_pos > len) { > + lua_pushnil(L); > + lua_pushstring(L, "position is out of string"); > + return 2; > + } > + int result = 0; > + if (end_pos > start_pos) { > + UChar32 c; > + ++result; > + /* > + * In Lua 5.3 it is not ok to start from bad > + * position. But next symbols can be any. > + */ I would be not so sure. Consider: print(utf8.len('a\xF4')) Lua 5.3: nil 2 Tarantool: 2 > + U8_NEXT(str, start_pos, end_pos, c); > + if (c == U_SENTINEL) { > + lua_pushnil(L); > + lua_pushinteger(L, start_pos); > + return 2; > + } > + while (start_pos < end_pos) { > + U8_NEXT_UNSAFE(str, start_pos, c); It is controversial decision. Why you are sure Lua 5.3 works in the same manner? > + ++result; > + } > + } > + lua_pushinteger(L, result); > + return 1; > +} > + > +/** > + * Get next symbol code by @an offset. > + * @param String to get symbol code. > + * @param Byte offset from which get. > + * > + * @retval - No more symbols. Do you mean nil? > + * @retval not nil, not nil Byte offset and symbol code. > + */ > +static int > +utf8_next(struct lua_State *L) > <...> > +UCHAR32_CHECKER(islower) > +UCHAR32_CHECKER(isupper) > +UCHAR32_CHECKER(isdigit) > +UCHAR32_CHECKER(isalpha) It would be good to have doxygen-style comment in the source code likewise you posted in #3081.
next prev parent reply other threads:[~2018-05-04 22:33 UTC|newest] Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top 2018-04-28 22:45 [tarantool-patches] [PATCH v2 0/5] Expose ICU into Lua Vladislav Shpilevoy 2018-04-28 22:45 ` [tarantool-patches] [PATCH v2 1/5] alter: fix assertion in collations alter Vladislav Shpilevoy 2018-04-28 22:45 ` [tarantool-patches] [PATCH v2 2/5] Move struct on_access_denied_ctx into error.h Vladislav Shpilevoy 2018-05-04 11:06 ` [tarantool-patches] " Alexander Turenko 2018-05-04 12:05 ` Vladislav Shpilevoy 2018-04-28 22:45 ` [tarantool-patches] [PATCH v2 3/5] Merge box_error, stat and collations into core library Vladislav Shpilevoy 2018-05-04 11:36 ` [tarantool-patches] " Alexander Turenko 2018-05-04 12:05 ` Vladislav Shpilevoy 2018-05-08 13:18 ` Konstantin Osipov 2018-05-10 21:06 ` Vladislav Shpilevoy 2018-04-28 22:45 ` [tarantool-patches] [PATCH v2 4/5] Always store built-in collations in the cache Vladislav Shpilevoy 2018-05-08 13:20 ` [tarantool-patches] " Konstantin Osipov 2018-04-28 22:45 ` [tarantool-patches] [PATCH v2 5/5] lua: introduce utf8 built-in globaly visible module Vladislav Shpilevoy 2018-05-04 22:33 ` Alexander Turenko [this message] 2018-05-04 23:32 ` [tarantool-patches] " Vladislav Shpilevoy 2018-05-05 0:18 ` Alexander Turenko 2018-05-05 0:24 ` Vladislav Shpilevoy 2018-05-05 0:38 ` Alexander Turenko 2018-05-05 0:45 ` Vladislav Shpilevoy 2018-05-08 13:21 ` Konstantin Osipov 2018-05-05 0:19 ` [tarantool-patches] Re: [PATCH v2 0/5] Expose ICU into Lua Alexander Turenko
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20180504223322.y7i5ymo46hcd7usu@tkn_work_nb \ --to=alexander.turenko@tarantool.org \ --cc=tarantool-patches@freelists.org \ --cc=v.shpilevoy@tarantool.org \ --subject='[tarantool-patches] Re: [PATCH v2 5/5] lua: introduce utf8 built-in globaly visible module' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox