[tarantool-patches] Re: [PATCH v2 5/5] lua: introduce utf8 built-in globaly visible module
Alexander Turenko
alexander.turenko at tarantool.org
Sat May 5 01:33:23 MSK 2018
Vlad,
Are you try to run tests from utf8.lua from [1]?
[1]: https://www.lua.org/tests/lua-5.3.4-tests.tar.gz
Debug build:
```
/home/alex/p/tarantool-meta/review/tarantool/src/lua/utf8.c: In function ‘utf8_next’:
/home/alex/p/tarantool-meta/review/tarantool/src/lua/utf8.c:214:2: error: passing argument 2 of ‘utf8_nextCharSafeBody’ from incompatible pointer type [-Werror=incompatible-pointer-types]
U8_NEXT(str, pos, len, c);
^~~~~~~
In file included from /usr/include/unicode/utf.h:217:0,
from /usr/include/unicode/utypes.h:44,
from /usr/include/unicode/ucasemap.h:24,
from /home/alex/p/tarantool-meta/review/tarantool/src/lua/utf8.c:31:
/usr/include/unicode/utf8.h:120:1: note: expected ‘int32_t * {aka int *}’ but argument is of type ‘size_t * {aka long unsigned int *}’
utf8_nextCharSafeBody(const uint8_t *s, int32_t *pi, int32_t length, UChar32 c, UBool strict);
^~~~~~~~~~~~~~~~~~~~~
```
The following cases works differently in Lua 5.3 and in tarantool's
implementation:
print(utf8.len('12İ☢勺34', 3, 3))
print(utf8.len('\xF4\x9F\xBF\xBF'))
Lua 5.3:
1
nil 1
Tarantool:
nil 3
nil 4
See also comments below.
Thanks!
WBR, Alexander Turenko.
On Sun, Apr 29, 2018 at 01:45:13AM +0300, Vladislav Shpilevoy wrote:
> utf8 is a module partially compatible with Lua 5.3 utf8 and
> lua-utf8 third party module.
> Partially means, that not all functions are implemented.
>
> The patch introduces these ones:
> upper, lower, len, char, sub.
And next.
>
> All of them works exactly like in Lua 5.3, or if Lua 5.3 has no
> a function, then works like in lua-utf8.
>
Only len and char are from lua 5.3, we can state it explicitly.
> Tarantool utf8 has extensions:
>
> * isupper/lower/alpha/digit, that check some property by a symbol
> or by its code;
>
> * cmp/casecmp, that compare two UTF8 strings.
>
> Closes #3290
> Closes #3385
> Closes #3081
> ---
> src/CMakeLists.txt | 3 +-
> src/lua/init.c | 3 +
> src/lua/utf8.c | 451 +++++++++++++++++++++++++++++++++++++++++++
> src/lua/utf8.h | 42 ++++
> test/app-tap/string.test.lua | 160 ++++++++++++++-
> test/box/ddl.result | 15 ++
> test/box/ddl.test.lua | 8 +
> 7 files changed, 680 insertions(+), 2 deletions(-)
> create mode 100644 src/lua/utf8.c
> create mode 100644 src/lua/utf8.h
> <...>
> diff --git a/src/lua/utf8.c b/src/lua/utf8.c
> new file mode 100644
> index 000000000..d1d3d72c2
> --- /dev/null
> +++ b/src/lua/utf8.c
> <...>
> +/**
> + * Calculate a 0-based positive byte offset in a string by any
> + * 0-based offset (possibly negative).
But you pass 1-based offsets from utf8_len function to that one.
> + * @param offset Original 0-based offset with any sign.
> + * @param len A string byte length.
> + * @retval 0-based positive offset.
> + */
> +static inline int
> +utf8_convert_offset(int offset, size_t len)
> +{
> + if (offset >= 0)
> + return offset;
> + else if ((size_t)-offset > len)
> + return 0;
> + return len + offset + 1;
It seems that the function is about 1-based offsets (at some cases at
least).
> +}
> +
> +/**
> + * Calculate length of a UTF8 string. Length here is symbol count.
> + * Works like utf8.len in Lua 5.3.
> + * @param String to get length.
> + * @param Start byte offset. Must point to the start of symbol. On
> + * invalid symbol an error is returned. Can be negative.
Can be 1 <= |start| <= #str + 1, right? Is it worth to document? Such
offset equilibristics is not very intuitive (at least for me).
> + * @param End byte offset, can be negative. Can point to the
> + * middle of symbol.
We need to clarify that a symbol under the end offset is subject to
include into the resulting count (inclusive range).
I would also explicitly stated that -1 is the end byte.
Worth to document allowed range (0 <= |end| <= #str, right?)?
> + * @retval not nil Symbol count.
> + * @retval nil, error Error. Byte position of the error is
> + * returned in the second value.
> + */
> +static int
> +utf8_len(struct lua_State *L)
> +{
> + if (lua_gettop(L) > 3 || !lua_isstring(L, 1))
> + return luaL_error(L, "Usage: utf8.len(<string>, [i, [j]])");
> + size_t slen;
> + const char *str = lua_tolstring(L, 1, &slen);
> + int len = (int)slen;
> + int start_pos = utf8_convert_offset(luaL_optinteger(L, 2, 1), len);
> + int end_pos = utf8_convert_offset(luaL_optinteger(L, 3, -1), len);
> + if (start_pos < 1 || --start_pos > len || end_pos > len) {
> + lua_pushnil(L);
> + lua_pushstring(L, "position is out of string");
> + return 2;
> + }
> + int result = 0;
> + if (end_pos > start_pos) {
> + UChar32 c;
> + ++result;
> + /*
> + * In Lua 5.3 it is not ok to start from bad
> + * position. But next symbols can be any.
> + */
I would be not so sure. Consider:
print(utf8.len('a\xF4'))
Lua 5.3:
nil 2
Tarantool:
2
> + U8_NEXT(str, start_pos, end_pos, c);
> + if (c == U_SENTINEL) {
> + lua_pushnil(L);
> + lua_pushinteger(L, start_pos);
> + return 2;
> + }
> + while (start_pos < end_pos) {
> + U8_NEXT_UNSAFE(str, start_pos, c);
It is controversial decision. Why you are sure Lua 5.3 works in the same
manner?
> + ++result;
> + }
> + }
> + lua_pushinteger(L, result);
> + return 1;
> +}
> +
> +/**
> + * Get next symbol code by @an offset.
> + * @param String to get symbol code.
> + * @param Byte offset from which get.
> + *
> + * @retval - No more symbols.
Do you mean nil?
> + * @retval not nil, not nil Byte offset and symbol code.
> + */
> +static int
> +utf8_next(struct lua_State *L)
> <...>
> +UCHAR32_CHECKER(islower)
> +UCHAR32_CHECKER(isupper)
> +UCHAR32_CHECKER(isdigit)
> +UCHAR32_CHECKER(isalpha)
It would be good to have doxygen-style comment in the source code
likewise you posted in #3081.
More information about the Tarantool-patches
mailing list