[tarantool-patches] Re: [PATCH v2 5/5] lua: introduce utf8 built-in globaly visible module

Sat May 5 01:33:23 MSK 2018

Vlad,

Are you try to run tests from utf8.lua from [1]?

[1]: https://www.lua.org/tests/lua-5.3.4-tests.tar.gz

Debug build:

```
/home/alex/p/tarantool-meta/review/tarantool/src/lua/utf8.c: In function ‘utf8_next’:
/home/alex/p/tarantool-meta/review/tarantool/src/lua/utf8.c:214:2: error: passing argument 2 of ‘utf8_nextCharSafeBody’ from incompatible pointer type [-Werror=incompatible-pointer-types]
  U8_NEXT(str, pos, len, c);
  ^~~~~~~
In file included from /usr/include/unicode/utf.h:217:0,
                 from /usr/include/unicode/utypes.h:44,
                 from /usr/include/unicode/ucasemap.h:24,
                 from /home/alex/p/tarantool-meta/review/tarantool/src/lua/utf8.c:31:
/usr/include/unicode/utf8.h:120:1: note: expected ‘int32_t * {aka int *}’ but argument is of type ‘size_t * {aka long unsigned int *}’
 utf8_nextCharSafeBody(const uint8_t *s, int32_t *pi, int32_t length, UChar32 c, UBool strict);
 ^~~~~~~~~~~~~~~~~~~~~
```

The following cases works differently in Lua 5.3 and in tarantool's
implementation:

print(utf8.len('12İ☢勺34', 3, 3))
print(utf8.len('\xF4\x9F\xBF\xBF'))

Lua 5.3:

1
nil	1

Tarantool:

nil     3
nil     4

See also comments below.

Thanks!

WBR, Alexander Turenko.

On Sun, Apr 29, 2018 at 01:45:13AM +0300, Vladislav Shpilevoy wrote:
> utf8 is a module partially compatible with Lua 5.3 utf8 and
> lua-utf8 third party module.
> Partially means, that not all functions are implemented.
> 
> The patch introduces these ones:
> upper, lower, len, char, sub.

And next.

> 
> All of them works exactly like in Lua 5.3, or if Lua 5.3 has no
> a function, then works like in lua-utf8.
> 

Only len and char are from lua 5.3, we can state it explicitly.

> Tarantool utf8 has extensions:
> 
> * isupper/lower/alpha/digit, that check some property by a symbol
>   or by its code;
> 
> * cmp/casecmp, that compare two UTF8 strings.
> 
> Closes #3290
> Closes #3385
> Closes #3081
> ---
>  src/CMakeLists.txt           |   3 +-
>  src/lua/init.c               |   3 +
>  src/lua/utf8.c               | 451 +++++++++++++++++++++++++++++++++++++++++++
>  src/lua/utf8.h               |  42 ++++
>  test/app-tap/string.test.lua | 160 ++++++++++++++-
>  test/box/ddl.result          |  15 ++
>  test/box/ddl.test.lua        |   8 +
>  7 files changed, 680 insertions(+), 2 deletions(-)
>  create mode 100644 src/lua/utf8.c
>  create mode 100644 src/lua/utf8.h
> <...>
> diff --git a/src/lua/utf8.c b/src/lua/utf8.c
> new file mode 100644
> index 000000000..d1d3d72c2
> --- /dev/null
> +++ b/src/lua/utf8.c
> <...>
> +/**
> + * Calculate a 0-based positive byte offset in a string by any
> + * 0-based offset (possibly negative).

But you pass 1-based offsets from utf8_len function to that one.

> + * @param offset Original 0-based offset with any sign.
> + * @param len A string byte length.
> + * @retval 0-based positive offset.
> + */
> +static inline int
> +utf8_convert_offset(int offset, size_t len)
> +{
> +	if (offset >= 0)
> +		return offset;
> +	else if ((size_t)-offset > len)
> +		return 0;
> +	return len + offset + 1;

It seems that the function is about 1-based offsets (at some cases at
least).

> +}
> +
> +/**
> + * Calculate length of a UTF8 string. Length here is symbol count.
> + * Works like utf8.len in Lua 5.3.
> + * @param String to get length.
> + * @param Start byte offset. Must point to the start of symbol. On
> + *        invalid symbol an error is returned. Can be negative.

Can be 1 <= |start| <= #str + 1, right? Is it worth to document? Such
offset equilibristics is not very intuitive (at least for me).

> + * @param End byte offset, can be negative. Can point to the
> + *        middle of symbol.

We need to clarify that a symbol under the end offset is subject to
include into the resulting count (inclusive range).

I would also explicitly stated that -1 is the end byte.

Worth to document allowed range (0 <= |end| <= #str, right?)?

> + * @retval not nil Symbol count.
> + * @retval nil, error Error. Byte position of the error is
> + *         returned in the second value.
> + */
> +static int
> +utf8_len(struct lua_State *L)
> +{
> +	if (lua_gettop(L) > 3 || !lua_isstring(L, 1))
> +		return luaL_error(L, "Usage: utf8.len(<string>, [i, [j]])");
> +	size_t slen;
> +	const char *str = lua_tolstring(L, 1, &slen);
> +	int len = (int)slen;
> +	int start_pos = utf8_convert_offset(luaL_optinteger(L, 2, 1), len);
> +	int end_pos = utf8_convert_offset(luaL_optinteger(L, 3, -1), len);
> +	if (start_pos < 1 || --start_pos > len || end_pos > len) {
> +		lua_pushnil(L);
> +		lua_pushstring(L, "position is out of string");
> +		return 2;
> +	}
> +	int result = 0;
> +	if (end_pos > start_pos) {
> +		UChar32 c;
> +		++result;
> +		/*
> +		 * In Lua 5.3 it is not ok to start from bad
> +		 * position. But next symbols can be any.
> +		 */

I would be not so sure. Consider:

print(utf8.len('a\xF4'))

Lua 5.3:

nil	2

Tarantool:

2

> +		U8_NEXT(str, start_pos, end_pos, c);
> +		if (c == U_SENTINEL) {
> +			lua_pushnil(L);
> +			lua_pushinteger(L, start_pos);
> +			return 2;
> +		}
> +		while (start_pos < end_pos) {
> +			U8_NEXT_UNSAFE(str, start_pos, c);

It is controversial decision. Why you are sure Lua 5.3 works in the same
manner?

> +			++result;
> +		}
> +	}
> +	lua_pushinteger(L, result);
> +	return 1;
> +}
> +
> +/**
> + * Get next symbol code by @an offset.
> + * @param String to get symbol code.
> + * @param Byte offset from which get.
> + *
> + * @retval - No more symbols.

Do you mean nil?

> + * @retval not nil, not nil Byte offset and symbol code.
> + */
> +static int
> +utf8_next(struct lua_State *L)
> <...>
> +UCHAR32_CHECKER(islower)
> +UCHAR32_CHECKER(isupper)
> +UCHAR32_CHECKER(isdigit)
> +UCHAR32_CHECKER(isalpha)

It would be good to have doxygen-style comment in the source code
likewise you posted in #3081.