Tarantool development patches archive
 help / color / mirror / Atom feed
From: Alexander Turenko <alexander.turenko@tarantool.org>
To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Cc: tarantool-patches@freelists.org
Subject: [tarantool-patches] Re: [PATCH v2 5/5] lua: introduce utf8 built-in globaly visible module
Date: Sat, 5 May 2018 01:33:23 +0300	[thread overview]
Message-ID: <20180504223322.y7i5ymo46hcd7usu@tkn_work_nb> (raw)
In-Reply-To: <562a24d9c42df6701e85d50b06a47d57e6d884bf.1524955403.git.v.shpilevoy@tarantool.org>

Vlad,

Are you try to run tests from utf8.lua from [1]?

[1]: https://www.lua.org/tests/lua-5.3.4-tests.tar.gz

Debug build:

```
/home/alex/p/tarantool-meta/review/tarantool/src/lua/utf8.c: In function ‘utf8_next’:
/home/alex/p/tarantool-meta/review/tarantool/src/lua/utf8.c:214:2: error: passing argument 2 of ‘utf8_nextCharSafeBody’ from incompatible pointer type [-Werror=incompatible-pointer-types]
  U8_NEXT(str, pos, len, c);
  ^~~~~~~
In file included from /usr/include/unicode/utf.h:217:0,
                 from /usr/include/unicode/utypes.h:44,
                 from /usr/include/unicode/ucasemap.h:24,
                 from /home/alex/p/tarantool-meta/review/tarantool/src/lua/utf8.c:31:
/usr/include/unicode/utf8.h:120:1: note: expected ‘int32_t * {aka int *}’ but argument is of type ‘size_t * {aka long unsigned int *}’
 utf8_nextCharSafeBody(const uint8_t *s, int32_t *pi, int32_t length, UChar32 c, UBool strict);
 ^~~~~~~~~~~~~~~~~~~~~
```

The following cases works differently in Lua 5.3 and in tarantool's
implementation:

print(utf8.len('12İ☢勺34', 3, 3))
print(utf8.len('\xF4\x9F\xBF\xBF'))

Lua 5.3:

1
nil	1

Tarantool:

nil     3
nil     4

See also comments below.

Thanks!

WBR, Alexander Turenko.

On Sun, Apr 29, 2018 at 01:45:13AM +0300, Vladislav Shpilevoy wrote:
> utf8 is a module partially compatible with Lua 5.3 utf8 and
> lua-utf8 third party module.
> Partially means, that not all functions are implemented.
> 
> The patch introduces these ones:
> upper, lower, len, char, sub.

And next.

> 
> All of them works exactly like in Lua 5.3, or if Lua 5.3 has no
> a function, then works like in lua-utf8.
> 

Only len and char are from lua 5.3, we can state it explicitly.

> Tarantool utf8 has extensions:
> 
> * isupper/lower/alpha/digit, that check some property by a symbol
>   or by its code;
> 
> * cmp/casecmp, that compare two UTF8 strings.
> 
> Closes #3290
> Closes #3385
> Closes #3081
> ---
>  src/CMakeLists.txt           |   3 +-
>  src/lua/init.c               |   3 +
>  src/lua/utf8.c               | 451 +++++++++++++++++++++++++++++++++++++++++++
>  src/lua/utf8.h               |  42 ++++
>  test/app-tap/string.test.lua | 160 ++++++++++++++-
>  test/box/ddl.result          |  15 ++
>  test/box/ddl.test.lua        |   8 +
>  7 files changed, 680 insertions(+), 2 deletions(-)
>  create mode 100644 src/lua/utf8.c
>  create mode 100644 src/lua/utf8.h
> <...>
> diff --git a/src/lua/utf8.c b/src/lua/utf8.c
> new file mode 100644
> index 000000000..d1d3d72c2
> --- /dev/null
> +++ b/src/lua/utf8.c
> <...>
> +/**
> + * Calculate a 0-based positive byte offset in a string by any
> + * 0-based offset (possibly negative).

But you pass 1-based offsets from utf8_len function to that one.

> + * @param offset Original 0-based offset with any sign.
> + * @param len A string byte length.
> + * @retval 0-based positive offset.
> + */
> +static inline int
> +utf8_convert_offset(int offset, size_t len)
> +{
> +	if (offset >= 0)
> +		return offset;
> +	else if ((size_t)-offset > len)
> +		return 0;
> +	return len + offset + 1;

It seems that the function is about 1-based offsets (at some cases at
least).

> +}
> +
> +/**
> + * Calculate length of a UTF8 string. Length here is symbol count.
> + * Works like utf8.len in Lua 5.3.
> + * @param String to get length.
> + * @param Start byte offset. Must point to the start of symbol. On
> + *        invalid symbol an error is returned. Can be negative.

Can be 1 <= |start| <= #str + 1, right? Is it worth to document? Such
offset equilibristics is not very intuitive (at least for me).

> + * @param End byte offset, can be negative. Can point to the
> + *        middle of symbol.

We need to clarify that a symbol under the end offset is subject to
include into the resulting count (inclusive range).

I would also explicitly stated that -1 is the end byte.

Worth to document allowed range (0 <= |end| <= #str, right?)?

> + * @retval not nil Symbol count.
> + * @retval nil, error Error. Byte position of the error is
> + *         returned in the second value.
> + */
> +static int
> +utf8_len(struct lua_State *L)
> +{
> +	if (lua_gettop(L) > 3 || !lua_isstring(L, 1))
> +		return luaL_error(L, "Usage: utf8.len(<string>, [i, [j]])");
> +	size_t slen;
> +	const char *str = lua_tolstring(L, 1, &slen);
> +	int len = (int)slen;
> +	int start_pos = utf8_convert_offset(luaL_optinteger(L, 2, 1), len);
> +	int end_pos = utf8_convert_offset(luaL_optinteger(L, 3, -1), len);
> +	if (start_pos < 1 || --start_pos > len || end_pos > len) {
> +		lua_pushnil(L);
> +		lua_pushstring(L, "position is out of string");
> +		return 2;
> +	}
> +	int result = 0;
> +	if (end_pos > start_pos) {
> +		UChar32 c;
> +		++result;
> +		/*
> +		 * In Lua 5.3 it is not ok to start from bad
> +		 * position. But next symbols can be any.
> +		 */

I would be not so sure. Consider:

print(utf8.len('a\xF4'))

Lua 5.3:

nil	2

Tarantool:

2

> +		U8_NEXT(str, start_pos, end_pos, c);
> +		if (c == U_SENTINEL) {
> +			lua_pushnil(L);
> +			lua_pushinteger(L, start_pos);
> +			return 2;
> +		}
> +		while (start_pos < end_pos) {
> +			U8_NEXT_UNSAFE(str, start_pos, c);

It is controversial decision. Why you are sure Lua 5.3 works in the same
manner?

> +			++result;
> +		}
> +	}
> +	lua_pushinteger(L, result);
> +	return 1;
> +}
> +
> +/**
> + * Get next symbol code by @an offset.
> + * @param String to get symbol code.
> + * @param Byte offset from which get.
> + *
> + * @retval - No more symbols.

Do you mean nil?

> + * @retval not nil, not nil Byte offset and symbol code.
> + */
> +static int
> +utf8_next(struct lua_State *L)
> <...>
> +UCHAR32_CHECKER(islower)
> +UCHAR32_CHECKER(isupper)
> +UCHAR32_CHECKER(isdigit)
> +UCHAR32_CHECKER(isalpha)

It would be good to have doxygen-style comment in the source code
likewise you posted in #3081.

  reply	other threads:[~2018-05-04 22:33 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-28 22:45 [tarantool-patches] [PATCH v2 0/5] Expose ICU into Lua Vladislav Shpilevoy
2018-04-28 22:45 ` [tarantool-patches] [PATCH v2 1/5] alter: fix assertion in collations alter Vladislav Shpilevoy
2018-04-28 22:45 ` [tarantool-patches] [PATCH v2 2/5] Move struct on_access_denied_ctx into error.h Vladislav Shpilevoy
2018-05-04 11:06   ` [tarantool-patches] " Alexander Turenko
2018-05-04 12:05     ` Vladislav Shpilevoy
2018-04-28 22:45 ` [tarantool-patches] [PATCH v2 3/5] Merge box_error, stat and collations into core library Vladislav Shpilevoy
2018-05-04 11:36   ` [tarantool-patches] " Alexander Turenko
2018-05-04 12:05     ` Vladislav Shpilevoy
2018-05-08 13:18   ` Konstantin Osipov
2018-05-10 21:06     ` Vladislav Shpilevoy
2018-04-28 22:45 ` [tarantool-patches] [PATCH v2 4/5] Always store built-in collations in the cache Vladislav Shpilevoy
2018-05-08 13:20   ` [tarantool-patches] " Konstantin Osipov
2018-04-28 22:45 ` [tarantool-patches] [PATCH v2 5/5] lua: introduce utf8 built-in globaly visible module Vladislav Shpilevoy
2018-05-04 22:33   ` Alexander Turenko [this message]
2018-05-04 23:32     ` [tarantool-patches] " Vladislav Shpilevoy
2018-05-05  0:18       ` Alexander Turenko
2018-05-05  0:24         ` Vladislav Shpilevoy
2018-05-05  0:38           ` Alexander Turenko
2018-05-05  0:45             ` Vladislav Shpilevoy
2018-05-08 13:21   ` Konstantin Osipov
2018-05-05  0:19 ` [tarantool-patches] Re: [PATCH v2 0/5] Expose ICU into Lua Alexander Turenko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180504223322.y7i5ymo46hcd7usu@tkn_work_nb \
    --to=alexander.turenko@tarantool.org \
    --cc=tarantool-patches@freelists.org \
    --cc=v.shpilevoy@tarantool.org \
    --subject='[tarantool-patches] Re: [PATCH v2 5/5] lua: introduce utf8 built-in globaly visible module' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox