[tarantool-patches] Re: [PATCH v2 5/5] lua: introduce utf8 built-in globaly visible module
Vladislav Shpilevoy
v.shpilevoy at tarantool.org
Sat May 5 02:32:27 MSK 2018
Hello. Thanks for review.
On 05/05/2018 01:33, Alexander Turenko wrote:
> Vlad,
>
> Are you try to run tests from utf8.lua from [1]?
>
> [1]: https://www.lua.org/tests/lua-5.3.4-tests.tar.gz
>
> Debug build:
>
> ```
> /home/alex/p/tarantool-meta/review/tarantool/src/lua/utf8.c: In function ‘utf8_next’:
> /home/alex/p/tarantool-meta/review/tarantool/src/lua/utf8.c:214:2: error: passing argument 2 of ‘utf8_nextCharSafeBody’ from incompatible pointer type [-Werror=incompatible-pointer-types]
> U8_NEXT(str, pos, len, c);
Looks like you did not fetch the newest version. Travis is ok:
https://travis-ci.org/tarantool/tarantool/builds/374952902?utm_source=github_status&utm_medium=notification
> ^~~~~~~
> In file included from /usr/include/unicode/utf.h:217:0,
> from /usr/include/unicode/utypes.h:44,
> from /usr/include/unicode/ucasemap.h:24,
> from /home/alex/p/tarantool-meta/review/tarantool/src/lua/utf8.c:31:
> /usr/include/unicode/utf8.h:120:1: note: expected ‘int32_t * {aka int *}’ but argument is of type ‘size_t * {aka long unsigned int *}’
> utf8_nextCharSafeBody(const uint8_t *s, int32_t *pi, int32_t length, UChar32 c, UBool strict);
> ^~~~~~~~~~~~~~~~~~~~~
> ```
>
> The following cases works differently in Lua 5.3 and in tarantool's
> implementation:
>
> print(utf8.len('12İ☢勺34', 3, 3))
> print(utf8.len('\xF4\x9F\xBF\xBF'))
>
> Lua 5.3:
>
> 1
> nil 1
>
> Tarantool:
>
> nil 3
> nil 4
Second case works ok already:
tarantool> utf8.len('\xF4\x9F\xBF\xBF')
---
- null
- 1
...
It matches Lua 5.3.
First case fixed:
diff --git a/src/lua/utf8.c b/src/lua/utf8.c
index 1d0776ab1..02ecdede4 100644
--- a/src/lua/utf8.c
+++ b/src/lua/utf8.c
@@ -176,7 +176,7 @@ utf8_len(struct lua_State *L)
* In Lua 5.3 it is not ok to start from bad
* position. But next symbols can be any.
*/
- U8_NEXT(str, start_pos, end_pos, c);
+ U8_NEXT(str, start_pos, len, c);
if (c == U_SENTINEL) {
lua_pushnil(L);
lua_pushinteger(L, start_pos);
diff --git a/test/app-tap/string.test.lua b/test/app-tap/string.test.lua
index dc28b87e1..fb96104ed 100755
--- a/test/app-tap/string.test.lua
+++ b/test/app-tap/string.test.lua
@@ -129,7 +129,7 @@ test:test("strip", function(test)
end )
test:test("unicode", function(test)
- test:plan(100)
+ test:plan(101)
local str = 'хеЛлоу вОрЛд ё Ё я Я э Э ъ Ъ hElLo WorLd 1234 i I İ 勺#☢༺'
local upper_res = 'ХЕЛЛОУ ВОРЛД Ё Ё Я Я Э Э Ъ Ъ HELLO WORLD 1234 I I İ 勺#☢༺'
local lower_res = 'хеллоу ворлд ё ё я я э э ъ ъ hello world 1234 i i i̇ 勺#☢༺'
@@ -191,6 +191,7 @@ test:test("unicode", function(test)
test:is(utf8.len(s, 10, -10), 0, 'it is ok to swap end and start pos')
test:is(utf8.len(''), 0, 'empty len')
test:is(utf8.len(s, -6, -1), 3, 'pass both negative offsets')
+ test:is(utf8.len(s, 3, 3), 1, "end in the middle on the same symbol as start")
local chars = {}
local codes = {}
>
> See also comments below.
>
> Thanks!
>
> WBR, Alexander Turenko.
>
> On Sun, Apr 29, 2018 at 01:45:13AM +0300, Vladislav Shpilevoy wrote:
>> utf8 is a module partially compatible with Lua 5.3 utf8 and
>> lua-utf8 third party module.
>> Partially means, that not all functions are implemented.
>>
>> The patch introduces these ones:
>> upper, lower, len, char, sub.
>
> And next.
>
>>
>> All of them works exactly like in Lua 5.3, or if Lua 5.3 has no
>> a function, then works like in lua-utf8.
>>
>
> Only len and char are from lua 5.3, we can state it explicitly.
Fixed.
lua: introduce utf8 built-in globaly visible module
utf8 is a module partially compatible with Lua 5.3 utf8 and
lua-utf8 third party module.
Partially means, that not all functions are implemented.
The patch introduces these ones:
upper, lower, len, char, sub, next.
Len and char works exactly like in Lua 5.3. Other functions work
like in lua-utf8.
Tarantool utf8 has extensions:
* isupper/lower/alpha/digit, that check some property by a symbol
or by its code;
* cmp/casecmp, that compare two UTF8 strings.
Closes #3290
Closes #3385
Closes #3081
>> diff --git a/src/lua/utf8.c b/src/lua/utf8.c
>> new file mode 100644
>> index 000000000..d1d3d72c2
>> --- /dev/null
>> +++ b/src/lua/utf8.c
>> <...>
>> +/**
>> + * Calculate a 0-based positive byte offset in a string by any
>> + * 0-based offset (possibly negative).
>
> But you pass 1-based offsets from utf8_len function to that one.
diff --git a/src/lua/utf8.c b/src/lua/utf8.c
index 0f4a84c98..d4f57a5f4 100644
--- a/src/lua/utf8.c
+++ b/src/lua/utf8.c
@@ -124,11 +124,11 @@ utf8_lower(struct lua_State *L)
}
/**
- * Calculate a 0-based positive byte offset in a string by any
- * 0-based offset (possibly negative).
- * @param offset Original 0-based offset with any sign.
+ * Calculate a 1-based positive byte offset in a string by any
+ * 1-based offset (possibly negative).
+ * @param offset Original 1-based offset with any sign.
* @param len A string byte length.
- * @retval 0-based positive offset.
+ * @retval 1-based positive offset.
*/
static inline int
utf8_convert_offset(int offset, size_t len)
@@ -207,14 +207,16 @@ utf8_next(struct lua_State *L)
size_t slen;
const char *str = lua_tolstring(L, 1, &slen);
int len = (int) slen;
- int pos = utf8_convert_offset(luaL_optinteger(L, 2, 0), len);
+ int pos = utf8_convert_offset(luaL_optinteger(L, 2, 1), len);
+ if (pos > 0)
+ --pos;
if (pos >= len)
return 0;
UChar32 c;
U8_NEXT(str, pos, len, c);
if (c == U_SENTINEL)
return 0;
- lua_pushinteger(L, pos);
+ lua_pushinteger(L, pos + 1);
lua_pushinteger(L, c);
return 2;
}
>> +
>> +/**
>> + * Calculate length of a UTF8 string. Length here is symbol count.
>> + * Works like utf8.len in Lua 5.3.
>> + * @param String to get length.
>> + * @param Start byte offset. Must point to the start of symbol. On
>> + * invalid symbol an error is returned. Can be negative.
>
> Can be 1 <= |start| <= #str + 1, right? Is it worth to document? Such
> offset equilibristics is not very intuitive (at least for me).
No, start can be any, as well as end.
>
>> + * @param End byte offset, can be negative. Can point to the
>> + * middle of symbol.
>
> We need to clarify that a symbol under the end offset is subject to
> include into the resulting count (inclusive range).
>
> I would also explicitly stated that -1 is the end byte.
>
> Worth to document allowed range (0 <= |end| <= #str, right?)?
diff --git a/src/lua/utf8.c b/src/lua/utf8.c
index d4f57a5f4..0507da134 100644
--- a/src/lua/utf8.c
+++ b/src/lua/utf8.c
@@ -142,12 +142,13 @@ utf8_convert_offset(int offset, size_t len)
/**
* Calculate length of a UTF8 string. Length here is symbol count.
- * Works like utf8.len in Lua 5.3.
+ * Works like utf8.len in Lua 5.3. Can take negative offsets. A
+ * negative offset is an offset from the end of string.
* @param String to get length.
* @param Start byte offset. Must point to the start of symbol. On
- * invalid symbol an error is returned. Can be negative.
- * @param End byte offset, can be negative. Can point to the
- * middle of symbol.
+ * invalid symbol an error is returned.
+ * @param End byte offset. Can point to the middle of symbol.
+ * Partial symbol is counted too.
* @retval not nil Symbol count.
* @retval nil, error Error. Byte position of the error is
* returned in the second value.
>
>> + * @retval not nil Symbol count.
>> + * @retval nil, error Error. Byte position of the error is
>> + * returned in the second value.
>> + */
>> +static int
>> +utf8_len(struct lua_State *L)
>> +{
>> + if (lua_gettop(L) > 3 || !lua_isstring(L, 1))
>> + return luaL_error(L, "Usage: utf8.len(<string>, [i, [j]])");
>> + size_t slen;
>> + const char *str = lua_tolstring(L, 1, &slen);
>> + int len = (int)slen;
>> + int start_pos = utf8_convert_offset(luaL_optinteger(L, 2, 1), len);
>> + int end_pos = utf8_convert_offset(luaL_optinteger(L, 3, -1), len);
>> + if (start_pos < 1 || --start_pos > len || end_pos > len) {
>> + lua_pushnil(L);
>> + lua_pushstring(L, "position is out of string");
>> + return 2;
>> + }
>> + int result = 0;
>> + if (end_pos > start_pos) {
>> + UChar32 c;
>> + ++result;
>> + /*
>> + * In Lua 5.3 it is not ok to start from bad
>> + * position. But next symbols can be any.
>> + */
>
> I would be not so sure. Consider:
>
> print(utf8.len('a\xF4'))
>
> Lua 5.3:
>
> nil 2
>
> Tarantool:
>
> 2
diff --git a/src/lua/utf8.c b/src/lua/utf8.c
index 1d0776ab1..c31171b2c 100644
--- a/src/lua/utf8.c
+++ b/src/lua/utf8.c
@@ -171,20 +171,14 @@ utf8_len(struct lua_State *L)
int result = 0;
if (end_pos > start_pos) {
UChar32 c;
- ++result;
- /*
- * In Lua 5.3 it is not ok to start from bad
- * position. But next symbols can be any.
- */
- U8_NEXT(str, start_pos, end_pos, c);
- if (c == U_SENTINEL) {
- lua_pushnil(L);
- lua_pushinteger(L, start_pos);
- return 2;
- }
while (start_pos < end_pos) {
- U8_NEXT_UNSAFE(str, start_pos, c);
++result;
+ U8_NEXT(str, start_pos, len, c);
+ if (c == U_SENTINEL) {
+ lua_pushnil(L);
+ lua_pushinteger(L, start_pos);
+ return 2;
+ }
}
}
lua_pushinteger(L, result);
>> +/**
>> + * Get next symbol code by @an offset.
>> + * @param String to get symbol code.
>> + * @param Byte offset from which get.
>> + *
>> + * @retval - No more symbols.
>
> Do you mean nil?
No. It returns literally nothing.
>
>> + * @retval not nil, not nil Byte offset and symbol code.
>> + */
>> +static int
>> +utf8_next(struct lua_State *L)
>> <...>
>> +UCHAR32_CHECKER(islower)
>> +UCHAR32_CHECKER(isupper)
>> +UCHAR32_CHECKER(isdigit)
>> +UCHAR32_CHECKER(isalpha)
>
> It would be good to have doxygen-style comment in the source code
> likewise you posted in #3081.
>
-/** Macro to easy create lua wrappers for ICU symbol checkers. */
+/**
+ * Macro to easy create lua wrappers for ICU symbol checkers.
+ * @param One stmbol code or string.
+ * @retval True, if the symbol has a requested property. Else
+ * false.
+ */
#define UCHAR32_CHECKER(name) \
More information about the Tarantool-patches
mailing list