[tarantool-patches] Re: [PATCH v2 5/5] lua: introduce utf8 built-in globaly visible module

Sat May 5 02:32:27 MSK 2018

Hello. Thanks for review.

On 05/05/2018 01:33, Alexander Turenko wrote:
> Vlad,
> 
> Are you try to run tests from utf8.lua from [1]?
> 
> [1]: https://www.lua.org/tests/lua-5.3.4-tests.tar.gz
> 
> Debug build:
> 
> ```
> /home/alex/p/tarantool-meta/review/tarantool/src/lua/utf8.c: In function ‘utf8_next’:
> /home/alex/p/tarantool-meta/review/tarantool/src/lua/utf8.c:214:2: error: passing argument 2 of ‘utf8_nextCharSafeBody’ from incompatible pointer type [-Werror=incompatible-pointer-types]
>    U8_NEXT(str, pos, len, c);

Looks like you did not fetch the newest version. Travis is ok:
https://travis-ci.org/tarantool/tarantool/builds/374952902?utm_source=github_status&utm_medium=notification

>    ^~~~~~~
> In file included from /usr/include/unicode/utf.h:217:0,
>                   from /usr/include/unicode/utypes.h:44,
>                   from /usr/include/unicode/ucasemap.h:24,
>                   from /home/alex/p/tarantool-meta/review/tarantool/src/lua/utf8.c:31:
> /usr/include/unicode/utf8.h:120:1: note: expected ‘int32_t * {aka int *}’ but argument is of type ‘size_t * {aka long unsigned int *}’
>   utf8_nextCharSafeBody(const uint8_t *s, int32_t *pi, int32_t length, UChar32 c, UBool strict);
>   ^~~~~~~~~~~~~~~~~~~~~
> ```
> 
> The following cases works differently in Lua 5.3 and in tarantool's
> implementation:
> 
> print(utf8.len('12İ☢勺34', 3, 3))
> print(utf8.len('\xF4\x9F\xBF\xBF'))
> 
> Lua 5.3:
> 
> 1
> nil	1
> 
> Tarantool:
> 
> nil     3
> nil     4

Second case works ok already:

tarantool> utf8.len('\xF4\x9F\xBF\xBF')
---
- null
- 1
...

It matches Lua 5.3.

First case fixed:

diff --git a/src/lua/utf8.c b/src/lua/utf8.c
index 1d0776ab1..02ecdede4 100644
--- a/src/lua/utf8.c
+++ b/src/lua/utf8.c
@@ -176,7 +176,7 @@ utf8_len(struct lua_State *L)
                  * In Lua 5.3 it is not ok to start from bad
                  * position. But next symbols can be any.
                  */
-               U8_NEXT(str, start_pos, end_pos, c);
+               U8_NEXT(str, start_pos, len, c);
                 if (c == U_SENTINEL) {
                         lua_pushnil(L);
                         lua_pushinteger(L, start_pos);
diff --git a/test/app-tap/string.test.lua b/test/app-tap/string.test.lua
index dc28b87e1..fb96104ed 100755
--- a/test/app-tap/string.test.lua
+++ b/test/app-tap/string.test.lua
@@ -129,7 +129,7 @@ test:test("strip", function(test)
  end )
  
  test:test("unicode", function(test)
-    test:plan(100)
+    test:plan(101)
      local str = 'хеЛлоу вОрЛд ё Ё я Я э Э ъ Ъ hElLo WorLd 1234 i I İ 勺#☢༺'
      local upper_res = 'ХЕЛЛОУ ВОРЛД Ё Ё Я Я Э Э Ъ Ъ HELLO WORLD 1234 I I İ 勺#☢༺'
      local lower_res = 'хеллоу ворлд ё ё я я э э ъ ъ hello world 1234 i i i̇ 勺#☢༺'
@@ -191,6 +191,7 @@ test:test("unicode", function(test)
      test:is(utf8.len(s, 10, -10), 0, 'it is ok to swap end and start pos')
      test:is(utf8.len(''), 0, 'empty len')
      test:is(utf8.len(s, -6, -1), 3, 'pass both negative offsets')
+    test:is(utf8.len(s, 3, 3), 1, "end in the middle on the same symbol as start")
  
      local chars = {}
      local codes = {}

> 
> See also comments below.
> 
> Thanks!
> 
> WBR, Alexander Turenko.
> 
> On Sun, Apr 29, 2018 at 01:45:13AM +0300, Vladislav Shpilevoy wrote:
>> utf8 is a module partially compatible with Lua 5.3 utf8 and
>> lua-utf8 third party module.
>> Partially means, that not all functions are implemented.
>>
>> The patch introduces these ones:
>> upper, lower, len, char, sub.
> 
> And next.
> 
>>
>> All of them works exactly like in Lua 5.3, or if Lua 5.3 has no
>> a function, then works like in lua-utf8.
>>
> 
> Only len and char are from lua 5.3, we can state it explicitly.

Fixed.

     lua: introduce utf8 built-in globaly visible module
     
     utf8 is a module partially compatible with Lua 5.3 utf8 and
     lua-utf8 third party module.
     Partially means, that not all functions are implemented.
     
     The patch introduces these ones:
     upper, lower, len, char, sub, next.
     
     Len and char works exactly like in Lua 5.3. Other functions work
     like in lua-utf8.
     
     Tarantool utf8 has extensions:
     
     * isupper/lower/alpha/digit, that check some property by a symbol
       or by its code;
     
     * cmp/casecmp, that compare two UTF8 strings.
     
     Closes #3290
     Closes #3385
     Closes #3081
>> diff --git a/src/lua/utf8.c b/src/lua/utf8.c
>> new file mode 100644
>> index 000000000..d1d3d72c2
>> --- /dev/null
>> +++ b/src/lua/utf8.c
>> <...>
>> +/**
>> + * Calculate a 0-based positive byte offset in a string by any
>> + * 0-based offset (possibly negative).
> 
> But you pass 1-based offsets from utf8_len function to that one.

diff --git a/src/lua/utf8.c b/src/lua/utf8.c
index 0f4a84c98..d4f57a5f4 100644
--- a/src/lua/utf8.c
+++ b/src/lua/utf8.c
@@ -124,11 +124,11 @@ utf8_lower(struct lua_State *L)
  }
  
  /**
- * Calculate a 0-based positive byte offset in a string by any
- * 0-based offset (possibly negative).
- * @param offset Original 0-based offset with any sign.
+ * Calculate a 1-based positive byte offset in a string by any
+ * 1-based offset (possibly negative).
+ * @param offset Original 1-based offset with any sign.
   * @param len A string byte length.
- * @retval 0-based positive offset.
+ * @retval 1-based positive offset.
   */
  static inline int
  utf8_convert_offset(int offset, size_t len)
@@ -207,14 +207,16 @@ utf8_next(struct lua_State *L)
         size_t slen;
         const char *str = lua_tolstring(L, 1, &slen);
         int len = (int) slen;
-       int pos = utf8_convert_offset(luaL_optinteger(L, 2, 0), len);
+       int pos = utf8_convert_offset(luaL_optinteger(L, 2, 1), len);
+       if (pos > 0)
+               --pos;
         if (pos >= len)
                 return 0;
         UChar32 c;
         U8_NEXT(str, pos, len, c);
         if (c == U_SENTINEL)
                 return 0;
-       lua_pushinteger(L, pos);
+       lua_pushinteger(L, pos + 1);
         lua_pushinteger(L, c);
         return 2;
  }
>> +
>> +/**
>> + * Calculate length of a UTF8 string. Length here is symbol count.
>> + * Works like utf8.len in Lua 5.3.
>> + * @param String to get length.
>> + * @param Start byte offset. Must point to the start of symbol. On
>> + *        invalid symbol an error is returned. Can be negative.
> 
> Can be 1 <= |start| <= #str + 1, right? Is it worth to document? Such
> offset equilibristics is not very intuitive (at least for me).

No, start can be any, as well as end.

> 
>> + * @param End byte offset, can be negative. Can point to the
>> + *        middle of symbol.
> 
> We need to clarify that a symbol under the end offset is subject to
> include into the resulting count (inclusive range).
> 
> I would also explicitly stated that -1 is the end byte.
> 
> Worth to document allowed range (0 <= |end| <= #str, right?)?

diff --git a/src/lua/utf8.c b/src/lua/utf8.c
index d4f57a5f4..0507da134 100644
--- a/src/lua/utf8.c
+++ b/src/lua/utf8.c
@@ -142,12 +142,13 @@ utf8_convert_offset(int offset, size_t len)
  
  /**
   * Calculate length of a UTF8 string. Length here is symbol count.
- * Works like utf8.len in Lua 5.3.
+ * Works like utf8.len in Lua 5.3. Can take negative offsets. A
+ * negative offset is an offset from the end of string.
   * @param String to get length.
   * @param Start byte offset. Must point to the start of symbol. On
- *        invalid symbol an error is returned. Can be negative.
- * @param End byte offset, can be negative. Can point to the
- *        middle of symbol.
+ *        invalid symbol an error is returned.
+ * @param End byte offset. Can point to the middle of symbol.
+ *        Partial symbol is counted too.
   * @retval not nil Symbol count.
   * @retval nil, error Error. Byte position of the error is
   *         returned in the second value.

> 
>> + * @retval not nil Symbol count.
>> + * @retval nil, error Error. Byte position of the error is
>> + *         returned in the second value.
>> + */
>> +static int
>> +utf8_len(struct lua_State *L)
>> +{
>> +	if (lua_gettop(L) > 3 || !lua_isstring(L, 1))
>> +		return luaL_error(L, "Usage: utf8.len(<string>, [i, [j]])");
>> +	size_t slen;
>> +	const char *str = lua_tolstring(L, 1, &slen);
>> +	int len = (int)slen;
>> +	int start_pos = utf8_convert_offset(luaL_optinteger(L, 2, 1), len);
>> +	int end_pos = utf8_convert_offset(luaL_optinteger(L, 3, -1), len);
>> +	if (start_pos < 1 || --start_pos > len || end_pos > len) {
>> +		lua_pushnil(L);
>> +		lua_pushstring(L, "position is out of string");
>> +		return 2;
>> +	}
>> +	int result = 0;
>> +	if (end_pos > start_pos) {
>> +		UChar32 c;
>> +		++result;
>> +		/*
>> +		 * In Lua 5.3 it is not ok to start from bad
>> +		 * position. But next symbols can be any.
>> +		 */
> 
> I would be not so sure. Consider:
> 
> print(utf8.len('a\xF4'))
> 
> Lua 5.3:
> 
> nil	2
> 
> Tarantool:
> 
> 2

diff --git a/src/lua/utf8.c b/src/lua/utf8.c
index 1d0776ab1..c31171b2c 100644
--- a/src/lua/utf8.c
+++ b/src/lua/utf8.c
@@ -171,20 +171,14 @@ utf8_len(struct lua_State *L)
         int result = 0;
         if (end_pos > start_pos) {
                 UChar32 c;
-               ++result;
-               /*
-                * In Lua 5.3 it is not ok to start from bad
-                * position. But next symbols can be any.
-                */
-               U8_NEXT(str, start_pos, end_pos, c);
-               if (c == U_SENTINEL) {
-                       lua_pushnil(L);
-                       lua_pushinteger(L, start_pos);
-                       return 2;
-               }
                 while (start_pos < end_pos) {
-                       U8_NEXT_UNSAFE(str, start_pos, c);
                         ++result;
+                       U8_NEXT(str, start_pos, len, c);
+                       if (c == U_SENTINEL) {
+                               lua_pushnil(L);
+                               lua_pushinteger(L, start_pos);
+                               return 2;
+                       }
                 }
         }
         lua_pushinteger(L, result);
>> +/**
>> + * Get next symbol code by @an offset.
>> + * @param String to get symbol code.
>> + * @param Byte offset from which get.
>> + *
>> + * @retval - No more symbols.
> 
> Do you mean nil?

No. It returns literally nothing.

> 
>> + * @retval not nil, not nil Byte offset and symbol code.
>> + */
>> +static int
>> +utf8_next(struct lua_State *L)
>> <...>
>> +UCHAR32_CHECKER(islower)
>> +UCHAR32_CHECKER(isupper)
>> +UCHAR32_CHECKER(isdigit)
>> +UCHAR32_CHECKER(isalpha)
> 
> It would be good to have doxygen-style comment in the source code
> likewise you posted in #3081.
> 

-/** Macro to easy create lua wrappers for ICU symbol checkers. */
+/**
+ * Macro to easy create lua wrappers for ICU symbol checkers.
+ * @param One stmbol code or string.
+ * @retval True, if the symbol has a requested property. Else
+ *         false.
+ */
  #define UCHAR32_CHECKER(name) \