[Tarantool-patches] [PATCH v1 1/1] misc: introduce misc.tonumber64()

Wed Nov 25 22:11:51 MSK 2020

Mergen,

Thanks for the patch! Please consider my comments below.

Please use "luajit" prefix in the subject for LuaJIT-related patches in the
future (see that patch series [1] for example). It's not obligatory, just looks
more convenient.

On 16.11.20, imeevma at tarantool.org wrote:
> This patch introduces function misc.tonumber64() that accepts number,
> cdata number or string and returns 64-bit integer value.
> 
> Part of tarantool/tarantool#4738
> ---

It is totally unclear to me what is the behaviour of types different from
`CTID_INT64` and `CTID_UINT64`. I suppose this fogged zone should be
well described in commit message and documentation.

> https://github.com/tarantool/tarantool/issues/4738
> https://github.com/tarantool/luajit/tree/imeevma/gh-4738-introduce-misc-tonumber64
> 
>  src/lib_misc.c                                | 173 ++++++++++++++++++
>  ...-4738-incorrect-result-tonumber64.test.lua |  38 ++++
>  2 files changed, 211 insertions(+)
>  create mode 100755 test/gh-4738-incorrect-result-tonumber64.test.lua
> 
> diff --git a/src/lib_misc.c b/src/lib_misc.c

I observe these warnings at compile, please fix them in the next version:

| lib_misc.c: In function 'lj_cf_misc_tonumber64':
| lib_misc.c:203:38: warning: 'val_u' may be used uninitialized in this function [-Wmaybe-uninitialized]
|   203 |           *(uint64_t *) cdataptr(cd) = val_u;
|       |           ~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
| lib_misc.c:205:37: warning: 'val_i' may be used uninitialized in this function [-Wmaybe-uninitialized]
|   205 |           *(int64_t *) cdataptr(cd) = val_i;
|       |           ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
| lib_misc.c: In function 'lj_cf_misc_tonumber64':
| lib_misc.c:203:38: warning: 'val_u' may be used uninitialized in this function [-Wmaybe-uninitialized]
|   203 |           *(uint64_t *) cdataptr(cd) = val_u;
|       |           ~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
| lib_misc.c:205:37: warning: 'val_i' may be used uninitialized in this function [-Wmaybe-uninitialized]
|   205 |           *(int64_t *) cdataptr(cd) = val_i;
|       |           ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~

I've compiled with the following command:
| make -j CCDEBUG=" -g -ggdb3" XCFLAGS=" -DLUA_USE_APICHECK -DLUA_USE_ASSERT"

Also, with non-debug build I see the following errors:

| lib_misc.c: In function 'lj_cf_misc_tonumber64':
| lib_misc.c:100:5: warning: implicit declaration of function 'assert' [-Wimplicit-function-declaration]
|   100 |     assert(size != CTSIZE_INVALID);
|       |     ^~~~~~
| lib_misc.c:25:1: note: 'assert' is defined in header '<assert.h>'; did you forget to '#include <assert.h>'?
|    24 | #include "lj_state.h"
|   +++ |+#include <assert.h>
|    25 | 
| lib_misc.c: In function 'lj_cf_misc_tonumber64':
| lib_misc.c:100:5: warning: implicit declaration of function 'assert' [-Wimplicit-function-declaration]
|   100 |     assert(size != CTSIZE_INVALID);
|       |     ^~~~~~
| lib_misc.c:25:1: note: 'assert' is defined in header '<assert.h>'; did you forget to '#include <assert.h>'?
|    24 | #include "lj_state.h"
|   +++ |+#include <assert.h>
|    25 |

I've compiled with the following command:
| make -j

Please use `lua_assert()` instead of `assert()` inside LuaJIT code.

I suppose that should partially fix red CI [2].

> index 6f7b9a9..4228836 100644
> --- a/src/lib_misc.c
> +++ b/src/lib_misc.c
> @@ -10,11 +10,18 @@
>  
>  #include "lua.h"
>  #include "lmisclib.h"
> +#include "lauxlib.h"
> +#include <errno.h>
> +#include <ctype.h>

Please, include system-wide headers before local ones.

>  
>  #include "lj_obj.h"
>  #include "lj_str.h"
>  #include "lj_tab.h"
>  #include "lj_lib.h"
> +#include "lj_err.h"
> +#include "lj_ctype.h"
> +#include "lj_cdata.h"

Usually these two headers includes only if LuaJIT build with FFI. Please add
| #if LJ_HASFFI
directive.

> +#include "lj_state.h"

Please, update <Makefile.dep> with these headers and their dependencies
recursively.

>  
>  /* ------------------------------------------------------------------------ */
>  
> @@ -63,6 +70,172 @@ LJLIB_CF(misc_getmetrics)
>    return 1;
>  }
>  
> +LJLIB_CF(misc_tonumber64)

What is the function's behaviour when LuaJIT build without FFI?

> +{
> +  TValue *o = L->base;
> +  if (o >= L->top)
> +    lj_err_arg(L, 1, LJ_ERR_NOVAL);

It seems like `lj_lib_checkany()` does what you want here, doesn't it?

> +  int base = luaL_optint(L, 2, -1);

Please declare variables at the beginning of a block here and below,
considering LuaJIT code style (you can use these flags
| -Wall -Wextra -Wdeclaration-after-statement -Wredundant-decls -Wshadow -Wpointer-arith
to check yourself, see <src/Makefile>).

Why do you use the public interface instead `lj_lib_optint()`?

> +  luaL_argcheck(L, (2 <= base && base <= 36) || base == -1, 2,
> +                "base out of range");

Let's use quarter-fake tab here. See <lib_base.c> indentation style, for
example.

Also, you may use `lj_err_arg()` here.

> +  uint32_t ctypeid = 0;
> +  GCcdata *cd;
> +  CTSize size;
> +  CTState *cts;
> +  double val_d;

AFAICS, `val_d` value uses only inside `LUA_TNUMBER` case-block and can
be declared inside it. Also, I see no reason to use `double` instead
`lua_Number` here.

> +  switch (lua_type(L, 1)) {
> +  case LUA_TNUMBER:
> +    base = (base == -1 ? 10 : base);
> +    if (base != 10)
> +      return luaL_argerror(L, 1, "string expected");

Nit: AFAICS here and below we expect string if `base != 10`.
IMHO, code structure like this a little bit clearer:

| const int32_t base = lj_lib_optint(L, 2, 10);
| if (base == 10) {
|   TValue *o = lj_lib_checkany(L, 1);
|   /* ... */
| } else {
|   const char *arg = strdata(lj_lib_checkstr(L, 1));
|   /* ... */
| 

Feel free to ignore.

> +    val_d = numV(L->base);
> +    if (val_d < (double)INT64_MIN || val_d >= (double)UINT64_MAX)
> +      return luaL_argerror(L, 1, "cannot convert to 64-bit integer");

It's a little bit confusing -- we throw an error here, but for the "same"
string returns nil. Both ways break backward compatibility (as far as
`tonumber64` just yield `lua_Number` before, although this is not described in
the documentation), but at least we should be consistent here, shouldn't we?

| $ src/luajit -e 'print(misc.tonumber64(18446744073709551616))'
| src/luajit: (command line):1: bad argument #1 to 'tonumber64' (cannot convert to 64-bit integer)
| stack traceback:
|         [C]: in function 'tonumber64'
|         (command line):1: in main chunk
|         [C]: at 0x55f47d5541d0
|
| ################################################################
|
| $ src/luajit -e 'print(misc.tonumber64("18446744073709551616"))'
| nil

Yields nil in both ways looks more consistent to me. Thoughts?

> +    cts = ctype_cts(L);

Unfortunately, ctype state may be not initialized here. It's initialized at
opening of ffi library. This will lead to crush if we call `misc.tonumber64()`
before load of ffi.

| $ src/luajit -e 'print(misc.tonumber64(1))'
| Segmentation fault (core dumped)

> +    if (val_d < 0.0)
> +      ctypeid = CTID_INT64;
> +    else
> +      ctypeid = CTID_UINT64;

IMHO, the ternary operator is prettier here, isn't it?
| ctypeid = (val_d < 0) ? CTID_INT64 : CTID_UINT64;

> +    lj_ctype_info(cts, ctypeid, &size);

This call looks redundant -- we already know that size == 8.

> +    assert(size != CTSIZE_INVALID);
> +    cd = lj_cdata_new(cts, ctypeid, size);

Is the usage of `lj_cdata_new_()` better here?

> +    o = L->top;
> +    setcdataV(L, o, cd);

I suppose you forgot to use `lj_gc_check(L)` here and below when return cdata.
We need to check used memory to start GC cycle if necessary.

> +    incr_top(L);

We shouldn't check for growing stack here and below. We have 1 or 2 argument
and return 1 value.

> +    if (val_d < 0.0)

This check and the following cast look redundant, don't they?

> +      *(int64_t *) cdataptr(cd) = (int64_t)val_d;
> +    else
> +      *(uint64_t *) cdataptr(cd) = (uint64_t)val_d;
> +    return 1;
> +  case LUA_TSTRING: {
> +    GCstr *s = strV(o);
> +    const char *arg = strdata(s);
> +    size_t argl = s->len;
> +    while (argl > 0 && isspace(arg[argl - 1]))

Is there any reason why do you prefer `isspace()` over `lj_char_isspace()`?

> +      argl--;
> +    while (isspace(*arg)) {

Ditto.

> +      arg++;
> +      argl--;
> +    }
> +
> +    /*
> +     * Check if we're parsing custom format:

Please use LuaJIT comment format here and below (for example, see any copyright
header).

> +     * 1) '0x' or '0X' trim in case of base == 16 or base == -1
> +     * 2) '0b' or '0B' trim in case of base == 2  or base == -1
> +     * 3) '-' for negative numbers
> +     * 4) LL, ULL, LLU - trim, but only for base == 2 or
> +     *    base == 16 or base == -1. For consistency do not bother
> +     *    with any non-common bases, since user may have specified
> +     *    base >= 22, in which case 'L' will be a digit.
> +     */

Side note: AFAIK <lj_strscan.c> has a lot of functions to parse different formats.
May be it is more consistent to use LuaJIT's parsers here.
Thoughts?

> +    char negative = 0;
> +    if (arg[0] == '-') {
> +      arg++;
> +      argl--;
> +      negative = 1;
> +    }
> +    if (argl > 2 && arg[0] == '0') {
> +      if ((arg[1] == 'x' || arg[1] == 'X') &&

I suppose, that you can use `casecmp()` macro like in <src/lj_strscan.c> here
and below.

> +          (base == 16 || base == -1)) {
> +        base = 16; arg += 2; argl -= 2;
> +      } else if ((arg[1] == 'b' || arg[1] == 'B') &&
> +                 (base == 2 || base == -1)) {
> +        base = 2;  arg += 2; argl -= 2;
> +      }
> +    }
> +    int ull = 0;
> +    if (argl > 2 && (base == 2 || base == 16 || base == -1)) {
> +      if (arg[argl - 1] == 'u' || arg[argl - 1] == 'U') {
> +        ull = 1;
> +        --argl;

It will be nice to see an explanation about behaviour of working with integers
less that `INT64_MIN` for example. Should `tonumber64` returns nil here or same
value as LuaJIT parsers for corresponding input string?
IMHO, the value should overlap (behaviour must be same as for LuaJIT's parser).

> +      }
> +      if ((arg[argl - 1] == 'l' || arg[argl - 1] == 'L') &&
> +          (arg[argl - 2] == 'l' || arg[argl - 2] == 'L')) {
> +        argl -= 2;
> +        if (ull == 0 && (arg[argl - 1] == 'u' || arg[argl - 1] == 'U')) {
> +          ull = 1;
> +          --argl;
> +        }
> +      } else {
> +        ull = 0;
> +      }
> +    }

I see no check for strings like "1.5" or "1e6" and so on.
Why are these formats not supported?

> +    base = (base == -1 ? 10 : base);
> +    errno = 0;
> +    char *arge;
> +    unsigned long long result = strtoull(arg, &arge, base);
> +    if (errno == 0 && arge == arg + argl) {
> +      if (argl == 0) {
> +        lua_pushnil(L);
> +      } else if (negative) {
> +        if (ull == 0 && result != 0 && result - 1 > INT64_MAX) {

Will we save this behaviour? I suppose it should be the same as
for LuaJIT parser.
Thoughts?

> +          lua_pushnil(L);
> +          return 1;
> +        }
> +        cts = ctype_cts(L);
> +        /*

This comment should be highter, shouldn't it?

> +         * To test overflow, consider
> +         *  result > -INT64_MIN;
> +         *  result - 1 > -INT64_MIN - 1;
> +         * Assumption:
> +         *  INT64_MAX == -(INT64_MIN + 1);
> +         * Finally,
> +         *  result - 1 > INT64_MAX;
> +         */

Does this cast differ from?
| (uint64_t)-(int64_t)result

> +        int64_t val_i;
> +        uint64_t val_u;
> +        if (ull != 0) {
> +          val_u = (UINT64_MAX - result) + 1;
> +          ctypeid = CTID_UINT64;
> +        } else {
> +          val_i = -result;
> +          ctypeid = CTID_INT64;
> +        }
> +
> +        lj_ctype_info(cts, ctypeid, &size);

This call looks redundant -- we already know that size == 8.

> +        assert(size != CTSIZE_INVALID);
> +
> +        cd = lj_cdata_new(cts, ctypeid, size);
> +        TValue *o = L->top;
> +        setcdataV(L, o, cd);
> +        incr_top(L);
> +        if (ull != 0)
> +          *(uint64_t *) cdataptr(cd) = val_u;
> +        else
> +          *(int64_t *) cdataptr(cd) = val_i;
> +      } else {

The code here duplicates the code above. I suppose that it can be done much
simpler.
Something like that:
| *(uint64_t *)cdataptr(cd) = neg ? (uint64_t)-(int64_t)result : result;

All these comments return us to question about <lj_strscan.*> parsers.
At least it will be nice to see an explanation in commit message, why
they are not suitable here (if they are not).

> +        cts = ctype_cts(L);
> +        uint64_t val_u = result;
> +        ctypeid = CTID_UINT64;
> +        lj_ctype_info(cts, ctypeid, &size);
> +        assert(size != CTSIZE_INVALID);
> +
> +        cd = lj_cdata_new(cts, ctypeid, size);
> +        TValue *o = L->top;
> +        setcdataV(L, o, cd);
> +        incr_top(L);
> +        *(uint64_t *) cdataptr(cd) = val_u;
> +      }
> +      return 1;
> +    }
> +    break;
> +  }
> +  case LUA_TCDATA:
> +    base = (base == -1 ? 10 : base);
> +    if (base != 10)
> +      return luaL_argerror(L, 1, "string expected");
> +    cd = cdataV(L->base);
> +    ctypeid = cd->ctypeid;
> +    if (ctypeid >= CTID_INT8 && ctypeid <= CTID_DOUBLE) {
> +      lua_pushvalue(L, 1);
> +      return 1;
> +    }
> +    break;
> +  }
> +  lua_pushnil(L);
> +  return 1;
> +}
> +

What's about LuaC API?

>  /* ------------------------------------------------------------------------ */
>  
>  #include "lj_libdef.h"
> diff --git a/test/gh-4738-incorrect-result-tonumber64.test.lua b/test/gh-4738-incorrect-result-tonumber64.test.lua

Nit: It is new functionality for LuaJIT maybe it will be better to use the same
naming for the test like `getmetrics()` and leave a comment with issue reference
nearby corresponding test.
Feel free to ignore.

> new file mode 100755
> index 0000000..1d6da4a
> --- /dev/null
> +++ b/test/gh-4738-incorrect-result-tonumber64.test.lua
> @@ -0,0 +1,38 @@
> +#!/usr/bin/env tarantool
> +
> +-- Miscellaneous test for LuaJIT bugs

I don't understand this comment. This test checks `misc.tonumber64()` only,
doesn't it?

> +local tap = require('tap')
> +local misc = require('misc')

It is not necessary to require this builtin library (ffi library is the only
one exception).

> +
> +local test = tap.test("gh-4738-incorrect-result-tonumber64")
> +test:plan(24)
> +--
> +-- gh-4738: Make sure that tonumber64 always returns cdata.
> +--
> +test:ok(misc.tonumber64(1) == 1)
> +test:ok(misc.tonumber64(-1) == -1)
> +test:ok(misc.tonumber64(1.5) == 1)
> +test:ok(misc.tonumber64(-1.5) == -1)
> +test:ok(misc.tonumber64(1LL) == 1)
> +test:ok(misc.tonumber64(1ULL) == 1)
> +test:ok(misc.tonumber64(1LLU) == 1)
> +test:ok(misc.tonumber64(-1ULL) == 18446744073709551615ULL)
> +test:ok(misc.tonumber64('1') == 1)
> +test:ok(misc.tonumber64('1LL') == 1)
> +test:ok(misc.tonumber64('1ULL') == 1)
> +test:ok(misc.tonumber64('-1ULL') == 18446744073709551615ULL)
> +
> +test:is(type(misc.tonumber64(1)), 'cdata')

Minor: we check each number/cdata/string twice: to value and its type.
May be it well be better to use 12 subtests that test value and type.
Feel free to ignore.

> +test:is(type(misc.tonumber64(-1)), 'cdata')
> +test:is(type(misc.tonumber64(1.5)), 'cdata')
> +test:is(type(misc.tonumber64(-1.5)), 'cdata')
> +test:is(type(misc.tonumber64(1LL)), 'cdata')
> +test:is(type(misc.tonumber64(1ULL)), 'cdata')
> +test:is(type(misc.tonumber64(1LLU)), 'cdata')
> +test:is(type(misc.tonumber64(-1ULL)), 'cdata')
> +test:is(type(misc.tonumber64('1')), 'cdata')
> +test:is(type(misc.tonumber64('1LL')), 'cdata')
> +test:is(type(misc.tonumber64('1ULL')), 'cdata')
> +test:is(type(misc.tonumber64('-1ULL')), 'cdata')

It would be great to add several tests for `ffi.typeof()` since there are
complaints [3] on return value type.

> +
> +os.exit(test:check() and 0 or 1)
> -- 
> 2.25.1
> 

[1]: https://lists.tarantool.org/pipermail/tarantool-patches/2020-January/013801.html
[2]: https://gitlab.com/tarantool/tarantool/-/pipelines/216530551/failures
[3]: https://github.com/tarantool/tarantool/issues/4738#issuecomment-687241952

-- 
Best regards,
Sergey Kaplun