[Tarantool-patches] [PATCH v4 2/2] misc: add C and Lua API for platform metrics

Wed Oct 7 01:17:21 MSK 2020

Sergey,

Thanks for the patch! Please consider my comments below.

On 05.10.20, Sergey Kaplun wrote:
> This patch introduces both C and Lua API for LuaJIT platform
> metrics implemented in scope of the previous patch. New <lmisclib.h>
> header provides <luaM_metrics> C interface that fills the given
> <luam_metrics> structure with the platform metrics. Additionally
> <misc> module is loaded to Lua space and provides <getmetrics>
> method that yields the corresponding metrics via table.
> 
> Part of tarantool/tarantool#5187
> ---
>  Makefile                                      |   2 +-
>  src/Makefile                                  |   5 +-
>  src/Makefile.dep                              |  14 +-
>  src/lib_init.c                                |   2 +
>  src/lib_misc.c                                |  72 +++
>  src/lj_mapi.c                                 |  61 +++
>  src/ljamalg.c                                 |   2 +
>  src/lmisclib.h                                |  64 +++
>  src/luaconf.h                                 |   1 +
>  test/clib-misclib-getmetrics.test.lua         | 174 ++++++++
>  test/clib-misclib-getmetrics/CMakeLists.txt   |   1 +
>  test/clib-misclib-getmetrics/testgetmetrics.c | 242 ++++++++++
>  test/lib-misc-getmetrics.test.lua             | 418 ++++++++++++++++++
>  13 files changed, 1050 insertions(+), 8 deletions(-)
>  create mode 100644 src/lib_misc.c
>  create mode 100644 src/lj_mapi.c
>  create mode 100644 src/lmisclib.h
>  create mode 100755 test/clib-misclib-getmetrics.test.lua
>  create mode 100644 test/clib-misclib-getmetrics/CMakeLists.txt
>  create mode 100644 test/clib-misclib-getmetrics/testgetmetrics.c
>  create mode 100755 test/lib-misc-getmetrics.test.lua
> 

<snipped>

> diff --git a/src/lib_misc.c b/src/lib_misc.c
> new file mode 100644
> index 0000000..ef11237
> --- /dev/null
> +++ b/src/lib_misc.c
> @@ -0,0 +1,72 @@

<snipped>

> +#define LJLIB_MODULE_misc
> +
> +LJLIB_CF(misc_getmetrics)
> +{
> +  struct luam_Metrics metrics;
> +  lua_createtable(L, 0, 22);

19 slots are definitely enough.

> +  GCtab *m = tabV(L->top - 1);

Once more: please, declare the variables in the beginning of the block.

> +
> +  luaM_metrics(L, &metrics);
> +

<snipped>

> diff --git a/src/lj_mapi.c b/src/lj_mapi.c
> new file mode 100644
> index 0000000..7645a44
> --- /dev/null
> +++ b/src/lj_mapi.c
> @@ -0,0 +1,61 @@

<snipped>

> +
> +LUAMISC_API void luaM_metrics(lua_State *L, struct luam_Metrics *metrics)
> +{
> +  lua_assert(metrics != NULL);

Ditto.

> +  global_State *g = G(L);
> +  GCState *gc = &g->gc;
> +#if LJ_HASJIT
> +  jit_State *J = G2J(g);
> +#endif
> +

<snipped>

> diff --git a/src/lmisclib.h b/src/lmisclib.h
> new file mode 100644
> index 0000000..e2d1909
> --- /dev/null
> +++ b/src/lmisclib.h

<snipped>

> +struct luam_Metrics {
> +  /* Strings amount found in string hash instead of allocation of new one. */

If I don't know how this metric is collected, I'll never guess what this
comment mean. Let's reword it the following way:
| /*
| ** Number of strings being interned (i.e. the string with the same
| ** payload is found, so a new one is not created/allocated)
| */

> +  size_t strhash_hit;
> +  /* Strings amount allocated and put into string hash. */

Well, does it mean this metric has the same value as gc_strnum?
Not indeed. This is a number of the strings

> +  size_t strhash_miss;
> +
> +  /* Amount of allocated string objects. */
> +  size_t gc_strnum;
> +  /* Amount of allocated table objects. */
> +  size_t gc_tabnum;
> +  /* Amount of allocated udata objects. */
> +  size_t gc_udatanum;
> +  /* Amount of allocated cdata objects. */
> +  size_t gc_cdatanum;
> +
> +  /* Memory currently allocated. */
> +  size_t gc_total;
> +  /* Total amount of freed memory. */
> +  size_t gc_freed;
> +  /* Total amount of allocated memory. */
> +  size_t gc_allocated;
> +
> +  /* Count of incremental GC steps per state. */
> +  size_t gc_steps_pause;
> +  size_t gc_steps_propagate;
> +  size_t gc_steps_atomic;
> +  size_t gc_steps_sweepstring;
> +  size_t gc_steps_sweep;
> +  size_t gc_steps_finalize;
> +
> +  /*
> +  ** Overall number of snap restores (amount of guard assertions
> +  ** leading to stopping trace executions and trace exits,
> +  ** that are not stitching with other traces).

Minor: Strictly saying, it's not a stitching in LuaJIT terms.

> +  */
> +  size_t jit_snap_restore;
> +  /* Overall number of abort traces. */
> +  size_t jit_trace_abort;
> +  /* Total size of all allocated machine code areas. */
> +  size_t jit_mcode_size;
> +  /* Amount of JIT traces. */
> +  unsigned int jit_trace_num;
> +};

<snipped>

> diff --git a/test/clib-misclib-getmetrics.test.lua b/test/clib-misclib-getmetrics.test.lua

Add Github issue to the file name.

> new file mode 100755
> index 0000000..34adaba
> --- /dev/null
> +++ b/test/clib-misclib-getmetrics.test.lua
> @@ -0,0 +1,174 @@
> +#!/usr/bin/env tarantool
> +
> +local file = debug.getinfo(1, "S").source:sub(2)
> +local filepath = file:match("(.*/)")
> +local soext = jit.os == "OSX" and "dylib" or "so"
> +package.cpath = filepath..'clib-misclib-getmetrics/?.'..soext..";"
> +                ..package.cpath

I've already solved this issue, so this part can be easily borrowed from
utils.lua[1]. Consider the following:
| local path = arg[0]:gsub('%.test%.lua', '')
| local suffix = package.cpath:match('?.(%a+);')
| package.cpath = ('%s/?.%s;%s'):format(path, suffix, package.cpath)

> +
> +local jit_opt_default_hotloop = 56
> +local jit_opt_default_hotexit = 10
> +local jit_opt_default_level = 3
> +local jit_opt_default_minstitch = 0

Well, this can be done in a much more elegant way:
| local jit_opt_default = {
|   3, -- level
|   "hotloop=56",
|   "hotexit=10",
|   "minstitch=0",
| }
|
| <snipped>
|
| jit.opt.start(unpack(jit_opt_default))

> +

<snipped>

> +test:ok(testgetmetrics.objcount(function()

<snipped>

> +    for i = 1, 1000 do
> +        table.insert(placeholder.tab, table_new(i, 0))

This can be done via creating { i }, so table.new is excess here.

> +    end

<snipped>

> +-- Compiled loop with a direct exit to the interpreter.
> +test:ok(testgetmetrics.snap_restores(function()
> +    jit.opt.start(0, "hotloop=2")

Why hotloop is 2?

> +

<snipped>

> +-- Compiled loop with a side exit which does not get compiled.
> +test:ok(testgetmetrics.snap_restores(function()
> +    jit.opt.start(0, "hotloop=2", "hotexit=2", "minstitch=15")

Why hotloop is 2?

> +

<snipped>

> +-- Compiled scalar trace with a direct exit to the interpreter.
> +test:ok(testgetmetrics.snap_restores(function()
> +    -- For calls it will be 2 * hotloop (see lj_dispatch.{c,h}
> +    -- and hotcall at vm_*.dasc).
> +    jit.opt.start(3, "hotloop=2", "hotexit=3")

Why hotloop is 2?

> +

<snipped>

> diff --git a/test/clib-misclib-getmetrics/testgetmetrics.c b/test/clib-misclib-getmetrics/testgetmetrics.c

Add Github issue to the directory name.

> new file mode 100644
> index 0000000..32802d2
> --- /dev/null
> +++ b/test/clib-misclib-getmetrics/testgetmetrics.c
> @@ -0,0 +1,242 @@

<snipped>

> +static int base(lua_State *L)
> +{
> +	struct luam_Metrics metrics;
> +	luaM_metrics(L, &metrics);
> +
> +	/* Just check API. */

Minor: it worths to drop a comment why there is a <ssize_t> here. It's
quite obvious, but has come to my mind only in a few seconds.

> +	assert((ssize_t)metrics.strhash_hit >= 0);

<snipped>

> +}
> +
> +static int gc_allocated_freed(lua_State *L)
> +{
> +	struct luam_Metrics oldm, newm;
> +	/* Force up garbage collect all dead objects. */
> +	lua_gc(L, LUA_GCCOLLECT, 0);
> +
> +	luaM_metrics(L, &oldm);
> +	/* Simple garbage generation. */
> +	if (luaL_dostring(L, "local i = 0 for j = 1, 10 do i = i + j end"))
> +		luaL_error(L, "failed to translate Lua code snippet");

Side note: OK, this is a funny example: the code to be executed doesn't
generate the garbage per se, but the machinery running it does. This is
nice, since AFAICS you can calculate the exact amount to allocated and
freed, can't you? Feel free to ignore this comment or consider it as
an extracurricular activity.

> +	lua_gc(L, LUA_GCCOLLECT, 0);
> +	luaM_metrics(L, &newm);
> +	assert(newm.gc_allocated - oldm.gc_allocated > 0);
> +	assert(newm.gc_freed - oldm.gc_freed > 0);
> +
> +	lua_pushboolean(L, 1);
> +	return 1;
> +}
> +
> +static int gc_steps(lua_State *L)
> +{
> +	struct luam_Metrics oldm, newm;
> +	/*
> +	 * Some garbage has already happened before the next line,
> +	 * i.e. during fronted processing lua test chunk.

Typo: s/fronted/frontend/.
Typo: s/lua/Lua/.

> +	 * Let's put a full garbage collection cycle on top
> +	 * of that, and confirm that non-null values are reported
> +	 * (we are not yet interested in actual numbers):
> +	 */
> +	luaM_metrics(L, &oldm);
> +	lua_gc(L, LUA_GCCOLLECT, 0);

OK, this is a misguiding part. At first I was confused with the comment
below that *new values are the same*. Why the difference doesn't equals
to zero if they are the same? Later I found the <lua_gc> call after
<oldm> initialization *but* prior to these asserts. I guess the next
reordering definitely make the code clearer:
| luaM_metrics(L, &oldm);
| assert(oldm.gc_steps_pause > 0);
| <other asserts>
| lua_gc(L, LUA_GCCOLLECT, 0);

> +	assert(oldm.gc_steps_pause > 0);

<snipped>

> +	/*
> +	 * As long as we don't create new Lua objects
> +	 * consequent call should return the same values:

How come? They *are not* the same considering the asserts below (except
<gc_steps_finalize>). The comment is misguiding.

> +	 */
> +	luaM_metrics(L, &newm);
> +	assert(newm.gc_steps_pause - oldm.gc_steps_pause > 0);

<snipped>

> +}
> +
> +static int objcount(lua_State *L)
> +{
> +	struct luam_Metrics oldm, newm;
> +	int n = lua_gettop(L);
> +	if (n != 1 || !lua_isfunction(L, 1))
> +		luaL_error(L, "incorrect argument: 1 function is required");
> +
> +	/* Force up garbage collect all dead objects. */
> +	lua_gc(L, LUA_GCCOLLECT, 0);
> +
> +	luaM_metrics(L, &oldm);
> +	/* Generate garbage. */
> +	lua_call(L, 0, 0);

If loop optimization is disabled, then you can pass the number of
iterations as an argument to the function and check the metrics values
against it. Thoughts?

> +	lua_gc(L, LUA_GCCOLLECT, 0);
> +	luaM_metrics(L, &newm);

<snipped>

> +}

> +static int tracenum_base(lua_State *L)
> +{
> +	struct luam_Metrics metrics;
> +	int n = lua_gettop(L);
> +	if (n != 1 || !lua_isfunction(L, 1))
> +		luaL_error(L, "incorrect arguments: 1 function is required");
> +
> +	luaJIT_setmode(L, 0, LUAJIT_MODE_OFF);

This line looks excess.

> +	luaJIT_setmode(L, 0, LUAJIT_MODE_FLUSH);
> +	/* Force up garbage collect all dead objects. */
> +	lua_gc(L, LUA_GCCOLLECT, 0);
> +
> +	luaM_metrics(L, &metrics);
> +	assert(metrics.jit_trace_num == 0);
> +
> +	luaJIT_setmode(L, 0, LUAJIT_MODE_ON);

<snipped>

> +
> +	luaJIT_setmode(L, 0, LUAJIT_MODE_FLUSH);
> +	/* Force up garbage collect all dead objects. */
> +	lua_gc(L, LUA_GCCOLLECT, 0);
> +	luaM_metrics(L, &metrics);
> +	assert(metrics.jit_trace_num == 0);
> +
> +	luaJIT_setmode(L, 0, LUAJIT_MODE_ON);

This line looks excess.

> +	lua_pushboolean(L, 1);
> +	return 1;
> +}

<snipped>

> diff --git a/test/lib-misc-getmetrics.test.lua b/test/lib-misc-getmetrics.test.lua

Add Github issue to the file name.

> new file mode 100755
> index 0000000..2859e26
> --- /dev/null
> +++ b/test/lib-misc-getmetrics.test.lua
> @@ -0,0 +1,418 @@
> +#!/usr/bin/env tarantool
> +
> +-- This is a part of tarantool/luajit testing suite.
> +-- Major portions taken verbatim or adapted from the LuaVela testing suit.

Typo: s/suit/suite/.

> +-- Copyright (C) 2015-2019 IPONWEB Ltd.

<snipped>

> +test:test("gc-allocated-freed", function(subtest)
> +    subtest:plan(1)
> +
> +    -- Force up garbage collect all dead objects.
> +    collectgarbage("collect")
> +
> +    -- Bump getmetrincs table and string keys allocation.

Typo: s/getmetrincs/getmetrics/.

> +    local old_metrics = misc.getmetrics()
> +
> +    -- Remember allocated size for getmetrics table.
> +    old_metrics = misc.getmetrics()
> +
> +    collectgarbage("collect")
> +
> +    local new_metrics = misc.getmetrics()
> +
> +    local diff_alloc = new_metrics.gc_allocated - old_metrics.gc_allocated
> +    local getmetrics_alloc = diff_alloc

Why the difference can't be assigned right to <getmetrics_alloc>?

> +

<snipped>

> +test:test("gc-steps", function(subtest)
> +    subtest:plan(24)
> +
> +    -- Some garbage has already created before the next line,
> +    -- i.e. during fronted processing this chunk.

Typo: s/fronted/frontend/.

> +    -- Let's put a full garbage collection cycle on top of that,
> +    -- and confirm that non-null values are reported (we are not
> +    -- yet interested in actual numbers):

<snipped>

> +test:test("objcount", function(subtest)

<snipped>

> +    -- Bump strings and table creation.
> +    local old_metrics = misc.getmetrics()

OK, I finally got the math you explained to me offline, but it's too
complex for such simple case. Let's consider the newly allocated table
right in the test assertion.

> +    old_metrics = misc.getmetrics()
> +

<snipped>

> +    for i = 1, 1000 do
> +        table.insert(placeholder.tab, table_new(i, 0))

This can be done via creating { i }, so table.new is excess here.

> +    end
> +

<snipped>

> +test:test("snap-restores-direct-loop", function(subtest)
> +    -- Compiled loop with a direct exit to the interpreter.
> +    subtest:plan(1)
> +
> +    jit.opt.start(0, "hotloop=2")

Why hotloop is 2?

> +

<snipped>

> +test:test("snap-restores-loop-side-exit-non-compiled", function(subtest)
> +    -- Compiled loop with a side exit which does not get compiled.
> +    subtest:plan(1)
> +
> +    jit.opt.start(0, "hotloop=2", "hotexit=2", "minstitch=15")

Why hotloop is 2?

> +

<snipped>

> +test:test("snap-restores-scalar", function(subtest)
> +    -- Compiled scalar trace with a direct exit to the interpreter.
> +    subtest:plan(2)
> +
> +    -- For calls it will be 2 * hotloop (see lj_dispatch.{c,h}
> +    -- and hotcall at vm_*.dasc).
> +    jit.opt.start(3, "hotloop=2", "hotexit=3")

Why hotloop is 2?

> +

<snipped>

> +test:test("tracenum-base", function(subtest)
> +    subtest:plan(3)
> +
> +    jit.off()

This line looks excess.

> +    jit.flush()
> +    collectgarbage("collect")
> +    local metrics = misc.getmetrics()
> +    subtest:is(metrics.jit_trace_num, 0)
> +
> +    jit.on()
> +    local sum = 0
> +    for i = 1, 100 do
> +        sum = sum + i
> +    end
> +
> +    metrics = misc.getmetrics()
> +    subtest:is(metrics.jit_trace_num, 1)
> +
> +    jit.off()

This line looks excess.

> +    jit.flush()
> +    collectgarbage("collect")
> +
> +    metrics = misc.getmetrics()
> +    subtest:is(metrics.jit_trace_num, 0)
> +
> +    jit.on()
> +end)
> +
> +os.exit(test:check() and 0 or 1)
> -- 
> 2.28.0
> 

[1]: https://github.com/tarantool/luajit/blob/tarantool/test/utils.lua

-- 
Best regards,
IM