[Tarantool-patches] [RFC] rfc: luajit metrics

Fri Jul 24 14:18:29 MSK 2020

Hi! Thanks for the review!

Mons makes no bones of mailing list so I cite his comments from
[1] here.

On 23.07.20, Sergey Ostanevich wrote:
> Hi!
> 
> Thanks for the patch! See my comments below.
> Otherwise LGTM.
> 
> Sergos
> 
> On 21 Jul 14:37, Sergey Kaplun wrote:
> > Part of #5187
> > ---
> > 
> > This patch adds RFC to LuaJIT metrics interfaces. Nevertheless name
> > `misc` for builtin library is not good and should be discussed, because
> > tons of user modules can use that name for their own libraies.
> > 
> > Branch: https://github.com/tarantool/tarantool/tree/skaplun/5187-luajit-metrics
> > Issue: https://github.com/tarantool/tarantool/issues/5187
> > 
> >  doc/rfc/5187-luajit-metrics.md | 134 +++++++++++++++++++++++++++++++++
> >  1 file changed, 134 insertions(+)
> >  create mode 100644 doc/rfc/5187-luajit-metrics.md
> > 
> > diff --git a/doc/rfc/5187-luajit-metrics.md b/doc/rfc/5187-luajit-metrics.md
> > new file mode 100644
> > index 000000000..33a8bf8e0
> > --- /dev/null
> > +++ b/doc/rfc/5187-luajit-metrics.md
> > @@ -0,0 +1,134 @@
> > +# LuaJIT metrics
> > +
> > +* **Status**: In progress
> > +* **Start date**: 17-07-2020
> > +* **Authors**: Sergey Kaplun @Buristan skaplun at tarantool.org,
> > +               Igor Munkin @igormunkin imun at tarantool.org,
> > +               Sergey Ostanevich @sergos sergos at tarantool.org
> > +* **Issues**: [#5187](https://github.com/tarantool/tarantool/issues/5187)
> > +
> > +## Summary
> > +
> > +LuaJIT metrics provide extra information about the Lua state. They consists of
> > +GC metrics (overall amount of objects and memory usage), JIT stats (both
> > +related to the compiled traces and the engine itself), string hash hits/misses.
> > +
> > +## Background and motivation
> > +
> > +One can be curious about their application performance. We are going to provide
> > +various metrics about the several platform subsystems behaviour. GC pressure
> > +produced by user code can weight down all application performance. Irrelevant
> > +traces compiled by the JIT engine can just burn CPU time with no benefits as a
> > +result. String hash collisions can lead to DoS caused by a single request. All
> > +these metrics should be well monitored by users wanting to improve the
> > +performance of their application.
> > +
> > +## Detailed design
> > +
> > +For C API we introduce additional extension header <lmisclib.h> that provides
> > +interfaces for new LuaJIT C API extensions. The first interface in this header
> > +will be the following:
> > +
> > +```
> > +/* API for obtaining various metrics from the platform. */
> > +
> > +LUAM_API struct luam_Metrics *luaM_metrics(lua_State *L,
> > +					   struct luam_Metrics *dst);
> > +```
> > +
> > +This function fills the structure pointed by `dst` with the corresponding
> > +metrics related to Lua state anchored to the given coroutine `L`. The result of
> > +the function is a pointer to the filled structure (the same `dst` points to).
> > +
> > +`struct luam_Metrics` has the following definition:
> 
>    ^ The

Thanks!

> > +
> > +```
> > +struct luam_Metrics {
> > +	/*
> > +	 * New string has been found in the storage since last
> 
> The 'new string found' is misleading IMO. 
> Number of strings found in string hash instead of allocation of new
> ones since ...

Ok, thanks!

> > +	 * luaM_metrics() call.
> > +	 */
> > +	size_t strhash_hit;
> > +	/*
> > +	 * New string has been added to the storage since last
> 
> Number of new strings allocated and put into string hash since ...

Sure!

> > +	 * luaM_metrics() call.
> > +	 */
> > +	size_t strhash_miss;
> > +
> > +	size_t strnum;   /* Current amount of string objects. */
> > +	size_t tabnum;   /* Current amount of table objects. */
> > +	size_t udatanum; /* Current amount of udata objects. */
> > +	size_t cdatanum; /* Current amount of cdata objects. */
> > +
> > +	/* Memory currently allocated. */
> > +	size_t gc_total;
> > +	/* Memory freed since last luaM_metrics() call. */
> > +	size_t gc_freed;
> > +	/* Memory allocated since last luaM_metrics() call. */
> > +	size_t gc_allocated;
> > +
> > +	/*
> > +	 * Count of GC invocations with different states
> > +	 * since previous call of luaM_metrics().
> > +	 */
> > +	size_t gc_steps_pause;
> > +	size_t gc_steps_propagate;
> > +	size_t gc_steps_atomic;
> > +	size_t gc_steps_sweepstring;
> > +	size_t gc_steps_sweep;
> > +	size_t gc_steps_finalize;
> > +
> > +	/*
> > +	 * Overall number of snap restores for all traces
> 
> Snap restores needs some explanation.

Ok, sounds reasonable. AFAIK number of snap restores equals to amount of
trace exits. May be this will be more informative for users.

> > +	 * "belonging" to the given jit_State since the last call
> > +	 * to luaM_metrics().
> > +	 */
> > +	size_t jit_snap_restore;
> > +	/*
> > +	 * Overall number of abort traces "belonging" to the given
> > +	 * jit_State since the last call to luaM_metrics().
> > +	 */
> > +	size_t jit_trace_abort;
> > +	/* Total size of all allocated machine code areas. */
> > +	size_t jit_mcode_size;
> > +	/* Current amount of JIT traces. */
> > +	unsigned int jit_trace_num;
> > +};
> > +```
> > +
> > +One can see metrics are divided by the two types: global and incremental.
> > +Global metrics are collected throughout the platform uptime. Incremental
> > +metrics are reset after each `luaM_metrics()` call.

As Mons said:
| Incremental metrics must not reset after each call (see practice from
| Linux kernel, for ex.). Rationale is simple: 2 different modules, both
| using this function, will ruin metrics to each other

It's very important comment. With two modules that use metrics our
counters become just garbage. I suppose that we should use monotonic
increase counters for all "incremental" metrics. Users can count delta
between calls of `getmetrics` and analise a result.

We can use uint64_t for storage of these metrics. If we astimate amount
of allocated objects as 40Gb per second it will be necessary 14 years of
instance uptime to overflow this counter. Also all logic corresponding
to overflow of the counter can be realized inside user code.

> > +
> > +There is also a complement introduced for Lua space -- `misc.getmetrics()`.
> > +This function is just a wrapper for `luaM_metrics()` returning a Lua table with
> > +the similar metrics. Its usage is quite simple:
> > +```
> > +$ ./src/tarantool
> > +Tarantool 2.5.0-267-gbf047ad44
> > +type 'help' for interactive help
> > +tarantool> misc.getmetrics()
> > +---
> > +- global:
> > +    tabnum: 1812
> > +    gc_total: 1369927
> > +    strnum: 5767
> > +    jit_trace_num: 0
> > +    cdatanum: 89
> > +    jit_mcode_size: 0
> > +    udatanum: 17
> > +  incremental:
> 
> incremental_from_last_call will be more descriptive?

Mons:
| I'd consider rename global to absolute. All metrics are global (not
| scoped). But some are incremental counters and others are absolute
| current values.

And also as far as we mustn't use updatable by call metrics, should we join all
metrics inside one table, shouldn't we?

> > +    jit_snap_restore: 0
> > +    gc_freed: 2239391
> > +    strhash_hit: 53759
> > +    gc_steps_finalize: 0
> > +    gc_allocated: 3609318
> > +    gc_steps_atomic: 6
> > +    gc_steps_sweep: 296
> > +    gc_steps_sweepstring: 17920
> > +    jit_trace_abort: 0
> > +    strhash_miss: 6874
> 
> I wonder if those keys can be sorted? Consider, I want to find
> something by the name. Sorting helps greatly. 

It can be implement by user as simple as:

| local metrics = xjit.getmetrics()
| local sorted = {}
| for k, v in pairs(metrics.incremental) do table.insert(sorted, k) end
| table.sort(sorted)
| for _, k in ipairs(sorted) do print(k, metrics.incremental[k]) end
|
| --[[
| gc_allocated    8920
| gc_freed        3528
| gc_steps_atomic 0
| gc_steps_finalize       0
| gc_steps_pause  0
| gc_steps_propagate      62
| gc_steps_sweep  0
| gc_steps_sweepstring    0
| jit_snap_restore        0
| jit_trace_abort 0
| strhash_hit     118
| strhash_miss    7
| ]]

> > +    gc_steps_propagate: 10106
> > +    gc_steps_pause: 7
> > +...
> > +```
> > -- 
> > 2.24.1
> > 

Thoughts?

-- 
Best regards,
Sergey Kaplun