[Tarantool-discussions] [RFC] rfc: describe a LuaJIT memory profiler

Sergey Kaplun skaplun at tarantool.org
Mon Nov 16 08:52:44 MSK 2020


Igor, thanks for your comments and for the review!

On 03.11.20, Igor Munkin wrote:
> Sergey,
> 
> Thanks for the RFC! Please consider my comments related to the design,
> terminology and common sense.
> 
> On 27.10.20, Sergey Kaplun wrote:
> > Part of #5442
> > ---
> > It would be important for me if Vlad, Kirill and Sasha tell their
> > thoughts about integration of memory profile with Tarantool (see last
> > one part). Where and how should we write profiling data?
> > 
> > I add benchmarks later with basic PoC (in Tarantool) and valuable examples.
> > 
> > Branch: https://github.com/tarantool/tarantool/tree/skaplun/gh-5442-luajit-memory-profiler
> > 
> > Also you can see a draft of profiler implementation in this [1] branch.
> > 
> > I have several questions to ask:
> > Should we extend profiler options with these two field:
> >  * `buffer_size` -- to determine size of buffer for profiling data
> >    instead of spikes it.
> 
> As Sergos already proposed offline, it's better to provide buffer by
> host application (i.e. Tarantool) and use a default internal one if no
> buffer is passed. I suggest to try this at first.

I suggest two possible solutions:

1) Change writer interface -- it should report new buffer start and its
length after the call. Return value still represents how much was
written.

| typedef size_t (*ljp_writer)(const char **data, size_t *len, void *opt);
instead of
| typedef size_t (*ljp_writer)(const void *data, size_t len, void *opt);

2) Add additional interface to allocate a new buffer each time.
| typedef void *(*ljp_alloc)(size_t len, void *opt);
It will invokes after each corresponding write.

I suppose we should provide benchmarks for current solution first.
Thoughts?

> 
> >  * `sample_size` -- to control frequency of dump of profile events
> >    like Go memory profiler does.
> 
> Does this option control how often the collected data is flushed from
> the buffer? Otherwise I have no idea what this option does.

This option control frequency of collectting data. In other words, it
means that profiler reports not all events, but events that happen after
each `sample_size` bytes allocation/deallocation. Also, each new
function object allocation event should be reported. It is necessary to
be able to restore allocation event inside that function by parsing
utility. This means not each allocation is counted, but only
allocations every N kB. This makes sampling very cheap and suitable for
production.  On the other hand, the collected data may be incomplete.

This will affect OVERRIDES section obviously -- it becomes useless and
can be dropped.

> 
> > 
> > And also I have some questions about additional LuaJIT memory profiler
> > features (see below):
> > 
> > Should we give opportunity for users to use more frequent time interval
> > like 1 millisecond instead one second?
> > 
> > Information in dump of Lua universe is not so verbose as information
> > about allocations and etc. It is required to know information about
> > BCPos and function frame, where object was declared. Is it correct to say
> > that each alive object (that will not garbage collected soon) is upvalue
> > for some function or belongs to currently executed function (by himself
> > or by chain of references) or is a function by itself?
> 
> Strictly saying we don't know prior to atomic phase end whether the
> object is alive. Anyway, it looks like weak tables are not fit the
> criteria you wrote above, doesn't it?

Yes, weak table keys may be available by pairs, but be already dead,
indeed.

> 

<snipped>

> > +
> > +### Prerequisites
> > +
> > +This section describes additional changes in LuaJIT required to feature
> > +implementation.
> > +
> > +There are two different representation of functions in LuaJIT: the function's
> > +prototype (`GCproto`) and the function's closure (`GCfunc`). The closures are
> 
> IMHO, it's better call it "function object".

It really sounds better. I just use terms from <lj_func.c>. Roberto
Ierusalimschy also have used this term since Lua 5.0 (see also [1]). I
suppose that we can introduce this term in RFC like "function object so
called closure".

> 
> > +represented as `GCfuncL` and `GCfuncC` for Lua and C closures correspondingly.
> > +Also LuaJIT has special function's type aka Fast Function. It is used for
> > +specially optimized standard Lua libraries functions.
> 
> Strictly saying this is wrong (or at least quite inaccurate). Fast
> functions are just LuaJIT builtins representing Lua standart library
> interfaces. Yes, most of them are implemented in an odd way, but there
> are interfaces implemented in pure Lua or implemented in plain C
> (omitting the buildvm preprocessing peculiarities). Yes, it's better to
> use more special term, but you need to introduce the term the right way
> prior to using it later.

OK, so "LuaJIT builtins" is more universal. I will omit the part about
Lua standard library interfaces as we have our own extensions.

> 
> > +
> > +For Lua functions for profile events we had to determine line number of the
> > +function definition and corresponding `GCproto` address. For C functions only
> > +address will be enough. If Fast function is called from Lua function we had to
> > +report the Lua function for more meaningful output. Otherwise report the C
> > +function.
> 
> Here we go. The last passage is too complex for understanding even for
> me. As we finally discussed offline user is not interested in profiling
> Lua builtins per se, so it's better to report the function calling these
> builtins instead. However, does it complicates the investigation for
> tail calls? What is reported for the following code?
> | local function append(str, rep) return string.rep(str, rep) end
> | for _ = 1, 1e5 do print(append('q', _)) end

Look up on this chunk:

| 1  jit.off()
| 2  misc.memprof.start("memprof_new.bin")
| 3  local function append(str, rep)
| 4      return string.rep(str, rep)
| 5  end
| 6
| 7  local t = {}
| 8  for _ = 1, 1e5 do
| 9      table.insert(t,
| 10         append('q', _)
| 11     )
| 12 end
| 13 misc.memprof.stop()

Profilers output is like the follows:
| @newtest.lua:0, line 10: 100007 5004638934      0
| @newtest.lua:0, line 5: 1       40      0
| @newtest.lua:0, line 7: 1       72      0
| @newtest.lua:0, line 9: 1       48      0
| 
| REALLOCATIONS
| @newtest.lua:0, line 9: 16      4194496 2097376
|         Overrides:
|                 @newtest.lua:0, line 9
| 
| @newtest.lua:0, line 10: 12     262080  131040
|         Overrides:
|                 @newtest.lua:0, line 10
| 
| 
| DEALLOCATIONS
| INTERNAL: 21    0       2463
| @newtest.lua:0, line 10: 8      0       1044480
|         Overrides:
|                 @newtest.lua:0, line 10

I'll add this example in RFC.

> 
> Anyway, the case with TCO should be well-described.

Agree. Fast function allocation events always belong to the previous
frame with TCO for FF.

> 
> > +
> > +So we need to know in what type of function CALL/RETURN virtual machine (VM)
> > +is. LuaJIT has already determined C function execution VM state but neither
> > +Fast functions nor Lua function. So corresponding VM states will be added.
> 
> Considering this change luajit-gdb.py should be adjusted.

Yes, sure!

> 
> > +
> > +To determine currently allocating coroutine (that may not be equal to currently
> > +executed) new field will be added to `global_State` structure named `mem_L`
> > +kept coroutine address.
> 
> What is the issue <mem_L> solves? Please mention it right here.

As simple example:
| lua_State *L = luaL_newstate();
| lua_State *L1 = lua_newthread(L);
| lua_pushstring(L1, "OOPS");

If we use the Lua C API we can create objects not related to `cur_L`.

> 
> > +
> > +There is the static function (`lj_debug_getframeline`) returned line number for
> > +current `BCPos` in `lj_debug.c` already. It will be added to debug module API
> > +to be used in memory profiler.
> > +
> > +### Information recording
> > +
> > +Each allocate/reallocate/free is considered as a type of event that are
> > +reported. Event stream has the following format:
> > +
> > +```c
> > +/*
> > +** Event stream format:
> > +**
> > +** stream         := symtab memprof
> > +** symtab         := see <ljp_symtab.h>
> > +** memprof        := prologue event* epilogue
> > +** prologue       := 'l' 'j' 'm' version reserved
> > +** version        := <BYTE>
> > +** reserved       := <BYTE> <BYTE> <BYTE>
> > +** prof-id        := <ULEB128>
> 
> This rule is excess, isn't it?

Yes, I think so.

> 
> > +** event          := event-alloc | event-realloc | event-free

<snipped>

> > +
> > +It iss enough to know the address of LUA/C function to determine it. Symbolic

Ouch, will fix it. (s/iss/is/)

> > +table (symtab) dumps at start of profiling to avoid determine and write line
> > +number of Lua code and corresponding chunk of code each time, when memory event
> > +happens. Each line contains the address, Lua chunk definition as filename and
> > +line number of the function's declaration. This symbols table has the following
> > +format described at
> > +<ljp_symtab.h>:
> > +
> > +```c
> > +/*
> > +** symtab format:
> > +**
> > +** symtab         := prologue sym*
> > +** prologue       := 'l' 'j' 's' version reserved vm-addr
> > +** version        := <BYTE>
> > +** reserved       := <BYTE> <BYTE> <BYTE>
> > +** vm-addr        := <ULEB128>
> 
> What does this rule represent?

Address of current VM. I suppose it is excess for now.

> 
> > +** sym            := sym-lua | sym-final
> > +** sym-lua        := sym-header sym-addr sym-chunk sym-line
> > +** sym-header     := <BYTE>
> > +** sym-addr       := <ULEB128>
> > +** sym-chunk      := string
> > +** sym-line       := <ULEB128>
> > +** sym-final      := sym-header
> > +** string         := string-len string-payload
> > +** string-len     := <ULEB128>
> > +** string-payload := <BYTE> {string-len}
> > +**
> > +** <BYTE>   :  A single byte (no surprises here)
> > +** <ULEB128>:  Unsigned integer represented in ULEB128 encoding
> > +**
> > +** (Order of bits below is hi -> lo)
> > +**
> > +** version: [VVVVVVVV]
> > +**  * VVVVVVVV: Byte interpreted as a plain numeric version number
> > +**
> > +** sym-header: [FUUUUUTT]
> > +**  * TT    : 2 bits for representing symbol type
> > +**  * UUUUU : 5 unused bits
> > +**  * F     : 1 bit marking the end of the symtab (final symbol)
> > +*/
> > +```
> > +
> > +As you can see the most part of data is saved in
> > +[ULEB128](https://en.wikipedia.org/wiki/LEB128) format.
> 
> How the sentence below comes from the sentence above? AFAIR, ULEB is
> used for numeric values "compression". That's all.

OK, I'll reformulate it in a similar manner.

> 
> > +So when memory profiling starts default allocation function is replaced by the
> > +new allocation function as additional wrapper to write inspected profiling
> > +events. When profiler stops old allocation function is substituted back.
> > +
> > +Extended functions to control profiler are added to <lmisclib.h>.
> > +Profiler is configured by this options structure:
> > +
> > +```c
> > +/* Profiler options. */
> > +struct luam_Prof_options {
> > +  /* Options for the profile writer and final callback. */
> > +  void *arg;
> > +  /*
> > +  ** Writer function for profile events.
> > +  ** Should return amount of written bytes on success or zero in case of error.
> > +  */
> > +  size_t (*writer)(const void *data, size_t len, void *arg);
> > +  /*
> > +  ** Callback on profiler stopping. Required for correctly cleaning
> > +  ** at vm shoutdown when profiler still running.
> > +  ** Returns zero on success.
> > +  */
> > +  int (*on_stop)(void *arg);
> > +};
> > +```
> 
> Well, maybe it's better to introduce a special interface to fill this
> struct? Something similar to luaE_coveragestart_cb[1]. As a result the
> structure is encapsulated in LuaJIT, that looks more convenient for the
> further maintenance.

Yes, but on the other side for each profiler we should create N
additional interfaces how to start it.

> 
> > +
> > +This options are saved inside the profiler till it is running. Argument (`arg`)
> > +for them will be passed to all writers and callbacks to be used as a context of
> 
> So, let's call it <ctx> then.

OK.

> 
> > +profiler. `on_stop` callback determines destroyer for `arg` that is called when
> > +profiler shutdowns. `writer` function copies data from internal profiler buffer
> > +wherever you want. Data can be written to the file, sending to the pipe or
> > +socket, writing to memory to be parsed later. If the writer function returns
> > +zero, profiler takes it as a error, remembers it and will notice about that as
> > +it will stopped. Otherwise returned value described how many bytes was
> > +successfully written as intended.
> > +
> > +Profiler is started by with:
> > +```c
> > +/*
> > +** Starts profiling. Returns LUAM_PROFILE_SUCCESS on success and one of
> > +** LUAM_PROFILE_ERR* codes otherwise. Destroyer is not called in case of
> 
> I guess it's called destructor, *but* in terms you introduced above this
> is on stop handler.

Yes, thanks.

> 
> > +** LUAM_PROFILE_ERR*.
> > +*/
> > +LUAMISC_API int luaM_memprof_start(lua_State *L,
> > +				   const struct luam_Prof_options *opt);
> > +```
> > +
> > +Profiler may fail to start with one of the `LUAM_PROFILE_ERR*` statuses that
> > +declares as:
> > +```c
> > +/* Profiler public API. */
> > +#define LUAM_PROFILE_SUCCESS 0
> > +#define LUAM_PROFILE_ERR     1
> > +#define LUAM_PROFILE_ERRMEM  2
> > +#define LUAM_PROFILE_ERRIO   3
> > +```
> > +
> > +This may happen in several cases as when you call profiler with bad arguments
> > +(`writer` or `on_stop` callback is undefined) or there is no enough memory to
> > +allocate corresponding buffer or profiler is already running or it fails to
> > +write the symtab or the prologue at start.
> 
> What are the particular errors the platform yields when the cases
> mentioned above occur?

OK, I'll rewrite it more clear.

> 
> > +As written above memory profiler does not call the destroyer for your argument.
> > +If it is failed to start -- it should be done by a caller.
> 
> This looks inconsistent. What are the cases when this callback is not
> invoked by LuaJIT itself?

Hmm, I've realized that it really can be done inside start if we call `on_stop`
from the options.

> 
> > +
> > +As soon as you want to finish profiling, call the corresponding function:
> > +
> > +```c
> > +/*
> > +** Stops profiling. Returns LUAM_PROFILE_SUCCESS on success and one of
> > +** LUAM_PROFILE_ERR* codes otherwise. If writer() function returns zero
> > +** on call at buffer flush, or on_stop() callback returns non-zero
> > +** value, returns LUAM_PROFILE_ERRIO.
> > +*/
> > +LUAMISC_API int luaM_memprof_stop(const lua_State *L);
> > +```
> > +
> > +This function flushes all data from the internal buffer to writer, sends final
> > +event header and returns back the substituted allocation function if profiling
> > +was started from the same Lua state. Otherwise does nothing and returns
> > +`LUAM_PROFILE_ERR`. Also this error returns if profiler is not running at the
> > +moment of call. `LUAM_PROFILE_ERRIO` returns if there were some problems at
> > +buffer write or `on_stop` callback has returned bad status.
> > +
> > +And the last one function that checks is profiler running or not:
> > +
> > +```c
> > +/* Check that profiler is running. */
> > +LUAMISC_API bool luaM_memprof_isrunning(void);
> 
> Hm, I doubt it is great idea to use bool types in plain C, at least
> without explicitly specifying the standard revision. Let's stay with
> a plain <int> then.

OK, no problem.

> 
> > +```
> > +
> > +If you want to build LuaJIT without memory profiler you should build it with
> > +`-DLUAJIT_DISABLE_MEMPROF`. If it is disabled `luaM_memprof_stop()` and
> > +`luaM_memprof_start` always returns `LUAM_PROFILE_ERR` and
> > +`luaM_memprof_isrunning()` always returns `false`.
> > +
> > +Memory profiler is expected to be thread safe, so it has corresponding
> 
> What's the issue that requires memprof to be thread safe?

As you can see we have a global variable to keep information about VM.
You can use multiple VMs in multiple threads, but only profile one
at a time.

> 
> > +lock/unlock at internal mutex whenever you call `luaM_memprof_*`. If you want
> > +to build LuaJIT without thread safety use `-DLUAJIT_DISABLE_THREAD_SAFE`.
> > +
> > +There are also complements introduced for Lua space in builtin `misc` library.
> > +Starting profiler from Lua is quite simple:
> > +
> > +```lua
> > +local started = misc.memprof.start(fname)
> > +```
> > +Where `fname` is name of the file where profile events are written. This
> > +function is just a wrapper to `luaM_memprof_start()`. Writer for this function
> > +perform `fwrite()` for each call. Final callback calls `fclose()` at the end of
> > +profiling. If it is impossible to open a file for writing or
> > +`luaM_memprof_start()` returns error status the function returns `false` value.
> 
> What about the error itself? It would be nice to see the exact reason
> why memprof is failed to start.

We can additionally push `perror()` value here. Thoughts?

> 
> > +Otherwise returns `true`.
> > +
> > +For stopping or checking that profile is running from Lua space use
> > +`misc.memprof.stop()` and `misc.memprof.is_running()` correspondingly.
> > +
> > +If there is any error occurred at profiling stopping (bad returned status from
> > +`luaM_memprof_stop()` or an error when file descriptor was closed)
> > +`memprof.stop()` returns `false`. Returns `true` otherwise.
> 
> Ditto.

Ditto.

> 
> > +
> > +### Reading and displaying saved data
> > +

<snipped>

> > +ALLOCATIONS
> 
> It would be nice to paste this test_memprof.lua chunk alongside with
> report, so we can match the code with the memprof report.

OK, but I should rewrite it a little bit.

> 
> > + at test_memprof.lua:0, line 10: 5	1248	0
> > +...
> > +INTERNAL: 2511	203000	0
> > +...
> > + at test_memprof.lua:0, line 8: 455	25646897	0
> > + at test_memprof.lua:0, line 14: 260	64698	0
> > +...
> > + at test_memprof.lua:0, line 6: 1	72	0
> > +
> > +REALLOCATIONS
> > +INTERNAL: 43	3556160	7100608
> > +	Overrides:
> 
> What does this "Overrides" attribute mean?

What allocation this reallocation overrides.

> 
> > +		...

<snipped>

> > +```
> > +
> > +Where `INTERNAL` means that this allocations are caused by internal LuaJIT
> > +structures. Note that events are sorted from the most often to the least.
> 
> I see not a word about allocations on traces. How are they reported?

I mentioned it above, but not here... I'll added it to RFC.

> 
> > +
> > +### Additional features
> > +
> > +This part describes two features: approximately time of profiling execution and
> > +full dumping of Lua objects information.
> > +
> > +#### Duration of memory profiler running
> 
> Well, I can hardly imagine this feature usage now. At least, it can be
> implemented via a simple fiber starting the profiler, yielding the
> execution for some time and stopping the profiler. So, let's simply omit
> this part, if nobody minds.

OK.

> 
> > +

<snipped>

> > +
> > +#### Dump of Lua universe
> > +
> > +This feature allows you to dump information about all Lua objects inside Lua
> > +universe to analyze it later. It is possible to determine object dependencies,
> > +find top of the most/least huge objects, see upvalues for different functions
> > +and so on. This function uses format like symtab stream but with more objects
> 
> The format is the same or the similar to memprof?

Not the same. I described it below.

> 
> > +in dump. It allows not to run full GC cycle, but traverse all object like the
> > +full propagate phase does.
> 
> I don't get how "full GC cycle" relates to full dump.

We have discussed it offline. Sergos suggest to report corresponding
information inside the GC. I'll remove this mention to avoid confusing.

> 
> > +
> > +For C API dump function declared as:
> > +```c
> > +/*
> > +** Traverse and dump Lua universe. Returns LUAM_PROFILE_SUCCESS on success and
> > +** one of LUAM_PROFILE_ERR* codes otherwise. Destroyer is not called at the
> > +** end.
> > +*/
> > +LUAMISC_API int luaM_memprof_dump(lua_State *L,
> > +				   const struct luam_Prof_options *opt);
> > +```
> > +
> > +This function does not yield. It dumps full information and only then stops.
> > +It ignores `dursec` and `signo` fields in `struct luam_Prof_options`.
> > +`on_stop` callback is ignored too, resources should clear manually by a caller.
> 
> I don't get why on_stop is ignored for this case. If the panic occurs,
> the auxiliary resources are left unreleased?

All necessary resources are checked on start (like buffer creation or so
on). We can't occur panic here, can we? Correct me please if I make
mistake about it.

> 
> > +
> > +For Lua space this function provided by
> > +```lua
> > +local res = misc.memprof.dump(filename)
> > +```
> > +
> > +It returns `true` on success, `false` if an error is occurred.
> > +
> > +Parsing utility is extended by additional option `--dump`:
> > +
> > +```
> > +  --dump                            Parse dump of whole Lua universe
> > +```
> > +
> > +It has the following output format:
> > +```
> > +===== MEMORY SNAPSHOT
> > +@<object type> <object addr> holds <amount of bytes> bytes and linked with
> > +	methatable => LJ_TTAB <metatable address> or NULL
> > +	<link format>
> > +```
> > +
> > +There are several object types with non-zero memory usage:
> > +- `LJ_TSTR`
> > +- `LJ_TUPVAL`
> > +- `LJ_TTHREAD`
> > +- `LJ_TPROTO`
> > +- `LJ_TFUNC`
> > +- `LJ_TTRACE`
> > +- `LJ_TCDATA`
> > +- `LJ_TTAB`
> > +- `LJ_TUDATA`
> > +
> > +As far as `LJ_TSTR`, `LJ_TUDATA` and `LJ_TCDATA` do not reference to any other
> > +objects they have empty link format.
> 
> What about functions, prototypes, coroutines and traces?

Functions reverence to upvalues and if the function is Lua function to
its prototype.
Prototypes reference own names, and collectable constants.
Coroutine reference their stack slots and env.
Traces reference other traces (for example sidetrace).

> 
> > +
> > +Linked list describes dependencies for referenced objects. If referencer type
> > +is `LJ_TTAB` link format looks like:
> > +```
> > +===== ARRAY PART
> > +	[<array field number>] => <object type> <object addr> > +	...
> > +===== HASH PART
> > +	<object type> <object addr> => <object type> <object addr>
> > +```
> > +
> > +And finally all other values contain a list of reference to other objects:
> > +```
> > +	<object type> <object addr>
> > +```
> > +
> > +### Tarantool integration
> > +

<snipped>

> > +original functions.
> 
> I believe we also need benchmarks for both approaches: this one and
> another using the introduced pure Lua interfaces with synchronous writes
> to the file. Otherwise it is only a priori estimates.

I am working on that :)

> 
> > -- 
> > 2.28.0
> > 
> 
> [1]: https://github.com/luavela/luavela/blob/master/src/lextlib.h#L173L175
> 
> -- 
> Best regards,
> IM

[1]: https://www.lua.org/doc/jucs05.pdf


-- 
Best regards,
Sergey Kaplun


More information about the Tarantool-discussions mailing list