[Tarantool-discussions] [RFC luajit v2] rfc: describe a LuaJIT memory profiler

Sergey Ostanevich s.ostanevich at corp.mail.ru
Tue Dec 22 15:02:10 MSK 2020


Hi!

Thanks for the patch!

See some comments below, after being applied is LGTM.

Sergos

> On 16 Dec 2020, at 22:09, Sergey Kaplun <skaplun at tarantool.org> wrote:
> 
> Part of #5442
> ---
> 
> Issues: https://github.com/tarantool/tarantool/issues/5442
> Branch: https://github.com/tarantool/tarantool/tree/skaplun/gh-5442-luajit-memory-profiler
> 
> Changes in v2:
> * Removed C API, Tarantool integration and description of additional
>  features -- they will be added in another RFC if necessary.
> * Removed checking profile is running from the public API.
> * Added benchmarks and more meaningful example.
> * Grammar fixes.
> 
> doc/rfc/5442-luajit-memory-profiler.md | 306 +++++++++++++++++++++++++
> 1 file changed, 306 insertions(+)
> create mode 100644 doc/rfc/5442-luajit-memory-profiler.md
> 
> diff --git a/doc/rfc/5442-luajit-memory-profiler.md b/doc/rfc/5442-luajit-memory-profiler.md
> new file mode 100644
> index 000000000..720105009
> --- /dev/null
> +++ b/doc/rfc/5442-luajit-memory-profiler.md
> @@ -0,0 +1,306 @@
> +# LuaJIT memory profiler
> +
> +* **Status**: In progress
> +* **Start date**: 24-10-2020
> +* **Authors**: Sergey Kaplun @Buristan skaplun at tarantool.org,
> +               Igor Munkin @igormunkin imun at tarantool.org,
> +               Sergey Ostanevich @sergos sergos at tarantool.org
> +* **Issues**: [#5442](https://github.com/tarantool/tarantool/issues/5442)
> +
> +## Summary
> +
> +LuaJIT memory profiler is a toolchain for analysis of memory usage by user's
> +application.
> +
> +## Background and motivation
> +
> +Garbage collector (GC) is a curse of performance for most of Lua applications.

This is something that is very questionable. Do you have any data on performance
of Lua code against the GC code in real life? Is there any about Tarantool
applications?

I’m not questioning the mem profiler itself, rather trying to weigh the GC problem.

> +Memory usage of Lua application should be profiled to find out various
							 ^^^^^^^ locate

> +memory-unoptimized code blocks. If the application has memory leaks they can be
> +found with the profiler.
			   ^ also.
> +
> +## Detailed design
> +
> +The whole toolchain of memory profiling will be divided by several parts:
							  ^^^ into
> +1) Prerequisites.
> +2) Recording information about memory usage and saving it.
> +3) Reading saved data and display it in human-readable format.
> +
> +### Prerequisites
> +
> +This section describes additional changes in LuaJIT required to feature
								^^ for the 
> +implementation. This version of LuaJIT memory profiler does not support
> +reporting allocations from traces. But trace code semantics should be totally
> +the same as for Lua interpreter. So profiling with `jit.off()` should be
> +enough.
> +
> +There are two different representations of functions in LuaJIT: the function's
> +prototype (`GCproto`) and the function object so called closure (`GCfunc`).
> +The closures are represented as `GCfuncL` and `GCfuncC` for Lua and C closures
> +correspondingly. Also LuaJIT has special function's type aka Fast Function. It
                                   ^ a
> +is used for LuaJIT builtins.
> +
> +Fast function allocation events always belong to the previous frame with
> +considering of tail call optimizations (TCO).
> +
> +Assume we have the following Lua chunk named <test.lua>:
> +
> +```
> +1  jit.off()
> +2  misc.memprof.start("memprof_new.bin")
> +3  local function append(str, rep)
> +4      return string.rep(str, rep)
> +5  end
> +6
> +7  local t = {}
> +8  for _ = 1, 1e5 do
> +9      table.insert(t,
> +10         append('q', _)
> +11     )
> +12 end
> +13 misc.memprof.stop()
> +```
> +
> +Profilers output is like the follows:
> +```
> +ALLOCATIONS
> + at test.lua:0, line 10: 100007 5004638934      0
> + at test.lua:0, line 5: 1       40      0
> + at test.lua:0, line 7: 1       72      0
> + at test.lua:0, line 9: 1       48      0
> +
> +REALLOCATIONS
> + at test.lua:0, line 9: 16      4194496 2097376
> +        Overrides:
> +                @test.lua:0, line 9
> +
> + at test.lua:0, line 10: 12     262080  131040
> +        Overrides:
> +                @test.lua:0, line 10
> +
> +
> +DEALLOCATIONS
> +INTERNAL: 21    0       2463
> + at test.lua:0, line 10: 8      0       1044480
> +        Overrides:
> +                @test.lua:0, line 10
> +```
> +
> +In Lua functions for profile events, we had to determine the line number of the
> +function definition and corresponding `GCproto` address. For C functions only
> +address will be enough. If Fast function is called from Lua function we had to
> +report the Lua function for more meaningful output. Otherwise report the C
> +function.
> +
> +So we need to know in what type of function CALL/RETURN virtual machine (VM)
                      ^^^^^^^ a                           ^ the 
> +is. LuaJIT has already determined C function execution VM state but neither
    ^ currently in. 

> +is. LuaJIT has already determined C function execution VM state but neither
      Currently VM state identifies C function execution only, so Fast and Lua
functions states will be added. 

> +Fast functions nor Lua function. So corresponding VM states will be added.
> +
> +To determine currently allocating coroutine (that may not be equal to currently
> +executed) new field will be added to `global_State` structure named `mem_L`
            ^ a

> +kept coroutine address. This field sets at each reallocation to corresponding
   ^^^ to keep the                    ^^^ is set

> +`L` with which it was called.
> +
> +There is the static function (`lj_debug_getframeline`) returned line number for
            ^^^ a                                         ^^^ that returns
> +current `BCPos` in `lj_debug.c` already. It will be added to the debug module
> +API to be used in memory profiler.
> +
> +### Information recording
> +
> +Each allocate/reallocate/free is considered as a type of event that are
> +reported. Event stream has the following format:
> +
> +```c
> +/*
> +** Event stream format:
> +**
> +** stream         := symtab memprof
> +** symtab         := see <ljp_symtab.h>
> +** memprof        := prologue event* epilogue
> +** prologue       := 'l' 'j' 'm' version reserved
> +** version        := <BYTE>
> +** reserved       := <BYTE> <BYTE> <BYTE>
> +** event          := event-alloc | event-realloc | event-free
> +** event-alloc    := event-header loc? naddr nsize
> +** event-realloc  := event-header loc? oaddr osize naddr nsize
> +** event-free     := event-header loc? oaddr osize
> +** event-header   := <BYTE>
> +** loc            := loc-lua | loc-c
> +** loc-lua        := sym-addr line-no
> +** loc-c          := sym-addr
> +** sym-addr       := <ULEB128>
> +** line-no        := <ULEB128>
> +** oaddr          := <ULEB128>
> +** naddr          := <ULEB128>
> +** osize          := <ULEB128>
> +** nsize          := <ULEB128>
> +** epilogue       := event-header
> +**
> +** <BYTE>   :  A single byte (no surprises here)
> +** <ULEB128>:  Unsigned integer represented in ULEB128 encoding
> +**
> +** (Order of bits below is hi -> lo)
> +**
> +** version: [VVVVVVVV]
> +**  * VVVVVVVV: Byte interpreted as a plain integer version number
> +**
> +** event-header: [FTUUSSEE]
> +**  * EE   : 2 bits for representing allocation event type (AEVENT_*)
> +**  * SS   : 2 bits for representing allocation source type (ASOURCE_*)
> +**  * UU   : 2 unused bits
> +**  * T    : Reserved. 0 for regular events, 1 for the events marked with
> +**           the timestamp mark. It is assumed that the time distance between

No representation of a timestamp so far? If so - maybe remove excessive description
and put ‘reserved for timestamp’.

> +**           two marked events is approximately the same and is equal
> +**           to 1 second. Always zero for now.
> +**  * F    : 0 for regular events, 1 for epilogue's *F*inal header
> +**           (if F is set to 1, all other bits are currently ignored)
> +*/
> +```
> +
> +It is enough to know the address of LUA/C function to determine it. Symbolic
> +table (symtab) dumps at start of profiling to avoid determine and write line
> +number of Lua code and corresponding chunk of code each time, when memory event
> +happens. Each line contains the address, Lua chunk definition as the filename
> +and line number of the function's declaration. This table of symbols has the
> +following format described at <ljp_symtab.h>:
> +
> +```c
> +/*
> +** symtab format:
> +**
> +** symtab         := prologue sym*
> +** prologue       := 'l' 'j' 's' version reserved
> +** version        := <BYTE>
> +** reserved       := <BYTE> <BYTE> <BYTE>
> +** sym            := sym-lua | sym-final
> +** sym-lua        := sym-header sym-addr sym-chunk sym-line
> +** sym-header     := <BYTE>
> +** sym-addr       := <ULEB128>
> +** sym-chunk      := string
> +** sym-line       := <ULEB128>
> +** sym-final      := sym-header
> +** string         := string-len string-payload
> +** string-len     := <ULEB128>
> +** string-payload := <BYTE> {string-len}
> +**
> +** <BYTE>   :  A single byte (no surprises here)
> +** <ULEB128>:  Unsigned integer represented in ULEB128 encoding
> +**
> +** (Order of bits below is hi -> lo)
> +**
> +** version: [VVVVVVVV]
> +**  * VVVVVVVV: Byte interpreted as a plain numeric version number
> +**
> +** sym-header: [FUUUUUTT]
> +**  * TT    : 2 bits for representing symbol type
> +**  * UUUUU : 5 unused bits
> +**  * F     : 1 bit marking the end of the symtab (final symbol)
> +*/
> +```
> +
> +So when memory profiling starts default allocation function is replaced by the
> +new allocation function as additional wrapper to write inspected profiling
> +events. When profiler stops old allocation function is substituted back.
> +
> +Starting profiler from Lua is quite simple:
> +```lua
> +local started, err = misc.memprof.start(fname)
> +```
> +Where `fname` is name of the file where profile events are written. Writer for
> +this function perform `fwrite()` for each call retrying in case of `EINTR`.
> +Final callback calls `fclose()` at the end of profiling. If it is impossible to
> +open a file for writing or profiler fails to start, returns `nil` on failure
> +(plus an error message as a second result and a system-dependent error code as
> +a third result). Otherwise returns some true value.
> +
> +Stopping profiler from Lua is simple too:
> +```lua
> +local stopped, err = misc.memprof.stop()
> +```
> +
> +If there is any error occurred at profiling stopping (an error when file
> +descriptor was closed) `memprof.stop()` returns `nil` (plus an error message as
> +a second result and a system-dependent error code as a third result). Returns

Why do you need to separate an ‘error’ from a ’system error’?

> +`true` otherwise.
> +
> +If you want to build LuaJIT without memory profiler, you should build it with
> +`-DLUAJIT_DISABLE_MEMPROF`. If it is disabled `misc.memprof.start()` and
> +`misc.memprof.stop()` always return `false`.
> +
> +Memory profiler is expected to be thread safe, so it has a corresponding
> +lock/unlock at internal mutex whenever you call `luaM_memprof_*`. If you want
> +to build LuaJIT without thread safety use `-DLUAJIT_DISABLE_THREAD_SAFE`.
> +
> +### Reading and displaying saved data
> +
> +Binary data can be read by `lj-parse-memprof` utility. It parses the binary
> +format provided from memory profiler and render it in human-readable format.
> +
> +The usage is very simple:
> +```
> +$ ./luajit-parse-memprof --help
> +luajit-parse-memprof - parser of the memory usage profile collected
> +                       with LuaJIT's memprof.
> +
> +SYNOPSIS
> +
> +luajit-parse-memprof [options] memprof.bin
> +
> +Supported options are:
> +
> +  --help                            Show this help and exit
> +```
> +
> +Plain text of profiled info has the following format:
> +```
> +@<filename>:<function_line>, line <line where event was detected>: <number of events>	<allocated>	<freed>
> +```
> +See example above.
> +
> +`INTERNAL` means that this allocations are caused by internal LuaJIT
> +structures. Note that events are sorted from the most often to the least.
> +
> +`Overrides` means what allocation this reallocation overrides.
> +
> +## Benchmarks
> +
> +Benchmarks were taken from repo:
> +[LuaJIT-test-cleanup](https://github.com/LuaJIT/LuaJIT-test-cleanup).
> +
> +Example of usage:
Either ‘usage’ or ‘example of use’


> +```bash
> +/usr/bin/time -f"array3d %U" ./luajit $BENCH_DIR/array3d.lua 300 >/dev/null
> +```
> +
> +Benchmark results before and after the patch (less is better):

Comparative to ‘BEFORE’ is more reafable, dispersion should also help with
reading of results.

> +
> +```
> +               | BEFORE | AFTER,memprof off | AFTER,memprof on
> +---------------+--------+-------------------+-----------------
> +array3d        |  0.22  |        0.20       |       0.21
> +binary-trees   |  3.32  |        3.33       |       3.94
> +chameneos      |  2.92  |        3.18       |       3.12
> +coroutine-ring |  0.99  |        1.00       |       0.99
> +euler14-bit    |  1.04  |        1.05       |       1.03
> +fannkuch       |  6.77  |        6.69       |       6.64
> +fasta          |  8.27  |        8.30       |       8.25
> +life           |  0.48  |        0.48       |       1.03
> +mandelbrot     |  2.69  |        2.70       |       2.75
> +mandelbrot-bit |  1.99  |        2.00       |       2.08
> +md5            |  1.57  |        1.61       |       1.56
> +nbody          |  1.35  |        1.38       |       1.33
> +nsieve         |  2.11  |        2.19       |       2.09
> +nsieve-bit     |  1.50  |        1.55       |       1.47
> +nsieve-bit-fp  |  4.40  |        4.63       |       4.44
> +partialsums    |  0.54  |        0.58       |       0.55
> +pidigits-nogmp |  3.48  |        3.50       |       3.47
> +ray            |  1.63  |        1.68       |       1.64
> +recursive-ack  |  0.19  |        0.22       |       0.20
> +recursive-fib  |  1.62  |        1.71       |       1.63
> +scimark-fft    |  5.78  |        5.94       |       5.69
> +scimark-lu     |  3.26  |        3.57       |       3.59
> +scimark-sor    |  2.34  |        2.35       |       2.33
> +scimark-sparse |  5.03  |        4.92       |       4.91
> +series         |  0.94  |        0.96       |       0.95
> +spectral-norm  |  0.96  |        0.96       |       0.95
> +```
> -- 
> 2.28.0
> 



More information about the Tarantool-discussions mailing list