[Tarantool-discussions] [RFC luajit v2] rfc: describe a LuaJIT memory profiler
Igor Munkin
imun at tarantool.org
Mon Dec 21 01:37:33 MSK 2020
Sergey,
Thanks for the new version! The design is fine, but please consider my
minor comments regarding the document itself.
On 16.12.20, Sergey Kaplun wrote:
> Part of #5442
> ---
>
> Issues: https://github.com/tarantool/tarantool/issues/5442
> Branch: https://github.com/tarantool/tarantool/tree/skaplun/gh-5442-luajit-memory-profiler
>
> Changes in v2:
> * Removed C API, Tarantool integration and description of additional
> features -- they will be added in another RFC if necessary.
> * Removed checking profile is running from the public API.
> * Added benchmarks and more meaningful example.
> * Grammar fixes.
>
> doc/rfc/5442-luajit-memory-profiler.md | 306 +++++++++++++++++++++++++
> 1 file changed, 306 insertions(+)
> create mode 100644 doc/rfc/5442-luajit-memory-profiler.md
>
> diff --git a/doc/rfc/5442-luajit-memory-profiler.md b/doc/rfc/5442-luajit-memory-profiler.md
> new file mode 100644
> index 000000000..720105009
> --- /dev/null
> +++ b/doc/rfc/5442-luajit-memory-profiler.md
> @@ -0,0 +1,306 @@
<snipped>
> +## Detailed design
> +
> +The whole toolchain of memory profiling will be divided by several parts:
> +1) Prerequisites.
> +2) Recording information about memory usage and saving it.
> +3) Reading saved data and display it in human-readable format.
I believe the links would also be great here.
> +
> +### Prerequisites
> +
> +This section describes additional changes in LuaJIT required to feature
> +implementation. This version of LuaJIT memory profiler does not support
> +reporting allocations from traces. But trace code semantics should be totally
> +the same as for Lua interpreter. So profiling with `jit.off()` should be
> +enough.
>From this part I guess nothing works with enabled JIT. As a result of
the offline discussion it occurs everything works but allocation on
traces are attributed as internal ones. I think this is not a big deal
since JIT is a black internal box. Please mention this explicitly above.
Minor: It's worth to mention that there might be less GC pressure while
trace execution (considering allocation sinking optimization impact),
but the traces itself are allocated in GC area.
> +
<snipped>
> +
> +Fast function allocation events always belong to the previous frame with
> +considering of tail call optimizations (TCO).
Still don't get it. In the previous version I asked to explicitly
mention that allocations in builtins are attributed with their caller.
And another case is a tail call optimization, when allocations are also
attributes with the caller.
So, the example is fine (but comments are desired, IMHO), but I suggest
to separate these cases in the passage above.
> +
> +Assume we have the following Lua chunk named <test.lua>:
> +
> +```
> +1 jit.off()
> +2 misc.memprof.start("memprof_new.bin")
> +3 local function append(str, rep)
> +4 return string.rep(str, rep)
> +5 end
> +6
> +7 local t = {}
> +8 for _ = 1, 1e5 do
> +9 table.insert(t,
> +10 append('q', _)
> +11 )
> +12 end
> +13 misc.memprof.stop()
> +```
> +
> +Profilers output is like the follows:
> +```
> +ALLOCATIONS
> + at test.lua:0, line 10: 100007 5004638934 0
> + at test.lua:0, line 5: 1 40 0
> + at test.lua:0, line 7: 1 72 0
> + at test.lua:0, line 9: 1 48 0
> +
> +REALLOCATIONS
> + at test.lua:0, line 9: 16 4194496 2097376
> + Overrides:
> + @test.lua:0, line 9
> +
> + at test.lua:0, line 10: 12 262080 131040
> + Overrides:
> + @test.lua:0, line 10
> +
> +
> +DEALLOCATIONS
> +INTERNAL: 21 0 2463
> + at test.lua:0, line 10: 8 0 1044480
> + Overrides:
> + @test.lua:0, line 10
> +```
I have no idea what is dumped here. There is a legend below (in reading
and displaying saved data section), so leave a link to it right here.
Otherwise this is some kind of elvish.
> +
> +In Lua functions for profile events, we had to determine the line number of the
> +function definition and corresponding `GCproto` address. For C functions only
> +address will be enough. If Fast function is called from Lua function we had to
> +report the Lua function for more meaningful output. Otherwise report the C
> +function.
Still don't understand two last sentences. In the previous reply you
told this relates to the part above (about the fast functions). You
reworded that part, but this is still left unclear.
> +
> +So we need to know in what type of function CALL/RETURN virtual machine (VM)
AFAICS the function type is enough, isn't it? As you mentioned above
allocation are attributed to the callers in case of tail call
optimization.
> +is. LuaJIT has already determined C function execution VM state but neither
> +Fast functions nor Lua function. So corresponding VM states will be added.
<snipped>
> +### Information recording
> +
> +Each allocate/reallocate/free is considered as a type of event that are
> +reported. Event stream has the following format:
> +
> +```c
> +/*
> +** Event stream format:
> +**
> +** stream := symtab memprof
> +** symtab := see <ljp_symtab.h>
> +** memprof := prologue event* epilogue
> +** prologue := 'l' 'j' 'm' version reserved
> +** version := <BYTE>
> +** reserved := <BYTE> <BYTE> <BYTE>
> +** event := event-alloc | event-realloc | event-free
> +** event-alloc := event-header loc? naddr nsize
> +** event-realloc := event-header loc? oaddr osize naddr nsize
> +** event-free := event-header loc? oaddr osize
> +** event-header := <BYTE>
> +** loc := loc-lua | loc-c
> +** loc-lua := sym-addr line-no
> +** loc-c := sym-addr
> +** sym-addr := <ULEB128>
> +** line-no := <ULEB128>
> +** oaddr := <ULEB128>
> +** naddr := <ULEB128>
> +** osize := <ULEB128>
> +** nsize := <ULEB128>
> +** epilogue := event-header
> +**
> +** <BYTE> : A single byte (no surprises here)
> +** <ULEB128>: Unsigned integer represented in ULEB128 encoding
> +**
> +** (Order of bits below is hi -> lo)
> +**
> +** version: [VVVVVVVV]
> +** * VVVVVVVV: Byte interpreted as a plain integer version number
> +**
> +** event-header: [FTUUSSEE]
> +** * EE : 2 bits for representing allocation event type (AEVENT_*)
> +** * SS : 2 bits for representing allocation source type (ASOURCE_*)
> +** * UU : 2 unused bits
> +** * T : Reserved. 0 for regular events, 1 for the events marked with
> +** the timestamp mark. It is assumed that the time distance between
> +** two marked events is approximately the same and is equal
> +** to 1 second. Always zero for now.
It looks this is zero always in our case, so we have 3 unused bits,
right? Then just drop this part.
> +** * F : 0 for regular events, 1 for epilogue's *F*inal header
> +** (if F is set to 1, all other bits are currently ignored)
> +*/
> +```
> +
> +It is enough to know the address of LUA/C function to determine it. Symbolic
> +table (symtab) dumps at start of profiling to avoid determine and write line
> +number of Lua code and corresponding chunk of code each time, when memory event
> +happens. Each line contains the address, Lua chunk definition as the filename
> +and line number of the function's declaration. This table of symbols has the
> +following format described at <ljp_symtab.h>:
> +
> +```c
> +/*
> +** symtab format:
> +**
> +** symtab := prologue sym*
> +** prologue := 'l' 'j' 's' version reserved
> +** version := <BYTE>
> +** reserved := <BYTE> <BYTE> <BYTE>
> +** sym := sym-lua | sym-final
> +** sym-lua := sym-header sym-addr sym-chunk sym-line
> +** sym-header := <BYTE>
> +** sym-addr := <ULEB128>
> +** sym-chunk := string
> +** sym-line := <ULEB128>
> +** sym-final := sym-header
> +** string := string-len string-payload
> +** string-len := <ULEB128>
> +** string-payload := <BYTE> {string-len}
> +**
> +** <BYTE> : A single byte (no surprises here)
> +** <ULEB128>: Unsigned integer represented in ULEB128 encoding
> +**
> +** (Order of bits below is hi -> lo)
> +**
> +** version: [VVVVVVVV]
> +** * VVVVVVVV: Byte interpreted as a plain numeric version number
> +**
> +** sym-header: [FUUUUUTT]
> +** * TT : 2 bits for representing symbol type
> +** * UUUUU : 5 unused bits
> +** * F : 1 bit marking the end of the symtab (final symbol)
> +*/
> +```
<snipped>
> +Starting profiler from Lua is quite simple:
> +```lua
> +local started, err = misc.memprof.start(fname)
> +```
> +Where `fname` is name of the file where profile events are written. Writer for
> +this function perform `fwrite()` for each call retrying in case of `EINTR`.
> +Final callback calls `fclose()` at the end of profiling. If it is impossible to
> +open a file for writing or profiler fails to start, returns `nil` on failure
> +(plus an error message as a second result and a system-dependent error code as
> +a third result). Otherwise returns some true value.
What the heck is the third result and why is it ignored in the example
above? I think this error should be incorporated in the second argument
kinda <perror> does.
> +
> +Stopping profiler from Lua is simple too:
> +```lua
> +local stopped, err = misc.memprof.stop()
> +```
> +
> +If there is any error occurred at profiling stopping (an error when file
> +descriptor was closed) `memprof.stop()` returns `nil` (plus an error message as
> +a second result and a system-dependent error code as a third result). Returns
> +`true` otherwise.
Ditto.
> +
<snipped>
> +Memory profiler is expected to be thread safe, so it has a corresponding
> +lock/unlock at internal mutex whenever you call `luaM_memprof_*`. If you want
> +to build LuaJIT without thread safety use `-DLUAJIT_DISABLE_THREAD_SAFE`.
I see no luaM_memprof_* interfaces. So what is this thread safety for?
> +
<snipped>
> +## Benchmarks
> +
> +Benchmarks were taken from repo:
> +[LuaJIT-test-cleanup](https://github.com/LuaJIT/LuaJIT-test-cleanup).
> +
> +Example of usage:
> +```bash
> +/usr/bin/time -f"array3d %U" ./luajit $BENCH_DIR/array3d.lua 300 >/dev/null
> +```
> +
> +Benchmark results before and after the patch (less is better):
There are some considerable differences in benchmark results, so I have
several questions:
* Mention whether you tested with enabled JIT or not
* How many iterations did you made? What is the dispersion/noise for
these runs?
* How come the values with the *enabled* memprof are less than values
with disabled memprof and even the vanilla LuaJIT? E.g. fasta or fast
fourier transform.
> +
> +```
> + | BEFORE | AFTER,memprof off | AFTER,memprof on
> +---------------+--------+-------------------+-----------------
> +array3d | 0.22 | 0.20 | 0.21
> +binary-trees | 3.32 | 3.33 | 3.94
> +chameneos | 2.92 | 3.18 | 3.12
> +coroutine-ring | 0.99 | 1.00 | 0.99
> +euler14-bit | 1.04 | 1.05 | 1.03
> +fannkuch | 6.77 | 6.69 | 6.64
> +fasta | 8.27 | 8.30 | 8.25
> +life | 0.48 | 0.48 | 1.03
> +mandelbrot | 2.69 | 2.70 | 2.75
> +mandelbrot-bit | 1.99 | 2.00 | 2.08
> +md5 | 1.57 | 1.61 | 1.56
> +nbody | 1.35 | 1.38 | 1.33
> +nsieve | 2.11 | 2.19 | 2.09
> +nsieve-bit | 1.50 | 1.55 | 1.47
> +nsieve-bit-fp | 4.40 | 4.63 | 4.44
> +partialsums | 0.54 | 0.58 | 0.55
> +pidigits-nogmp | 3.48 | 3.50 | 3.47
> +ray | 1.63 | 1.68 | 1.64
> +recursive-ack | 0.19 | 0.22 | 0.20
> +recursive-fib | 1.62 | 1.71 | 1.63
> +scimark-fft | 5.78 | 5.94 | 5.69
> +scimark-lu | 3.26 | 3.57 | 3.59
> +scimark-sor | 2.34 | 2.35 | 2.33
> +scimark-sparse | 5.03 | 4.92 | 4.91
> +series | 0.94 | 0.96 | 0.95
> +spectral-norm | 0.96 | 0.96 | 0.95
> +```
> --
> 2.28.0
>
--
Best regards,
IM
More information about the Tarantool-discussions
mailing list