From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp50.i.mail.ru (smtp50.i.mail.ru [94.100.177.110]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 20E634765E0 for ; Fri, 25 Dec 2020 14:35:27 +0300 (MSK) From: Sergey Kaplun Date: Fri, 25 Dec 2020 14:34:31 +0300 Message-Id: <20201225113431.9538-1-skaplun@tarantool.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: [Tarantool-discussions] [RFC luajit v3] rfc: describe a LuaJIT memory profiler List-Id: Tarantool development process List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Igor Munkin , Sergey Ostanevich Cc: tarantool-discussions@dev.tarantool.org Part of #5442 --- RFC on branch: https://github.com/tarantool/tarantool/blob/skaplun/gh-5442-luajit-memory-profiler/doc/rfc/5442-luajit-memory-profiler.md Changes in v3: * More comments in example. * More verbose benchmark information. * Grammar and spelling fixes. Changes in v2: * Removed C API, Tarantool integration and description of additional features -- they will be added in another RFC if necessary. * Removed checking profile is running from the public API. * Added benchmarks and more meaningful example. * Grammar fixes. doc/rfc/5442-luajit-memory-profiler.md | 314 +++++++++++++++++++++++++ 1 file changed, 314 insertions(+) create mode 100644 doc/rfc/5442-luajit-memory-profiler.md diff --git a/doc/rfc/5442-luajit-memory-profiler.md b/doc/rfc/5442-luajit-memory-profiler.md new file mode 100644 index 000000000..85a61462a --- /dev/null +++ b/doc/rfc/5442-luajit-memory-profiler.md @@ -0,0 +1,314 @@ +# LuaJIT memory profiler + +* **Status**: In progress +* **Start date**: 24-10-2020 +* **Authors**: Sergey Kaplun @Buristan skaplun@tarantool.org, + Igor Munkin @igormunkin imun@tarantool.org, + Sergey Ostanevich @sergos sergos@tarantool.org +* **Issues**: [#5442](https://github.com/tarantool/tarantool/issues/5442) + +## Summary + +LuaJIT memory profiler is a toolchain for analysis of memory usage by user's +application. + +## Background and motivation + +Garbage collector (GC) is a curse of performance for most of Lua applications. +Memory usage of Lua application should be profiled to locate various +memory-unoptimized code blocks. If the application has memory leaks they can be +found with the profiler also. + +## Detailed design + +The whole toolchain of memory profiling will be divided into several parts: +1) [Prerequisites](#prerequisites). +2) [Recording information about memory usage and saving it](#information-recording). +3) [Reading saved data and display it in human-readable format](#reading-and-displaying-saved-data). + +### Prerequisites + +This section describes additional changes in LuaJIT required for the feature +implementation. This version of LuaJIT memory profiler does not support verbose +reporting allocations from traces. All allocation from traces are reported as +internal. But trace code semantics should be totally the same as for the Lua +interpreter (excluding sink optimizations). Also all deallocations reported as +internal too. + +There are two different representations of functions in LuaJIT: the function's +prototype (`GCproto`) and the function object so called closure (`GCfunc`). +The closures are represented as `GCfuncL` and `GCfuncC` for Lua and C closures +correspondingly. Also LuaJIT has a special function's type aka Fast Function. +It is used for LuaJIT builtins. + +Tail call optimization does not create a new call frame, so all allocations +inside the function called via `CALLT`/`CALLMT` are attributed to its caller. + +Usually developers are not interested in information about allocations inside +builtins. So if fast function was called from a Lua function all +allocations are attributed to this Lua function. Otherwise attribute this event +to a C function. + +Assume we have the following Lua chunk named : + +``` +1 jit.off() +2 misc.memprof.start("memprof_new.bin") +3 -- Lua does not create a new frame to call string.rep and all allocations are +4 -- attributed not to `append()` function but to the parent scope. +5 local function append(str, rep) +6 return string.rep(str, rep) +7 end +8 +9 local t = {} +10 for _ = 1, 1e5 do +11 -- table.insert is a builtin and all corresponding allocations +12 -- are reported in the scope of main chunk +13 table.insert(t, +14 append('q', _) +15 ) +16 end +17 misc.memprof.stop() +``` + +If one run the chunk above the profiler reports approximately the following +(see legend [here](#reading-and-displaying-saved-data)): +``` +ALLOCATIONS +@test.lua:0, line 14: 1002 531818 0 +@test.lua:0, line 13: 1 24 0 +@test.lua:0, line 9: 1 32 0 +@test.lua:0, line 7: 1 20 0 + +REALLOCATIONS +@test.lua:0, line 13: 9 16424 8248 + Overrides: + @test.lua:0, line 13 + +@test.lua:0, line 14: 5 1984 992 + Overrides: + @test.lua:0, line 14 + + +DEALLOCATIONS +INTERNAL: 20 0 1481 +@test.lua:0, line 14: 3 0 7168 + Overrides: + @test.lua:0, line 14 +``` + +So we need to know a type of function being executed by the virtual machine +(VM). Currently VM state identifies C function execution only, so Fast and Lua +functions states will be added. + +To determine currently allocating coroutine (that may not be equal to currently +executed one) a new field called `mem_L` is added to `global_State` structure +to keep the coroutine address. This field is set at each reallocation to +corresponding `L` with which it was called. + +There is a static function (`lj_debug_getframeline`) that returns line number +for current `BCPos` in `lj_debug.c` already. It will be added to the debug +module API to be used in memory profiler. + +### Information recording + +Each allocate/reallocate/free is considered as a type of event that are +reported. Event stream has the following format: + +```c +/* +** Event stream format: +** +** stream := symtab memprof +** symtab := see symtab description +** memprof := prologue event* epilogue +** prologue := 'l' 'j' 'm' version reserved +** version := +** reserved := +** event := event-alloc | event-realloc | event-free +** event-alloc := event-header loc? naddr nsize +** event-realloc := event-header loc? oaddr osize naddr nsize +** event-free := event-header loc? oaddr osize +** event-header := +** loc := loc-lua | loc-c +** loc-lua := sym-addr line-no +** loc-c := sym-addr +** sym-addr := +** line-no := +** oaddr := +** naddr := +** osize := +** nsize := +** epilogue := event-header +** +** : A single byte (no surprises here) +** : Unsigned integer represented in ULEB128 encoding +** +** (Order of bits below is hi -> lo) +** +** version: [VVVVVVVV] +** * VVVVVVVV: Byte interpreted as a plain integer version number +** +** event-header: [FUUUSSEE] +** * EE : 2 bits for representing allocation event type (AEVENT_*) +** * SS : 2 bits for representing allocation source type (ASOURCE_*) +** * UUU : 3 unused bits +** * F : 0 for regular events, 1 for epilogue's *F*inal header +** (if F is set to 1, all other bits are currently ignored) +*/ +``` + +It is enough to know the address of LUA/C function to determine it. Symbolic +table (symtab) is dumped at the start of profiling to avoid dumping function +location on each memory event for saving both CPU usage and binary profile +size. + +Each line contains the address, Lua chunk definition as the filename and line +number of the function's declaration. This table of symbols has the following +format described at : + +```c +/* +** symtab format: +** +** symtab := prologue sym* +** prologue := 'l' 'j' 's' version reserved +** version := +** reserved := +** sym := sym-lua | sym-final +** sym-lua := sym-header sym-addr sym-chunk sym-line +** sym-header := +** sym-addr := +** sym-chunk := string +** sym-line := +** sym-final := sym-header +** string := string-len string-payload +** string-len := +** string-payload := {string-len} +** +** : A single byte (no surprises here) +** : Unsigned integer represented in ULEB128 encoding +** +** (Order of bits below is hi -> lo) +** +** version: [VVVVVVVV] +** * VVVVVVVV: Byte interpreted as a plain numeric version number +** +** sym-header: [FUUUUUTT] +** * TT : 2 bits for representing symbol type +** * UUUUU : 5 unused bits +** * F : 1 bit marking the end of the symtab (final symbol) +*/ +``` + +So when memory profiling starts the current allocation function is replaced by +the new allocation function additionally wrapped to write the profiling events. +When profiler stops the previous allocation function is restored. + +Starting profiler from Lua is quite simple: +```lua +local started, err, errno = misc.memprof.start(fname) +``` +where `fname` is name of the file where profile events are written. Writer for +this function perform `fwrite()` for each call retrying in case of `EINTR`. +When the profiling is stopped the `fclose()` is called. If it is impossible to +open a file for writing or profiler fails to start, returns `nil` on failure +(plus an error message as a second result and a system-dependent error code as +a third result). Otherwise returns some true value. + +Stopping profiler from Lua is simple too: +```lua +local stopped, err, errno = misc.memprof.stop() +``` + +If there is any error occurred at profiling stopping (an error when file +descriptor was closed) `memprof.stop()` returns `nil` (plus an error message as +a second result and a system-dependent error code as a third result). Returns +`true` otherwise. + +If you want to build LuaJIT without memory profiler, you should build it with +`-DLUAJIT_DISABLE_MEMPROF`. If it is disabled `misc.memprof.start()` and +`misc.memprof.stop()` always return `false`. + +Memory profiler is expected to be thread safe, so it has a corresponding +lock/unlock at internal mutex whenever you call corresponding memprof +functions. If you want to build LuaJIT without thread safety use +`-DLUAJIT_DISABLE_THREAD_SAFE`. + +### Reading and displaying saved data + +Binary data can be read by `lj-parse-memprof` utility. It parses the binary +format provided by memory profiler and render it on human-readable format. + +The usage is very simple: +``` +$ ./luajit-parse-memprof --help +luajit-parse-memprof - parser of the memory usage profile collected + with LuaJIT's memprof. + +SYNOPSIS + +luajit-parse-memprof [options] memprof.bin + +Supported options are: + + --help Show this help and exit +``` + +Plain text of profiled info has the following format: +``` +@:, line : +``` +See the example [above](#prerequisites). + +`INTERNAL` means that these allocations are caused by internal LuaJIT +structures. Note that events are sorted from the most often to the least. + +`Overrides` means what allocation this reallocation overrides. + +## Benchmarks + +Benchmarks were taken from repo: +[LuaJIT-test-cleanup](https://github.com/LuaJIT/LuaJIT-test-cleanup). + +Example of measuring: +```bash +/usr/bin/time -f"array3d %U" ./luajit $BENCH_DIR/array3d.lua 300 >/dev/null +``` + +This table shows performance deviation in relation to REFerence value (before +commit) with stopped and running profiler. The table shows the average value +for 11 runs. The first field of the column indicates the change in the average +time in seconds (less is better). The second field is the standard deviation +for the found difference. + +``` + Name | REF | AFTER, memprof off | AFTER, memprof on +----------------+------+--------------------+------------------ +array3d | 0.21 | +0.00 (0.01) | +0.00 (0.01) +binary-trees | 3.25 | -0.01 (0.06) | +0.53 (0.10) +chameneos | 2.97 | +0.14 (0.04) | +0.13 (0.06) +coroutine-ring | 1.00 | +0.01 (0.04) | +0.01 (0.04) +euler14-bit | 1.03 | +0.01 (0.02) | +0.00 (0.02) +fannkuch | 6.81 | -0.21 (0.06) | -0.20 (0.06) +fasta | 8.20 | -0.07 (0.05) | -0.08 (0.03) +life | 0.46 | +0.00 (0.01) | +0.35 (0.01) +mandelbrot | 2.65 | +0.00 (0.01) | +0.01 (0.01) +mandelbrot-bit | 1.97 | +0.00 (0.01) | +0.01 (0.02) +md5 | 1.58 | -0.01 (0.04) | -0.04 (0.04) +nbody | 1.34 | +0.00 (0.01) | -0.02 (0.01) +nsieve | 2.07 | -0.03 (0.03) | -0.01 (0.04) +nsieve-bit | 1.50 | -0.02 (0.04) | +0.00 (0.04) +nsieve-bit-fp | 4.44 | -0.03 (0.07) | -0.01 (0.07) +partialsums | 0.54 | +0.00 (0.01) | +0.00 (0.01) +pidigits-nogmp | 3.47 | -0.01 (0.02) | -0.10 (0.02) +ray | 1.62 | -0.02 (0.03) | +0.00 (0.02) +recursive-ack | 0.20 | +0.00 (0.01) | +0.00 (0.01) +recursive-fib | 1.63 | +0.00 (0.01) | +0.01 (0.02) +scimark-fft | 5.72 | +0.06 (0.09) | -0.01 (0.10) +scimark-lu | 3.47 | +0.02 (0.27) | -0.03 (0.26) +scimark-sor | 2.34 | +0.00 (0.01) | -0.01 (0.01) +scimark-sparse | 4.95 | -0.02 (0.04) | -0.02 (0.04) +series | 0.95 | +0.00 (0.02) | +0.00 (0.01) +spectral-norm | 0.96 | +0.00 (0.02) | -0.01 (0.02) +``` -- 2.28.0