From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp60.i.mail.ru (smtp60.i.mail.ru [217.69.128.40]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 426B44765E0 for ; Fri, 25 Dec 2020 16:15:29 +0300 (MSK) Date: Fri, 25 Dec 2020 16:14:43 +0300 From: Sergey Kaplun Message-ID: <20201225131443.GT9101@root> References: <20201216190928.32410-1-skaplun@tarantool.org> <508BBB58-6932-41D3-9C40-DB27B5FC453D@corp.mail.ru> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <508BBB58-6932-41D3-9C40-DB27B5FC453D@corp.mail.ru> Subject: Re: [Tarantool-discussions] [RFC luajit v2] rfc: describe a LuaJIT memory profiler List-Id: Tarantool development process List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Sergey Ostanevich Cc: tarantool-discussions@dev.tarantool.org Hi! Thanks for the review! On 22.12.20, Sergey Ostanevich wrote: > Hi! > > Thanks for the patch! > > See some comments below, after being applied is LGTM. Please, see new verison here [1]. > > Sergos > > > On 16 Dec 2020, at 22:09, Sergey Kaplun wrote: > > > > Part of #5442 > > --- > > > > Issues: https://github.com/tarantool/tarantool/issues/5442 > > Branch: https://github.com/tarantool/tarantool/tree/skaplun/gh-5442-luajit-memory-profiler > > > > Changes in v2: > > * Removed C API, Tarantool integration and description of additional > > features -- they will be added in another RFC if necessary. > > * Removed checking profile is running from the public API. > > * Added benchmarks and more meaningful example. > > * Grammar fixes. > > > > doc/rfc/5442-luajit-memory-profiler.md | 306 +++++++++++++++++++++++++ > > 1 file changed, 306 insertions(+) > > create mode 100644 doc/rfc/5442-luajit-memory-profiler.md > > > > diff --git a/doc/rfc/5442-luajit-memory-profiler.md b/doc/rfc/5442-luajit-memory-profiler.md > > new file mode 100644 > > index 000000000..720105009 > > --- /dev/null > > +++ b/doc/rfc/5442-luajit-memory-profiler.md > > @@ -0,0 +1,306 @@ > > +# LuaJIT memory profiler > > + > > +* **Status**: In progress > > +* **Start date**: 24-10-2020 > > +* **Authors**: Sergey Kaplun @Buristan skaplun@tarantool.org, > > + Igor Munkin @igormunkin imun@tarantool.org, > > + Sergey Ostanevich @sergos sergos@tarantool.org > > +* **Issues**: [#5442](https://github.com/tarantool/tarantool/issues/5442) > > + > > +## Summary > > + > > +LuaJIT memory profiler is a toolchain for analysis of memory usage by user's > > +application. > > + > > +## Background and motivation > > + > > +Garbage collector (GC) is a curse of performance for most of Lua applications. > > This is something that is very questionable. Do you have any data on performance > of Lua code against the GC code in real life? Is there any about Tarantool > applications? > > I’m not questioning the mem profiler itself, rather trying to weigh the GC problem. I can give benchmarks for vshard [2] for now. There are approximately 20% of time spent in `lj_gc_steps()`, IINM. > > > +Memory usage of Lua application should be profiled to find out various > ^^^^^^^ locate Fixed. > > > +memory-unoptimized code blocks. If the application has memory leaks they can be > > +found with the profiler. > ^ also. Fixed. > > + > > +## Detailed design > > + > > +The whole toolchain of memory profiling will be divided by several parts: > ^^^ into Thank you! Fixed. > > +1) Prerequisites. > > +2) Recording information about memory usage and saving it. > > +3) Reading saved data and display it in human-readable format. > > + > > +### Prerequisites > > + > > +This section describes additional changes in LuaJIT required to feature > ^^ for the Fixed. > > +implementation. This version of LuaJIT memory profiler does not support > > +reporting allocations from traces. But trace code semantics should be totally > > +the same as for Lua interpreter. So profiling with `jit.off()` should be > > +enough. > > + > > +There are two different representations of functions in LuaJIT: the function's > > +prototype (`GCproto`) and the function object so called closure (`GCfunc`). > > +The closures are represented as `GCfuncL` and `GCfuncC` for Lua and C closures > > +correspondingly. Also LuaJIT has special function's type aka Fast Function. It > ^ a Fixed. > > +is used for LuaJIT builtins. > > + > > +Fast function allocation events always belong to the previous frame with > > +considering of tail call optimizations (TCO). > > + > > +Assume we have the following Lua chunk named : > > + > > +``` > > +1 jit.off() > > +2 misc.memprof.start("memprof_new.bin") > > +3 local function append(str, rep) > > +4 return string.rep(str, rep) > > +5 end > > +6 > > +7 local t = {} > > +8 for _ = 1, 1e5 do > > +9 table.insert(t, > > +10 append('q', _) > > +11 ) > > +12 end > > +13 misc.memprof.stop() > > +``` > > + > > +Profilers output is like the follows: > > +``` > > +ALLOCATIONS > > +@test.lua:0, line 10: 100007 5004638934 0 > > +@test.lua:0, line 5: 1 40 0 > > +@test.lua:0, line 7: 1 72 0 > > +@test.lua:0, line 9: 1 48 0 > > + > > +REALLOCATIONS > > +@test.lua:0, line 9: 16 4194496 2097376 > > + Overrides: > > + @test.lua:0, line 9 > > + > > +@test.lua:0, line 10: 12 262080 131040 > > + Overrides: > > + @test.lua:0, line 10 > > + > > + > > +DEALLOCATIONS > > +INTERNAL: 21 0 2463 > > +@test.lua:0, line 10: 8 0 1044480 > > + Overrides: > > + @test.lua:0, line 10 > > +``` > > + > > +In Lua functions for profile events, we had to determine the line number of the > > +function definition and corresponding `GCproto` address. For C functions only > > +address will be enough. If Fast function is called from Lua function we had to > > +report the Lua function for more meaningful output. Otherwise report the C > > +function. > > + > > +So we need to know in what type of function CALL/RETURN virtual machine (VM) > ^^^^^^^ a ^ the > > +is. LuaJIT has already determined C function execution VM state but neither > ^ currently in. > > > +is. LuaJIT has already determined C function execution VM state but neither > Currently VM state identifies C function execution only, so Fast and Lua > functions states will be added. Reformulated in the new version. > > > +Fast functions nor Lua function. So corresponding VM states will be added. > > + > > +To determine currently allocating coroutine (that may not be equal to currently > > +executed) new field will be added to `global_State` structure named `mem_L` > ^ a Fixed. > > > +kept coroutine address. This field sets at each reallocation to corresponding > ^^^ to keep the ^^^ is set Fixed. > > > +`L` with which it was called. > > + > > +There is the static function (`lj_debug_getframeline`) returned line number for > ^^^ a ^^^ that returns Fixed. > > +current `BCPos` in `lj_debug.c` already. It will be added to the debug module > > +API to be used in memory profiler. > > + > > +### Information recording > > + > > +Each allocate/reallocate/free is considered as a type of event that are > > +reported. Event stream has the following format: > > + > > +```c > > +/* > > +** Event stream format: > > +** > > +** stream := symtab memprof > > +** symtab := see > > +** memprof := prologue event* epilogue > > +** prologue := 'l' 'j' 'm' version reserved > > +** version := > > +** reserved := > > +** event := event-alloc | event-realloc | event-free > > +** event-alloc := event-header loc? naddr nsize > > +** event-realloc := event-header loc? oaddr osize naddr nsize > > +** event-free := event-header loc? oaddr osize > > +** event-header := > > +** loc := loc-lua | loc-c > > +** loc-lua := sym-addr line-no > > +** loc-c := sym-addr > > +** sym-addr := > > +** line-no := > > +** oaddr := > > +** naddr := > > +** osize := > > +** nsize := > > +** epilogue := event-header > > +** > > +** : A single byte (no surprises here) > > +** : Unsigned integer represented in ULEB128 encoding > > +** > > +** (Order of bits below is hi -> lo) > > +** > > +** version: [VVVVVVVV] > > +** * VVVVVVVV: Byte interpreted as a plain integer version number > > +** > > +** event-header: [FTUUSSEE] > > +** * EE : 2 bits for representing allocation event type (AEVENT_*) > > +** * SS : 2 bits for representing allocation source type (ASOURCE_*) > > +** * UU : 2 unused bits > > +** * T : Reserved. 0 for regular events, 1 for the events marked with > > +** the timestamp mark. It is assumed that the time distance between > > No representation of a timestamp so far? If so - maybe remove excessive description > and put ‘reserved for timestamp’. I've dropped it according your and Igors comments. > > > +** two marked events is approximately the same and is equal > > +** to 1 second. Always zero for now. > > +** * F : 0 for regular events, 1 for epilogue's *F*inal header > > +** (if F is set to 1, all other bits are currently ignored) > > +*/ > > +``` > > + > > +It is enough to know the address of LUA/C function to determine it. Symbolic > > +table (symtab) dumps at start of profiling to avoid determine and write line > > +number of Lua code and corresponding chunk of code each time, when memory event > > +happens. Each line contains the address, Lua chunk definition as the filename > > +and line number of the function's declaration. This table of symbols has the > > +following format described at : > > + > > +```c > > +/* > > +** symtab format: > > +** > > +** symtab := prologue sym* > > +** prologue := 'l' 'j' 's' version reserved > > +** version := > > +** reserved := > > +** sym := sym-lua | sym-final > > +** sym-lua := sym-header sym-addr sym-chunk sym-line > > +** sym-header := > > +** sym-addr := > > +** sym-chunk := string > > +** sym-line := > > +** sym-final := sym-header > > +** string := string-len string-payload > > +** string-len := > > +** string-payload := {string-len} > > +** > > +** : A single byte (no surprises here) > > +** : Unsigned integer represented in ULEB128 encoding > > +** > > +** (Order of bits below is hi -> lo) > > +** > > +** version: [VVVVVVVV] > > +** * VVVVVVVV: Byte interpreted as a plain numeric version number > > +** > > +** sym-header: [FUUUUUTT] > > +** * TT : 2 bits for representing symbol type > > +** * UUUUU : 5 unused bits > > +** * F : 1 bit marking the end of the symtab (final symbol) > > +*/ > > +``` > > + > > +So when memory profiling starts default allocation function is replaced by the > > +new allocation function as additional wrapper to write inspected profiling > > +events. When profiler stops old allocation function is substituted back. > > + > > +Starting profiler from Lua is quite simple: > > +```lua > > +local started, err = misc.memprof.start(fname) > > +``` > > +Where `fname` is name of the file where profile events are written. Writer for > > +this function perform `fwrite()` for each call retrying in case of `EINTR`. > > +Final callback calls `fclose()` at the end of profiling. If it is impossible to > > +open a file for writing or profiler fails to start, returns `nil` on failure > > +(plus an error message as a second result and a system-dependent error code as > > +a third result). Otherwise returns some true value. > > + > > +Stopping profiler from Lua is simple too: > > +```lua > > +local stopped, err = misc.memprof.stop() > > +``` > > + > > +If there is any error occurred at profiling stopping (an error when file > > +descriptor was closed) `memprof.stop()` returns `nil` (plus an error message as > > +a second result and a system-dependent error code as a third result). Returns > > Why do you need to separate an ‘error’ from a ’system error’? I've used the same formulation as in Lua Reference Manual [3]. It's errno, in fact, but IINM it is system dependent. > > > +`true` otherwise. > > + > > +If you want to build LuaJIT without memory profiler, you should build it with > > +`-DLUAJIT_DISABLE_MEMPROF`. If it is disabled `misc.memprof.start()` and > > +`misc.memprof.stop()` always return `false`. > > + > > +Memory profiler is expected to be thread safe, so it has a corresponding > > +lock/unlock at internal mutex whenever you call `luaM_memprof_*`. If you want > > +to build LuaJIT without thread safety use `-DLUAJIT_DISABLE_THREAD_SAFE`. > > + > > +### Reading and displaying saved data > > + > > +Binary data can be read by `lj-parse-memprof` utility. It parses the binary > > +format provided from memory profiler and render it in human-readable format. > > + > > +The usage is very simple: > > +``` > > +$ ./luajit-parse-memprof --help > > +luajit-parse-memprof - parser of the memory usage profile collected > > + with LuaJIT's memprof. > > + > > +SYNOPSIS > > + > > +luajit-parse-memprof [options] memprof.bin > > + > > +Supported options are: > > + > > + --help Show this help and exit > > +``` > > + > > +Plain text of profiled info has the following format: > > +``` > > +@:, line : > > +``` > > +See example above. > > + > > +`INTERNAL` means that this allocations are caused by internal LuaJIT > > +structures. Note that events are sorted from the most often to the least. > > + > > +`Overrides` means what allocation this reallocation overrides. > > + > > +## Benchmarks > > + > > +Benchmarks were taken from repo: > > +[LuaJIT-test-cleanup](https://github.com/LuaJIT/LuaJIT-test-cleanup). > > + > > +Example of usage: > Either ‘usage’ or ‘example of use’ Fixed, thank you! > > > > +```bash > > +/usr/bin/time -f"array3d %U" ./luajit $BENCH_DIR/array3d.lua 300 >/dev/null > > +``` > > + > > +Benchmark results before and after the patch (less is better): > > Comparative to ‘BEFORE’ is more reafable, dispersion should also help with > reading of results. Done in the new version. > > > + > > +``` > > + | BEFORE | AFTER,memprof off | AFTER,memprof on > > +---------------+--------+-------------------+----------------- > > +array3d | 0.22 | 0.20 | 0.21 > > +binary-trees | 3.32 | 3.33 | 3.94 > > +chameneos | 2.92 | 3.18 | 3.12 > > +coroutine-ring | 0.99 | 1.00 | 0.99 > > +euler14-bit | 1.04 | 1.05 | 1.03 > > +fannkuch | 6.77 | 6.69 | 6.64 > > +fasta | 8.27 | 8.30 | 8.25 > > +life | 0.48 | 0.48 | 1.03 > > +mandelbrot | 2.69 | 2.70 | 2.75 > > +mandelbrot-bit | 1.99 | 2.00 | 2.08 > > +md5 | 1.57 | 1.61 | 1.56 > > +nbody | 1.35 | 1.38 | 1.33 > > +nsieve | 2.11 | 2.19 | 2.09 > > +nsieve-bit | 1.50 | 1.55 | 1.47 > > +nsieve-bit-fp | 4.40 | 4.63 | 4.44 > > +partialsums | 0.54 | 0.58 | 0.55 > > +pidigits-nogmp | 3.48 | 3.50 | 3.47 > > +ray | 1.63 | 1.68 | 1.64 > > +recursive-ack | 0.19 | 0.22 | 0.20 > > +recursive-fib | 1.62 | 1.71 | 1.63 > > +scimark-fft | 5.78 | 5.94 | 5.69 > > +scimark-lu | 3.26 | 3.57 | 3.59 > > +scimark-sor | 2.34 | 2.35 | 2.33 > > +scimark-sparse | 5.03 | 4.92 | 4.91 > > +series | 0.94 | 0.96 | 0.95 > > +spectral-norm | 0.96 | 0.96 | 0.95 > > +``` > > -- > > 2.28.0 > > > [1]: https://lists.tarantool.org/pipermail/tarantool-discussions/2020-December/000147.html [2]: https://github.com/tarantool/vshard/issues/224#issuecomment-637632349 [3]: https://www.lua.org/manual/5.1/manual.html#5.7 -- Best regards, Sergey Kaplun