[Tarantool-discussions] [RFC luajit v2] rfc: describe a LuaJIT memory profiler
Sergey Kaplun
skaplun at tarantool.org
Fri Dec 25 16:36:26 MSK 2020
Hi, Igor!
Thanks for the review!
On 21.12.20, Igor Munkin wrote:
> Sergey,
>
> Thanks for the new version! The design is fine, but please consider my
> minor comments regarding the document itself.
Please, see new version here [1].
>
> On 16.12.20, Sergey Kaplun wrote:
> > Part of #5442
> > ---
> >
> > Issues: https://github.com/tarantool/tarantool/issues/5442
> > Branch: https://github.com/tarantool/tarantool/tree/skaplun/gh-5442-luajit-memory-profiler
> >
> > Changes in v2:
> > * Removed C API, Tarantool integration and description of additional
> > features -- they will be added in another RFC if necessary.
> > * Removed checking profile is running from the public API.
> > * Added benchmarks and more meaningful example.
> > * Grammar fixes.
> >
> > doc/rfc/5442-luajit-memory-profiler.md | 306 +++++++++++++++++++++++++
> > 1 file changed, 306 insertions(+)
> > create mode 100644 doc/rfc/5442-luajit-memory-profiler.md
> >
> > diff --git a/doc/rfc/5442-luajit-memory-profiler.md b/doc/rfc/5442-luajit-memory-profiler.md
> > new file mode 100644
> > index 000000000..720105009
> > --- /dev/null
> > +++ b/doc/rfc/5442-luajit-memory-profiler.md
> > @@ -0,0 +1,306 @@
>
> <snipped>
>
> > +## Detailed design
> > +
> > +The whole toolchain of memory profiling will be divided by several parts:
> > +1) Prerequisites.
> > +2) Recording information about memory usage and saving it.
> > +3) Reading saved data and display it in human-readable format.
>
> I believe the links would also be great here.
Done in the new version.
>
> > +
> > +### Prerequisites
> > +
> > +This section describes additional changes in LuaJIT required to feature
> > +implementation. This version of LuaJIT memory profiler does not support
> > +reporting allocations from traces. But trace code semantics should be totally
> > +the same as for Lua interpreter. So profiling with `jit.off()` should be
> > +enough.
>
> From this part I guess nothing works with enabled JIT. As a result of
> the offline discussion it occurs everything works but allocation on
> traces are attributed as internal ones. I think this is not a big deal
> since JIT is a black internal box. Please mention this explicitly above.
>
> Minor: It's worth to mention that there might be less GC pressure while
> trace execution (considering allocation sinking optimization impact),
> but the traces itself are allocated in GC area.
Added, thanks!
>
> > +
>
> <snipped>
>
> > +
> > +Fast function allocation events always belong to the previous frame with
> > +considering of tail call optimizations (TCO).
>
> Still don't get it. In the previous version I asked to explicitly
> mention that allocations in builtins are attributed with their caller.
> And another case is a tail call optimization, when allocations are also
> attributes with the caller.
>
> So, the example is fine (but comments are desired, IMHO), but I suggest
> to separate these cases in the passage above.
Reformulated in the new verison as we discussed offline.
>
> > +
> > +Assume we have the following Lua chunk named <test.lua>:
> > +
> > +```
> > +1 jit.off()
> > +2 misc.memprof.start("memprof_new.bin")
> > +3 local function append(str, rep)
> > +4 return string.rep(str, rep)
> > +5 end
> > +6
> > +7 local t = {}
> > +8 for _ = 1, 1e5 do
> > +9 table.insert(t,
> > +10 append('q', _)
> > +11 )
> > +12 end
> > +13 misc.memprof.stop()
> > +```
> > +
> > +Profilers output is like the follows:
> > +```
> > +ALLOCATIONS
> > + at test.lua:0, line 10: 100007 5004638934 0
> > + at test.lua:0, line 5: 1 40 0
> > + at test.lua:0, line 7: 1 72 0
> > + at test.lua:0, line 9: 1 48 0
> > +
> > +REALLOCATIONS
> > + at test.lua:0, line 9: 16 4194496 2097376
> > + Overrides:
> > + @test.lua:0, line 9
> > +
> > + at test.lua:0, line 10: 12 262080 131040
> > + Overrides:
> > + @test.lua:0, line 10
> > +
> > +
> > +DEALLOCATIONS
> > +INTERNAL: 21 0 2463
> > + at test.lua:0, line 10: 8 0 1044480
> > + Overrides:
> > + @test.lua:0, line 10
> > +```
>
> I have no idea what is dumped here. There is a legend below (in reading
> and displaying saved data section), so leave a link to it right here.
> Otherwise this is some kind of elvish.
Added link.
>
> > +
> > +In Lua functions for profile events, we had to determine the line number of the
> > +function definition and corresponding `GCproto` address. For C functions only
> > +address will be enough. If Fast function is called from Lua function we had to
> > +report the Lua function for more meaningful output. Otherwise report the C
> > +function.
>
> Still don't understand two last sentences. In the previous reply you
> told this relates to the part above (about the fast functions). You
> reworded that part, but this is still left unclear.
Reformulated in the new verison as we discussed offline.
>
> > +
> > +So we need to know in what type of function CALL/RETURN virtual machine (VM)
>
> AFAICS the function type is enough, isn't it? As you mentioned above
> allocation are attributed to the callers in case of tail call
> optimization.
Reformulated in the new verison as we discussed offline.
>
> > +is. LuaJIT has already determined C function execution VM state but neither
> > +Fast functions nor Lua function. So corresponding VM states will be added.
>
> <snipped>
>
> > +### Information recording
> > +
> > +Each allocate/reallocate/free is considered as a type of event that are
> > +reported. Event stream has the following format:
> > +
> > +```c
> > +/*
> > +** Event stream format:
> > +**
> > +** stream := symtab memprof
> > +** symtab := see <ljp_symtab.h>
> > +** memprof := prologue event* epilogue
> > +** prologue := 'l' 'j' 'm' version reserved
> > +** version := <BYTE>
> > +** reserved := <BYTE> <BYTE> <BYTE>
> > +** event := event-alloc | event-realloc | event-free
> > +** event-alloc := event-header loc? naddr nsize
> > +** event-realloc := event-header loc? oaddr osize naddr nsize
> > +** event-free := event-header loc? oaddr osize
> > +** event-header := <BYTE>
> > +** loc := loc-lua | loc-c
> > +** loc-lua := sym-addr line-no
> > +** loc-c := sym-addr
> > +** sym-addr := <ULEB128>
> > +** line-no := <ULEB128>
> > +** oaddr := <ULEB128>
> > +** naddr := <ULEB128>
> > +** osize := <ULEB128>
> > +** nsize := <ULEB128>
> > +** epilogue := event-header
> > +**
> > +** <BYTE> : A single byte (no surprises here)
> > +** <ULEB128>: Unsigned integer represented in ULEB128 encoding
> > +**
> > +** (Order of bits below is hi -> lo)
> > +**
> > +** version: [VVVVVVVV]
> > +** * VVVVVVVV: Byte interpreted as a plain integer version number
> > +**
> > +** event-header: [FTUUSSEE]
> > +** * EE : 2 bits for representing allocation event type (AEVENT_*)
> > +** * SS : 2 bits for representing allocation source type (ASOURCE_*)
> > +** * UU : 2 unused bits
> > +** * T : Reserved. 0 for regular events, 1 for the events marked with
> > +** the timestamp mark. It is assumed that the time distance between
> > +** two marked events is approximately the same and is equal
> > +** to 1 second. Always zero for now.
>
> It looks this is zero always in our case, so we have 3 unused bits,
> right? Then just drop this part.
I've dropped it according your and Sergos comments.
>
<snipped>
> > +Starting profiler from Lua is quite simple:
> > +```lua
> > +local started, err = misc.memprof.start(fname)
> > +```
> > +Where `fname` is name of the file where profile events are written. Writer for
> > +this function perform `fwrite()` for each call retrying in case of `EINTR`.
> > +Final callback calls `fclose()` at the end of profiling. If it is impossible to
> > +open a file for writing or profiler fails to start, returns `nil` on failure
> > +(plus an error message as a second result and a system-dependent error code as
> > +a third result). Otherwise returns some true value.
>
> What the heck is the third result and why is it ignored in the example
> above? I think this error should be incorporated in the second argument
> kinda <perror> does.
I've used the same formulation as in Lua Reference Manual [2].
It's errno, in fact, but IINM it is system dependent.
>
> > +
> > +Stopping profiler from Lua is simple too:
> > +```lua
> > +local stopped, err = misc.memprof.stop()
> > +```
> > +
> > +If there is any error occurred at profiling stopping (an error when file
> > +descriptor was closed) `memprof.stop()` returns `nil` (plus an error message as
> > +a second result and a system-dependent error code as a third result). Returns
> > +`true` otherwise.
>
> Ditto.
Same meaning as above.
>
> > +
>
> <snipped>
>
> > +Memory profiler is expected to be thread safe, so it has a corresponding
> > +lock/unlock at internal mutex whenever you call `luaM_memprof_*`. If you want
> > +to build LuaJIT without thread safety use `-DLUAJIT_DISABLE_THREAD_SAFE`.
>
> I see no luaM_memprof_* interfaces. So what is this thread safety for?
It is used if you've created different VMs and try to start profiling
from different threads. But unfortunately for now we have one static
structure for all threads. It is necessary to avoid polushion
from other profiled VMs.
>
> > +
>
> <snipped>
>
> > +## Benchmarks
> > +
> > +Benchmarks were taken from repo:
> > +[LuaJIT-test-cleanup](https://github.com/LuaJIT/LuaJIT-test-cleanup).
> > +
> > +Example of usage:
> > +```bash
> > +/usr/bin/time -f"array3d %U" ./luajit $BENCH_DIR/array3d.lua 300 >/dev/null
> > +```
> > +
> > +Benchmark results before and after the patch (less is better):
>
> There are some considerable differences in benchmark results, so I have
> several questions:
> * Mention whether you tested with enabled JIT or not
Enabled of course.
For each test I've added this preambule:
| local start = misc.memprof.start("/tmp/memprof_tmp.bin")
| assert(start)
and this postambule:
| local stop = misc.memprof.stop()
| assert(stop)
> * How many iterations did you made? What is the dispersion/noise for
> these runs?
In the new version I've run 11 itterations for each test. And I've
added standard deviation for each test.
> * How come the values with the *enabled* memprof are less than values
> with disabled memprof and even the vanilla LuaJIT? E.g. fasta or fast
> fourier transform.
Good question. I've no answer for now.
My wild guess: garbage collector starts earlier (when amount of objects
is less) and so runs faster.
>
> > +
> > +```
> > + | BEFORE | AFTER,memprof off | AFTER,memprof on
> > +---------------+--------+-------------------+-----------------
> > +array3d | 0.22 | 0.20 | 0.21
> > +binary-trees | 3.32 | 3.33 | 3.94
> > +chameneos | 2.92 | 3.18 | 3.12
> > +coroutine-ring | 0.99 | 1.00 | 0.99
> > +euler14-bit | 1.04 | 1.05 | 1.03
> > +fannkuch | 6.77 | 6.69 | 6.64
> > +fasta | 8.27 | 8.30 | 8.25
> > +life | 0.48 | 0.48 | 1.03
> > +mandelbrot | 2.69 | 2.70 | 2.75
> > +mandelbrot-bit | 1.99 | 2.00 | 2.08
> > +md5 | 1.57 | 1.61 | 1.56
> > +nbody | 1.35 | 1.38 | 1.33
> > +nsieve | 2.11 | 2.19 | 2.09
> > +nsieve-bit | 1.50 | 1.55 | 1.47
> > +nsieve-bit-fp | 4.40 | 4.63 | 4.44
> > +partialsums | 0.54 | 0.58 | 0.55
> > +pidigits-nogmp | 3.48 | 3.50 | 3.47
> > +ray | 1.63 | 1.68 | 1.64
> > +recursive-ack | 0.19 | 0.22 | 0.20
> > +recursive-fib | 1.62 | 1.71 | 1.63
> > +scimark-fft | 5.78 | 5.94 | 5.69
> > +scimark-lu | 3.26 | 3.57 | 3.59
> > +scimark-sor | 2.34 | 2.35 | 2.33
> > +scimark-sparse | 5.03 | 4.92 | 4.91
> > +series | 0.94 | 0.96 | 0.95
> > +spectral-norm | 0.96 | 0.96 | 0.95
> > +```
> > --
> > 2.28.0
> >
>
> --
> Best regards,
> IM
[1]: https://lists.tarantool.org/pipermail/tarantool-discussions/2020-December/000147.html
[2]: https://www.lua.org/manual/5.1/manual.html#5.7
--
Best regards,
Sergey Kaplun
More information about the Tarantool-discussions
mailing list