Tarantool discussions archive
 help / color / mirror / Atom feed
* [Tarantool-discussions] [RFC luajit v2] rfc: describe a LuaJIT memory profiler
@ 2020-12-16 19:09 Sergey Kaplun
  2020-12-20 22:37 ` Igor Munkin
  2020-12-22 12:02 ` Sergey Ostanevich
  0 siblings, 2 replies; 5+ messages in thread
From: Sergey Kaplun @ 2020-12-16 19:09 UTC (permalink / raw)
  To: Igor Munkin, Sergey Ostanevich; +Cc: tarantool-discussions

Part of #5442
---

Issues: https://github.com/tarantool/tarantool/issues/5442
Branch: https://github.com/tarantool/tarantool/tree/skaplun/gh-5442-luajit-memory-profiler

Changes in v2:
* Removed C API, Tarantool integration and description of additional
  features -- they will be added in another RFC if necessary.
* Removed checking profile is running from the public API.
* Added benchmarks and more meaningful example.
* Grammar fixes.

 doc/rfc/5442-luajit-memory-profiler.md | 306 +++++++++++++++++++++++++
 1 file changed, 306 insertions(+)
 create mode 100644 doc/rfc/5442-luajit-memory-profiler.md

diff --git a/doc/rfc/5442-luajit-memory-profiler.md b/doc/rfc/5442-luajit-memory-profiler.md
new file mode 100644
index 000000000..720105009
--- /dev/null
+++ b/doc/rfc/5442-luajit-memory-profiler.md
@@ -0,0 +1,306 @@
+# LuaJIT memory profiler
+
+* **Status**: In progress
+* **Start date**: 24-10-2020
+* **Authors**: Sergey Kaplun @Buristan skaplun@tarantool.org,
+               Igor Munkin @igormunkin imun@tarantool.org,
+               Sergey Ostanevich @sergos sergos@tarantool.org
+* **Issues**: [#5442](https://github.com/tarantool/tarantool/issues/5442)
+
+## Summary
+
+LuaJIT memory profiler is a toolchain for analysis of memory usage by user's
+application.
+
+## Background and motivation
+
+Garbage collector (GC) is a curse of performance for most of Lua applications.
+Memory usage of Lua application should be profiled to find out various
+memory-unoptimized code blocks. If the application has memory leaks they can be
+found with the profiler.
+
+## Detailed design
+
+The whole toolchain of memory profiling will be divided by several parts:
+1) Prerequisites.
+2) Recording information about memory usage and saving it.
+3) Reading saved data and display it in human-readable format.
+
+### Prerequisites
+
+This section describes additional changes in LuaJIT required to feature
+implementation. This version of LuaJIT memory profiler does not support
+reporting allocations from traces. But trace code semantics should be totally
+the same as for Lua interpreter. So profiling with `jit.off()` should be
+enough.
+
+There are two different representations of functions in LuaJIT: the function's
+prototype (`GCproto`) and the function object so called closure (`GCfunc`).
+The closures are represented as `GCfuncL` and `GCfuncC` for Lua and C closures
+correspondingly. Also LuaJIT has special function's type aka Fast Function. It
+is used for LuaJIT builtins.
+
+Fast function allocation events always belong to the previous frame with
+considering of tail call optimizations (TCO).
+
+Assume we have the following Lua chunk named <test.lua>:
+
+```
+1  jit.off()
+2  misc.memprof.start("memprof_new.bin")
+3  local function append(str, rep)
+4      return string.rep(str, rep)
+5  end
+6
+7  local t = {}
+8  for _ = 1, 1e5 do
+9      table.insert(t,
+10         append('q', _)
+11     )
+12 end
+13 misc.memprof.stop()
+```
+
+Profilers output is like the follows:
+```
+ALLOCATIONS
+@test.lua:0, line 10: 100007 5004638934      0
+@test.lua:0, line 5: 1       40      0
+@test.lua:0, line 7: 1       72      0
+@test.lua:0, line 9: 1       48      0
+
+REALLOCATIONS
+@test.lua:0, line 9: 16      4194496 2097376
+        Overrides:
+                @test.lua:0, line 9
+
+@test.lua:0, line 10: 12     262080  131040
+        Overrides:
+                @test.lua:0, line 10
+
+
+DEALLOCATIONS
+INTERNAL: 21    0       2463
+@test.lua:0, line 10: 8      0       1044480
+        Overrides:
+                @test.lua:0, line 10
+```
+
+In Lua functions for profile events, we had to determine the line number of the
+function definition and corresponding `GCproto` address. For C functions only
+address will be enough. If Fast function is called from Lua function we had to
+report the Lua function for more meaningful output. Otherwise report the C
+function.
+
+So we need to know in what type of function CALL/RETURN virtual machine (VM)
+is. LuaJIT has already determined C function execution VM state but neither
+Fast functions nor Lua function. So corresponding VM states will be added.
+
+To determine currently allocating coroutine (that may not be equal to currently
+executed) new field will be added to `global_State` structure named `mem_L`
+kept coroutine address. This field sets at each reallocation to corresponding
+`L` with which it was called.
+
+There is the static function (`lj_debug_getframeline`) returned line number for
+current `BCPos` in `lj_debug.c` already. It will be added to the debug module
+API to be used in memory profiler.
+
+### Information recording
+
+Each allocate/reallocate/free is considered as a type of event that are
+reported. Event stream has the following format:
+
+```c
+/*
+** Event stream format:
+**
+** stream         := symtab memprof
+** symtab         := see <ljp_symtab.h>
+** memprof        := prologue event* epilogue
+** prologue       := 'l' 'j' 'm' version reserved
+** version        := <BYTE>
+** reserved       := <BYTE> <BYTE> <BYTE>
+** event          := event-alloc | event-realloc | event-free
+** event-alloc    := event-header loc? naddr nsize
+** event-realloc  := event-header loc? oaddr osize naddr nsize
+** event-free     := event-header loc? oaddr osize
+** event-header   := <BYTE>
+** loc            := loc-lua | loc-c
+** loc-lua        := sym-addr line-no
+** loc-c          := sym-addr
+** sym-addr       := <ULEB128>
+** line-no        := <ULEB128>
+** oaddr          := <ULEB128>
+** naddr          := <ULEB128>
+** osize          := <ULEB128>
+** nsize          := <ULEB128>
+** epilogue       := event-header
+**
+** <BYTE>   :  A single byte (no surprises here)
+** <ULEB128>:  Unsigned integer represented in ULEB128 encoding
+**
+** (Order of bits below is hi -> lo)
+**
+** version: [VVVVVVVV]
+**  * VVVVVVVV: Byte interpreted as a plain integer version number
+**
+** event-header: [FTUUSSEE]
+**  * EE   : 2 bits for representing allocation event type (AEVENT_*)
+**  * SS   : 2 bits for representing allocation source type (ASOURCE_*)
+**  * UU   : 2 unused bits
+**  * T    : Reserved. 0 for regular events, 1 for the events marked with
+**           the timestamp mark. It is assumed that the time distance between
+**           two marked events is approximately the same and is equal
+**           to 1 second. Always zero for now.
+**  * F    : 0 for regular events, 1 for epilogue's *F*inal header
+**           (if F is set to 1, all other bits are currently ignored)
+*/
+```
+
+It is enough to know the address of LUA/C function to determine it. Symbolic
+table (symtab) dumps at start of profiling to avoid determine and write line
+number of Lua code and corresponding chunk of code each time, when memory event
+happens. Each line contains the address, Lua chunk definition as the filename
+and line number of the function's declaration. This table of symbols has the
+following format described at <ljp_symtab.h>:
+
+```c
+/*
+** symtab format:
+**
+** symtab         := prologue sym*
+** prologue       := 'l' 'j' 's' version reserved
+** version        := <BYTE>
+** reserved       := <BYTE> <BYTE> <BYTE>
+** sym            := sym-lua | sym-final
+** sym-lua        := sym-header sym-addr sym-chunk sym-line
+** sym-header     := <BYTE>
+** sym-addr       := <ULEB128>
+** sym-chunk      := string
+** sym-line       := <ULEB128>
+** sym-final      := sym-header
+** string         := string-len string-payload
+** string-len     := <ULEB128>
+** string-payload := <BYTE> {string-len}
+**
+** <BYTE>   :  A single byte (no surprises here)
+** <ULEB128>:  Unsigned integer represented in ULEB128 encoding
+**
+** (Order of bits below is hi -> lo)
+**
+** version: [VVVVVVVV]
+**  * VVVVVVVV: Byte interpreted as a plain numeric version number
+**
+** sym-header: [FUUUUUTT]
+**  * TT    : 2 bits for representing symbol type
+**  * UUUUU : 5 unused bits
+**  * F     : 1 bit marking the end of the symtab (final symbol)
+*/
+```
+
+So when memory profiling starts default allocation function is replaced by the
+new allocation function as additional wrapper to write inspected profiling
+events. When profiler stops old allocation function is substituted back.
+
+Starting profiler from Lua is quite simple:
+```lua
+local started, err = misc.memprof.start(fname)
+```
+Where `fname` is name of the file where profile events are written. Writer for
+this function perform `fwrite()` for each call retrying in case of `EINTR`.
+Final callback calls `fclose()` at the end of profiling. If it is impossible to
+open a file for writing or profiler fails to start, returns `nil` on failure
+(plus an error message as a second result and a system-dependent error code as
+a third result). Otherwise returns some true value.
+
+Stopping profiler from Lua is simple too:
+```lua
+local stopped, err = misc.memprof.stop()
+```
+
+If there is any error occurred at profiling stopping (an error when file
+descriptor was closed) `memprof.stop()` returns `nil` (plus an error message as
+a second result and a system-dependent error code as a third result). Returns
+`true` otherwise.
+
+If you want to build LuaJIT without memory profiler, you should build it with
+`-DLUAJIT_DISABLE_MEMPROF`. If it is disabled `misc.memprof.start()` and
+`misc.memprof.stop()` always return `false`.
+
+Memory profiler is expected to be thread safe, so it has a corresponding
+lock/unlock at internal mutex whenever you call `luaM_memprof_*`. If you want
+to build LuaJIT without thread safety use `-DLUAJIT_DISABLE_THREAD_SAFE`.
+
+### Reading and displaying saved data
+
+Binary data can be read by `lj-parse-memprof` utility. It parses the binary
+format provided from memory profiler and render it in human-readable format.
+
+The usage is very simple:
+```
+$ ./luajit-parse-memprof --help
+luajit-parse-memprof - parser of the memory usage profile collected
+                       with LuaJIT's memprof.
+
+SYNOPSIS
+
+luajit-parse-memprof [options] memprof.bin
+
+Supported options are:
+
+  --help                            Show this help and exit
+```
+
+Plain text of profiled info has the following format:
+```
+@<filename>:<function_line>, line <line where event was detected>: <number of events>	<allocated>	<freed>
+```
+See example above.
+
+`INTERNAL` means that this allocations are caused by internal LuaJIT
+structures. Note that events are sorted from the most often to the least.
+
+`Overrides` means what allocation this reallocation overrides.
+
+## Benchmarks
+
+Benchmarks were taken from repo:
+[LuaJIT-test-cleanup](https://github.com/LuaJIT/LuaJIT-test-cleanup).
+
+Example of usage:
+```bash
+/usr/bin/time -f"array3d %U" ./luajit $BENCH_DIR/array3d.lua 300 >/dev/null
+```
+
+Benchmark results before and after the patch (less is better):
+
+```
+               | BEFORE | AFTER,memprof off | AFTER,memprof on
+---------------+--------+-------------------+-----------------
+array3d        |  0.22  |        0.20       |       0.21
+binary-trees   |  3.32  |        3.33       |       3.94
+chameneos      |  2.92  |        3.18       |       3.12
+coroutine-ring |  0.99  |        1.00       |       0.99
+euler14-bit    |  1.04  |        1.05       |       1.03
+fannkuch       |  6.77  |        6.69       |       6.64
+fasta          |  8.27  |        8.30       |       8.25
+life           |  0.48  |        0.48       |       1.03
+mandelbrot     |  2.69  |        2.70       |       2.75
+mandelbrot-bit |  1.99  |        2.00       |       2.08
+md5            |  1.57  |        1.61       |       1.56
+nbody          |  1.35  |        1.38       |       1.33
+nsieve         |  2.11  |        2.19       |       2.09
+nsieve-bit     |  1.50  |        1.55       |       1.47
+nsieve-bit-fp  |  4.40  |        4.63       |       4.44
+partialsums    |  0.54  |        0.58       |       0.55
+pidigits-nogmp |  3.48  |        3.50       |       3.47
+ray            |  1.63  |        1.68       |       1.64
+recursive-ack  |  0.19  |        0.22       |       0.20
+recursive-fib  |  1.62  |        1.71       |       1.63
+scimark-fft    |  5.78  |        5.94       |       5.69
+scimark-lu     |  3.26  |        3.57       |       3.59
+scimark-sor    |  2.34  |        2.35       |       2.33
+scimark-sparse |  5.03  |        4.92       |       4.91
+series         |  0.94  |        0.96       |       0.95
+spectral-norm  |  0.96  |        0.96       |       0.95
+```
-- 
2.28.0

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Tarantool-discussions] [RFC luajit v2] rfc: describe a LuaJIT memory profiler
  2020-12-16 19:09 [Tarantool-discussions] [RFC luajit v2] rfc: describe a LuaJIT memory profiler Sergey Kaplun
@ 2020-12-20 22:37 ` Igor Munkin
  2020-12-25 13:36   ` Sergey Kaplun
  2020-12-22 12:02 ` Sergey Ostanevich
  1 sibling, 1 reply; 5+ messages in thread
From: Igor Munkin @ 2020-12-20 22:37 UTC (permalink / raw)
  To: Sergey Kaplun; +Cc: tarantool-discussions

Sergey,

Thanks for the new version! The design is fine, but please consider my
minor comments regarding the document itself.

On 16.12.20, Sergey Kaplun wrote:
> Part of #5442
> ---
> 
> Issues: https://github.com/tarantool/tarantool/issues/5442
> Branch: https://github.com/tarantool/tarantool/tree/skaplun/gh-5442-luajit-memory-profiler
> 
> Changes in v2:
> * Removed C API, Tarantool integration and description of additional
>   features -- they will be added in another RFC if necessary.
> * Removed checking profile is running from the public API.
> * Added benchmarks and more meaningful example.
> * Grammar fixes.
> 
>  doc/rfc/5442-luajit-memory-profiler.md | 306 +++++++++++++++++++++++++
>  1 file changed, 306 insertions(+)
>  create mode 100644 doc/rfc/5442-luajit-memory-profiler.md
> 
> diff --git a/doc/rfc/5442-luajit-memory-profiler.md b/doc/rfc/5442-luajit-memory-profiler.md
> new file mode 100644
> index 000000000..720105009
> --- /dev/null
> +++ b/doc/rfc/5442-luajit-memory-profiler.md
> @@ -0,0 +1,306 @@

<snipped>

> +## Detailed design
> +
> +The whole toolchain of memory profiling will be divided by several parts:
> +1) Prerequisites.
> +2) Recording information about memory usage and saving it.
> +3) Reading saved data and display it in human-readable format.

I believe the links would also be great here.

> +
> +### Prerequisites
> +
> +This section describes additional changes in LuaJIT required to feature
> +implementation. This version of LuaJIT memory profiler does not support
> +reporting allocations from traces. But trace code semantics should be totally
> +the same as for Lua interpreter. So profiling with `jit.off()` should be
> +enough.

From this part I guess nothing works with enabled JIT. As a result of
the offline discussion it occurs everything works but allocation on
traces are attributed as internal ones. I think this is not a big deal
since JIT is a black internal box. Please mention this explicitly above.

Minor: It's worth to mention that there might be less GC pressure while
trace execution (considering allocation sinking optimization impact),
but the traces itself are allocated in GC area.

> +

<snipped>

> +
> +Fast function allocation events always belong to the previous frame with
> +considering of tail call optimizations (TCO).

Still don't get it. In the previous version I asked to explicitly
mention that allocations in builtins are attributed with their caller.
And another case is a tail call optimization, when allocations are also
attributes with the caller.

So, the example is fine (but comments are desired, IMHO), but I suggest
to separate these cases in the passage above.

> +
> +Assume we have the following Lua chunk named <test.lua>:
> +
> +```
> +1  jit.off()
> +2  misc.memprof.start("memprof_new.bin")
> +3  local function append(str, rep)
> +4      return string.rep(str, rep)
> +5  end
> +6
> +7  local t = {}
> +8  for _ = 1, 1e5 do
> +9      table.insert(t,
> +10         append('q', _)
> +11     )
> +12 end
> +13 misc.memprof.stop()
> +```
> +
> +Profilers output is like the follows:
> +```
> +ALLOCATIONS
> +@test.lua:0, line 10: 100007 5004638934      0
> +@test.lua:0, line 5: 1       40      0
> +@test.lua:0, line 7: 1       72      0
> +@test.lua:0, line 9: 1       48      0
> +
> +REALLOCATIONS
> +@test.lua:0, line 9: 16      4194496 2097376
> +        Overrides:
> +                @test.lua:0, line 9
> +
> +@test.lua:0, line 10: 12     262080  131040
> +        Overrides:
> +                @test.lua:0, line 10
> +
> +
> +DEALLOCATIONS
> +INTERNAL: 21    0       2463
> +@test.lua:0, line 10: 8      0       1044480
> +        Overrides:
> +                @test.lua:0, line 10
> +```

I have no idea what is dumped here. There is a legend below (in reading
and displaying saved data section), so leave a link to it right here.
Otherwise this is some kind of elvish.

> +
> +In Lua functions for profile events, we had to determine the line number of the
> +function definition and corresponding `GCproto` address. For C functions only
> +address will be enough. If Fast function is called from Lua function we had to
> +report the Lua function for more meaningful output. Otherwise report the C
> +function.

Still don't understand two last sentences. In the previous reply you
told this relates to the part above (about the fast functions). You
reworded that part, but this is still left unclear.

> +
> +So we need to know in what type of function CALL/RETURN virtual machine (VM)

AFAICS the function type is enough, isn't it? As you mentioned above
allocation are attributed to the callers in case of tail call
optimization.

> +is. LuaJIT has already determined C function execution VM state but neither
> +Fast functions nor Lua function. So corresponding VM states will be added.

<snipped>

> +### Information recording
> +
> +Each allocate/reallocate/free is considered as a type of event that are
> +reported. Event stream has the following format:
> +
> +```c
> +/*
> +** Event stream format:
> +**
> +** stream         := symtab memprof
> +** symtab         := see <ljp_symtab.h>
> +** memprof        := prologue event* epilogue
> +** prologue       := 'l' 'j' 'm' version reserved
> +** version        := <BYTE>
> +** reserved       := <BYTE> <BYTE> <BYTE>
> +** event          := event-alloc | event-realloc | event-free
> +** event-alloc    := event-header loc? naddr nsize
> +** event-realloc  := event-header loc? oaddr osize naddr nsize
> +** event-free     := event-header loc? oaddr osize
> +** event-header   := <BYTE>
> +** loc            := loc-lua | loc-c
> +** loc-lua        := sym-addr line-no
> +** loc-c          := sym-addr
> +** sym-addr       := <ULEB128>
> +** line-no        := <ULEB128>
> +** oaddr          := <ULEB128>
> +** naddr          := <ULEB128>
> +** osize          := <ULEB128>
> +** nsize          := <ULEB128>
> +** epilogue       := event-header
> +**
> +** <BYTE>   :  A single byte (no surprises here)
> +** <ULEB128>:  Unsigned integer represented in ULEB128 encoding
> +**
> +** (Order of bits below is hi -> lo)
> +**
> +** version: [VVVVVVVV]
> +**  * VVVVVVVV: Byte interpreted as a plain integer version number
> +**
> +** event-header: [FTUUSSEE]
> +**  * EE   : 2 bits for representing allocation event type (AEVENT_*)
> +**  * SS   : 2 bits for representing allocation source type (ASOURCE_*)
> +**  * UU   : 2 unused bits
> +**  * T    : Reserved. 0 for regular events, 1 for the events marked with
> +**           the timestamp mark. It is assumed that the time distance between
> +**           two marked events is approximately the same and is equal
> +**           to 1 second. Always zero for now.

It looks this is zero always in our case, so we have 3 unused bits,
right? Then just drop this part.

> +**  * F    : 0 for regular events, 1 for epilogue's *F*inal header
> +**           (if F is set to 1, all other bits are currently ignored)
> +*/
> +```
> +
> +It is enough to know the address of LUA/C function to determine it. Symbolic
> +table (symtab) dumps at start of profiling to avoid determine and write line
> +number of Lua code and corresponding chunk of code each time, when memory event
> +happens. Each line contains the address, Lua chunk definition as the filename
> +and line number of the function's declaration. This table of symbols has the
> +following format described at <ljp_symtab.h>:
> +
> +```c
> +/*
> +** symtab format:
> +**
> +** symtab         := prologue sym*
> +** prologue       := 'l' 'j' 's' version reserved
> +** version        := <BYTE>
> +** reserved       := <BYTE> <BYTE> <BYTE>
> +** sym            := sym-lua | sym-final
> +** sym-lua        := sym-header sym-addr sym-chunk sym-line
> +** sym-header     := <BYTE>
> +** sym-addr       := <ULEB128>
> +** sym-chunk      := string
> +** sym-line       := <ULEB128>
> +** sym-final      := sym-header
> +** string         := string-len string-payload
> +** string-len     := <ULEB128>
> +** string-payload := <BYTE> {string-len}
> +**
> +** <BYTE>   :  A single byte (no surprises here)
> +** <ULEB128>:  Unsigned integer represented in ULEB128 encoding
> +**
> +** (Order of bits below is hi -> lo)
> +**
> +** version: [VVVVVVVV]
> +**  * VVVVVVVV: Byte interpreted as a plain numeric version number
> +**
> +** sym-header: [FUUUUUTT]
> +**  * TT    : 2 bits for representing symbol type
> +**  * UUUUU : 5 unused bits
> +**  * F     : 1 bit marking the end of the symtab (final symbol)
> +*/
> +```

<snipped>

> +Starting profiler from Lua is quite simple:
> +```lua
> +local started, err = misc.memprof.start(fname)
> +```
> +Where `fname` is name of the file where profile events are written. Writer for
> +this function perform `fwrite()` for each call retrying in case of `EINTR`.
> +Final callback calls `fclose()` at the end of profiling. If it is impossible to
> +open a file for writing or profiler fails to start, returns `nil` on failure
> +(plus an error message as a second result and a system-dependent error code as
> +a third result). Otherwise returns some true value.

What the heck is the third result and why is it ignored in the example
above? I think this error should be incorporated in the second argument
kinda <perror> does.

> +
> +Stopping profiler from Lua is simple too:
> +```lua
> +local stopped, err = misc.memprof.stop()
> +```
> +
> +If there is any error occurred at profiling stopping (an error when file
> +descriptor was closed) `memprof.stop()` returns `nil` (plus an error message as
> +a second result and a system-dependent error code as a third result). Returns
> +`true` otherwise.

Ditto.

> +

<snipped>

> +Memory profiler is expected to be thread safe, so it has a corresponding
> +lock/unlock at internal mutex whenever you call `luaM_memprof_*`. If you want
> +to build LuaJIT without thread safety use `-DLUAJIT_DISABLE_THREAD_SAFE`.

I see no luaM_memprof_* interfaces. So what is this thread safety for?

> +

<snipped>

> +## Benchmarks
> +
> +Benchmarks were taken from repo:
> +[LuaJIT-test-cleanup](https://github.com/LuaJIT/LuaJIT-test-cleanup).
> +
> +Example of usage:
> +```bash
> +/usr/bin/time -f"array3d %U" ./luajit $BENCH_DIR/array3d.lua 300 >/dev/null
> +```
> +
> +Benchmark results before and after the patch (less is better):

There are some considerable differences in benchmark results, so I have
several questions:
* Mention whether you tested with enabled JIT or not
* How many iterations did you made? What is the dispersion/noise for
  these runs?
* How come the values with the *enabled* memprof are less than values
  with disabled memprof and even the vanilla LuaJIT? E.g. fasta or fast
  fourier transform.

> +
> +```
> +               | BEFORE | AFTER,memprof off | AFTER,memprof on
> +---------------+--------+-------------------+-----------------
> +array3d        |  0.22  |        0.20       |       0.21
> +binary-trees   |  3.32  |        3.33       |       3.94
> +chameneos      |  2.92  |        3.18       |       3.12
> +coroutine-ring |  0.99  |        1.00       |       0.99
> +euler14-bit    |  1.04  |        1.05       |       1.03
> +fannkuch       |  6.77  |        6.69       |       6.64
> +fasta          |  8.27  |        8.30       |       8.25
> +life           |  0.48  |        0.48       |       1.03
> +mandelbrot     |  2.69  |        2.70       |       2.75
> +mandelbrot-bit |  1.99  |        2.00       |       2.08
> +md5            |  1.57  |        1.61       |       1.56
> +nbody          |  1.35  |        1.38       |       1.33
> +nsieve         |  2.11  |        2.19       |       2.09
> +nsieve-bit     |  1.50  |        1.55       |       1.47
> +nsieve-bit-fp  |  4.40  |        4.63       |       4.44
> +partialsums    |  0.54  |        0.58       |       0.55
> +pidigits-nogmp |  3.48  |        3.50       |       3.47
> +ray            |  1.63  |        1.68       |       1.64
> +recursive-ack  |  0.19  |        0.22       |       0.20
> +recursive-fib  |  1.62  |        1.71       |       1.63
> +scimark-fft    |  5.78  |        5.94       |       5.69
> +scimark-lu     |  3.26  |        3.57       |       3.59
> +scimark-sor    |  2.34  |        2.35       |       2.33
> +scimark-sparse |  5.03  |        4.92       |       4.91
> +series         |  0.94  |        0.96       |       0.95
> +spectral-norm  |  0.96  |        0.96       |       0.95
> +```
> -- 
> 2.28.0
> 

-- 
Best regards,
IM

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Tarantool-discussions] [RFC luajit v2] rfc: describe a LuaJIT memory profiler
  2020-12-16 19:09 [Tarantool-discussions] [RFC luajit v2] rfc: describe a LuaJIT memory profiler Sergey Kaplun
  2020-12-20 22:37 ` Igor Munkin
@ 2020-12-22 12:02 ` Sergey Ostanevich
  2020-12-25 13:14   ` Sergey Kaplun
  1 sibling, 1 reply; 5+ messages in thread
From: Sergey Ostanevich @ 2020-12-22 12:02 UTC (permalink / raw)
  To: Sergey Kaplun; +Cc: tarantool-discussions

Hi!

Thanks for the patch!

See some comments below, after being applied is LGTM.

Sergos

> On 16 Dec 2020, at 22:09, Sergey Kaplun <skaplun@tarantool.org> wrote:
> 
> Part of #5442
> ---
> 
> Issues: https://github.com/tarantool/tarantool/issues/5442
> Branch: https://github.com/tarantool/tarantool/tree/skaplun/gh-5442-luajit-memory-profiler
> 
> Changes in v2:
> * Removed C API, Tarantool integration and description of additional
>  features -- they will be added in another RFC if necessary.
> * Removed checking profile is running from the public API.
> * Added benchmarks and more meaningful example.
> * Grammar fixes.
> 
> doc/rfc/5442-luajit-memory-profiler.md | 306 +++++++++++++++++++++++++
> 1 file changed, 306 insertions(+)
> create mode 100644 doc/rfc/5442-luajit-memory-profiler.md
> 
> diff --git a/doc/rfc/5442-luajit-memory-profiler.md b/doc/rfc/5442-luajit-memory-profiler.md
> new file mode 100644
> index 000000000..720105009
> --- /dev/null
> +++ b/doc/rfc/5442-luajit-memory-profiler.md
> @@ -0,0 +1,306 @@
> +# LuaJIT memory profiler
> +
> +* **Status**: In progress
> +* **Start date**: 24-10-2020
> +* **Authors**: Sergey Kaplun @Buristan skaplun@tarantool.org,
> +               Igor Munkin @igormunkin imun@tarantool.org,
> +               Sergey Ostanevich @sergos sergos@tarantool.org
> +* **Issues**: [#5442](https://github.com/tarantool/tarantool/issues/5442)
> +
> +## Summary
> +
> +LuaJIT memory profiler is a toolchain for analysis of memory usage by user's
> +application.
> +
> +## Background and motivation
> +
> +Garbage collector (GC) is a curse of performance for most of Lua applications.

This is something that is very questionable. Do you have any data on performance
of Lua code against the GC code in real life? Is there any about Tarantool
applications?

I’m not questioning the mem profiler itself, rather trying to weigh the GC problem.

> +Memory usage of Lua application should be profiled to find out various
							 ^^^^^^^ locate

> +memory-unoptimized code blocks. If the application has memory leaks they can be
> +found with the profiler.
			   ^ also.
> +
> +## Detailed design
> +
> +The whole toolchain of memory profiling will be divided by several parts:
							  ^^^ into
> +1) Prerequisites.
> +2) Recording information about memory usage and saving it.
> +3) Reading saved data and display it in human-readable format.
> +
> +### Prerequisites
> +
> +This section describes additional changes in LuaJIT required to feature
								^^ for the 
> +implementation. This version of LuaJIT memory profiler does not support
> +reporting allocations from traces. But trace code semantics should be totally
> +the same as for Lua interpreter. So profiling with `jit.off()` should be
> +enough.
> +
> +There are two different representations of functions in LuaJIT: the function's
> +prototype (`GCproto`) and the function object so called closure (`GCfunc`).
> +The closures are represented as `GCfuncL` and `GCfuncC` for Lua and C closures
> +correspondingly. Also LuaJIT has special function's type aka Fast Function. It
                                   ^ a
> +is used for LuaJIT builtins.
> +
> +Fast function allocation events always belong to the previous frame with
> +considering of tail call optimizations (TCO).
> +
> +Assume we have the following Lua chunk named <test.lua>:
> +
> +```
> +1  jit.off()
> +2  misc.memprof.start("memprof_new.bin")
> +3  local function append(str, rep)
> +4      return string.rep(str, rep)
> +5  end
> +6
> +7  local t = {}
> +8  for _ = 1, 1e5 do
> +9      table.insert(t,
> +10         append('q', _)
> +11     )
> +12 end
> +13 misc.memprof.stop()
> +```
> +
> +Profilers output is like the follows:
> +```
> +ALLOCATIONS
> +@test.lua:0, line 10: 100007 5004638934      0
> +@test.lua:0, line 5: 1       40      0
> +@test.lua:0, line 7: 1       72      0
> +@test.lua:0, line 9: 1       48      0
> +
> +REALLOCATIONS
> +@test.lua:0, line 9: 16      4194496 2097376
> +        Overrides:
> +                @test.lua:0, line 9
> +
> +@test.lua:0, line 10: 12     262080  131040
> +        Overrides:
> +                @test.lua:0, line 10
> +
> +
> +DEALLOCATIONS
> +INTERNAL: 21    0       2463
> +@test.lua:0, line 10: 8      0       1044480
> +        Overrides:
> +                @test.lua:0, line 10
> +```
> +
> +In Lua functions for profile events, we had to determine the line number of the
> +function definition and corresponding `GCproto` address. For C functions only
> +address will be enough. If Fast function is called from Lua function we had to
> +report the Lua function for more meaningful output. Otherwise report the C
> +function.
> +
> +So we need to know in what type of function CALL/RETURN virtual machine (VM)
                      ^^^^^^^ a                           ^ the 
> +is. LuaJIT has already determined C function execution VM state but neither
    ^ currently in. 

> +is. LuaJIT has already determined C function execution VM state but neither
      Currently VM state identifies C function execution only, so Fast and Lua
functions states will be added. 

> +Fast functions nor Lua function. So corresponding VM states will be added.
> +
> +To determine currently allocating coroutine (that may not be equal to currently
> +executed) new field will be added to `global_State` structure named `mem_L`
            ^ a

> +kept coroutine address. This field sets at each reallocation to corresponding
   ^^^ to keep the                    ^^^ is set

> +`L` with which it was called.
> +
> +There is the static function (`lj_debug_getframeline`) returned line number for
            ^^^ a                                         ^^^ that returns
> +current `BCPos` in `lj_debug.c` already. It will be added to the debug module
> +API to be used in memory profiler.
> +
> +### Information recording
> +
> +Each allocate/reallocate/free is considered as a type of event that are
> +reported. Event stream has the following format:
> +
> +```c
> +/*
> +** Event stream format:
> +**
> +** stream         := symtab memprof
> +** symtab         := see <ljp_symtab.h>
> +** memprof        := prologue event* epilogue
> +** prologue       := 'l' 'j' 'm' version reserved
> +** version        := <BYTE>
> +** reserved       := <BYTE> <BYTE> <BYTE>
> +** event          := event-alloc | event-realloc | event-free
> +** event-alloc    := event-header loc? naddr nsize
> +** event-realloc  := event-header loc? oaddr osize naddr nsize
> +** event-free     := event-header loc? oaddr osize
> +** event-header   := <BYTE>
> +** loc            := loc-lua | loc-c
> +** loc-lua        := sym-addr line-no
> +** loc-c          := sym-addr
> +** sym-addr       := <ULEB128>
> +** line-no        := <ULEB128>
> +** oaddr          := <ULEB128>
> +** naddr          := <ULEB128>
> +** osize          := <ULEB128>
> +** nsize          := <ULEB128>
> +** epilogue       := event-header
> +**
> +** <BYTE>   :  A single byte (no surprises here)
> +** <ULEB128>:  Unsigned integer represented in ULEB128 encoding
> +**
> +** (Order of bits below is hi -> lo)
> +**
> +** version: [VVVVVVVV]
> +**  * VVVVVVVV: Byte interpreted as a plain integer version number
> +**
> +** event-header: [FTUUSSEE]
> +**  * EE   : 2 bits for representing allocation event type (AEVENT_*)
> +**  * SS   : 2 bits for representing allocation source type (ASOURCE_*)
> +**  * UU   : 2 unused bits
> +**  * T    : Reserved. 0 for regular events, 1 for the events marked with
> +**           the timestamp mark. It is assumed that the time distance between

No representation of a timestamp so far? If so - maybe remove excessive description
and put ‘reserved for timestamp’.

> +**           two marked events is approximately the same and is equal
> +**           to 1 second. Always zero for now.
> +**  * F    : 0 for regular events, 1 for epilogue's *F*inal header
> +**           (if F is set to 1, all other bits are currently ignored)
> +*/
> +```
> +
> +It is enough to know the address of LUA/C function to determine it. Symbolic
> +table (symtab) dumps at start of profiling to avoid determine and write line
> +number of Lua code and corresponding chunk of code each time, when memory event
> +happens. Each line contains the address, Lua chunk definition as the filename
> +and line number of the function's declaration. This table of symbols has the
> +following format described at <ljp_symtab.h>:
> +
> +```c
> +/*
> +** symtab format:
> +**
> +** symtab         := prologue sym*
> +** prologue       := 'l' 'j' 's' version reserved
> +** version        := <BYTE>
> +** reserved       := <BYTE> <BYTE> <BYTE>
> +** sym            := sym-lua | sym-final
> +** sym-lua        := sym-header sym-addr sym-chunk sym-line
> +** sym-header     := <BYTE>
> +** sym-addr       := <ULEB128>
> +** sym-chunk      := string
> +** sym-line       := <ULEB128>
> +** sym-final      := sym-header
> +** string         := string-len string-payload
> +** string-len     := <ULEB128>
> +** string-payload := <BYTE> {string-len}
> +**
> +** <BYTE>   :  A single byte (no surprises here)
> +** <ULEB128>:  Unsigned integer represented in ULEB128 encoding
> +**
> +** (Order of bits below is hi -> lo)
> +**
> +** version: [VVVVVVVV]
> +**  * VVVVVVVV: Byte interpreted as a plain numeric version number
> +**
> +** sym-header: [FUUUUUTT]
> +**  * TT    : 2 bits for representing symbol type
> +**  * UUUUU : 5 unused bits
> +**  * F     : 1 bit marking the end of the symtab (final symbol)
> +*/
> +```
> +
> +So when memory profiling starts default allocation function is replaced by the
> +new allocation function as additional wrapper to write inspected profiling
> +events. When profiler stops old allocation function is substituted back.
> +
> +Starting profiler from Lua is quite simple:
> +```lua
> +local started, err = misc.memprof.start(fname)
> +```
> +Where `fname` is name of the file where profile events are written. Writer for
> +this function perform `fwrite()` for each call retrying in case of `EINTR`.
> +Final callback calls `fclose()` at the end of profiling. If it is impossible to
> +open a file for writing or profiler fails to start, returns `nil` on failure
> +(plus an error message as a second result and a system-dependent error code as
> +a third result). Otherwise returns some true value.
> +
> +Stopping profiler from Lua is simple too:
> +```lua
> +local stopped, err = misc.memprof.stop()
> +```
> +
> +If there is any error occurred at profiling stopping (an error when file
> +descriptor was closed) `memprof.stop()` returns `nil` (plus an error message as
> +a second result and a system-dependent error code as a third result). Returns

Why do you need to separate an ‘error’ from a ’system error’?

> +`true` otherwise.
> +
> +If you want to build LuaJIT without memory profiler, you should build it with
> +`-DLUAJIT_DISABLE_MEMPROF`. If it is disabled `misc.memprof.start()` and
> +`misc.memprof.stop()` always return `false`.
> +
> +Memory profiler is expected to be thread safe, so it has a corresponding
> +lock/unlock at internal mutex whenever you call `luaM_memprof_*`. If you want
> +to build LuaJIT without thread safety use `-DLUAJIT_DISABLE_THREAD_SAFE`.
> +
> +### Reading and displaying saved data
> +
> +Binary data can be read by `lj-parse-memprof` utility. It parses the binary
> +format provided from memory profiler and render it in human-readable format.
> +
> +The usage is very simple:
> +```
> +$ ./luajit-parse-memprof --help
> +luajit-parse-memprof - parser of the memory usage profile collected
> +                       with LuaJIT's memprof.
> +
> +SYNOPSIS
> +
> +luajit-parse-memprof [options] memprof.bin
> +
> +Supported options are:
> +
> +  --help                            Show this help and exit
> +```
> +
> +Plain text of profiled info has the following format:
> +```
> +@<filename>:<function_line>, line <line where event was detected>: <number of events>	<allocated>	<freed>
> +```
> +See example above.
> +
> +`INTERNAL` means that this allocations are caused by internal LuaJIT
> +structures. Note that events are sorted from the most often to the least.
> +
> +`Overrides` means what allocation this reallocation overrides.
> +
> +## Benchmarks
> +
> +Benchmarks were taken from repo:
> +[LuaJIT-test-cleanup](https://github.com/LuaJIT/LuaJIT-test-cleanup).
> +
> +Example of usage:
Either ‘usage’ or ‘example of use’


> +```bash
> +/usr/bin/time -f"array3d %U" ./luajit $BENCH_DIR/array3d.lua 300 >/dev/null
> +```
> +
> +Benchmark results before and after the patch (less is better):

Comparative to ‘BEFORE’ is more reafable, dispersion should also help with
reading of results.

> +
> +```
> +               | BEFORE | AFTER,memprof off | AFTER,memprof on
> +---------------+--------+-------------------+-----------------
> +array3d        |  0.22  |        0.20       |       0.21
> +binary-trees   |  3.32  |        3.33       |       3.94
> +chameneos      |  2.92  |        3.18       |       3.12
> +coroutine-ring |  0.99  |        1.00       |       0.99
> +euler14-bit    |  1.04  |        1.05       |       1.03
> +fannkuch       |  6.77  |        6.69       |       6.64
> +fasta          |  8.27  |        8.30       |       8.25
> +life           |  0.48  |        0.48       |       1.03
> +mandelbrot     |  2.69  |        2.70       |       2.75
> +mandelbrot-bit |  1.99  |        2.00       |       2.08
> +md5            |  1.57  |        1.61       |       1.56
> +nbody          |  1.35  |        1.38       |       1.33
> +nsieve         |  2.11  |        2.19       |       2.09
> +nsieve-bit     |  1.50  |        1.55       |       1.47
> +nsieve-bit-fp  |  4.40  |        4.63       |       4.44
> +partialsums    |  0.54  |        0.58       |       0.55
> +pidigits-nogmp |  3.48  |        3.50       |       3.47
> +ray            |  1.63  |        1.68       |       1.64
> +recursive-ack  |  0.19  |        0.22       |       0.20
> +recursive-fib  |  1.62  |        1.71       |       1.63
> +scimark-fft    |  5.78  |        5.94       |       5.69
> +scimark-lu     |  3.26  |        3.57       |       3.59
> +scimark-sor    |  2.34  |        2.35       |       2.33
> +scimark-sparse |  5.03  |        4.92       |       4.91
> +series         |  0.94  |        0.96       |       0.95
> +spectral-norm  |  0.96  |        0.96       |       0.95
> +```
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Tarantool-discussions] [RFC luajit v2] rfc: describe a LuaJIT memory profiler
  2020-12-22 12:02 ` Sergey Ostanevich
@ 2020-12-25 13:14   ` Sergey Kaplun
  0 siblings, 0 replies; 5+ messages in thread
From: Sergey Kaplun @ 2020-12-25 13:14 UTC (permalink / raw)
  To: Sergey Ostanevich; +Cc: tarantool-discussions

Hi!

Thanks for the review!

On 22.12.20, Sergey Ostanevich wrote:
> Hi!
> 
> Thanks for the patch!
> 
> See some comments below, after being applied is LGTM.

Please, see new verison here [1].

> 
> Sergos
> 
> > On 16 Dec 2020, at 22:09, Sergey Kaplun <skaplun@tarantool.org> wrote:
> > 
> > Part of #5442
> > ---
> > 
> > Issues: https://github.com/tarantool/tarantool/issues/5442
> > Branch: https://github.com/tarantool/tarantool/tree/skaplun/gh-5442-luajit-memory-profiler
> > 
> > Changes in v2:
> > * Removed C API, Tarantool integration and description of additional
> >  features -- they will be added in another RFC if necessary.
> > * Removed checking profile is running from the public API.
> > * Added benchmarks and more meaningful example.
> > * Grammar fixes.
> > 
> > doc/rfc/5442-luajit-memory-profiler.md | 306 +++++++++++++++++++++++++
> > 1 file changed, 306 insertions(+)
> > create mode 100644 doc/rfc/5442-luajit-memory-profiler.md
> > 
> > diff --git a/doc/rfc/5442-luajit-memory-profiler.md b/doc/rfc/5442-luajit-memory-profiler.md
> > new file mode 100644
> > index 000000000..720105009
> > --- /dev/null
> > +++ b/doc/rfc/5442-luajit-memory-profiler.md
> > @@ -0,0 +1,306 @@
> > +# LuaJIT memory profiler
> > +
> > +* **Status**: In progress
> > +* **Start date**: 24-10-2020
> > +* **Authors**: Sergey Kaplun @Buristan skaplun@tarantool.org,
> > +               Igor Munkin @igormunkin imun@tarantool.org,
> > +               Sergey Ostanevich @sergos sergos@tarantool.org
> > +* **Issues**: [#5442](https://github.com/tarantool/tarantool/issues/5442)
> > +
> > +## Summary
> > +
> > +LuaJIT memory profiler is a toolchain for analysis of memory usage by user's
> > +application.
> > +
> > +## Background and motivation
> > +
> > +Garbage collector (GC) is a curse of performance for most of Lua applications.
> 
> This is something that is very questionable. Do you have any data on performance
> of Lua code against the GC code in real life? Is there any about Tarantool
> applications?
> 
> I’m not questioning the mem profiler itself, rather trying to weigh the GC problem.

I can give benchmarks for vshard [2] for now. There are approximately
20% of time spent in `lj_gc_steps()`, IINM.

> 
> > +Memory usage of Lua application should be profiled to find out various
> 							 ^^^^^^^ locate

Fixed.

> 
> > +memory-unoptimized code blocks. If the application has memory leaks they can be
> > +found with the profiler.
> 			   ^ also.

Fixed.

> > +
> > +## Detailed design
> > +
> > +The whole toolchain of memory profiling will be divided by several parts:
> 							  ^^^ into

Thank you! Fixed.

> > +1) Prerequisites.
> > +2) Recording information about memory usage and saving it.
> > +3) Reading saved data and display it in human-readable format.
> > +
> > +### Prerequisites
> > +
> > +This section describes additional changes in LuaJIT required to feature
> 								^^ for the 

Fixed.

> > +implementation. This version of LuaJIT memory profiler does not support
> > +reporting allocations from traces. But trace code semantics should be totally
> > +the same as for Lua interpreter. So profiling with `jit.off()` should be
> > +enough.
> > +
> > +There are two different representations of functions in LuaJIT: the function's
> > +prototype (`GCproto`) and the function object so called closure (`GCfunc`).
> > +The closures are represented as `GCfuncL` and `GCfuncC` for Lua and C closures
> > +correspondingly. Also LuaJIT has special function's type aka Fast Function. It
>                                    ^ a

Fixed.

> > +is used for LuaJIT builtins.
> > +
> > +Fast function allocation events always belong to the previous frame with
> > +considering of tail call optimizations (TCO).
> > +
> > +Assume we have the following Lua chunk named <test.lua>:
> > +
> > +```
> > +1  jit.off()
> > +2  misc.memprof.start("memprof_new.bin")
> > +3  local function append(str, rep)
> > +4      return string.rep(str, rep)
> > +5  end
> > +6
> > +7  local t = {}
> > +8  for _ = 1, 1e5 do
> > +9      table.insert(t,
> > +10         append('q', _)
> > +11     )
> > +12 end
> > +13 misc.memprof.stop()
> > +```
> > +
> > +Profilers output is like the follows:
> > +```
> > +ALLOCATIONS
> > +@test.lua:0, line 10: 100007 5004638934      0
> > +@test.lua:0, line 5: 1       40      0
> > +@test.lua:0, line 7: 1       72      0
> > +@test.lua:0, line 9: 1       48      0
> > +
> > +REALLOCATIONS
> > +@test.lua:0, line 9: 16      4194496 2097376
> > +        Overrides:
> > +                @test.lua:0, line 9
> > +
> > +@test.lua:0, line 10: 12     262080  131040
> > +        Overrides:
> > +                @test.lua:0, line 10
> > +
> > +
> > +DEALLOCATIONS
> > +INTERNAL: 21    0       2463
> > +@test.lua:0, line 10: 8      0       1044480
> > +        Overrides:
> > +                @test.lua:0, line 10
> > +```
> > +
> > +In Lua functions for profile events, we had to determine the line number of the
> > +function definition and corresponding `GCproto` address. For C functions only
> > +address will be enough. If Fast function is called from Lua function we had to
> > +report the Lua function for more meaningful output. Otherwise report the C
> > +function.
> > +
> > +So we need to know in what type of function CALL/RETURN virtual machine (VM)
>                       ^^^^^^^ a                           ^ the 
> > +is. LuaJIT has already determined C function execution VM state but neither
>     ^ currently in. 
> 
> > +is. LuaJIT has already determined C function execution VM state but neither
>       Currently VM state identifies C function execution only, so Fast and Lua
> functions states will be added. 

Reformulated in the new version.

> 
> > +Fast functions nor Lua function. So corresponding VM states will be added.
> > +
> > +To determine currently allocating coroutine (that may not be equal to currently
> > +executed) new field will be added to `global_State` structure named `mem_L`
>             ^ a

Fixed.

> 
> > +kept coroutine address. This field sets at each reallocation to corresponding
>    ^^^ to keep the                    ^^^ is set

Fixed.

> 
> > +`L` with which it was called.
> > +
> > +There is the static function (`lj_debug_getframeline`) returned line number for
>             ^^^ a                                         ^^^ that returns

Fixed.

> > +current `BCPos` in `lj_debug.c` already. It will be added to the debug module
> > +API to be used in memory profiler.
> > +
> > +### Information recording
> > +
> > +Each allocate/reallocate/free is considered as a type of event that are
> > +reported. Event stream has the following format:
> > +
> > +```c
> > +/*
> > +** Event stream format:
> > +**
> > +** stream         := symtab memprof
> > +** symtab         := see <ljp_symtab.h>
> > +** memprof        := prologue event* epilogue
> > +** prologue       := 'l' 'j' 'm' version reserved
> > +** version        := <BYTE>
> > +** reserved       := <BYTE> <BYTE> <BYTE>
> > +** event          := event-alloc | event-realloc | event-free
> > +** event-alloc    := event-header loc? naddr nsize
> > +** event-realloc  := event-header loc? oaddr osize naddr nsize
> > +** event-free     := event-header loc? oaddr osize
> > +** event-header   := <BYTE>
> > +** loc            := loc-lua | loc-c
> > +** loc-lua        := sym-addr line-no
> > +** loc-c          := sym-addr
> > +** sym-addr       := <ULEB128>
> > +** line-no        := <ULEB128>
> > +** oaddr          := <ULEB128>
> > +** naddr          := <ULEB128>
> > +** osize          := <ULEB128>
> > +** nsize          := <ULEB128>
> > +** epilogue       := event-header
> > +**
> > +** <BYTE>   :  A single byte (no surprises here)
> > +** <ULEB128>:  Unsigned integer represented in ULEB128 encoding
> > +**
> > +** (Order of bits below is hi -> lo)
> > +**
> > +** version: [VVVVVVVV]
> > +**  * VVVVVVVV: Byte interpreted as a plain integer version number
> > +**
> > +** event-header: [FTUUSSEE]
> > +**  * EE   : 2 bits for representing allocation event type (AEVENT_*)
> > +**  * SS   : 2 bits for representing allocation source type (ASOURCE_*)
> > +**  * UU   : 2 unused bits
> > +**  * T    : Reserved. 0 for regular events, 1 for the events marked with
> > +**           the timestamp mark. It is assumed that the time distance between
> 
> No representation of a timestamp so far? If so - maybe remove excessive description
> and put ‘reserved for timestamp’.

I've dropped it according your and Igors comments.

> 
> > +**           two marked events is approximately the same and is equal
> > +**           to 1 second. Always zero for now.
> > +**  * F    : 0 for regular events, 1 for epilogue's *F*inal header
> > +**           (if F is set to 1, all other bits are currently ignored)
> > +*/
> > +```
> > +
> > +It is enough to know the address of LUA/C function to determine it. Symbolic
> > +table (symtab) dumps at start of profiling to avoid determine and write line
> > +number of Lua code and corresponding chunk of code each time, when memory event
> > +happens. Each line contains the address, Lua chunk definition as the filename
> > +and line number of the function's declaration. This table of symbols has the
> > +following format described at <ljp_symtab.h>:
> > +
> > +```c
> > +/*
> > +** symtab format:
> > +**
> > +** symtab         := prologue sym*
> > +** prologue       := 'l' 'j' 's' version reserved
> > +** version        := <BYTE>
> > +** reserved       := <BYTE> <BYTE> <BYTE>
> > +** sym            := sym-lua | sym-final
> > +** sym-lua        := sym-header sym-addr sym-chunk sym-line
> > +** sym-header     := <BYTE>
> > +** sym-addr       := <ULEB128>
> > +** sym-chunk      := string
> > +** sym-line       := <ULEB128>
> > +** sym-final      := sym-header
> > +** string         := string-len string-payload
> > +** string-len     := <ULEB128>
> > +** string-payload := <BYTE> {string-len}
> > +**
> > +** <BYTE>   :  A single byte (no surprises here)
> > +** <ULEB128>:  Unsigned integer represented in ULEB128 encoding
> > +**
> > +** (Order of bits below is hi -> lo)
> > +**
> > +** version: [VVVVVVVV]
> > +**  * VVVVVVVV: Byte interpreted as a plain numeric version number
> > +**
> > +** sym-header: [FUUUUUTT]
> > +**  * TT    : 2 bits for representing symbol type
> > +**  * UUUUU : 5 unused bits
> > +**  * F     : 1 bit marking the end of the symtab (final symbol)
> > +*/
> > +```
> > +
> > +So when memory profiling starts default allocation function is replaced by the
> > +new allocation function as additional wrapper to write inspected profiling
> > +events. When profiler stops old allocation function is substituted back.
> > +
> > +Starting profiler from Lua is quite simple:
> > +```lua
> > +local started, err = misc.memprof.start(fname)
> > +```
> > +Where `fname` is name of the file where profile events are written. Writer for
> > +this function perform `fwrite()` for each call retrying in case of `EINTR`.
> > +Final callback calls `fclose()` at the end of profiling. If it is impossible to
> > +open a file for writing or profiler fails to start, returns `nil` on failure
> > +(plus an error message as a second result and a system-dependent error code as
> > +a third result). Otherwise returns some true value.
> > +
> > +Stopping profiler from Lua is simple too:
> > +```lua
> > +local stopped, err = misc.memprof.stop()
> > +```
> > +
> > +If there is any error occurred at profiling stopping (an error when file
> > +descriptor was closed) `memprof.stop()` returns `nil` (plus an error message as
> > +a second result and a system-dependent error code as a third result). Returns
> 
> Why do you need to separate an ‘error’ from a ’system error’?

I've used the same formulation as in Lua Reference Manual [3].
It's errno, in fact, but IINM it is system dependent.

> 
> > +`true` otherwise.
> > +
> > +If you want to build LuaJIT without memory profiler, you should build it with
> > +`-DLUAJIT_DISABLE_MEMPROF`. If it is disabled `misc.memprof.start()` and
> > +`misc.memprof.stop()` always return `false`.
> > +
> > +Memory profiler is expected to be thread safe, so it has a corresponding
> > +lock/unlock at internal mutex whenever you call `luaM_memprof_*`. If you want
> > +to build LuaJIT without thread safety use `-DLUAJIT_DISABLE_THREAD_SAFE`.
> > +
> > +### Reading and displaying saved data
> > +
> > +Binary data can be read by `lj-parse-memprof` utility. It parses the binary
> > +format provided from memory profiler and render it in human-readable format.
> > +
> > +The usage is very simple:
> > +```
> > +$ ./luajit-parse-memprof --help
> > +luajit-parse-memprof - parser of the memory usage profile collected
> > +                       with LuaJIT's memprof.
> > +
> > +SYNOPSIS
> > +
> > +luajit-parse-memprof [options] memprof.bin
> > +
> > +Supported options are:
> > +
> > +  --help                            Show this help and exit
> > +```
> > +
> > +Plain text of profiled info has the following format:
> > +```
> > +@<filename>:<function_line>, line <line where event was detected>: <number of events>	<allocated>	<freed>
> > +```
> > +See example above.
> > +
> > +`INTERNAL` means that this allocations are caused by internal LuaJIT
> > +structures. Note that events are sorted from the most often to the least.
> > +
> > +`Overrides` means what allocation this reallocation overrides.
> > +
> > +## Benchmarks
> > +
> > +Benchmarks were taken from repo:
> > +[LuaJIT-test-cleanup](https://github.com/LuaJIT/LuaJIT-test-cleanup).
> > +
> > +Example of usage:
> Either ‘usage’ or ‘example of use’

Fixed, thank you!

> 
> 
> > +```bash
> > +/usr/bin/time -f"array3d %U" ./luajit $BENCH_DIR/array3d.lua 300 >/dev/null
> > +```
> > +
> > +Benchmark results before and after the patch (less is better):
> 
> Comparative to ‘BEFORE’ is more reafable, dispersion should also help with
> reading of results.

Done in the new version.

> 
> > +
> > +```
> > +               | BEFORE | AFTER,memprof off | AFTER,memprof on
> > +---------------+--------+-------------------+-----------------
> > +array3d        |  0.22  |        0.20       |       0.21
> > +binary-trees   |  3.32  |        3.33       |       3.94
> > +chameneos      |  2.92  |        3.18       |       3.12
> > +coroutine-ring |  0.99  |        1.00       |       0.99
> > +euler14-bit    |  1.04  |        1.05       |       1.03
> > +fannkuch       |  6.77  |        6.69       |       6.64
> > +fasta          |  8.27  |        8.30       |       8.25
> > +life           |  0.48  |        0.48       |       1.03
> > +mandelbrot     |  2.69  |        2.70       |       2.75
> > +mandelbrot-bit |  1.99  |        2.00       |       2.08
> > +md5            |  1.57  |        1.61       |       1.56
> > +nbody          |  1.35  |        1.38       |       1.33
> > +nsieve         |  2.11  |        2.19       |       2.09
> > +nsieve-bit     |  1.50  |        1.55       |       1.47
> > +nsieve-bit-fp  |  4.40  |        4.63       |       4.44
> > +partialsums    |  0.54  |        0.58       |       0.55
> > +pidigits-nogmp |  3.48  |        3.50       |       3.47
> > +ray            |  1.63  |        1.68       |       1.64
> > +recursive-ack  |  0.19  |        0.22       |       0.20
> > +recursive-fib  |  1.62  |        1.71       |       1.63
> > +scimark-fft    |  5.78  |        5.94       |       5.69
> > +scimark-lu     |  3.26  |        3.57       |       3.59
> > +scimark-sor    |  2.34  |        2.35       |       2.33
> > +scimark-sparse |  5.03  |        4.92       |       4.91
> > +series         |  0.94  |        0.96       |       0.95
> > +spectral-norm  |  0.96  |        0.96       |       0.95
> > +```
> > -- 
> > 2.28.0
> > 
> 

[1]: https://lists.tarantool.org/pipermail/tarantool-discussions/2020-December/000147.html
[2]: https://github.com/tarantool/vshard/issues/224#issuecomment-637632349
[3]: https://www.lua.org/manual/5.1/manual.html#5.7

-- 
Best regards,
Sergey Kaplun

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Tarantool-discussions] [RFC luajit v2] rfc: describe a LuaJIT memory profiler
  2020-12-20 22:37 ` Igor Munkin
@ 2020-12-25 13:36   ` Sergey Kaplun
  0 siblings, 0 replies; 5+ messages in thread
From: Sergey Kaplun @ 2020-12-25 13:36 UTC (permalink / raw)
  To: Igor Munkin; +Cc: tarantool-discussions

Hi, Igor!

Thanks for the review!

On 21.12.20, Igor Munkin wrote:
> Sergey,
> 
> Thanks for the new version! The design is fine, but please consider my
> minor comments regarding the document itself.

Please, see new version here [1].

> 
> On 16.12.20, Sergey Kaplun wrote:
> > Part of #5442
> > ---
> > 
> > Issues: https://github.com/tarantool/tarantool/issues/5442
> > Branch: https://github.com/tarantool/tarantool/tree/skaplun/gh-5442-luajit-memory-profiler
> > 
> > Changes in v2:
> > * Removed C API, Tarantool integration and description of additional
> >   features -- they will be added in another RFC if necessary.
> > * Removed checking profile is running from the public API.
> > * Added benchmarks and more meaningful example.
> > * Grammar fixes.
> > 
> >  doc/rfc/5442-luajit-memory-profiler.md | 306 +++++++++++++++++++++++++
> >  1 file changed, 306 insertions(+)
> >  create mode 100644 doc/rfc/5442-luajit-memory-profiler.md
> > 
> > diff --git a/doc/rfc/5442-luajit-memory-profiler.md b/doc/rfc/5442-luajit-memory-profiler.md
> > new file mode 100644
> > index 000000000..720105009
> > --- /dev/null
> > +++ b/doc/rfc/5442-luajit-memory-profiler.md
> > @@ -0,0 +1,306 @@
> 
> <snipped>
> 
> > +## Detailed design
> > +
> > +The whole toolchain of memory profiling will be divided by several parts:
> > +1) Prerequisites.
> > +2) Recording information about memory usage and saving it.
> > +3) Reading saved data and display it in human-readable format.
> 
> I believe the links would also be great here.

Done in the new version.

> 
> > +
> > +### Prerequisites
> > +
> > +This section describes additional changes in LuaJIT required to feature
> > +implementation. This version of LuaJIT memory profiler does not support
> > +reporting allocations from traces. But trace code semantics should be totally
> > +the same as for Lua interpreter. So profiling with `jit.off()` should be
> > +enough.
> 
> From this part I guess nothing works with enabled JIT. As a result of
> the offline discussion it occurs everything works but allocation on
> traces are attributed as internal ones. I think this is not a big deal
> since JIT is a black internal box. Please mention this explicitly above.
> 
> Minor: It's worth to mention that there might be less GC pressure while
> trace execution (considering allocation sinking optimization impact),
> but the traces itself are allocated in GC area.

Added, thanks!

> 
> > +
> 
> <snipped>
> 
> > +
> > +Fast function allocation events always belong to the previous frame with
> > +considering of tail call optimizations (TCO).
> 
> Still don't get it. In the previous version I asked to explicitly
> mention that allocations in builtins are attributed with their caller.
> And another case is a tail call optimization, when allocations are also
> attributes with the caller.
> 
> So, the example is fine (but comments are desired, IMHO), but I suggest
> to separate these cases in the passage above.

Reformulated in the new verison as we discussed offline.

> 
> > +
> > +Assume we have the following Lua chunk named <test.lua>:
> > +
> > +```
> > +1  jit.off()
> > +2  misc.memprof.start("memprof_new.bin")
> > +3  local function append(str, rep)
> > +4      return string.rep(str, rep)
> > +5  end
> > +6
> > +7  local t = {}
> > +8  for _ = 1, 1e5 do
> > +9      table.insert(t,
> > +10         append('q', _)
> > +11     )
> > +12 end
> > +13 misc.memprof.stop()
> > +```
> > +
> > +Profilers output is like the follows:
> > +```
> > +ALLOCATIONS
> > +@test.lua:0, line 10: 100007 5004638934      0
> > +@test.lua:0, line 5: 1       40      0
> > +@test.lua:0, line 7: 1       72      0
> > +@test.lua:0, line 9: 1       48      0
> > +
> > +REALLOCATIONS
> > +@test.lua:0, line 9: 16      4194496 2097376
> > +        Overrides:
> > +                @test.lua:0, line 9
> > +
> > +@test.lua:0, line 10: 12     262080  131040
> > +        Overrides:
> > +                @test.lua:0, line 10
> > +
> > +
> > +DEALLOCATIONS
> > +INTERNAL: 21    0       2463
> > +@test.lua:0, line 10: 8      0       1044480
> > +        Overrides:
> > +                @test.lua:0, line 10
> > +```
> 
> I have no idea what is dumped here. There is a legend below (in reading
> and displaying saved data section), so leave a link to it right here.
> Otherwise this is some kind of elvish.

Added link.

> 
> > +
> > +In Lua functions for profile events, we had to determine the line number of the
> > +function definition and corresponding `GCproto` address. For C functions only
> > +address will be enough. If Fast function is called from Lua function we had to
> > +report the Lua function for more meaningful output. Otherwise report the C
> > +function.
> 
> Still don't understand two last sentences. In the previous reply you
> told this relates to the part above (about the fast functions). You
> reworded that part, but this is still left unclear.

Reformulated in the new verison as we discussed offline.

> 
> > +
> > +So we need to know in what type of function CALL/RETURN virtual machine (VM)
> 
> AFAICS the function type is enough, isn't it? As you mentioned above
> allocation are attributed to the callers in case of tail call
> optimization.

Reformulated in the new verison as we discussed offline.

> 
> > +is. LuaJIT has already determined C function execution VM state but neither
> > +Fast functions nor Lua function. So corresponding VM states will be added.
> 
> <snipped>
> 
> > +### Information recording
> > +
> > +Each allocate/reallocate/free is considered as a type of event that are
> > +reported. Event stream has the following format:
> > +
> > +```c
> > +/*
> > +** Event stream format:
> > +**
> > +** stream         := symtab memprof
> > +** symtab         := see <ljp_symtab.h>
> > +** memprof        := prologue event* epilogue
> > +** prologue       := 'l' 'j' 'm' version reserved
> > +** version        := <BYTE>
> > +** reserved       := <BYTE> <BYTE> <BYTE>
> > +** event          := event-alloc | event-realloc | event-free
> > +** event-alloc    := event-header loc? naddr nsize
> > +** event-realloc  := event-header loc? oaddr osize naddr nsize
> > +** event-free     := event-header loc? oaddr osize
> > +** event-header   := <BYTE>
> > +** loc            := loc-lua | loc-c
> > +** loc-lua        := sym-addr line-no
> > +** loc-c          := sym-addr
> > +** sym-addr       := <ULEB128>
> > +** line-no        := <ULEB128>
> > +** oaddr          := <ULEB128>
> > +** naddr          := <ULEB128>
> > +** osize          := <ULEB128>
> > +** nsize          := <ULEB128>
> > +** epilogue       := event-header
> > +**
> > +** <BYTE>   :  A single byte (no surprises here)
> > +** <ULEB128>:  Unsigned integer represented in ULEB128 encoding
> > +**
> > +** (Order of bits below is hi -> lo)
> > +**
> > +** version: [VVVVVVVV]
> > +**  * VVVVVVVV: Byte interpreted as a plain integer version number
> > +**
> > +** event-header: [FTUUSSEE]
> > +**  * EE   : 2 bits for representing allocation event type (AEVENT_*)
> > +**  * SS   : 2 bits for representing allocation source type (ASOURCE_*)
> > +**  * UU   : 2 unused bits
> > +**  * T    : Reserved. 0 for regular events, 1 for the events marked with
> > +**           the timestamp mark. It is assumed that the time distance between
> > +**           two marked events is approximately the same and is equal
> > +**           to 1 second. Always zero for now.
> 
> It looks this is zero always in our case, so we have 3 unused bits,
> right? Then just drop this part.

I've dropped it according your and Sergos comments.

> 

<snipped>

> > +Starting profiler from Lua is quite simple:
> > +```lua
> > +local started, err = misc.memprof.start(fname)
> > +```
> > +Where `fname` is name of the file where profile events are written. Writer for
> > +this function perform `fwrite()` for each call retrying in case of `EINTR`.
> > +Final callback calls `fclose()` at the end of profiling. If it is impossible to
> > +open a file for writing or profiler fails to start, returns `nil` on failure
> > +(plus an error message as a second result and a system-dependent error code as
> > +a third result). Otherwise returns some true value.
> 
> What the heck is the third result and why is it ignored in the example
> above? I think this error should be incorporated in the second argument
> kinda <perror> does.

I've used the same formulation as in Lua Reference Manual [2].
It's errno, in fact, but IINM it is system dependent.

> 
> > +
> > +Stopping profiler from Lua is simple too:
> > +```lua
> > +local stopped, err = misc.memprof.stop()
> > +```
> > +
> > +If there is any error occurred at profiling stopping (an error when file
> > +descriptor was closed) `memprof.stop()` returns `nil` (plus an error message as
> > +a second result and a system-dependent error code as a third result). Returns
> > +`true` otherwise.
> 
> Ditto.

Same meaning as above.

> 
> > +
> 
> <snipped>
> 
> > +Memory profiler is expected to be thread safe, so it has a corresponding
> > +lock/unlock at internal mutex whenever you call `luaM_memprof_*`. If you want
> > +to build LuaJIT without thread safety use `-DLUAJIT_DISABLE_THREAD_SAFE`.
> 
> I see no luaM_memprof_* interfaces. So what is this thread safety for?

It is used if you've created different VMs and try to start profiling
from different threads. But unfortunately for now we have one static
structure for all threads. It is necessary to avoid polushion
from other profiled VMs.

> 
> > +
> 
> <snipped>
> 
> > +## Benchmarks
> > +
> > +Benchmarks were taken from repo:
> > +[LuaJIT-test-cleanup](https://github.com/LuaJIT/LuaJIT-test-cleanup).
> > +
> > +Example of usage:
> > +```bash
> > +/usr/bin/time -f"array3d %U" ./luajit $BENCH_DIR/array3d.lua 300 >/dev/null
> > +```
> > +
> > +Benchmark results before and after the patch (less is better):
> 
> There are some considerable differences in benchmark results, so I have
> several questions:
> * Mention whether you tested with enabled JIT or not

Enabled of course.
For each test I've added this preambule:
| local start = misc.memprof.start("/tmp/memprof_tmp.bin")
| assert(start)

and this postambule:
| local stop = misc.memprof.stop()
| assert(stop)

> * How many iterations did you made? What is the dispersion/noise for
>   these runs?

In the new version I've run 11 itterations for each test. And I've
added standard deviation for each test.

> * How come the values with the *enabled* memprof are less than values
>   with disabled memprof and even the vanilla LuaJIT? E.g. fasta or fast
>   fourier transform.

Good question. I've no answer for now.
My wild guess: garbage collector starts earlier (when amount of objects
is less) and so runs faster.

> 
> > +
> > +```
> > +               | BEFORE | AFTER,memprof off | AFTER,memprof on
> > +---------------+--------+-------------------+-----------------
> > +array3d        |  0.22  |        0.20       |       0.21
> > +binary-trees   |  3.32  |        3.33       |       3.94
> > +chameneos      |  2.92  |        3.18       |       3.12
> > +coroutine-ring |  0.99  |        1.00       |       0.99
> > +euler14-bit    |  1.04  |        1.05       |       1.03
> > +fannkuch       |  6.77  |        6.69       |       6.64
> > +fasta          |  8.27  |        8.30       |       8.25
> > +life           |  0.48  |        0.48       |       1.03
> > +mandelbrot     |  2.69  |        2.70       |       2.75
> > +mandelbrot-bit |  1.99  |        2.00       |       2.08
> > +md5            |  1.57  |        1.61       |       1.56
> > +nbody          |  1.35  |        1.38       |       1.33
> > +nsieve         |  2.11  |        2.19       |       2.09
> > +nsieve-bit     |  1.50  |        1.55       |       1.47
> > +nsieve-bit-fp  |  4.40  |        4.63       |       4.44
> > +partialsums    |  0.54  |        0.58       |       0.55
> > +pidigits-nogmp |  3.48  |        3.50       |       3.47
> > +ray            |  1.63  |        1.68       |       1.64
> > +recursive-ack  |  0.19  |        0.22       |       0.20
> > +recursive-fib  |  1.62  |        1.71       |       1.63
> > +scimark-fft    |  5.78  |        5.94       |       5.69
> > +scimark-lu     |  3.26  |        3.57       |       3.59
> > +scimark-sor    |  2.34  |        2.35       |       2.33
> > +scimark-sparse |  5.03  |        4.92       |       4.91
> > +series         |  0.94  |        0.96       |       0.95
> > +spectral-norm  |  0.96  |        0.96       |       0.95
> > +```
> > -- 
> > 2.28.0
> > 
> 
> -- 
> Best regards,
> IM

[1]: https://lists.tarantool.org/pipermail/tarantool-discussions/2020-December/000147.html
[2]: https://www.lua.org/manual/5.1/manual.html#5.7

-- 
Best regards,
Sergey Kaplun

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-12-25 13:37 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-16 19:09 [Tarantool-discussions] [RFC luajit v2] rfc: describe a LuaJIT memory profiler Sergey Kaplun
2020-12-20 22:37 ` Igor Munkin
2020-12-25 13:36   ` Sergey Kaplun
2020-12-22 12:02 ` Sergey Ostanevich
2020-12-25 13:14   ` Sergey Kaplun

Tarantool discussions archive

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://lists.tarantool.org/tarantool-discussions/0 tarantool-discussions/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 tarantool-discussions tarantool-discussions/ https://lists.tarantool.org/tarantool-discussions \
		tarantool-discussions@dev.tarantool.org.
	public-inbox-index tarantool-discussions

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git