Hi! 

Thanks for the patch, I've looked into 
https://github.com/tarantool/tarantool/blob/skaplun/gh-5442-luajit-memory-profiler-rfc/doc/rfc/5442-luajit-memory-profiler.md

in ‘Prerequisites’: 
Also all, deallocations are reported as internal too.

the comma is not needed

Lua developers can do nothing with allocations made inside the built-ins except reducing its usage.

‘its’ doesn’t explain exact matter. I would rephrase: "As for allocations made inside the built-ins user can do nothing but reduce use of these built-ins."

Currently VM state identifies C function execution only, so Fast and Lua functions states are added.

‘Currently’ -> ‘Originally’

Otherwise LGTM.
Sergos


On 20 Jan 2021, at 11:19, Sergey Kaplun <skaplun@tarantool.org> wrote:

Hi, Igor!

Thanks for the review!

On 15.01.21, Igor Munkin wrote:
Sergey,

Thanks for the changes. There is a bit of nitpicking below and I
believe we'll push the next version doc to the trunk.

I've fixed all your comments, plus added some insignificant fixes.
See two iterative patches below. Branch is force pushed.


On 25.12.20, Sergey Kaplun wrote:
Part of #5442
---

RFC on branch: https://github.com/tarantool/tarantool/blob/skaplun/gh-5442-luajit-memory-profiler/doc/rfc/5442-luajit-memory-profiler.md

Side note: branch name is updated.
New RFC version: https://github.com/tarantool/tarantool/blob/skaplun/gh-5442-luajit-memory-profiler-rfc/doc/rfc/5442-luajit-memory-profiler.md


Changes in v3:
* More comments in example.
* More verbose benchmark information.
* Grammar and spelling fixes.

Changes in v2:
* Removed C API, Tarantool integration and description of additional
 features -- they will be added in another RFC if necessary.
* Removed checking profile is running from the public API.
* Added benchmarks and more meaningful example.
* Grammar fixes.

doc/rfc/5442-luajit-memory-profiler.md | 314 +++++++++++++++++++++++++
1 file changed, 314 insertions(+)
create mode 100644 doc/rfc/5442-luajit-memory-profiler.md

diff --git a/doc/rfc/5442-luajit-memory-profiler.md b/doc/rfc/5442-luajit-memory-profiler.md
new file mode 100644
index 000000000..85a61462a
--- /dev/null
+++ b/doc/rfc/5442-luajit-memory-profiler.md
@@ -0,0 +1,314 @@

<snipped>

+### Prerequisites
+
+This section describes additional changes in LuaJIT required for the feature
+implementation. This version of LuaJIT memory profiler does not support verbose
+reporting allocations from traces. All allocation from traces are reported as

Typo: s/reporting allocations from/reporting for allocations made on/.

Fixed, thanks!


+internal. But trace code semantics should be totally the same as for the Lua
+interpreter (excluding sink optimizations). Also all deallocations reported as

Typo: s/deallocations reported/deallocation are reported/.

Fixed, thanks!


+internal too.
+
+There are two different representations of functions in LuaJIT: the function's
+prototype (`GCproto`) and the function object so called closure (`GCfunc`).
+The closures are represented as `GCfuncL` and `GCfuncC` for Lua and C closures
+correspondingly. Also LuaJIT has a special function's type aka Fast Function.

Typo: s/correspondingly/respectively/.

+It is used for LuaJIT builtins.

It's better to not split this sentence. Consider the rewording:
| Besides LuaJIT has a special function type a.k.a. Fast Function that
| is used for LuaJIT builtins.

Applied! Thanks!


+

<snipped>

+Usually developers are not interested in information about allocations inside
+builtins. So if fast function was called from a Lua function all
+allocations are attributed to this Lua function. Otherwise attribute this event
+to a C function.

I propose the following rewording:
| Lua developers can do nothing with allocations made inside the
| builtins except reducing its usage. So if fast function is called from
| a Lua function all allocations made in its scope are attributed to this
| Lua function (i.e. the builtin caller). Otherwise this event is
| attributed to a C function.


Applied, thanks!

+

<snipped>

+If one run the chunk above the profiler reports approximately the following

Typo: s/run/runs/.

Fixed.


+(see legend [here](#reading-and-displaying-saved-data)):

<snipped>

+So we need to know a type of function being executed by the virtual machine
+(VM). Currently VM state identifies C function execution only, so Fast and Lua
+functions states will be added.

Typo: s/will be/are/.

Sure, thanks!


+
+To determine currently allocating coroutine (that may not be equal to currently
+executed one) a new field called `mem_L` is added to `global_State` structure
+to keep the coroutine address. This field is set at each reallocation to

Typo: /at each reallocation to/on each reallocation to the/.

Fixed.


+corresponding `L` with which it was called.

Typo: s/it was/it is/.

Thanks, fixed!


+

<snipped>

+When the profiling is stopped the `fclose()` is called. If it is impossible to

Typo: s/the `fclose()`/`fclose()`/.

Fixed.


+open a file for writing or profiler fails to start, returns `nil` on failure

Typo: s/returns `nil`/`nil` is returned/.

Fixed.


+(plus an error message as a second result and a system-dependent error code as
+a third result). Otherwise returns some true value.

It would be nice to mention that the function contract is similar to
other standart io.* interfaces.

I glanced the source code: it's not "some" true value; it is exactly the
*true* value.

All right! Fixed.


+

<snipped>

+Memory profiler is expected to be thread safe, so it has a corresponding
+lock/unlock at internal mutex whenever you call corresponding memprof
+functions. If you want to build LuaJIT without thread safety use
+`-DLUAJIT_DISABLE_THREAD_SAFE`.

This is not implemented in scope of the MVP, so drop this part.

Done.


+
+### Reading and displaying saved data
+
+Binary data can be read by `lj-parse-memprof` utility. It parses the binary

Typo: s/lj-parse-memprof/luajit-parse-memprof/.

Fixed, thanks!


+format provided by memory profiler and render it on human-readable format.

Typo: s/it on/it to/.

Fixed, thanks!


+

<snipped>

+This table shows performance deviation in relation to REFerence value (before
+commit) with stopped and running profiler. The table shows the average value
+for 11 runs. The first field of the column indicates the change in the average
+time in seconds (less is better). The second field is the standard deviation
+for the found difference.
+
+```
+     Name       | REF  | AFTER, memprof off | AFTER, memprof on
+----------------+------+--------------------+------------------
+array3d         | 0.21 |    +0.00 (0.01)    |    +0.00 (0.01)
+binary-trees    | 3.25 |    -0.01 (0.06)    |    +0.53 (0.10)
+chameneos       | 2.97 |    +0.14 (0.04)    |    +0.13 (0.06)
+coroutine-ring  | 1.00 |    +0.01 (0.04)    |    +0.01 (0.04)
+euler14-bit     | 1.03 |    +0.01 (0.02)    |    +0.00 (0.02)
+fannkuch        | 6.81 |    -0.21 (0.06)    |    -0.20 (0.06)
+fasta           | 8.20 |    -0.07 (0.05)    |    -0.08 (0.03)

Side note: Still curious how this can happen. It looks OK when this is
negative difference in within its deviation. But this is sorta magic.

Yes, me too. Unfortunately, we have neither any benchmark tests nor
performance analisis for LuaJIT for now.


+life            | 0.46 |    +0.00 (0.01)    |    +0.35 (0.01)
+mandelbrot      | 2.65 |    +0.00 (0.01)    |    +0.01 (0.01)
+mandelbrot-bit  | 1.97 |    +0.00 (0.01)    |    +0.01 (0.02)
+md5             | 1.58 |    -0.01 (0.04)    |    -0.04 (0.04)
+nbody           | 1.34 |    +0.00 (0.01)    |    -0.02 (0.01)
+nsieve          | 2.07 |    -0.03 (0.03)    |    -0.01 (0.04)
+nsieve-bit      | 1.50 |    -0.02 (0.04)    |    +0.00 (0.04)
+nsieve-bit-fp   | 4.44 |    -0.03 (0.07)    |    -0.01 (0.07)
+partialsums     | 0.54 |    +0.00 (0.01)    |    +0.00 (0.01)
+pidigits-nogmp  | 3.47 |    -0.01 (0.02)    |    -0.10 (0.02)
+ray             | 1.62 |    -0.02 (0.03)    |    +0.00 (0.02)
+recursive-ack   | 0.20 |    +0.00 (0.01)    |    +0.00 (0.01)
+recursive-fib   | 1.63 |    +0.00 (0.01)    |    +0.01 (0.02)
+scimark-fft     | 5.72 |    +0.06 (0.09)    |    -0.01 (0.10)
+scimark-lu      | 3.47 |    +0.02 (0.27)    |    -0.03 (0.26)
+scimark-sor     | 2.34 |    +0.00 (0.01)    |    -0.01 (0.01)
+scimark-sparse  | 4.95 |    -0.02 (0.04)    |    -0.02 (0.04)
+series          | 0.95 |    +0.00 (0.02)    |    +0.00 (0.01)
+spectral-norm   | 0.96 |    +0.00 (0.02)    |    -0.01 (0.02)
+```
-- 
2.28.0


-- 
Best regards,
IM

===================================================================
diff --git a/doc/rfc/5442-luajit-memory-profiler.md b/doc/rfc/5442-luajit-memory-profiler.md
index 85a61462a..2721f1cc1 100644
--- a/doc/rfc/5442-luajit-memory-profiler.md
+++ b/doc/rfc/5442-luajit-memory-profiler.md
@@ -30,39 +30,39 @@ The whole toolchain of memory profiling will be divided into several parts:

This section describes additional changes in LuaJIT required for the feature
implementation. This version of LuaJIT memory profiler does not support verbose
-reporting allocations from traces. All allocation from traces are reported as
-internal. But trace code semantics should be totally the same as for the Lua
-interpreter (excluding sink optimizations). Also all deallocations reported as
-internal too.
+reporting for allocations made on traces. All allocation from traces are
+reported as internal. But trace code semantics should be totally the same as
+for the Lua interpreter (excluding sink optimizations). Also all, deallocations
+are reported as internal too.

There are two different representations of functions in LuaJIT: the function's
prototype (`GCproto`) and the function object so called closure (`GCfunc`).
The closures are represented as `GCfuncL` and `GCfuncC` for Lua and C closures
-correspondingly. Also LuaJIT has a special function's type aka Fast Function.
-It is used for LuaJIT builtins.
+respectively. Besides LuaJIT has a special function type, a.k.a. Fast Function
+that is used for LuaJIT built-ins

Tail call optimization does not create a new call frame, so all allocations
inside the function called via `CALLT`/`CALLMT` are attributed to its caller.

-Usually developers are not interested in information about allocations inside
-builtins. So if fast function was called from a Lua function all
-allocations are attributed to this Lua function. Otherwise attribute this event
-to a C function.
+Lua developers can do nothing with allocations made inside the built-ins except
+reducing its usage. So if fast function is called from a Lua function all
+allocations made in its scope are attributed to this Lua function (i.e. the
+built-in caller). Otherwise, this event is attributed to a C function.

Assume we have the following Lua chunk named <test.lua>:

-```
+```lua
1  jit.off()
2  misc.memprof.start("memprof_new.bin")
-3  -- Lua does not create a new frame to call string.rep and all allocations are
-4  -- attributed not to `append()` function but to the parent scope.
+3  -- Lua does not create a new frame to call string.rep() and all allocations
+4  -- are attributed not to append() function but to the parent scope.
5  local function append(str, rep)
6    return string.rep(str, rep)
7  end
8
9  local t = {}
10 for _ = 1, 1e5 do
-11   -- table.insert is a builtin and all corresponding allocations
+11   -- table.insert() is a built-in and all corresponding allocations
12   -- are reported in the scope of main chunk
13   table.insert(t,
14     append('q', _)
@@ -71,7 +71,7 @@ Assume we have the following Lua chunk named <test.lua>:
17 misc.memprof.stop()
```

-If one run the chunk above the profiler reports approximately the following
+If one runs the chunk above the profiler reports approximately the following
(see legend [here](#reading-and-displaying-saved-data)):
```
ALLOCATIONS
@@ -99,15 +99,15 @@ INTERNAL: 20    0       1481

So we need to know a type of function being executed by the virtual machine
(VM). Currently VM state identifies C function execution only, so Fast and Lua
-functions states will be added.
+functions states are added.

To determine currently allocating coroutine (that may not be equal to currently
executed one) a new field called `mem_L` is added to `global_State` structure
-to keep the coroutine address. This field is set at each reallocation to
-corresponding `L` with which it was called.
+to keep the coroutine address. This field is set on each reallocation to the
+corresponding `L` with which it is called.

There is a static function (`lj_debug_getframeline`) that returns line number
-for current `BCPos` in `lj_debug.c` already. It will be added to the debug
+for current `BCPos` in `lj_debug.c` already. It is added to the debug
module API to be used in memory profiler.

### Information recording
@@ -211,10 +211,11 @@ local started, err, errno = misc.memprof.start(fname)
```
where `fname` is name of the file where profile events are written. Writer for
this function perform `fwrite()` for each call retrying in case of `EINTR`.
-When the profiling is stopped the `fclose()` is called. If it is impossible to
-open a file for writing or profiler fails to start, returns `nil` on failure
+When the profiling is stopped `fclose()` is called. The profiler's function's
+contract is similar to standard `io.*` interfaces. If it is impossible to open
+a file for writing or profiler fails to start, `nil` is returned on failure
(plus an error message as a second result and a system-dependent error code as
-a third result). Otherwise returns some true value.
+a third result). Otherwise, returns `true` value.

Stopping profiler from Lua is simple too:
```lua
@@ -230,17 +231,12 @@ If you want to build LuaJIT without memory profiler, you should build it with
`-DLUAJIT_DISABLE_MEMPROF`. If it is disabled `misc.memprof.start()` and
`misc.memprof.stop()` always return `false`.

-Memory profiler is expected to be thread safe, so it has a corresponding
-lock/unlock at internal mutex whenever you call corresponding memprof
-functions. If you want to build LuaJIT without thread safety use
-`-DLUAJIT_DISABLE_THREAD_SAFE`.
-
### Reading and displaying saved data

-Binary data can be read by `lj-parse-memprof` utility. It parses the binary
-format provided by memory profiler and render it on human-readable format.
+Binary data can be read by `luajit-parse-memprof` utility. It parses the binary
+format provided by memory profiler and render it to human-readable format.

-The usage is very simple:
+The usage for LuaJIT itself is very simple:
```
$ ./luajit-parse-memprof --help
luajit-parse-memprof - parser of the memory usage profile collected
@@ -266,6 +262,12 @@ structures. Note that events are sorted from the most often to the least.

`Overrides` means what allocation this reallocation overrides.

+If you want to parse binary data via Tarantool only, use the following
+command (dash is important):
+```bash
+$ tarantool -e 'require("memprof")(arg[1])' - memprof.bin
+```
+
## Benchmarks

Benchmarks were taken from repo:
===================================================================

And one more iterative patch (over the previous one). Branch is
force pushed.
===================================================================
diff --git a/doc/rfc/5442-luajit-memory-profiler.md b/doc/rfc/5442-luajit-memory-profiler.md
index 2721f1cc1..f9c43f91f 100644
--- a/doc/rfc/5442-luajit-memory-profiler.md
+++ b/doc/rfc/5442-luajit-memory-profiler.md
@@ -5,7 +5,7 @@
* **Authors**: Sergey Kaplun @Buristan skaplun@tarantool.org,
               Igor Munkin @igormunkin imun@tarantool.org,
               Sergey Ostanevich @sergos sergos@tarantool.org
-* **Issues**: [#5442](https://github.com/tarantool/tarantool/issues/5442)
+* **Issues**: [#5442](https://github.com/tarantool/tarantool/issues/5442), [#5490](https://github.com/tarantool/tarantool/issues/5490)

## Summary
===================================================================
-- 
Best regards,
Sergey Kaplun