Hi!
Thanks for the review!
 
 
Hi, Maxim!
Thanks for the fixes!
Generally LGTM, but please consider my comments below.

On 10.07.23, Maxim Kokryashkin wrote:
> Sometimes, the Lua stack can be inconsistent during
> the FFUNC execution, which may lead to a sysprof
> crash during the stack unwinding.
>
> This patch replaces the `top_frame` property of `global_State`
> with `lj_sysprof_topframe` structure, which contains `top_frame`
> and `ffid` properties. `ffid` property makes sense only when the
> LuaJIT VM state is set to `FFUNC`. That property is set to the
> ffid of the fast function that VM is about to execute.
> In the same time, `top_frame` property is not updated now, so
> the top frame of the Lua stack can be streamed based on the ffid,
> and the rest of the Lua stack can be streamed as usual.
>
> Also, this patch fixes build with plain makefile, by adding

Nit: I suggest to rephrase it like "original Makefile" or
Makefile.original.

Feel free to ignore.
Fixed! The branch is force-pushed.

> the `LJ_HASSYSPROF` flag support to it.
>
> Resolves tarantool/tarantool#8594
> ---
> Changes in v3:
> - Fixed comments as per review by Sergey
>
> Branch: https://github.com/tarantool/luajit/tree/fckxorg/gh-8594-sysprof-ffunc-crash
> PR: https://github.com/tarantool/tarantool/pull/8737
> src/Makefile.original | 3 ++
> src/lj_obj.h | 7 +++-
> src/lj_sysprof.c | 26 ++++++++++++---
> src/vm_x64.dasc | 22 +++++++++++--
> src/vm_x86.dasc | 31 ++++++++++++++---
> .../gh-8594-sysprof-ffunc-crash.test.lua | 33 +++++++++++++++++++
> 6 files changed, 109 insertions(+), 13 deletions(-)
> create mode 100644 test/tarantool-tests/gh-8594-sysprof-ffunc-crash.test.lua
>
> diff --git a/src/Makefile.original b/src/Makefile.original
> index aedaaa73..e21a0e56 100644
> --- a/src/Makefile.original
> +++ b/src/Makefile.original

<snipped>

> diff --git a/test/tarantool-tests/gh-8594-sysprof-ffunc-crash.test.lua b/test/tarantool-tests/gh-8594-sysprof-ffunc-crash.test.lua
> new file mode 100644
> index 00000000..e5cdeb07
> --- /dev/null
> +++ b/test/tarantool-tests/gh-8594-sysprof-ffunc-crash.test.lua
> @@ -0,0 +1,33 @@
> +local tap = require('tap')
> +local test = tap.test('gh-8594-sysprof-ffunc-crash'):skipcond({
> + ['Sysprof is implemented for x86_64 only'] = jit.arch ~= 'x86' and
> + jit.arch ~= 'x64',
> + ['Sysprof is implemented for Linux only'] = jit.os ~= 'Linux',
> +})
> +
> +test:plan(1)
> +
> +jit.off()
> +-- XXX: Run JIT tuning functions in a safe frame to avoid errors
> +-- thrown when LuaJIT is compiled with JIT engine disabled.
> +pcall(jit.flush)
> +
> +local TMP_BINFILE = '/dev/null'
> +
> +local res, err = misc.sysprof.start{
> + mode = 'C',
> + interval = 3,
> + path = TMP_BINFILE,
> +}
> +assert(res, err)
> +
> +for i = 1, 1e5 do
> + tostring(i)
> +end

Within these (interval/iterations) changes I hardly can see assertion failure
on master branch for my laptop (1/10 cases).

| >>> git log -n1 --no-decorate --oneline
| 8e46d601 test: fix flaky <unit-jit-parse.test.lua>
| >>> LUA_PATH="src/?.lua;test/tarantool-tests/?.lua;;" src/luajit test/tarantool-tests/gh-8594-sysprof-ffunc-crash.test.lua
| TAP version 13
| 1..1
| ok - sysprof finished successfully

I don't know how to resolve this problem with our CI at death's door...
OTOH, we may consider that this value is enough for our CI, so I'll see
the problem there.
So, I'll agree with Sergey's opinion about this flaky test.
I’ve added a conditional parametrization for this test case, here is the diff:
==========================================================
diff --git a/test/tarantool-tests/gh-8594-sysprof-ffunc-crash.test.lua b/test/tarantool-tests/gh-8594-sysprof-ffunc-crash.test.lua
index a053e41c..347bd087 100644
--- a/test/tarantool-tests/gh-8594-sysprof-ffunc-crash.test.lua
+++ b/test/tarantool-tests/gh-8594-sysprof-ffunc-crash.test.lua
@@ -17,17 +17,30 @@ local TMP_BINFILE = '/dev/null'
 -- XXX: The best way to test the issue is to set the profile
 -- interval to be as short as possible. However, our CI is
 -- not capable of handling such intense testing, so it was a
--- forced decision to reduce the sampling frequency. As a
+-- forced decision to reduce the sampling frequency for it. As a
 -- result, it is now less likely to reproduce the issue
 -- statistically, but the test case is still valid.
+
+-- GitHub always sets[1] the `CI` environment variable to `true`
+-- for every step in a workflow.
+-- [1]: https://docs.github.com/en/actions/learn-github-actions/variables#default-environment-variables
+local CI = os.getenv('CI') == 'true'
+
+-- Profile interval and number of iterations for CI are
+-- empirical. Non-CI profile interval is set to be as short
+-- as possible, so the issue is more likely to reproduce.
+-- Non-CI number of iterations is greater for the same reason.
+local PROFILE_INTERVAL = CI and 3 or 1
+local N_ITERATIONS = CI and 1e5 or 1e6
+
 local res, err = misc.sysprof.start{
   mode = 'C',
-  interval = 3,
+  interval = PROFILE_INTERVAL,
   path = TMP_BINFILE,
 }
 assert(res, err)
 
-for i = 1, 1e5 do
+for i = 1, N_ITERATIONS do
   -- XXX: `tostring` is FFUNC.
   tostring(i)
 end
==========================================================
The branch is force-pushed.

> +
> +res, err = misc.sysprof.stop()
> +assert(res, err)
> +
> +test:ok(true, 'sysprof finished successfully')
> +
> +os.exit(test:check() and 0 or 1)
> --
> 2.40.1
>

--
Best regards,
Sergey Kaplun
--
Best regards,
Maxim Kokryashkin