Hi, Igor! Thanks for the fixes! LGTM -- Best regards, Maxim Kokryashkin     >  >>Max, >> >>Thanks for your review! >> >>On 03.12.23, Maxim Kokryashkin wrote: >>> Hi, Igor! >>> Thanks for the patch! >>> Please consider my comments below. >>> >>> Side note: This new testing approach is amazing, thanks a lot! >>> >>> On Tue, Nov 28, 2023 at 02:53:17PM +0000, Igor Munkin wrote: >>> > Often tests for sampling profiler require long running loops to be >>> Typo: s/Often/Often,/ >>> > executed, so a certain situation is likely to occur. Thus the test added >>> Typo: s/Thus/Thus,/ >>> > in the commit 285a1b0a16c1c3224ab93f7cf1fef17ab29e4ab4 ("sysprof: fix >>> Typo: s/in the/in/ >>> > crash during FFUNC stream") expects FFUNC VM state (and even the >>> Typo: s/expects FFUNC/expects the FFUNC/ >>> > particular instruction to be executed) at the moment when stacktrace is >>> > being collected. Unfortunately, it leads to test routine hang for >>> Typo: s/to test routine hang/to the test routine hanging/ >>> > several environments and if it does not we cannot guarantee that the >>> Typo: s/environments/environments,/ >>> Typo: s/and if it does not/and even if it does not,/ >>> > desired scenario is tested (only rely on statistics). As a result the >>> Typo: s/As a result/As a result,/ >>> > test for the aforementioned patch was disabled for Tarantool CI in the >>> > commit fef60a1052b8444be61e9a9932ab4875aff0b2ed ("test: prevent hanging >>> > Tarantool CI by sysprof test") until the issue is not resolved. >>> > >>> > This patch introduces the new approach for testing our sampling profiler >>> > via implementing precise managed execution of the profiled >>> Typo: s//,/ >>> > instance mentioned in tarantool/tarantool#9387. >>> > >>> > Instead of running around gazillion times we accurately step >>> Typo: s/times/times,/ >>> > to the exact place where the issue reproduces and manually emit SIGPROF >>> > to the Lua VM being profiled. The particular approach implemented in >>> > this patch is described below. >>> > >>> > As it was mentioned, the test makes sysprof to collect the particular >>> Typo: s/to collect/collect/ >>> > event (FFUNC) at the certain instruction in Lua VM () to >>> > reproduce the issue from tarantool/tarantool#8594. Hence it's enough to >>> Typo: s/Hence/Hence,/ >>> > call fast function in the profiled instance (i.e. "tracee"). >>> Typo: s/i.e./i.e.,/ >>> > To emit SIGPROF right at in scope of builtin, >>> Typo: s/in scope/in the scope/ >>> > the manager (i.e. "tracer") is implemented. >>> Typo: s/i.e./i.e.,/ >>> > >>> > Here are the main steps (see comments and `man 2 ptrace' for more info): >>> > 1. Poison instruction as the first instruction at >>> > to stop at the beginning of the fast function; >>> > 2. Resume the "tracee" from the "tracer"; >>> > 3. Hit the emitted interruption, restore the original instruction and >>> > "rewind" the RIP to "replay" the instruction at ; >>> > 4. Do the hack 1-3 for ; >>> > 5. Emit SIGPROF while resumimg the "tracee"; >>> Typo: s/resumimg/resuming/ >>> > >>> > As a result sysprof collects the full backtrace with fast >>> Typo: s/a result/a result,/ >>> > function as the topmost frame. >>> > >>> > Resolves tarantool/tarantool#9387 >>> > Follows up tarantool/tarantool#8594 >>> > >>> > Signed-off-by: Igor Munkin < imun@tarantool.org > >> >>Here is the updated commit message: >> >>================================================================================ >> >>test: rewrite sysprof test using managed execution >> >>Often, tests for sampling profiler require long running loops to be >>executed, so a certain situation is likely to occur. Thus, the test >>added in commit 285a1b0a16c1c3224ab93f7cf1fef17ab29e4ab4 ("sysprof: fix >>crash during FFUNC stream") expects the FFUNC VM state (and even the >>particular instruction to be executed) at the moment when stacktrace is >>being collected. Unfortunately, it leads to the test routine hang for >>several environments, and even if it does not, we cannot guarantee that >>the desired scenario is tested (only rely on statistics). As a result, >>the test for the aforementioned patch was disabled for Tarantool CI in >>the commit fef60a1052b8444be61e9a9932ab4875aff0b2ed ("test: prevent >>hanging Tarantool CI by sysprof test") until the issue is not resolved. >> >>This patch introduces the new approach for testing our sampling profiler >>via , implementing precise managed execution of the profiled >>instance mentioned in tarantool/tarantool#9387. >> >>Instead of running around gazillion times, we accurately step >>to the exact place where the issue reproduces and manually emit SIGPROF >>to the Lua VM being profiled. The particular approach implemented in >>this patch is described below. >> >>As it was mentioned, the test makes sysprof collect the particular event >>type (FFUNC) at the certain instruction in Lua VM () to >>reproduce the issue from tarantool/tarantool#8594. Hence, it's enough to >>call fast function in the profiled instance (i.e., "tracee"). >>To emit SIGPROF right at in the scope of >>builtin, the manager (i.e., "tracer") is implemented. >> >>Here are the main steps (see comments and `man 2 ptrace' for more info): >>  1. Poison instruction as the first instruction at >>      to stop at the beginning of the fast function; >>  2. Resume the "tracee" from the "tracer"; >>  3. Hit the emitted interruption, restore the original instruction and >>     "rewind" the RIP to "replay" the instruction at ; >>  4. Do the hack 1-3 for ; >>  5. Emit SIGPROF while resuming the "tracee"; >> >>As a result, sysprof collects the full backtrace with fast >>function as the topmost frame. >> >>Resolves tarantool/tarantool#9387 >>Follows up tarantool/tarantool#8594 >> >>================================================================================ >> >>> > --- >>> > .../gh-8594-sysprof-ffunc-crash.test.c | 269 ++++++++++++++++++ >>> > .../gh-8594-sysprof-ffunc-crash.test.lua | 55 ---- >>> > 2 files changed, 269 insertions(+), 55 deletions(-) >>> > create mode 100644 test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c >>> > delete mode 100644 test/tarantool-tests/gh-8594-sysprof-ffunc-crash.test.lua >>> > >>> > diff --git a/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c >>> > new file mode 100644 >>> > index 00000000..4caed8cb >>> > --- /dev/null >>> > +++ b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c >>> > @@ -0,0 +1,269 @@ >>> > +#include "lauxlib.h" >>> > +#include "lmisclib.h" >>> > +#include "lua.h" >>> > + >>> > +#include "test.h" >>> > +#include "utils.h" >>> > + >>> > +#include >>> > +#include >>> > +#include >>> > +#include >>> > +#include >>> > + >>> > +/* XXX: Still need normal assert inside and helpers. */ >>> > +#undef NDEBUG >>> > +#include >>> > + >>> > +/* >>> > + * XXX: The test is *very* Linux/x86_64 specific. Fortunately, so >>> > + * does the sampling profiler. is needed for LUAJIT_OS >>> > + * and LUAJIT_TARGET. >>> > + */ >>> > +#include "lj_arch.h" >>> > + >>> > +#if LUAJIT_OS == LUAJIT_OS_LINUX && LUAJIT_TARGET == LUAJIT_ARCH_X64 >>> Why not use the `LJ_HASSYSPROF` flag? >> >>Since the test per se is Linux/x86_64 specific (consider the particular >>register name used below). E.g. sysprof is implemented for x86 either, >>but the test is not. If somebody (Sergey Kaplun) finally decided to >>introduce x86 testing environment, the test would crash. I prefer more >>developer-friendly approach here (considering peculiarities). >> >>> > + >>> > +/* >>> > + * XXX: The test makes sysprof to collect the particular event >>> Typo: s/sysprof to/sysprof/ >>> > + * (FFUNC) at the particular instruction () to >>> > + * reproduce the issue #8594. Hence it's enough to call >>> Typo: s/Hence/Hence,/ >>> > + * fast function (this is done in function). To emit >>> > + * SIGPROF right at in scope of fast >>> > + * function, the managed execution is implemented in the function >>> > + * : instruction is poisoned as the first >>> > + * instruction at to stop at the >>> > + * beginning of the fast function; resumes the ; >>> > + * the same hack is done for . When the hits >>> > + * the interruption at , SIGPROF is emitted while >>> > + * resuming the . As a result sysprof collects the full >>> Typo: s/a result/a result,/ >>> > + * backtrace with fast function as the topmost frame. >>> > + * >>> > + * See more info here: >>> > + * * https://man7.org/linux/man-pages/man2/ptrace.2.html >>> > + * * https://github.com/tarantool/tarantool/issues/8594 >>> > + * * https://github.com/tarantool/tarantool/issues/9387 >>> > + */ >> >>Fixed. The diff is below: >> >>================================================================================ >> >>diff --git a/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c >>index 4caed8cb..f060760b 100644 >>--- a/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c >>+++ b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c >>@@ -25,18 +25,18 @@ >> #if LUAJIT_OS == LUAJIT_OS_LINUX && LUAJIT_TARGET == LUAJIT_ARCH_X64 >>  >> /* >>- * XXX: The test makes sysprof to collect the particular event >>+ * XXX: The test makes sysprof collect the particular event >>  * (FFUNC) at the particular instruction () to >>- * reproduce the issue #8594. Hence it's enough to call >>- * fast function (this is done in function). To emit >>- * SIGPROF right at in scope of fast >>- * function, the managed execution is implemented in the function >>- * : instruction is poisoned as the first >>+ * reproduce the issue #8594. Hence, it's enough to call >>+ * fast function (this is done in function). >>+ * To emit SIGPROF right at in scope of >>+ * fast function, the managed execution is implemented in the >>+ * function : instruction is poisoned as the first >>  * instruction at to stop at the >>  * beginning of the fast function; resumes the ; >>  * the same hack is done for . When the hits >>  * the interruption at , SIGPROF is emitted while >>- * resuming the . As a result sysprof collects the full >>+ * resuming the . As a result, sysprof collects the full >>  * backtrace with fast function as the topmost frame. >>  * >>  * See more info here: >> >>================================================================================ >> >>> > + >> >> >> >>> > + >>> > + /* Allow tracing for this process */ >>> > + if (ptrace(PTRACE_TRACEME, 0, 0, 0) < 0) { >>> > + perror("Failed to turn the calling thread into a tracee"); >>> Typo?: s/thread/process/ >> >>Yeah, copy-pasted from the recipe. Fixed. >> >>> > + return EXIT_FAILURE; >>> > + } >> >> >> >>> > + /* Terminate profiler and Lua universe. */ >>> > + assert(luaM_sysprof_stop(L) == PROFILE_SUCCESS); >>> > + utils_lua_close(L); >>> It doesn't seem to be semantically correct to terminate Lua universe >>> before obtaining the counters, even though there is no obstacles to do >>> so in the API implementation itself. >> >>To[MAH]to vs.To[MAY]to: fixed. The diff is below. >> >>================================================================================ >> >>diff --git a/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c >>index c0f9d1ca..5a2523ef 100644 >>--- a/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c >>+++ b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c >>@@ -142,9 +142,8 @@ static int tracee(const char *luacode) { >>  assert(luaL_dostring(L, luacode) == LUA_OK); >>  assert(strcmp(lua_tostring(L, -1), MESSAGE) == 0); >>  >>- /* Terminate profiler and Lua universe. */ >>+ /* Terminate profiler. */ >>  assert(luaM_sysprof_stop(L) == PROFILE_SUCCESS); >>- utils_lua_close(L); >>  >>  /* >>  * XXX: The only event to be streamed must be FFUNC at >>@@ -155,6 +154,9 @@ static int tracee(const char *luacode) { >>  assert(counters.samples == 1); >>  assert(counters.vmst_ffunc == 1); >>  >>+ /* Terminate Lua universe. */ >>+ utils_lua_close(L); >>+ >>  return EXIT_SUCCESS; >> } >>  >> >>================================================================================ >> >>> > + >> >> >> >>> > + /* Resume tracee until SIGTRAP occurs. */ >>> > + ptrace(PTRACE_CONT, chpid, 0, 0); >>> > + /* Wait tracee signal-delivery-stop. */ >>> > + waitpid(chpid, &wstatus, 0); >>> Should we check the `wstatus` here too? >> >>Implemented via helper (see the details below). >> >>> > + >> >> >> >>> > + >>> > +static int tracer(pid_t chpid) { >>> > + int wstatus; >>> > + >>> > + /* Wait until tracee is ready. */ >>> > + waitpid(chpid, &wstatus, 0); >>> Should we check the process status here too? >> >>Implemented via helper (see the implementation below). >> >>================================================================================ >> >>diff --git a/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c >>index 84c024f0..d17386dd 100644 >>--- a/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c >>+++ b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c >>@@ -222,7 +222,7 @@ static int tracer(pid_t chpid) { >>  int wstatus; >>  >>  /* Wait until tracee is ready. */ >>- waitpid(chpid, &wstatus, 0); >>+ wait_alive(chpid); >>  >>  /* >>  * Resume tracee until . >> >>================================================================================ >> >>> > + >>> > + /* Resume tracee until . */ >>> > + wstatus = continue_until(chpid, lj_ff_tostring); >>> > + >>> > + /* The tracee has to be alive and stopped by SIGTRAP. */ >>> > + assert_false(WIFEXITED(wstatus)); >>> > + assert_true(WIFSTOPPED(wstatus)); >>> > + >>> > + /* Resume tracee until . */ >>> > + wstatus = continue_until(chpid, lj_fff_res1); >>> > + >>> > + /* The tracee has to be alive and stopped by SIGTRAP. */ >>> > + assert_false(WIFEXITED(wstatus)); >>> > + assert_true(WIFSTOPPED(wstatus)); >>> This part is duplicated, let's move it to a separate `healthcheck` >>> function, or something like that. >> >>It's only duplicated in sense of reusing this particular code, no more. >>However, let's consider these conditions are not parts of the test >>assertions set, but rather invariants for helper (i.e. >> has to yield the execution with the "alive" tracee). >> >>In this case, the patch for is the following: >> >>================================================================================ >> >>diff --git a/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c >>index 5a2523ef..84c024f0 100644 >>--- a/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c >>+++ b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c >>@@ -160,6 +160,18 @@ static int tracee(const char *luacode) { >>  return EXIT_SUCCESS; >> } >>  >>+static void wait_alive(pid_t chpid) { >>+ int wstatus; >>+ >>+ /* Wait tracee signal-delivery-stop. */ >>+ waitpid(chpid, &wstatus, 0); >>+ >>+ /* Check the tracee is still alive and just stopped. */ >>+ assert(!WIFEXITED(wstatus)); >>+ assert(!WIFSIGNALED(wstatus)); >>+ assert(WIFSTOPPED(wstatus)); >>+} >>+ >> const uint8_t INT3 = 0xCC; >> static inline unsigned long int3poison(unsigned long instruction) { >>  const size_t int3bits = sizeof(INT3) * 8; >>@@ -167,8 +179,7 @@ static inline unsigned long int3poison(unsigned long instruction) { >>  return (instruction & int3mask) | INT3; >> } >>  >>-static int continue_until(pid_t chpid, void *addr) { >>- int wstatus; >>+static void continue_until(pid_t chpid, void *addr) { >>  struct user_regs_struct regs; >>  >>  /* Obtain the instructions at the . */ >>@@ -182,8 +193,12 @@ static int continue_until(pid_t chpid, void *addr) { >>  >>  /* Resume tracee until SIGTRAP occurs. */ >>  ptrace(PTRACE_CONT, chpid, 0, 0); >>- /* Wait tracee signal-delivery-stop. */ >>- waitpid(chpid, &wstatus, 0); >>+ >>+ /* >>+ * Wait tracee signal-delivery-stop and check >>+ * whether it's still alive and just stopped via SIGTRAP. >>+ */ >>+ wait_alive(chpid); >>  >>  /* Obtain GPR set to tweak RIP for further execution. */ >>  ptrace(PTRACE_GETREGS, chpid, 0, ®s); >>@@ -201,9 +216,6 @@ static int continue_until(pid_t chpid, void *addr) { >>  regs.rip -= sizeof(INT3); >>  ptrace(PTRACE_SETREGS, chpid, 0, ®s); >>  ptrace(PTRACE_POKETEXT, chpid, addr, data); >>- >>- /* Return wait status to the caller for test checks. */ >>- return wstatus; >> } >>  >> static int tracer(pid_t chpid) { >>@@ -212,19 +224,19 @@ static int tracer(pid_t chpid) { >>  /* Wait until tracee is ready. */ >>  waitpid(chpid, &wstatus, 0); >>  >>- /* Resume tracee until . */ >>- wstatus = continue_until(chpid, lj_ff_tostring); >>- >>- /* The tracee has to be alive and stopped by SIGTRAP. */ >>- assert_false(WIFEXITED(wstatus)); >>- assert_true(WIFSTOPPED(wstatus)); >>- >>- /* Resume tracee until . */ >>- wstatus = continue_until(chpid, lj_fff_res1); >>+ /* >>+ * Resume tracee until . >>+ * The tracee is alive and stopped by SIGTRAP if >>+ * returns. >>+ */ >>+ continue_until(chpid, lj_ff_tostring); >>  >>- /* The tracee has to be alive and stopped by SIGTRAP. */ >>- assert_false(WIFEXITED(wstatus)); >>- assert_true(WIFSTOPPED(wstatus)); >>+ /* >>+ * Resume tracee until . >>+ * The tracee is alive and stopped by SIGTRAP if >>+ * returns. >>+ */ >>+ continue_until(chpid, lj_fff_res1); >>  >>  /* Send SIGPROF to make sysprof collect the event. */ >>  ptrace(PTRACE_CONT, chpid, 0, SIGPROF); >> >>================================================================================ >> >>> > + >> >> >> >>-- >>Best regards, >>IM >