[Tarantool-patches] [PATCH luajit 2/2] test: rewrite sysprof test using managed execution

Maxim Kokryashkin m.kokryashkin at tarantool.org
Tue Dec 5 16:23:17 MSK 2023


Hi, Igor!
Thanks for the fixes!
LGTM
--
Best regards,
Maxim Kokryashkin
 
 
> 
>>Max,
>>
>>Thanks for your review!
>>
>>On 03.12.23, Maxim Kokryashkin wrote:
>>> Hi, Igor!
>>> Thanks for the patch!
>>> Please consider my comments below.
>>>
>>> Side note: This new testing approach is amazing, thanks a lot!
>>>
>>> On Tue, Nov 28, 2023 at 02:53:17PM +0000, Igor Munkin wrote:
>>> > Often tests for sampling profiler require long running loops to be
>>> Typo: s/Often/Often,/
>>> > executed, so a certain situation is likely to occur. Thus the test added
>>> Typo: s/Thus/Thus,/
>>> > in the commit 285a1b0a16c1c3224ab93f7cf1fef17ab29e4ab4 ("sysprof: fix
>>> Typo: s/in the/in/
>>> > crash during FFUNC stream") expects FFUNC VM state (and even the
>>> Typo: s/expects FFUNC/expects the FFUNC/
>>> > particular instruction to be executed) at the moment when stacktrace is
>>> > being collected. Unfortunately, it leads to test routine hang for
>>> Typo: s/to test routine hang/to the test routine hanging/
>>> > several environments and if it does not we cannot guarantee that the
>>> Typo: s/environments/environments,/
>>> Typo: s/and if it does not/and even if it does not,/
>>> > desired scenario is tested (only rely on statistics). As a result the
>>> Typo: s/As a result/As a result,/
>>> > test for the aforementioned patch was disabled for Tarantool CI in the
>>> > commit fef60a1052b8444be61e9a9932ab4875aff0b2ed ("test: prevent hanging
>>> > Tarantool CI by sysprof test") until the issue is not resolved.
>>> >
>>> > This patch introduces the new approach for testing our sampling profiler
>>> > via <ptrace> implementing precise managed execution of the profiled
>>> Typo: s/<ptrace>/<ptrace>,/
>>> > instance mentioned in tarantool/tarantool#9387.
>>> >
>>> > Instead of running around <tostring> gazillion times we accurately step
>>> Typo: s/times/times,/
>>> > to the exact place where the issue reproduces and manually emit SIGPROF
>>> > to the Lua VM being profiled. The particular approach implemented in
>>> > this patch is described below.
>>> >
>>> > As it was mentioned, the test makes sysprof to collect the particular
>>> Typo: s/to collect/collect/
>>> > event (FFUNC) at the certain instruction in Lua VM (<lj_fff_res1>) to
>>> > reproduce the issue from tarantool/tarantool#8594. Hence it's enough to
>>> Typo: s/Hence/Hence,/
>>> > call <tostring> fast function in the profiled instance (i.e. "tracee").
>>> Typo: s/i.e./i.e.,/
>>> > To emit SIGPROF right at <lj_fff_res1> in scope of <tostring> builtin,
>>> Typo: s/in scope/in the scope/
>>> > the manager (i.e. "tracer") is implemented.
>>> Typo: s/i.e./i.e.,/
>>> >
>>> > Here are the main steps (see comments and `man 2 ptrace' for more info):
>>> > 1. Poison <int 3> instruction as the first instruction at
>>> > <lj_ff_tostring> to stop at the beginning of the fast function;
>>> > 2. Resume the "tracee" from the "tracer";
>>> > 3. Hit the emitted interruption, restore the original instruction and
>>> > "rewind" the RIP to "replay" the instruction at <lj_ff_tostring>;
>>> > 4. Do the hack 1-3 for <lj_fff_res1>;
>>> > 5. Emit SIGPROF while resumimg the "tracee";
>>> Typo: s/resumimg/resuming/
>>> >
>>> > As a result sysprof collects the full backtrace with <tostring> fast
>>> Typo: s/a result/a result,/
>>> > function as the topmost frame.
>>> >
>>> > Resolves tarantool/tarantool#9387
>>> > Follows up tarantool/tarantool#8594
>>> >
>>> > Signed-off-by: Igor Munkin < imun at tarantool.org >
>>
>>Here is the updated commit message:
>>
>>================================================================================
>>
>>test: rewrite sysprof test using managed execution
>>
>>Often, tests for sampling profiler require long running loops to be
>>executed, so a certain situation is likely to occur. Thus, the test
>>added in commit 285a1b0a16c1c3224ab93f7cf1fef17ab29e4ab4 ("sysprof: fix
>>crash during FFUNC stream") expects the FFUNC VM state (and even the
>>particular instruction to be executed) at the moment when stacktrace is
>>being collected. Unfortunately, it leads to the test routine hang for
>>several environments, and even if it does not, we cannot guarantee that
>>the desired scenario is tested (only rely on statistics). As a result,
>>the test for the aforementioned patch was disabled for Tarantool CI in
>>the commit fef60a1052b8444be61e9a9932ab4875aff0b2ed ("test: prevent
>>hanging Tarantool CI by sysprof test") until the issue is not resolved.
>>
>>This patch introduces the new approach for testing our sampling profiler
>>via <ptrace>, implementing precise managed execution of the profiled
>>instance mentioned in tarantool/tarantool#9387.
>>
>>Instead of running around <tostring> gazillion times, we accurately step
>>to the exact place where the issue reproduces and manually emit SIGPROF
>>to the Lua VM being profiled. The particular approach implemented in
>>this patch is described below.
>>
>>As it was mentioned, the test makes sysprof collect the particular event
>>type (FFUNC) at the certain instruction in Lua VM (<lj_fff_res1>) to
>>reproduce the issue from tarantool/tarantool#8594. Hence, it's enough to
>>call <tostring> fast function in the profiled instance (i.e., "tracee").
>>To emit SIGPROF right at <lj_fff_res1> in the scope of <tostring>
>>builtin, the manager (i.e., "tracer") is implemented.
>>
>>Here are the main steps (see comments and `man 2 ptrace' for more info):
>>  1. Poison <int 3> instruction as the first instruction at
>>     <lj_ff_tostring> to stop at the beginning of the fast function;
>>  2. Resume the "tracee" from the "tracer";
>>  3. Hit the emitted interruption, restore the original instruction and
>>     "rewind" the RIP to "replay" the instruction at <lj_ff_tostring>;
>>  4. Do the hack 1-3 for <lj_fff_res1>;
>>  5. Emit SIGPROF while resuming the "tracee";
>>
>>As a result, sysprof collects the full backtrace with <tostring> fast
>>function as the topmost frame.
>>
>>Resolves tarantool/tarantool#9387
>>Follows up tarantool/tarantool#8594
>>
>>================================================================================
>>
>>> > ---
>>> > .../gh-8594-sysprof-ffunc-crash.test.c | 269 ++++++++++++++++++
>>> > .../gh-8594-sysprof-ffunc-crash.test.lua | 55 ----
>>> > 2 files changed, 269 insertions(+), 55 deletions(-)
>>> > create mode 100644 test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c
>>> > delete mode 100644 test/tarantool-tests/gh-8594-sysprof-ffunc-crash.test.lua
>>> >
>>> > diff --git a/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c
>>> > new file mode 100644
>>> > index 00000000..4caed8cb
>>> > --- /dev/null
>>> > +++ b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c
>>> > @@ -0,0 +1,269 @@
>>> > +#include "lauxlib.h"
>>> > +#include "lmisclib.h"
>>> > +#include "lua.h"
>>> > +
>>> > +#include "test.h"
>>> > +#include "utils.h"
>>> > +
>>> > +#include <signal.h>
>>> > +#include <sys/ptrace.h>
>>> > +#include <sys/user.h>
>>> > +#include <sys/wait.h>
>>> > +#include <unistd.h>
>>> > +
>>> > +/* XXX: Still need normal assert inside <tracee> and helpers. */
>>> > +#undef NDEBUG
>>> > +#include <assert.h>
>>> > +
>>> > +/*
>>> > + * XXX: The test is *very* Linux/x86_64 specific. Fortunately, so
>>> > + * does the sampling profiler. <lj_arch.> is needed for LUAJIT_OS
>>> > + * and LUAJIT_TARGET.
>>> > + */
>>> > +#include "lj_arch.h"
>>> > +
>>> > +#if LUAJIT_OS == LUAJIT_OS_LINUX && LUAJIT_TARGET == LUAJIT_ARCH_X64
>>> Why not use the `LJ_HASSYSPROF` flag?
>>
>>Since the test per se is Linux/x86_64 specific (consider the particular
>>register name used below). E.g. sysprof is implemented for x86 either,
>>but the test is not. If somebody (Sergey Kaplun) finally decided to
>>introduce x86 testing environment, the test would crash. I prefer more
>>developer-friendly approach here (considering <ptrace> peculiarities).
>>
>>> > +
>>> > +/*
>>> > + * XXX: The test makes sysprof to collect the particular event
>>> Typo: s/sysprof to/sysprof/
>>> > + * (FFUNC) at the particular instruction (<lj_fff_res1>) to
>>> > + * reproduce the issue #8594. Hence it's enough to call <tostring>
>>> Typo: s/Hence/Hence,/
>>> > + * fast function (this is done in <tracee> function). To emit
>>> > + * SIGPROF right at <lj_fff_res1> in scope of <tostring> fast
>>> > + * function, the managed execution is implemented in the function
>>> > + * <tracer>: <int 3> instruction is poisoned as the first
>>> > + * instruction at <lj_ff_tostring> to stop <tracee> at the
>>> > + * beginning of the fast function; <tracer> resumes the <tracee>;
>>> > + * the same hack is done for <lj_fff_res1>. When the <tracee> hits
>>> > + * the interruption at <lj_fff_res1>, SIGPROF is emitted while
>>> > + * resuming the <tracee>. As a result sysprof collects the full
>>> Typo: s/a result/a result,/
>>> > + * backtrace with <tostring> fast function as the topmost frame.
>>> > + *
>>> > + * See more info here:
>>> > + * *  https://man7.org/linux/man-pages/man2/ptrace.2.html
>>> > + * *  https://github.com/tarantool/tarantool/issues/8594
>>> > + * *  https://github.com/tarantool/tarantool/issues/9387
>>> > + */
>>
>>Fixed. The diff is below:
>>
>>================================================================================
>>
>>diff --git a/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c
>>index 4caed8cb..f060760b 100644
>>--- a/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c
>>+++ b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c
>>@@ -25,18 +25,18 @@
>> #if LUAJIT_OS == LUAJIT_OS_LINUX && LUAJIT_TARGET == LUAJIT_ARCH_X64
>> 
>> /*
>>- * XXX: The test makes sysprof to collect the particular event
>>+ * XXX: The test makes sysprof collect the particular event
>>  * (FFUNC) at the particular instruction (<lj_fff_res1>) to
>>- * reproduce the issue #8594. Hence it's enough to call <tostring>
>>- * fast function (this is done in <tracee> function). To emit
>>- * SIGPROF right at <lj_fff_res1> in scope of <tostring> fast
>>- * function, the managed execution is implemented in the function
>>- * <tracer>: <int 3> instruction is poisoned as the first
>>+ * reproduce the issue #8594. Hence, it's enough to call
>>+ * <tostring> fast function (this is done in <tracee> function).
>>+ * To emit SIGPROF right at <lj_fff_res1> in scope of <tostring>
>>+ * fast function, the managed execution is implemented in the
>>+ * function <tracer>: <int 3> instruction is poisoned as the first
>>  * instruction at <lj_ff_tostring> to stop <tracee> at the
>>  * beginning of the fast function; <tracer> resumes the <tracee>;
>>  * the same hack is done for <lj_fff_res1>. When the <tracee> hits
>>  * the interruption at <lj_fff_res1>, SIGPROF is emitted while
>>- * resuming the <tracee>. As a result sysprof collects the full
>>+ * resuming the <tracee>. As a result, sysprof collects the full
>>  * backtrace with <tostring> fast function as the topmost frame.
>>  *
>>  * See more info here:
>>
>>================================================================================
>>
>>> > +
>>
>><snipped>
>>
>>> > +
>>> > + /* Allow tracing for this process */
>>> > + if (ptrace(PTRACE_TRACEME, 0, 0, 0) < 0) {
>>> > + perror("Failed to turn the calling thread into a tracee");
>>> Typo?: s/thread/process/
>>
>>Yeah, copy-pasted from the <ptrace> recipe. Fixed.
>>
>>> > + return EXIT_FAILURE;
>>> > + }
>>
>><snipped>
>>
>>> > + /* Terminate profiler and Lua universe. */
>>> > + assert(luaM_sysprof_stop(L) == PROFILE_SUCCESS);
>>> > + utils_lua_close(L);
>>> It doesn't seem to be semantically correct to terminate Lua universe
>>> before obtaining the counters, even though there is no obstacles to do
>>> so in the API implementation itself.
>>
>>To[MAH]to vs.To[MAY]to: fixed. The diff is below.
>>
>>================================================================================
>>
>>diff --git a/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c
>>index c0f9d1ca..5a2523ef 100644
>>--- a/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c
>>+++ b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c
>>@@ -142,9 +142,8 @@ static int tracee(const char *luacode) {
>>  assert(luaL_dostring(L, luacode) == LUA_OK);
>>  assert(strcmp(lua_tostring(L, -1), MESSAGE) == 0);
>> 
>>- /* Terminate profiler and Lua universe. */
>>+ /* Terminate profiler. */
>>  assert(luaM_sysprof_stop(L) == PROFILE_SUCCESS);
>>- utils_lua_close(L);
>> 
>>  /*
>>  * XXX: The only event to be streamed must be FFUNC at
>>@@ -155,6 +154,9 @@ static int tracee(const char *luacode) {
>>  assert(counters.samples == 1);
>>  assert(counters.vmst_ffunc == 1);
>> 
>>+ /* Terminate Lua universe. */
>>+ utils_lua_close(L);
>>+
>>  return EXIT_SUCCESS;
>> }
>> 
>>
>>================================================================================
>>
>>> > +
>>
>><snipped>
>>
>>> > + /* Resume <chpid> tracee until SIGTRAP occurs. */
>>> > + ptrace(PTRACE_CONT, chpid, 0, 0);
>>> > + /* Wait <chpid> tracee signal-delivery-stop. */
>>> > + waitpid(chpid, &wstatus, 0);
>>> Should we check the `wstatus` here too?
>>
>>Implemented via <wait_alive> helper (see the details below).
>>
>>> > +
>>
>><snipped>
>>
>>> > +
>>> > +static int tracer(pid_t chpid) {
>>> > + int wstatus;
>>> > +
>>> > + /* Wait until <chpid> tracee is ready. */
>>> > + waitpid(chpid, &wstatus, 0);
>>> Should we check the process status here too?
>>
>>Implemented via <wait_alive> helper (see the implementation below).
>>
>>================================================================================
>>
>>diff --git a/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c
>>index 84c024f0..d17386dd 100644
>>--- a/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c
>>+++ b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c
>>@@ -222,7 +222,7 @@ static int tracer(pid_t chpid) {
>>  int wstatus;
>> 
>>  /* Wait until <chpid> tracee is ready. */
>>- waitpid(chpid, &wstatus, 0);
>>+ wait_alive(chpid);
>> 
>>  /*
>>  * Resume <chpid> tracee until <lj_ff_tostring>.
>>
>>================================================================================
>>
>>> > +
>>> > + /* Resume <chpid> tracee until <lj_ff_tostring>. */
>>> > + wstatus = continue_until(chpid, lj_ff_tostring);
>>> > +
>>> > + /* The tracee has to be alive and stopped by SIGTRAP. */
>>> > + assert_false(WIFEXITED(wstatus));
>>> > + assert_true(WIFSTOPPED(wstatus));
>>> > +
>>> > + /* Resume <chpid> tracee until <lj_fff_res1>. */
>>> > + wstatus = continue_until(chpid, lj_fff_res1);
>>> > +
>>> > + /* The tracee has to be alive and stopped by SIGTRAP. */
>>> > + assert_false(WIFEXITED(wstatus));
>>> > + assert_true(WIFSTOPPED(wstatus));
>>> This part is duplicated, let's move it to a separate `healthcheck`
>>> function, or something like that.
>>
>>It's only duplicated in sense of reusing this particular code, no more.
>>However, let's consider these conditions are not parts of the test
>>assertions set, but rather invariants for <continue_until> helper (i.e.
>><continue_until> has to yield the execution with the "alive" tracee).
>>
>>In this case, the patch for <continue_until> is the following:
>>
>>================================================================================
>>
>>diff --git a/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c
>>index 5a2523ef..84c024f0 100644
>>--- a/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c
>>+++ b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c
>>@@ -160,6 +160,18 @@ static int tracee(const char *luacode) {
>>  return EXIT_SUCCESS;
>> }
>> 
>>+static void wait_alive(pid_t chpid) {
>>+ int wstatus;
>>+
>>+ /* Wait <chpid> tracee signal-delivery-stop. */
>>+ waitpid(chpid, &wstatus, 0);
>>+
>>+ /* Check the tracee is still alive and just stopped. */
>>+ assert(!WIFEXITED(wstatus));
>>+ assert(!WIFSIGNALED(wstatus));
>>+ assert(WIFSTOPPED(wstatus));
>>+}
>>+
>> const uint8_t INT3 = 0xCC;
>> static inline unsigned long int3poison(unsigned long instruction) {
>>  const size_t int3bits = sizeof(INT3) * 8;
>>@@ -167,8 +179,7 @@ static inline unsigned long int3poison(unsigned long instruction) {
>>  return (instruction & int3mask) | INT3;
>> }
>> 
>>-static int continue_until(pid_t chpid, void *addr) {
>>- int wstatus;
>>+static void continue_until(pid_t chpid, void *addr) {
>>  struct user_regs_struct regs;
>> 
>>  /* Obtain the instructions at the <addr>. */
>>@@ -182,8 +193,12 @@ static int continue_until(pid_t chpid, void *addr) {
>> 
>>  /* Resume <chpid> tracee until SIGTRAP occurs. */
>>  ptrace(PTRACE_CONT, chpid, 0, 0);
>>- /* Wait <chpid> tracee signal-delivery-stop. */
>>- waitpid(chpid, &wstatus, 0);
>>+
>>+ /*
>>+ * Wait <chpid> tracee signal-delivery-stop and check
>>+ * whether it's still alive and just stopped via SIGTRAP.
>>+ */
>>+ wait_alive(chpid);
>> 
>>  /* Obtain GPR set to tweak RIP for further execution. */
>>  ptrace(PTRACE_GETREGS, chpid, 0, &regs);
>>@@ -201,9 +216,6 @@ static int continue_until(pid_t chpid, void *addr) {
>>  regs.rip -= sizeof(INT3);
>>  ptrace(PTRACE_SETREGS, chpid, 0, &regs);
>>  ptrace(PTRACE_POKETEXT, chpid, addr, data);
>>-
>>- /* Return wait status to the caller for test checks. */
>>- return wstatus;
>> }
>> 
>> static int tracer(pid_t chpid) {
>>@@ -212,19 +224,19 @@ static int tracer(pid_t chpid) {
>>  /* Wait until <chpid> tracee is ready. */
>>  waitpid(chpid, &wstatus, 0);
>> 
>>- /* Resume <chpid> tracee until <lj_ff_tostring>. */
>>- wstatus = continue_until(chpid, lj_ff_tostring);
>>-
>>- /* The tracee has to be alive and stopped by SIGTRAP. */
>>- assert_false(WIFEXITED(wstatus));
>>- assert_true(WIFSTOPPED(wstatus));
>>-
>>- /* Resume <chpid> tracee until <lj_fff_res1>. */
>>- wstatus = continue_until(chpid, lj_fff_res1);
>>+ /*
>>+ * Resume <chpid> tracee until <lj_ff_tostring>.
>>+ * The tracee is alive and stopped by SIGTRAP if
>>+ * <continue_until> returns.
>>+ */
>>+ continue_until(chpid, lj_ff_tostring);
>> 
>>- /* The tracee has to be alive and stopped by SIGTRAP. */
>>- assert_false(WIFEXITED(wstatus));
>>- assert_true(WIFSTOPPED(wstatus));
>>+ /*
>>+ * Resume <chpid> tracee until <lj_fff_res1>.
>>+ * The tracee is alive and stopped by SIGTRAP if
>>+ * <continue_until> returns.
>>+ */
>>+ continue_until(chpid, lj_fff_res1);
>> 
>>  /* Send SIGPROF to make sysprof collect the event. */
>>  ptrace(PTRACE_CONT, chpid, 0, SIGPROF);
>>
>>================================================================================
>>
>>> > +
>>
>><snipped>
>>
>>--
>>Best regards,
>>IM
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.tarantool.org/pipermail/tarantool-patches/attachments/20231205/39a6ac97/attachment.htm>


More information about the Tarantool-patches mailing list