[Tarantool-patches] [PATCH luajit 2/2] test: rewrite sysprof test using managed execution

Igor Munkin imun at tarantool.org
Tue Dec 5 14:34:39 MSK 2023


Max,

Thanks for your review!

On 03.12.23, Maxim Kokryashkin wrote:
> Hi, Igor!
> Thanks for the patch!
> Please consider my comments below.
> 
> Side note: This new testing approach is amazing, thanks a lot!
> 
> On Tue, Nov 28, 2023 at 02:53:17PM +0000, Igor Munkin wrote:
> > Often tests for sampling profiler require long running loops to be
> Typo: s/Often/Often,/
> > executed, so a certain situation is likely to occur. Thus the test added
> Typo: s/Thus/Thus,/
> > in the commit 285a1b0a16c1c3224ab93f7cf1fef17ab29e4ab4 ("sysprof: fix
> Typo: s/in the/in/
> > crash during FFUNC stream") expects FFUNC VM state (and even the
> Typo: s/expects FFUNC/expects the FFUNC/
> > particular instruction to be executed) at the moment when stacktrace is
> > being collected. Unfortunately, it leads to test routine hang for
> Typo: s/to test routine hang/to the test routine hanging/
> > several environments and if it does not we cannot guarantee that the
> Typo: s/environments/environments,/
> Typo: s/and if it does not/and even if it does not,/
> > desired scenario is tested (only rely on statistics). As a result the
> Typo: s/As a result/As a result,/
> > test for the aforementioned patch was disabled for Tarantool CI in the
> > commit fef60a1052b8444be61e9a9932ab4875aff0b2ed ("test: prevent hanging
> > Tarantool CI by sysprof test") until the issue is not resolved.
> >
> > This patch introduces the new approach for testing our sampling profiler
> > via <ptrace> implementing precise managed execution of the profiled
> Typo: s/<ptrace>/<ptrace>,/
> > instance mentioned in tarantool/tarantool#9387.
> >
> > Instead of running around <tostring> gazillion times we accurately step
> Typo: s/times/times,/
> > to the exact place where the issue reproduces and manually emit SIGPROF
> > to the Lua VM being profiled. The particular approach implemented in
> > this patch is described below.
> >
> > As it was mentioned, the test makes sysprof to collect the particular
> Typo: s/to collect/collect/
> > event (FFUNC) at the certain instruction in Lua VM (<lj_fff_res1>) to
> > reproduce the issue from tarantool/tarantool#8594. Hence it's enough to
> Typo: s/Hence/Hence,/
> > call <tostring> fast function in the profiled instance (i.e. "tracee").
> Typo: s/i.e./i.e.,/
> > To emit SIGPROF right at <lj_fff_res1> in scope of <tostring> builtin,
> Typo: s/in scope/in the scope/
> > the manager (i.e. "tracer") is implemented.
> Typo: s/i.e./i.e.,/
> >
> > Here are the main steps (see comments and `man 2 ptrace' for more info):
> >   1. Poison <int 3> instruction as the first instruction at
> >      <lj_ff_tostring> to stop at the beginning of the fast function;
> >   2. Resume the "tracee" from the "tracer";
> >   3. Hit the emitted interruption, restore the original instruction and
> >      "rewind" the RIP to "replay" the instruction at <lj_ff_tostring>;
> >   4. Do the hack 1-3 for <lj_fff_res1>;
> >   5. Emit SIGPROF while resumimg the "tracee";
> Typo: s/resumimg/resuming/
> >
> > As a result sysprof collects the full backtrace with <tostring> fast
> Typo: s/a result/a result,/
> > function as the topmost frame.
> >
> > Resolves tarantool/tarantool#9387
> > Follows up tarantool/tarantool#8594
> >
> > Signed-off-by: Igor Munkin <imun at tarantool.org>

Here is the updated commit message:

================================================================================

test: rewrite sysprof test using managed execution

Often, tests for sampling profiler require long running loops to be
executed, so a certain situation is likely to occur. Thus, the test
added in commit 285a1b0a16c1c3224ab93f7cf1fef17ab29e4ab4 ("sysprof: fix
crash during FFUNC stream") expects the FFUNC VM state (and even the
particular instruction to be executed) at the moment when stacktrace is
being collected. Unfortunately, it leads to the test routine hang for
several environments, and even if it does not, we cannot guarantee that
the desired scenario is tested (only rely on statistics). As a result,
the test for the aforementioned patch was disabled for Tarantool CI in
the commit fef60a1052b8444be61e9a9932ab4875aff0b2ed ("test: prevent
hanging Tarantool CI by sysprof test") until the issue is not resolved.

This patch introduces the new approach for testing our sampling profiler
via <ptrace>, implementing precise managed execution of the profiled
instance mentioned in tarantool/tarantool#9387.

Instead of running around <tostring> gazillion times, we accurately step
to the exact place where the issue reproduces and manually emit SIGPROF
to the Lua VM being profiled. The particular approach implemented in
this patch is described below.

As it was mentioned, the test makes sysprof collect the particular event
type (FFUNC) at the certain instruction in Lua VM (<lj_fff_res1>) to
reproduce the issue from tarantool/tarantool#8594. Hence, it's enough to
call <tostring> fast function in the profiled instance (i.e., "tracee").
To emit SIGPROF right at <lj_fff_res1> in the scope of <tostring>
builtin, the manager (i.e., "tracer") is implemented.

Here are the main steps (see comments and `man 2 ptrace' for more info):
  1. Poison <int 3> instruction as the first instruction at
     <lj_ff_tostring> to stop at the beginning of the fast function;
  2. Resume the "tracee" from the "tracer";
  3. Hit the emitted interruption, restore the original instruction and
     "rewind" the RIP to "replay" the instruction at <lj_ff_tostring>;
  4. Do the hack 1-3 for <lj_fff_res1>;
  5. Emit SIGPROF while resuming the "tracee";

As a result, sysprof collects the full backtrace with <tostring> fast
function as the topmost frame.

Resolves tarantool/tarantool#9387
Follows up tarantool/tarantool#8594

================================================================================

> > ---
> >  .../gh-8594-sysprof-ffunc-crash.test.c        | 269 ++++++++++++++++++
> >  .../gh-8594-sysprof-ffunc-crash.test.lua      |  55 ----
> >  2 files changed, 269 insertions(+), 55 deletions(-)
> >  create mode 100644 test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c
> >  delete mode 100644 test/tarantool-tests/gh-8594-sysprof-ffunc-crash.test.lua
> >
> > diff --git a/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c
> > new file mode 100644
> > index 00000000..4caed8cb
> > --- /dev/null
> > +++ b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c
> > @@ -0,0 +1,269 @@
> > +#include "lauxlib.h"
> > +#include "lmisclib.h"
> > +#include "lua.h"
> > +
> > +#include "test.h"
> > +#include "utils.h"
> > +
> > +#include <signal.h>
> > +#include <sys/ptrace.h>
> > +#include <sys/user.h>
> > +#include <sys/wait.h>
> > +#include <unistd.h>
> > +
> > +/* XXX: Still need normal assert inside <tracee> and helpers. */
> > +#undef NDEBUG
> > +#include <assert.h>
> > +
> > +/*
> > + * XXX: The test is *very* Linux/x86_64 specific. Fortunately, so
> > + * does the sampling profiler. <lj_arch.> is needed for LUAJIT_OS
> > + * and LUAJIT_TARGET.
> > + */
> > +#include "lj_arch.h"
> > +
> > +#if LUAJIT_OS == LUAJIT_OS_LINUX && LUAJIT_TARGET == LUAJIT_ARCH_X64
> Why not use the `LJ_HASSYSPROF` flag?

Since the test per se is Linux/x86_64 specific (consider the particular
register name used below). E.g. sysprof is implemented for x86 either,
but the test is not. If somebody (Sergey Kaplun) finally decided to
introduce x86 testing environment, the test would crash. I prefer more
developer-friendly approach here (considering <ptrace> peculiarities).

> > +
> > +/*
> > + * XXX: The test makes sysprof to collect the particular event
> Typo: s/sysprof to/sysprof/
> > + * (FFUNC) at the particular instruction (<lj_fff_res1>) to
> > + * reproduce the issue #8594. Hence it's enough to call <tostring>
> Typo: s/Hence/Hence,/
> > + * fast function (this is done in <tracee> function). To emit
> > + * SIGPROF right at <lj_fff_res1> in scope of <tostring> fast
> > + * function, the managed execution is implemented in the function
> > + * <tracer>: <int 3> instruction is poisoned as the first
> > + * instruction at <lj_ff_tostring> to stop <tracee> at the
> > + * beginning of the fast function; <tracer> resumes the <tracee>;
> > + * the same hack is done for <lj_fff_res1>. When the <tracee> hits
> > + * the interruption at <lj_fff_res1>, SIGPROF is emitted while
> > + * resuming the <tracee>. As a result sysprof collects the full
> Typo: s/a result/a result,/
> > + * backtrace with <tostring> fast function as the topmost frame.
> > + *
> > + * See more info here:
> > + * * https://man7.org/linux/man-pages/man2/ptrace.2.html
> > + * * https://github.com/tarantool/tarantool/issues/8594
> > + * * https://github.com/tarantool/tarantool/issues/9387
> > + */

Fixed. The diff is below:

================================================================================

diff --git a/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c
index 4caed8cb..f060760b 100644
--- a/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c
+++ b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c
@@ -25,18 +25,18 @@
 #if LUAJIT_OS == LUAJIT_OS_LINUX && LUAJIT_TARGET == LUAJIT_ARCH_X64
 
 /*
- * XXX: The test makes sysprof to collect the particular event
+ * XXX: The test makes sysprof collect the particular event
  * (FFUNC) at the particular instruction (<lj_fff_res1>) to
- * reproduce the issue #8594. Hence it's enough to call <tostring>
- * fast function (this is done in <tracee> function). To emit
- * SIGPROF right at <lj_fff_res1> in scope of <tostring> fast
- * function, the managed execution is implemented in the function
- * <tracer>: <int 3> instruction is poisoned as the first
+ * reproduce the issue #8594. Hence, it's enough to call
+ * <tostring> fast function (this is done in <tracee> function).
+ * To emit SIGPROF right at <lj_fff_res1> in scope of <tostring>
+ * fast function, the managed execution is implemented in the
+ * function <tracer>: <int 3> instruction is poisoned as the first
  * instruction at <lj_ff_tostring> to stop <tracee> at the
  * beginning of the fast function; <tracer> resumes the <tracee>;
  * the same hack is done for <lj_fff_res1>. When the <tracee> hits
  * the interruption at <lj_fff_res1>, SIGPROF is emitted while
- * resuming the <tracee>. As a result sysprof collects the full
+ * resuming the <tracee>. As a result, sysprof collects the full
  * backtrace with <tostring> fast function as the topmost frame.
  *
  * See more info here:

================================================================================

> > +

<snipped>

> > +
> > +	/* Allow tracing for this process */
> > +	if (ptrace(PTRACE_TRACEME, 0, 0, 0) < 0) {
> > +		perror("Failed to turn the calling thread into a tracee");
> Typo?: s/thread/process/

Yeah, copy-pasted from the <ptrace> recipe. Fixed.

> > +		return EXIT_FAILURE;
> > +	}

<snipped>

> > +	/* Terminate profiler and Lua universe. */
> > +	assert(luaM_sysprof_stop(L) == PROFILE_SUCCESS);
> > +	utils_lua_close(L);
> It doesn't seem to be semantically correct to terminate Lua universe
> before obtaining the counters, even though there is no obstacles to do
> so in the API implementation itself.

To[MAH]to vs.To[MAY]to: fixed. The diff is below.

================================================================================

diff --git a/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c
index c0f9d1ca..5a2523ef 100644
--- a/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c
+++ b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c
@@ -142,9 +142,8 @@ static int tracee(const char *luacode) {
 	assert(luaL_dostring(L, luacode) == LUA_OK);
 	assert(strcmp(lua_tostring(L, -1), MESSAGE) == 0);
 
-	/* Terminate profiler and Lua universe. */
+	/* Terminate profiler. */
 	assert(luaM_sysprof_stop(L) == PROFILE_SUCCESS);
-	utils_lua_close(L);
 
 	/*
 	 * XXX: The only event to be streamed must be FFUNC at
@@ -155,6 +154,9 @@ static int tracee(const char *luacode) {
 	assert(counters.samples == 1);
 	assert(counters.vmst_ffunc == 1);
 
+	/* Terminate Lua universe. */
+	utils_lua_close(L);
+
 	return EXIT_SUCCESS;
 }
 

================================================================================

> > +

<snipped>

> > +	/* Resume <chpid> tracee until SIGTRAP occurs. */
> > +	ptrace(PTRACE_CONT, chpid, 0, 0);
> > +	/* Wait <chpid> tracee signal-delivery-stop. */
> > +	waitpid(chpid, &wstatus, 0);
> Should we check the `wstatus` here too?

Implemented via <wait_alive> helper (see the details below).

> > +

<snipped>

> > +
> > +static int tracer(pid_t chpid) {
> > +	int wstatus;
> > +
> > +	/* Wait until <chpid> tracee is ready. */
> > +	waitpid(chpid, &wstatus, 0);
> Should we check the process status here too?

Implemented via <wait_alive> helper (see the implementation below).

================================================================================

diff --git a/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c
index 84c024f0..d17386dd 100644
--- a/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c
+++ b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c
@@ -222,7 +222,7 @@ static int tracer(pid_t chpid) {
 	int wstatus;
 
 	/* Wait until <chpid> tracee is ready. */
-	waitpid(chpid, &wstatus, 0);
+	wait_alive(chpid);
 
 	/*
 	 * Resume <chpid> tracee until <lj_ff_tostring>.

================================================================================

> > +
> > +	/* Resume <chpid> tracee until <lj_ff_tostring>. */
> > +	wstatus = continue_until(chpid, lj_ff_tostring);
> > +
> > +	/* The tracee has to be alive and stopped by SIGTRAP. */
> > +	assert_false(WIFEXITED(wstatus));
> > +	assert_true(WIFSTOPPED(wstatus));
> > +
> > +	/* Resume <chpid> tracee until <lj_fff_res1>. */
> > +	wstatus = continue_until(chpid, lj_fff_res1);
> > +
> > +	/* The tracee has to be alive and stopped by SIGTRAP. */
> > +	assert_false(WIFEXITED(wstatus));
> > +	assert_true(WIFSTOPPED(wstatus));
> This part is duplicated, let's move it to a separate `healthcheck`
> function, or something like that.

It's only duplicated in sense of reusing this particular code, no more.
However, let's consider these conditions are not parts of the test
assertions set, but rather invariants for <continue_until> helper (i.e.
<continue_until> has to yield the execution with the "alive" tracee).

In this case, the patch for <continue_until> is the following:

================================================================================

diff --git a/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c
index 5a2523ef..84c024f0 100644
--- a/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c
+++ b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c
@@ -160,6 +160,18 @@ static int tracee(const char *luacode) {
 	return EXIT_SUCCESS;
 }
 
+static void wait_alive(pid_t chpid) {
+	int wstatus;
+
+	/* Wait <chpid> tracee signal-delivery-stop. */
+	waitpid(chpid, &wstatus, 0);
+
+	/* Check the tracee is still alive and just stopped. */
+	assert(!WIFEXITED(wstatus));
+	assert(!WIFSIGNALED(wstatus));
+	assert(WIFSTOPPED(wstatus));
+}
+
 const uint8_t INT3 = 0xCC;
 static inline unsigned long int3poison(unsigned long instruction) {
 	const size_t int3bits = sizeof(INT3) * 8;
@@ -167,8 +179,7 @@ static inline unsigned long int3poison(unsigned long instruction) {
 	return (instruction & int3mask) | INT3;
 }
 
-static int continue_until(pid_t chpid, void *addr) {
-	int wstatus;
+static void continue_until(pid_t chpid, void *addr) {
 	struct user_regs_struct regs;
 
 	/* Obtain the instructions at the <addr>. */
@@ -182,8 +193,12 @@ static int continue_until(pid_t chpid, void *addr) {
 
 	/* Resume <chpid> tracee until SIGTRAP occurs. */
 	ptrace(PTRACE_CONT, chpid, 0, 0);
-	/* Wait <chpid> tracee signal-delivery-stop. */
-	waitpid(chpid, &wstatus, 0);
+
+	/*
+	 * Wait <chpid> tracee signal-delivery-stop and check
+	 * whether it's still alive and just stopped via SIGTRAP.
+	 */
+	wait_alive(chpid);
 
 	/* Obtain GPR set to tweak RIP for further execution. */
 	ptrace(PTRACE_GETREGS, chpid, 0, &regs);
@@ -201,9 +216,6 @@ static int continue_until(pid_t chpid, void *addr) {
 	regs.rip -= sizeof(INT3);
 	ptrace(PTRACE_SETREGS, chpid, 0, &regs);
 	ptrace(PTRACE_POKETEXT, chpid, addr, data);
-
-	/* Return wait status to the caller for test checks. */
-	return wstatus;
 }
 
 static int tracer(pid_t chpid) {
@@ -212,19 +224,19 @@ static int tracer(pid_t chpid) {
 	/* Wait until <chpid> tracee is ready. */
 	waitpid(chpid, &wstatus, 0);
 
-	/* Resume <chpid> tracee until <lj_ff_tostring>. */
-	wstatus = continue_until(chpid, lj_ff_tostring);
-
-	/* The tracee has to be alive and stopped by SIGTRAP. */
-	assert_false(WIFEXITED(wstatus));
-	assert_true(WIFSTOPPED(wstatus));
-
-	/* Resume <chpid> tracee until <lj_fff_res1>. */
-	wstatus = continue_until(chpid, lj_fff_res1);
+	/*
+	 * Resume <chpid> tracee until <lj_ff_tostring>.
+	 * The tracee is alive and stopped by SIGTRAP if
+	 * <continue_until> returns.
+	 */
+	continue_until(chpid, lj_ff_tostring);
 
-	/* The tracee has to be alive and stopped by SIGTRAP. */
-	assert_false(WIFEXITED(wstatus));
-	assert_true(WIFSTOPPED(wstatus));
+	/*
+	 * Resume <chpid> tracee until <lj_fff_res1>.
+	 * The tracee is alive and stopped by SIGTRAP if
+	 * <continue_until> returns.
+	 */
+	 continue_until(chpid, lj_fff_res1);
 
 	/* Send SIGPROF to make sysprof collect the event. */
 	ptrace(PTRACE_CONT, chpid, 0, SIGPROF);

================================================================================

> > +

<snipped>

-- 
Best regards,
IM


More information about the Tarantool-patches mailing list