From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id 4386070AC38; Tue, 28 Nov 2023 17:59:29 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 4386070AC38 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1701183569; bh=/HQ6zDyWgM1IQQWuecja+Q18kKgZMxD99x22a+AjQEs=; h=To:Date:In-Reply-To:References:Subject:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=bO6tnj/7MmOeRPzcS0MAn22cdtl8f/+ec0/wgk1JCLdEmZaj3O/fEasBuMzPI+nh1 3oHgdc2DkwHo4ILXjGu9cDW/kTt1nA7dd6H45/fZuVqlpUzDn7OjLCrF2fI2AlI/3n D7sdF9m2N4cjtN/LpS0rXIFWc+5Ftl46SC3Fu8Yw= Received: from smtp34.i.mail.ru (smtp34.i.mail.ru [95.163.41.75]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 615D370AC2F for ; Tue, 28 Nov 2023 17:58:29 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 615D370AC2F Received: by smtp34.i.mail.ru with esmtpa (envelope-from ) id 1r7zXg-004VGe-1o; Tue, 28 Nov 2023 17:58:29 +0300 To: Maxim Kokryashkin , Sergey Kaplun Date: Tue, 28 Nov 2023 14:53:17 +0000 Message-Id: X-Mailer: git-send-email 2.39.2 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Mailru-Src: smtp X-4EC0790: 10 X-7564579A: EEAE043A70213CC8 X-77F55803: 4F1203BC0FB41BD9D2A6479154BC7F41A126550EFD29F2664D8B4428834D7431182A05F5380850402C21319D55BC68DC9D3F0191CD8598BE5BFDD2F78E304E00825D81214FC59835 X-7FA49CB5: FF5795518A3D127A4AD6D5ED66289B5278DA827A17800CE78981306C6E927004EA1F7E6F0F101C67BD4B6F7A4D31EC0BCC500DACC3FED6E28638F802B75D45FF8AA50765F79006379A70878BADDF00B98638F802B75D45FF36EB9D2243A4F8B5A6FCA7DBDB1FC311F39EFFDF887939037866D6147AF826D848316BB79E4D89FF8955BD46736E79A3117882F4460429724CE54428C33FAD305F5C1EE8F4F765FC1BE95B8C87527B4BA471835C12D1D9774AD6D5ED66289B52BA9C0B312567BB23117882F44604297287769387670735201E561CDFBCA1751FBDFBBEFFF4125B51D2E47CDBA5A96583BA9C0B312567BB2376E601842F6C81A19E625A9149C048EE26055571C92BF10FC67AC315686ED4D3D8FC6C240DEA76429C9F4D5AE37F343AA9539A8B242431040A6AB1C7CE11FEE3BC0ADEB1C81BB3626136E347CC761E07C4224003CC836476E2F48590F00D11D6E2021AF6380DFAD1A18204E546F3947CC2B5EEE3591E0D352E808ACE2090B5E1725E5C173C3A84C3C5EA940A35A165FF2DBA43225CD8A89F616AD31D0D18CD5C35872C767BF85DA2F004C90652538430E4A6367B16DE6309 X-C1DE0DAB: 0D63561A33F958A5BA94FF6C1F880A0BBD0C6543F236F6CA3729D7C6F4FA30E4F87CCE6106E1FC07E67D4AC08A07B9B0A85A14DF5F041C999C5DF10A05D560A950611B66E3DA6D700B0A020F03D25A0997E3FB2386030E77 X-C8649E89: 1C3962B70DF3F0ADE00A9FD3E00BEEDF3FED46C3ACD6F73ED3581295AF09D3DF87807E0823442EA2ED31085941D9CD0AF7F820E7B07EA4CFCB969A802A94B9777FA8F12E26D31B4B999B1DAF0B384AA63798391F98334EE3B0F06D2F9487D0FC2A47D1427BC164A71AE53155BED5A2967F086D08C144C53DA74DFFEFA5DC0E7F02C26D483E81D6BEECAEF3E2CCC1ED8C383653B6C8D9AE0FD16FCAA6493B703A X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2biojcqZ8kcsJfPHtpjpEEhwnXg== X-Mailru-Sender: 2FEBA92C8E508479FE7B9A1DF348D5315AFA72D76FC8D60673693EBB3F469C145E6A6BA515280ED42326FE6F2A341ACE0FB9F97486540B4CD9E8847AB8CFED4D9ABF8A61C016C2CFB0DAF586E7D11B3E67EA787935ED9F1B X-Mras: Ok Subject: [Tarantool-patches] [PATCH luajit 2/2] test: rewrite sysprof test using managed execution X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Igor Munkin via Tarantool-patches Reply-To: Igor Munkin Cc: tarantool-patches@dev.tarantool.org Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" Often tests for sampling profiler require long running loops to be executed, so a certain situation is likely to occur. Thus the test added in the commit 285a1b0a16c1c3224ab93f7cf1fef17ab29e4ab4 ("sysprof: fix crash during FFUNC stream") expects FFUNC VM state (and even the particular instruction to be executed) at the moment when stacktrace is being collected. Unfortunately, it leads to test routine hang for several environments and if it does not we cannot guarantee that the desired scenario is tested (only rely on statistics). As a result the test for the aforementioned patch was disabled for Tarantool CI in the commit fef60a1052b8444be61e9a9932ab4875aff0b2ed ("test: prevent hanging Tarantool CI by sysprof test") until the issue is not resolved. This patch introduces the new approach for testing our sampling profiler via implementing precise managed execution of the profiled instance mentioned in tarantool/tarantool#9387. Instead of running around gazillion times we accurately step to the exact place where the issue reproduces and manually emit SIGPROF to the Lua VM being profiled. The particular approach implemented in this patch is described below. As it was mentioned, the test makes sysprof to collect the particular event (FFUNC) at the certain instruction in Lua VM () to reproduce the issue from tarantool/tarantool#8594. Hence it's enough to call fast function in the profiled instance (i.e. "tracee"). To emit SIGPROF right at in scope of builtin, the manager (i.e. "tracer") is implemented. Here are the main steps (see comments and `man 2 ptrace' for more info): 1. Poison instruction as the first instruction at to stop at the beginning of the fast function; 2. Resume the "tracee" from the "tracer"; 3. Hit the emitted interruption, restore the original instruction and "rewind" the RIP to "replay" the instruction at ; 4. Do the hack 1-3 for ; 5. Emit SIGPROF while resumimg the "tracee"; As a result sysprof collects the full backtrace with fast function as the topmost frame. Resolves tarantool/tarantool#9387 Follows up tarantool/tarantool#8594 Signed-off-by: Igor Munkin --- .../gh-8594-sysprof-ffunc-crash.test.c | 269 ++++++++++++++++++ .../gh-8594-sysprof-ffunc-crash.test.lua | 55 ---- 2 files changed, 269 insertions(+), 55 deletions(-) create mode 100644 test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c delete mode 100644 test/tarantool-tests/gh-8594-sysprof-ffunc-crash.test.lua diff --git a/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c new file mode 100644 index 00000000..4caed8cb --- /dev/null +++ b/test/tarantool-c-tests/gh-8594-sysprof-ffunc-crash.test.c @@ -0,0 +1,269 @@ +#include "lauxlib.h" +#include "lmisclib.h" +#include "lua.h" + +#include "test.h" +#include "utils.h" + +#include +#include +#include +#include +#include + +/* XXX: Still need normal assert inside and helpers. */ +#undef NDEBUG +#include + +/* + * XXX: The test is *very* Linux/x86_64 specific. Fortunately, so + * does the sampling profiler. is needed for LUAJIT_OS + * and LUAJIT_TARGET. + */ +#include "lj_arch.h" + +#if LUAJIT_OS == LUAJIT_OS_LINUX && LUAJIT_TARGET == LUAJIT_ARCH_X64 + +/* + * XXX: The test makes sysprof to collect the particular event + * (FFUNC) at the particular instruction () to + * reproduce the issue #8594. Hence it's enough to call + * fast function (this is done in function). To emit + * SIGPROF right at in scope of fast + * function, the managed execution is implemented in the function + * : instruction is poisoned as the first + * instruction at to stop at the + * beginning of the fast function; resumes the ; + * the same hack is done for . When the hits + * the interruption at , SIGPROF is emitted while + * resuming the . As a result sysprof collects the full + * backtrace with fast function as the topmost frame. + * + * See more info here: + * * https://man7.org/linux/man-pages/man2/ptrace.2.html + * * https://github.com/tarantool/tarantool/issues/8594 + * * https://github.com/tarantool/tarantool/issues/9387 + */ + +#define MESSAGE "Canary is alive" +#define LUACALL "local a = tostring('" MESSAGE "') return a" +/* XXX: Resolve the necessary addresses from VM engine. */ +extern void *lj_ff_tostring(void); +extern void *lj_fff_res1(void); + +/* Sysprof "/dev/null" stream helpers. {{{ */ + +/* + * Yep, 8Mb. Tuned in order not to bother the platform with too + * often flushes. + */ +#define STREAM_BUFFER_SIZE (8 * 1024 * 1024) +#define DEVNULL -1 + +struct devnull_ctx { + /* + * XXX: Dummy file descriptor to be used as "/dev/null" + * context indicator in writer and on_stop callback. + */ + int fd; + /* Buffer for data recorded by sysprof. */ + uint8_t buf[STREAM_BUFFER_SIZE]; +}; + +static int stream_new(struct luam_Sysprof_Options *options) { + struct devnull_ctx *ctx = calloc(1, sizeof(struct devnull_ctx)); + if (ctx == NULL) + return PROFILE_ERRIO; + + /* Set "/dev/null" context indicator. */ + ctx->fd = DEVNULL; + options->ctx = ctx; + options->buf = ctx->buf; + options->len = STREAM_BUFFER_SIZE; + + return PROFILE_SUCCESS; +} + +static int stream_delete(void *rawctx, uint8_t *buf) { + struct devnull_ctx *ctx = rawctx; + assert(ctx->fd == DEVNULL); + free(ctx); + return PROFILE_SUCCESS; +} + +static size_t stream_writer(const void **buf_addr, size_t len, void *rawctx) { + struct devnull_ctx *ctx = rawctx; + assert(ctx->fd == DEVNULL); + /* Do nothing, just return back to the profiler. */ + return STREAM_BUFFER_SIZE; +} + +/* }}} Sysprof "/dev/null" stream helpers. */ + +static int tracee(const char *luacode) { + struct luam_Sysprof_Counters counters = {}; + struct luam_Sysprof_Options opt = { + /* Collect full backtraces per event. */ + .mode = LUAM_SYSPROF_CALLGRAPH, + /* + * XXX: Setting the "endless timer". The test + * requires the single event to be streamed at + * instruction, so to avoid spoiling + * the stream with other unwanted events, the + * timer is set to some unreachable point, so the + * profiler will be guaranteed to stop before any + * event is emitted. + */ + .interval = -1ULL, + }; + + /* Allow tracing for this process */ + if (ptrace(PTRACE_TRACEME, 0, 0, 0) < 0) { + perror("Failed to turn the calling thread into a tracee"); + return EXIT_FAILURE; + } + + /* + * XXX: Allow parent (which is our tracer now) to observe + * our signal-delivery-stop (i.e. the tracee is ready). + * For more info see ptrace(2), "Attaching and detaching". + */ + raise(SIGSTOP); + + lua_State *L = utils_lua_init(); + + /* Customize and start profiler. */ + assert(stream_new(&opt) == PROFILE_SUCCESS); + assert(luaM_sysprof_set_writer(stream_writer) == PROFILE_SUCCESS); + assert(luaM_sysprof_set_on_stop(stream_delete) == PROFILE_SUCCESS); + assert(luaM_sysprof_start(L, &opt) == PROFILE_SUCCESS); + + /* FIXME: Make this part test agnostic. */ + assert(luaL_dostring(L, luacode) == LUA_OK); + assert(strcmp(lua_tostring(L, -1), MESSAGE) == 0); + + /* Terminate profiler and Lua universe. */ + assert(luaM_sysprof_stop(L) == PROFILE_SUCCESS); + utils_lua_close(L); + + /* + * XXX: The only event to be streamed must be FFUNC at + * instruction. + * FIXME: Make this part test agnostic. + */ + assert(luaM_sysprof_report(&counters) == PROFILE_SUCCESS); + assert(counters.samples == 1); + assert(counters.vmst_ffunc == 1); + + return EXIT_SUCCESS; +} + +const uint8_t INT3 = 0xCC; +static inline unsigned long int3poison(unsigned long instruction) { + const size_t int3bits = sizeof(INT3) * 8; + const unsigned long int3mask = -1UL >> int3bits << int3bits; + return (instruction & int3mask) | INT3; +} + +static int continue_until(pid_t chpid, void *addr) { + int wstatus; + struct user_regs_struct regs; + + /* Obtain the instructions at the . */ + unsigned long data = ptrace(PTRACE_PEEKTEXT, chpid, addr, 0); + /* + * Emit the instruction to the . + * XXX: is poisoned as the LSB to the + * obtained from the above. + */ + ptrace(PTRACE_POKETEXT, chpid, addr, int3poison(data)); + + /* Resume tracee until SIGTRAP occurs. */ + ptrace(PTRACE_CONT, chpid, 0, 0); + /* Wait tracee signal-delivery-stop. */ + waitpid(chpid, &wstatus, 0); + + /* Obtain GPR set to tweak RIP for further execution. */ + ptrace(PTRACE_GETREGS, chpid, 0, ®s); + /* + * Make sure we indeed are stopped at . + * XXX: RIP points right after instruction. + */ + assert(regs.rip == (long)addr + sizeof(INT3)); + + /* + * XXX: Restore the original instruction at and + * "rewind" RIP by size to "replay" the poisoned + * instruction at the . + */ + regs.rip -= sizeof(INT3); + ptrace(PTRACE_SETREGS, chpid, 0, ®s); + ptrace(PTRACE_POKETEXT, chpid, addr, data); + + /* Return wait status to the caller for test checks. */ + return wstatus; +} + +static int tracer(pid_t chpid) { + int wstatus; + + /* Wait until tracee is ready. */ + waitpid(chpid, &wstatus, 0); + + /* Resume tracee until . */ + wstatus = continue_until(chpid, lj_ff_tostring); + + /* The tracee has to be alive and stopped by SIGTRAP. */ + assert_false(WIFEXITED(wstatus)); + assert_true(WIFSTOPPED(wstatus)); + + /* Resume tracee until . */ + wstatus = continue_until(chpid, lj_fff_res1); + + /* The tracee has to be alive and stopped by SIGTRAP. */ + assert_false(WIFEXITED(wstatus)); + assert_true(WIFSTOPPED(wstatus)); + + /* Send SIGPROF to make sysprof collect the event. */ + ptrace(PTRACE_CONT, chpid, 0, SIGPROF); + + /* Wait until tracee successfully exits. */ + waitpid(chpid, &wstatus, 0); + assert_true(WIFEXITED(wstatus)); + + return TEST_EXIT_SUCCESS; +} + +static int test_tostring_call(void *ctx) { + pid_t chpid = fork(); + switch(chpid) { + case -1: + bail_out("Tracee fork failed"); + case 0: + /* + * XXX: Tracee has to instead of + * to avoid duplicate reports in . + * Test assertions are used only in the , + * so the ought to report whether the + * test succeeded or not. + */ + exit(tracee(LUACALL)); + default: + return tracer(chpid); + } +} + +#else /* LUAJIT_OS == LUAJIT_OS_LINUX && LUAJIT_TARGET == LUAJIT_ARCH_X64 */ + +static int test_tostring_call(void *ctx) { + return skip_all("sysprof is implemented for Linux/x86_64 only"); +} + +#endif /* LUAJIT_OS == LUAJIT_OS_LINUX && LUAJIT_TARGET == LUAJIT_ARCH_X64 */ + +int main(void) { + const struct test_unit tgroup[] = { + test_unit_def(test_tostring_call), + }; + return test_run_group(tgroup, NULL); +} diff --git a/test/tarantool-tests/gh-8594-sysprof-ffunc-crash.test.lua b/test/tarantool-tests/gh-8594-sysprof-ffunc-crash.test.lua deleted file mode 100644 index f8b409ae..00000000 --- a/test/tarantool-tests/gh-8594-sysprof-ffunc-crash.test.lua +++ /dev/null @@ -1,55 +0,0 @@ -local tap = require('tap') -local test = tap.test('gh-8594-sysprof-ffunc-crash'):skipcond({ - ['Sysprof is implemented for x86_64 only'] = jit.arch ~= 'x86' and - jit.arch ~= 'x64', - ['Sysprof is implemented for Linux only'] = jit.os ~= 'Linux', - -- luacheck: no global - ['Prevent hanging Tarantool CI due to #9387'] = _TARANTOOL, -}) - -test:plan(1) - -jit.off() --- XXX: Run JIT tuning functions in a safe frame to avoid errors --- thrown when LuaJIT is compiled with JIT engine disabled. -pcall(jit.flush) - -local TMP_BINFILE = '/dev/null' - --- XXX: The best way to test the issue is to set the profile --- interval to be as short as possible. However, our CI is --- not capable of handling such intense testing, so it was a --- forced decision to reduce the sampling frequency for it. As a --- result, it is now less likely to reproduce the issue --- statistically, but the test case is still valid. - --- GitHub always sets[1] the `CI` environment variable to `true` --- for every step in a workflow. --- [1]: https://docs.github.com/en/actions/learn-github-actions/variables#default-environment-variables -local CI = os.getenv('CI') == 'true' - --- Profile interval and number of iterations for CI are --- empirical. Non-CI profile interval is set to be as short --- as possible, so the issue is more likely to reproduce. --- Non-CI number of iterations is greater for the same reason. -local PROFILE_INTERVAL = CI and 3 or 1 -local N_ITERATIONS = CI and 1e5 or 1e6 - -local res, err = misc.sysprof.start{ - mode = 'C', - interval = PROFILE_INTERVAL, - path = TMP_BINFILE, -} -assert(res, err) - -for i = 1, N_ITERATIONS do - -- XXX: `tostring` is FFUNC. - tostring(i) -end - -res, err = misc.sysprof.stop() -assert(res, err) - -test:ok(true, 'FFUNC frames were streamed correctly') - -os.exit(test:check() and 0 or 1) -- 2.39.2