[Tarantool-patches] [PATCH v3 0/4] fiber: introduce creation backtrace

Vladimir Davydov vdavydov at tarantool.org
Tue Jul 27 14:53:41 MSK 2021


On Wed, Jul 14, 2021 at 02:12:48PM +0300, Egor Elchinov via Tarantool-patches wrote:
> https://github.com/tarantool/tarantool/tree/Egor2001/gh-4002-fiber-creation-backtrace
> 
> Egor Elchinov (4):
>   fiber: add PoC for fiber creation backtrace
>   fiber: add option and PoC for Lua parent backtrace
>   fiber: refactor lua backtrace routines
>   fiber: refactor C backtrace and add changelog

Kirill asked me to take a look at this series. The branch contains a
different set of patches. I'm going to review what you pushed to the
branch.

Side note: the series is not very well split IMO. The goal of splitting
is to speed up the review process so it's good to split a series in such
a way that each patch can be viewed at and applied independently of the
following patches. In particular, all code introduced in a patch should
be fully covered by existing tests or tests added in the patch so that
it can be applied before the rest of the series. In your case it isn't
so:

 - fiber: add fiber creation Lua backtrace
   Adds some C functions that aren't used, nor tested.
 - fiber: add fiber creation C backtrace with option
   Adds a knob that does nothing + some struct members that aren't used.
 - backtrace: add RIP tracing and resolving API
   Introduces the rest of the feature.

As a result, a reviewer has to jump between patch 3 and patches 1, 2 or
ignore the splitting and review the resulting diff instead. I'm going to
do the latter.

You do a lot of refactoring in your patch, e.g. factor out helpers to
work with proc name cache. I'd do each independent piece of refactoring
in a separate, preparatory patch.

There are two main concerns regarding this patch, which should be
addressed if possible:

 - I think that we should include backtraces of all ancestors into a
   fiber's backtrace, not just its parent. Without it, the feature
   doesn't seem to be complete.

 - There's a lot of code duplication between the code handling child and
   parent backtraces. This makes the code difficult for understanding
   and future support. It would be nice to have one function that
   captures a backtrace and stores it in some format and one function
   that dumps a captured backtrace to Lua. And use these two functions
   for capturing and showing both child and parent backtraces. Can it be
   achieved?

Looked through the patch. Some minor comments below.

> diff --git a/changelogs/unreleased/gh-4002-fiber-creation-backtrace.md b/changelogs/unreleased/gh-4002-fiber-creation-backtrace.md
> new file mode 100644
> index 000000000000..1e1ac177a6c6
> --- /dev/null
> +++ b/changelogs/unreleased/gh-4002-fiber-creation-backtrace.md
> @@ -0,0 +1,8 @@
> +## feature/fiber

Should be feature/core I think.

> +
> + * Added new subtable `parent_backtrace` to the `fiber.info()`
> +   containing C and Lua backtrace chunks of fiber creation.
> + * Added `fiber.parent_bt_enable()` and `fiber.parent_bt_disable()`
> +   options in order to switch on/off the ability to collect
> +   parent backtraces for newly created fibers and to
> +   show/hide `parent_backtrace` subtables in the `fiber.info()`.

Please rephrase and put under one bullet point - it's one feature that
can be enabled/disabled, not two features.

> diff --git a/src/lib/core/backtrace.cc b/src/lib/core/backtrace.cc
> index 85c1aaefd009..a59b77053582 100644
> --- a/src/lib/core/backtrace.cc
> +++ b/src/lib/core/backtrace.cc
> @@ -74,12 +78,68 @@ backtrace_proc_cache_clear(void)
>  	proc_cache = NULL;
>  }
>  
> +int
> +backtrace_proc_cache_find(unw_word_t ip, const char **name, unw_word_t *offset)

Should be static.

> +int
> +backtrace_proc_cache_put(unw_word_t ip, const char *name, unw_word_t offset)

Should be static. The return value is never used, please remove.

Adding these two functions would be a good preparatory patch that could
be applied before the rest of the series is even reviewed.

> +/**
> + * Collect up to `limit' IP register values
> + * for frames of the current stack into `ip_buf'.
> + * Must be by far faster than usual backtrace according to the
> + * libunwind doc for unw_backtrace().
> + */
> +void NOINLINE
> +backtrace_collect_ip(void **ip_buf, int limit)
> +{
> +	memset(ip_buf, 0, limit * sizeof(*ip_buf));
> +#ifndef TARGET_OS_DARWIN
> +	unw_backtrace(ip_buf, limit);
> +#else
> +	/*
> +	 * This dumb implementation was chosen because the DARWIN
> +	 * lacks unw_backtrace() routine from libunwind and
> +	 * usual backtrace() from <execinfo.h> has less capabilities
> +	 * than the libunwind version which uses DWARF.
> +	 */
> +	unw_cursor_t unw_cur;
> +	unw_context_t unw_ctx;
> +	int frame_no = 0;
> +	unw_word_t ip;
> +
> +	unw_getcontext(&unw_ctx);
> +	unw_init_local(&unw_cur, &unw_ctx);
> +
> +	while (frame_no < limit && unw_step(&unw_cur) > 0) {
> +		unw_get_reg(&unw_cur, UNW_REG_IP, &ip);
> +		ip_buf[frame_no] = (void *)ip;
> +		++frame_no;
> +	}
> +#endif
> +}
> +
> +/**
> + * Call `cb' callback for not more than
> + * first `limit' frames present in the `ip_buf'.
> + *
> + * The implementation uses poorly documented `get_proc_name' callback
> + * from the `unw_accessors_t' to get procedure names via `ip_buf' values.
> + * Although `get_proc_name' is present on most architectures, it's an optional
> + * field, so procedure name is allowed to be absent (NULL) in `cb' call.
> + *
> + * TODO: to add cache and demangling support

Who's going to do that and when? Is there a ticket for that? What won't
work without "cache and demangling support"?

It would be best to avoid introducing any TODOs in the code if possible.
If not possible, please explain thoroughly what doesn't work, what
should be done, create a ticket, and add a reference to the ticket to
the TODO, e.g.

  TODO(gh-12345): ...

> + */
> +void
> +backtrace_foreach_ip(backtrace_cb cb, void **ip_buf, int limit,
> +		     void *cb_ctx)

Please make cb next to last argument, near cb_ctx:

  backtrace_foreach_ip(ip_buf, limit, cb, cb_ctx)

> +{
> +	int demangle_status;
> +	char *demangle_buf = NULL;
> +	size_t demangle_buf_len = 0;
> +#ifndef TARGET_OS_DARWIN
> +	char proc_name[BACKTRACE_NAME_MAX];
> +	unw_word_t ip = 0, offset = 0;
> +	unw_proc_info_t pi;
> +	int frame_no, ret = 0;
> +	const char *proc = NULL;
> +
> +	unw_accessors_t *acc = unw_get_accessors(unw_local_addr_space);
> +
> +	/*
> +	 * RIPs collecting comes from inside a helper routine
> +	 * so we skip the collector function address itself thus
> +	 * start fetching functions with frame number = 1.
> +	 */
> +	for (frame_no = 1; frame_no < limit && ip_buf[frame_no] != NULL;

Can we somehow not return the frame corresponding to
backtrace_collect_ip, since we don't need it anyway?

> +	     frame_no++) {
> +		ip = (unw_word_t)ip_buf[frame_no];
> +
> +		if (backtrace_proc_cache_find(ip, &proc, &offset) == 0) {
> +			ret = 0;
> +		} else if (acc->get_proc_name == NULL) {
> +			ret = unw_get_proc_info_by_ip(unw_local_addr_space,
> +						      ip, &pi, NULL);
> +			offset = ip - pi.start_ip;
> +			proc = NULL;
> +			backtrace_proc_cache_put(ip, proc, offset);
> +		} else {
> +			ret = acc->get_proc_name(unw_local_addr_space, ip,
> +						 proc_name, sizeof(proc_name),
> +						 &offset, NULL);
> +			proc = proc_name;
> +			backtrace_proc_cache_put(ip, proc, offset);
> +		}
> +
> +		if (proc != NULL) {
> +			char *cxxname = abi::__cxa_demangle(proc, demangle_buf,
> +							    &demangle_buf_len,
> +							    &demangle_status);
> +			if (cxxname != NULL) {
> +				demangle_buf = cxxname;
> +				proc = cxxname;
> +			}
> +		}
> +		if (ret != 0 || cb(frame_no - 1, (void *)ip, proc,
> +				   (size_t)offset, cb_ctx) != 0)
> +			break;
> +	}
> +
> +	free(demangle_buf);
> +	if (ret != 0)
> +		say_debug("unwinding error: %s", unw_strerror(ret));
> +#else
> +	int frame_no, ret = 1;
> +	void *ip = NULL;
> +	size_t offset = 0;
> +	Dl_info dli;
> +	const char *proc = NULL;
> +
> +	for (frame_no = 1; frame_no < limit && ip_buf[frame_no] != NULL;
> +	     ++frame_no) {
> +		ip = ip_buf[frame_no];
> +		if (backtrace_proc_cache_find((unw_word_t)ip, &proc,
> +					      &offset) == 0) {
> +			ret = 1;
> +		} else {
> +			ret = dladdr(ip, &dli);
> +			if (ret == 0)
> +				break;
> +			offset = (char *)ip - (char *)dli.dli_saddr;
> +			proc = dli.dli_sname;
> +			backtrace_proc_cache_put((unw_word_t)ip, proc, offset);
> +		}
> +
> +		if (proc != NULL) {
> +			char *cxxname = abi::__cxa_demangle(proc, demangle_buf,
> +							    &demangle_buf_len,
> +							    &demangle_status);
> +			if (cxxname != NULL) {
> +				demangle_buf = cxxname;
> +				proc = cxxname;
> +			}
> +		}

Looks like code duplication between #if TARGET_OS_DARWIN and #else
blocks. Is it possible to factor out only those parts that are actually
different between those two cases and put them in a separate function?

> +		if (cb(frame_no - 1, ip, proc, offset, cb_ctx) != 0)
> +			break;
> +	}
> +
> +	free(demangle_buf);
> +	if (ret == 0)
> +		say_debug("unwinding error: %i", ret);
> +#endif
> +}
> +
>  void
>  print_backtrace(void)
>  {
> diff --git a/src/lib/core/fiber.c b/src/lib/core/fiber.c
> index 588b78504f38..f0fe6b893cd3 100644
> --- a/src/lib/core/fiber.c
> +++ b/src/lib/core/fiber.c
> @@ -45,6 +45,11 @@
>  
>  extern void cord_on_yield(void);
>  
> +#if ENABLE_BACKTRACE
> +#include "backtrace.h" /* fast_trace */
> +
> +#endif /* ENABLE_BACKTRACE */
> +

The #if/endif isn't needed here.

>  #if ENABLE_FIBER_TOP
>  #include <x86intrin.h> /* __rdtscp() */
>  
> @@ -215,6 +220,10 @@ fiber_mprotect(void *addr, size_t len, int prot)
>  static __thread bool fiber_top_enabled = false;
>  #endif /* ENABLE_FIBER_TOP */
>  
> +#if ENABLE_BACKTRACE
> +static __thread bool fiber_parent_bt_enabled = false;
> +#endif /* ENABLE_BACKTRACE */
> +

Why __thread?

> diff --git a/src/lua/fiber.c b/src/lua/fiber.c
> index 38394666b41d..d82a3a20c132 100644
> --- a/src/lua/fiber.c
> +++ b/src/lua/fiber.c
> @@ -210,6 +226,69 @@ dump_lua_frame(struct lua_State *L, lua_Debug *ar, int tb_frame)
>  	lua_settable(L, -3);
>  }
>  
> +static void
> +lua_backtrace_foreach(struct lua_State *L, lua_backtrace_cb cb, void *cb_ctx)

This helper function could be introduced in a separate patch.

> @@ -220,38 +299,60 @@ fiber_backtrace_cb(int frameno, void *frameret, const char *func, size_t offset,
>  	 * https://github.com/tarantool/tarantool/issues/5326
>  	 * will not resolved, but is possible afterwards.
>  	 */
> -	if (func != NULL && strstr(func, "lj_BC_FUNCC") == func) {
> +	if (func != NULL && tb_ctx->R && strstr(func, "lj_BC_FUNCC") == func) {
>  		/* We are in the LUA vm. */
> -		lua_Debug ar;
> -		while (tb_ctx->R && lua_getstack(tb_ctx->R, tb_ctx->lua_frame, &ar) > 0) {
> -			/* Skip all following C-frames. */
> -			lua_getinfo(tb_ctx->R, "Sln", &ar);
> -			if (*ar.what != 'C')
> -				break;
> -			if (ar.name != NULL) {
> -				/* Dump frame if it is a C built-in call. */
> -				tb_ctx->tb_frame++;
> -				dump_lua_frame(L, &ar, tb_ctx->tb_frame);
> -			}
> -			tb_ctx->lua_frame++;
> -		}
> -		while (tb_ctx->R && lua_getstack(tb_ctx->R, tb_ctx->lua_frame, &ar) > 0) {
> -			/* Trace Lua frame. */
> -			lua_getinfo(tb_ctx->R, "Sln", &ar);
> -			if (*ar.what == 'C') {
> -				break;
> -			}
> +		lua_backtrace_foreach(tb_ctx->R, dump_lua_frame_cb, tb_ctx);
> +	}
> +	char buf[512];
> +	int l = snprintf(buf, sizeof(buf), "#%-2d %p in ",
> +			 frameno, frameret);
> +	if (func)
> +		snprintf(buf + l, sizeof(buf) - l, "%s+%zu", func, offset);
> +	else
> +		snprintf(buf + l, sizeof(buf) - l, "?");
> +	buf[sizeof(buf) - 1] = 0;
> +	tb_ctx->tb_frame++;
> +	lua_pushnumber(L, tb_ctx->tb_frame);
> +	lua_newtable(L);
> +	lua_pushstring(L, "C");
> +	lua_pushstring(L, buf);
> +	lua_settable(L, -3);
> +	lua_settable(L, -3);

Duplication with the code from fiber_backtrace_cb below. Should be
factored out to a separate helper function.

> +	return 0;
> +}
> +
> +static int
> +fiber_parent_backtrace_cb(int frameno, void *frameret, const char *func,
> +			  size_t offset, void *cb_ctx)
> +{
> +	int lua_frame = 0;
> +	struct lua_parent_tb_ctx *tb_ctx = (struct lua_parent_tb_ctx *)cb_ctx;
> +	struct parent_bt_lua *bt = tb_ctx->bt;
> +	struct lua_State *L = tb_ctx->L;
> +	/*
> +	 * There is impossible to get func == NULL until
> +	 * https://github.com/tarantool/tarantool/issues/5326
> +	 * will not resolved, but is possible afterwards.
> +	 */
> +	if (bt != NULL && func != NULL && strstr(func, "lj_BC_FUNCC") == func) {
> +		/* We are in the LUA vm. */
> +		while (lua_frame < bt->cnt) {
>  			tb_ctx->tb_frame++;
> -			dump_lua_frame(L, &ar, tb_ctx->tb_frame);
> -			tb_ctx->lua_frame++;
> +			dump_lua_frame(L, bt->names[lua_frame],
> +				       bt->sources[lua_frame],
> +				       bt->lines[lua_frame],
> +				       tb_ctx->tb_frame);
> +			lua_frame++;
>  		}
>  	}
>  	char buf[512];
> -	int l = snprintf(buf, sizeof(buf), "#%-2d %p in ", frameno, frameret);
> +	int l = snprintf(buf, sizeof(buf), "#%-2d %p in ",
> +			 frameno, frameret);
>  	if (func)
>  		snprintf(buf + l, sizeof(buf) - l, "%s+%zu", func, offset);
>  	else
>  		snprintf(buf + l, sizeof(buf) - l, "?");
> +	buf[sizeof(buf) - 1] = 0;
>  	tb_ctx->tb_frame++;
>  	lua_pushnumber(L, tb_ctx->tb_frame);
>  	lua_newtable(L);
> @@ -433,6 +569,22 @@ lbox_do_backtrace(struct lua_State *L)
>  	}
>  	return true;
>  }
> +
> +static int
> +lbox_fiber_parent_bt_enable(struct lua_State *L)
> +{
> +	(void) L;
> +	fiber_parent_bt_enable();
> +	return 0;
> +}
> +
> +static int
> +lbox_fiber_parent_bt_disable(struct lua_State *L)
> +{
> +	(void) L;
> +	fiber_parent_bt_disable();
> +	return 0;
> +}
>  #endif /* ENABLE_BACKTRACE */

Better have one function which would take true/false IMO, but I think we
have to wind up with two, for consistency with fiber_top_enable...

> @@ -498,6 +654,12 @@ fiber_create(struct lua_State *L)
>  		luaT_error(L);
>  	}
>  
> +#ifdef ENABLE_BACKTRACE
> +	// TODO: error handling

Please fix in this patch. Either with luaT_error() or with panic(),
which is okay, because malloc doesn't normally fail and if it does
there isn't much we can do.

> +	if (fiber_parent_bt_is_enabled())
> +		fiber_parent_bt_init(f, L);
> +#endif
> +
> diff --git a/src/lua/fiber.h b/src/lua/fiber.h
> index e298987062c5..450840fb045b 100644
> --- a/src/lua/fiber.h
> +++ b/src/lua/fiber.h
> @@ -36,6 +36,25 @@ extern "C" {
>  
>  struct lua_State;
>  
> +/**
> + * Maximal name length (including '\0')
> + * and backtrace length.
> + */
> +enum {
> +	PARENT_BT_LUA_NAME_MAX = 64,
> +	PARENT_BT_LUA_LEN_MAX = 8

I think we should capture more frames, especially if we capture
backtraces of all ancestors. I think 64 would be an adequate limit.

> +};
> +
> +/**
> + * Stores lua parent backtrace for fiber.
> + */
> +struct parent_bt_lua {
> +	int cnt;
> +	char names[PARENT_BT_LUA_LEN_MAX][PARENT_BT_LUA_NAME_MAX];
> +	char sources[PARENT_BT_LUA_LEN_MAX][PARENT_BT_LUA_NAME_MAX];
> +	int lines[PARENT_BT_LUA_LEN_MAX];
> +};

Instead of having three arrays, better introduce a separate struct and
create one array of those structs.


More information about the Tarantool-patches mailing list