[Tarantool-patches] [PATCH v4 4/4] crash: report crash data to the feedback server

Cyrill Gorcunov gorcunov at gmail.com
Thu Dec 10 19:18:32 MSK 2020


We have a feedback server which gathers information about a running instnace.
While general info is enough for now we may loose a precious information about
crashes (such as call backtrace which caused the issue, type of build and etc).

In the commit we add support of sending this kind of infomation to the feedback
server. Internally we gather the reason of failure, pack it into base64 form
and then run another tarantool instance which sends it out.

A typical report might look like

 | {
 |   "crashdump": {
 |     "version": "1",
 |     "data": "eyJ1bmFtZSI6eyJzeXNuYW1lIjoiTGludXgiLCJyZWxlYXNlIjoiNS"
 |             "45LjExLTEwMC5mYzMyLng4Nl82NCIsInZlcnNpb24iOiIjMSBTTVAg"
 |             "VHVlIE5vdiAyNCAxOToxNjo1MyBVVEMgMjAyMCIsIm1hY2hpbmUiOi"
 |             "J4ODZfNjQifSwiYnVpbGQiOnsidmVyc2lvbiI6IjIuNy4wLTgzLWc5"
 |             "ZTg1MDM4ZmIiLCJjbWFrZV90eXBlIjoiTGludXgteDg2XzY0LURlYn"
 |             "VnIn0sInNpZ25hbCI6eyJzaWdubyI6MTEsInNpX2NvZGUiOjAsInNp"
 |             "X2FkZHIiOiIweDNlODAwMDVjOGE5IiwiYmFja3RyYWNlIjoiSXpBZ0"
 |             "lEQjROak14WkRBeklHbHVJR055WVhOb1gyTnZiR3hsWTNRclltWUtJ"
 |             "ekVnSURCNE5qTXlaV1JoSUdsdUlHTnlZWE5vWDNOcFoyNWhiRjlqWW"
 |             "lzMU1Rb2pNaUFnTUhnM1pqRmtORGt5TXpsaE9UQWdhVzRnWm5WdWJH"
 |             "OWphMlpwYkdVck5qQUtJek1nSURCNE4yWXhaRFE0WkdReFl6VmxJR2"
 |             "x1SUdWd2IyeHNYM2RoYVhRck5XVUtJelFnSURCNE9ETTNOekV6SUds"
 |             "dUlHVndiMnhzWDNCdmJHd3JPVGtLSXpVZ0lEQjRPRE5oWkRRNElHbH"
 |             "VJR1YyWDNKMWJpczBPRGNLSXpZZ0lEQjROakJtTUdGbElHbHVJSFJo"
 |             "Y21GdWRHOXZiRjlzZFdGZmNuVnVYM05qY21sd2RDc3hNemdLSXpjZ0"
 |             "lEQjRORFEyTldZMUlHbHVJRzFoYVc0ck5XRmpDaU00SUNBd2VEZG1N"
 |             "V1EwT0dObU56QTBNaUJwYmlCZlgyeHBZbU5mYzNSaGNuUmZiV0ZwYm"
 |             "l0bU1nb2pPU0FnTUhnME5EUmhNR1VnYVc0Z1gzTjBZWEowS3pKbENn"
 |             "PT0iLCJ0aW1lc3RhbXAiOiIyMDIwLTEyLTEwIDE3OjA4OjQyIE1TSyJ9fQ=="
 |   }
 | }

The `data` value is a single string I wrapped it for commit message only.
When `data` is decoded it consists of

 | {
 |   "uname": {
 |     "sysname": "Linux",
 |     "release": "5.9.11-100.fc32.x86_64",
 |     "version": "#1 SMP Tue Nov 24 19:16:53 UTC 2020",
 |     "machine": "x86_64"
 |   },
 |   "build": {
 |     "version": "2.7.0-83-g9e85038fb",
 |     "cmake_type": "Linux-x86_64-Debug"
 |   },
 |   "signal": {
 |     "signo": 11,
 |     "si_code": 0,
 |     "si_addr": "0x3e80005c8a9",
 |     "backtrace": "IzAgIDB4NjMxZDAzIGluIGNyYXNoX2NvbGxlY3QrYmYKIzEgID"
 |                  "B4NjMyZWRhIGluIGNyYXNoX3NpZ25hbF9jYis1MQojMiAgMHg3ZjFkNDkyMzlhO"
 |                  "TAgaW4gZnVubG9ja2ZpbGUrNjAKIzMgIDB4N2YxZDQ4ZGQxYzVlIGluIGVwb2xs"
 |                  "X3dhaXQrNWUKIzQgIDB4ODM3NzEzIGluIGVwb2xsX3BvbGwrOTkKIzUgIDB4ODN"
 |                  "hZDQ4IGluIGV2X3J1bis0ODcKIzYgIDB4NjBmMGFlIGluIHRhcmFudG9vbF9sdW"
 |                  "FfcnVuX3NjcmlwdCsxMzgKIzcgIDB4NDQ2NWY1IGluIG1haW4rNWFjCiM4ICAwe"
 |                  "DdmMWQ0OGNmNzA0MiBpbiBfX2xpYmNfc3RhcnRfbWFpbitmMgojOSAgMHg0NDRh"
 |                  "MGUgaW4gX3N0YXJ0KzJlCg==",
 |     "timestamp": "2020-12-10 17:08:42 MSK"
 |   }
 | }

The `backtrace` istself is encoded as base64 because of newline symbols
(and may comprise some nonascii symbols as well).

There is no simple way to test this so I did it manually:
1) Run instance with

	box.cfg{log_level = 8, feedback_host="127.0.0.1:1500"}

2) Run listener shell as

	while true ; do nc -l -p 1500 -c 'echo -e "HTTP/1.1 200 OK\n\n $(date)"'; done

3) Send SIGSEGV

	kill -11 `pidof tarantool`

Once SIGSEGV is delivered the crashinfo data is generated and sent out. For
debug purpose this data is also printed to the terminal on debug log level.

Closes #5261

Signed-off-by: Cyrill Gorcunov <gorcunov at gmail.com>

@TarantoolBot document
Title: Configuration update, allow to disable sending crash information

For better analisys of program crashes the information associated with
the crash such as

 - utsname (similar to `uname -a` output except the network name)
 - build information
 - reason for a crash
 - call backtrace

is sent to the feedback server. To disable it set `feedback_no_crashinfo`
to `false`.
---
 src/box/lua/feedback_daemon.lua |   7 +
 src/box/lua/load_cfg.lua        |   3 +
 src/exports.h                   |   1 +
 src/lib/core/CMakeLists.txt     |   2 +-
 src/lib/core/crash.c            | 263 ++++++++++++++++++++++++++++++++
 src/lib/core/crash.h            |  12 ++
 src/main.cc                     |   1 +
 test/box/admin.result           |   2 +
 test/box/cfg.result             |   4 +
 9 files changed, 294 insertions(+), 1 deletion(-)

diff --git a/src/box/lua/feedback_daemon.lua b/src/box/lua/feedback_daemon.lua
index 1d39012ed..ae4d1a25f 100644
--- a/src/box/lua/feedback_daemon.lua
+++ b/src/box/lua/feedback_daemon.lua
@@ -1,5 +1,6 @@
 -- feedback_daemon.lua (internal file)
 --
+local ffi   = require('ffi')
 local log   = require('log')
 local json  = require('json')
 local fiber = require('fiber')
@@ -360,12 +361,18 @@ local function reload(self)
     self:start()
 end
 
+ffi.cdef[[
+void
+crash_cfg(void);
+]]
+
 setmetatable(daemon, {
     __index = {
         set_feedback_params = function()
             daemon.enabled  = box.cfg.feedback_enabled
             daemon.host     = box.cfg.feedback_host
             daemon.interval = box.cfg.feedback_interval
+            ffi.C.crash_cfg()
             reload(daemon)
             return
         end,
diff --git a/src/box/lua/load_cfg.lua b/src/box/lua/load_cfg.lua
index 770442052..1878c31bd 100644
--- a/src/box/lua/load_cfg.lua
+++ b/src/box/lua/load_cfg.lua
@@ -99,6 +99,7 @@ local default_cfg = {
     replication_skip_conflict = false,
     replication_anon      = false,
     feedback_enabled      = true,
+    feedback_no_crashinfo = false,
     feedback_host         = "https://feedback.tarantool.io",
     feedback_interval     = 3600,
     net_msg_max           = 768,
@@ -179,6 +180,7 @@ local template_cfg = {
     replication_skip_conflict = 'boolean',
     replication_anon      = 'boolean',
     feedback_enabled      = ifdef_feedback('boolean'),
+    feedback_no_crashinfo = ifdef_feedback('boolean'),
     feedback_host         = ifdef_feedback('string'),
     feedback_interval     = ifdef_feedback('number'),
     net_msg_max           = 'number',
@@ -277,6 +279,7 @@ local dynamic_cfg = {
     checkpoint_wal_threshold = private.cfg_set_checkpoint_wal_threshold,
     worker_pool_threads     = private.cfg_set_worker_pool_threads,
     feedback_enabled        = ifdef_feedback_set_params,
+    feedback_no_crashinfo   = ifdef_feedback_set_params,
     feedback_host           = ifdef_feedback_set_params,
     feedback_interval       = ifdef_feedback_set_params,
     -- do nothing, affects new replicas, which query this value on start
diff --git a/src/exports.h b/src/exports.h
index 867a027dc..a91edbc1a 100644
--- a/src/exports.h
+++ b/src/exports.h
@@ -113,6 +113,7 @@ EXPORT(coio_wait)
 EXPORT(console_get_output_format)
 EXPORT(console_set_output_format)
 EXPORT(cord_slab_cache)
+EXPORT(crash_cfg)
 EXPORT(crc32_calc)
 EXPORT(crypto_EVP_MD_CTX_free)
 EXPORT(crypto_EVP_MD_CTX_new)
diff --git a/src/lib/core/CMakeLists.txt b/src/lib/core/CMakeLists.txt
index 06b2b91e1..d057c855d 100644
--- a/src/lib/core/CMakeLists.txt
+++ b/src/lib/core/CMakeLists.txt
@@ -40,7 +40,7 @@ endif()
 
 add_library(core STATIC ${core_sources})
 
-target_link_libraries(core salad small uri decNumber ${LIBEV_LIBRARIES}
+target_link_libraries(core salad small uri decNumber misc ${LIBEV_LIBRARIES}
                       ${LIBEIO_LIBRARIES} ${LIBCORO_LIBRARIES}
                       ${MSGPUCK_LIBRARIES} ${ICU_LIBRARIES})
 
diff --git a/src/lib/core/crash.c b/src/lib/core/crash.c
index 9572a023c..06cc240a4 100644
--- a/src/lib/core/crash.c
+++ b/src/lib/core/crash.c
@@ -7,16 +7,28 @@
 #include <string.h>
 #include <unistd.h>
 #include <time.h>
+#include <sys/types.h>
+#include <sys/utsname.h>
 
+#include "third_party/base64.h"
+#include "small/static.h"
 #include "trivia/util.h"
 #include "backtrace.h"
 #include "crash.h"
+#include "cfg.h"
 #include "say.h"
 
 #define pr_fmt(fmt)		"crash: " fmt
+#define pr_debug(fmt, ...)	say_debug(pr_fmt(fmt), ##__VA_ARGS__)
+#define pr_info(fmt, ...)	say_info(pr_fmt(fmt), ##__VA_ARGS__)
+#define pr_err(fmt, ...)	say_error(pr_fmt(fmt), ##__VA_ARGS__)
 #define pr_syserr(fmt, ...)	say_syserror(pr_fmt(fmt), ##__VA_ARGS__)
+#define pr_crit(fmt, ...)	fprintf(stderr, pr_fmt(fmt) "\n", ##__VA_ARGS__)
 #define pr_panic(fmt, ...)	panic(pr_fmt(fmt), ##__VA_ARGS__)
 
+/* Use strlcpy with destination as an array */
+#define strlcpy_a(dst, src) strlcpy(dst, src, sizeof(dst))
+
 #ifdef TARGET_OS_LINUX
 #ifndef __x86_64__
 # error "Non x86-64 architectures are not supported"
@@ -67,6 +79,10 @@ static struct crash_info {
 	 */
 	struct crash_greg greg;
 #endif
+	/**
+	 * Timestamp in nanoseconds (realtime).
+	 */
+	uint64_t timestamp_rt;
 	/**
 	 * Faulting address.
 	 */
@@ -88,6 +104,72 @@ static struct crash_info {
 #endif
 } crash_info;
 
+static char tarantool_path[PATH_MAX];
+static char feedback_host[2048];
+static bool send_crashinfo = true;
+
+static inline uint64_t
+timespec_to_ns(struct timespec *ts)
+{
+	return (uint64_t)ts->tv_sec * 1000000000 + (uint64_t)ts->tv_nsec;
+}
+
+static char *
+ns_to_localtime(uint64_t timestamp, char *buf, ssize_t len)
+{
+	time_t sec = timestamp / 1000000000;
+	char *start = buf;
+	struct tm tm;
+
+	/*
+	 * Use similar format as say_x logger. Except plain
+	 * seconds should be enough.
+	 */
+	localtime_r(&sec, &tm);
+	ssize_t total = strftime(start, len, "%F %T %Z", &tm);
+	start += total;
+	if (total < len)
+		return buf;
+	buf[len - 1] = '\0';
+	return buf;
+}
+
+void
+crash_init(const char *tarantool_bin)
+{
+	strlcpy_a(tarantool_path, tarantool_bin);
+	if (strlen(tarantool_path) < strlen(tarantool_bin))
+		pr_panic("executable path is trimmed");
+}
+
+void
+crash_cfg(void)
+{
+	const char *host = cfg_gets("feedback_host");
+	bool is_enabled = cfg_getb("feedback_enabled");
+	bool no_crashinfo = cfg_getb("feedback_no_crashinfo");
+
+	if (host == NULL || !is_enabled || no_crashinfo) {
+		if (send_crashinfo) {
+			pr_info("disable sedning crashinfo feedback");
+			send_crashinfo = false;
+			feedback_host[0] = '\0';
+		}
+		return;
+	}
+
+	if (strcmp(feedback_host, host) != 0) {
+		strlcpy_a(feedback_host, host);
+		if (strlen(feedback_host) < strlen(host))
+			pr_panic("feedback_host is too long");
+	}
+
+	if (!send_crashinfo) {
+		pr_info("enable sedning crashinfo feedback");
+		send_crashinfo = true;
+	}
+}
+
 /**
  * The routine is called inside crash signal handler so
  * be careful to not cause additional signals inside.
@@ -96,6 +178,12 @@ static struct crash_info *
 crash_collect(int signo, siginfo_t *siginfo, void *ucontext)
 {
 	struct crash_info *cinfo = &crash_info;
+	struct timespec ts;
+
+	if (clock_gettime(CLOCK_REALTIME, &ts) == 0)
+		cinfo->timestamp_rt = timespec_to_ns(&ts);
+	else
+		cinfo->timestamp_rt = 0;
 
 	cinfo->signo = signo;
 	cinfo->sicode = siginfo->si_code;
@@ -126,6 +214,179 @@ crash_collect(int signo, siginfo_t *siginfo, void *ucontext)
 	return cinfo;
 }
 
+/**
+ * Report crash information to the feedback daemon
+ * (ie send it to feedback daemon).
+ */
+static void
+crash_report_feedback_daemon(struct crash_info *cinfo)
+{
+	/*
+	 * Update to a new number if format get changed.
+	 */
+	static const int crashinfo_version = 1;
+
+	char *p = static_alloc(SMALL_STATIC_SIZE);
+	char *e = &p[SMALL_STATIC_SIZE - 1];
+	char *head = p;
+	char *tail = &p[SMALL_STATIC_SIZE - 8];
+
+	/*
+	 * Note that while we encode the report we
+	 * intensively use a tail of the allocated
+	 * buffer as a temporary store.
+	 */
+
+#define snprintf_safe(fmt, ...)					\
+	do {							\
+		p += snprintf(p, e - p, fmt, ##__VA_ARGS__);	\
+		if (p >= e)					\
+			goto out;				\
+	} while (0)
+
+	/*
+	 * On linux there is new_utsname structure which
+	 * encodes each field to __NEW_UTS_LEN + 1 => 64 + 1 = 65.
+	 *
+	 * So lets just reserve more data in advance: 5 fields
+	 * 128 bytes each => 640 bytes.
+	 */
+	static_assert(sizeof(struct utsname) <= 640,
+		      "utsname is bigger than expected");
+
+	/*
+	 * Lets reuse tail of the buffer as a temp space.
+	 */
+	struct utsname *uname_ptr = (void *)&tail[-640];
+	if (p > (char *)(void *)uname_ptr)
+		goto out;
+
+	if (uname(uname_ptr) != 0) {
+		pr_syserr("uname call failed, ignore");
+		memset(uname_ptr, 0, 640);
+	}
+
+	snprintf_safe("{");
+	snprintf_safe("\"uname\":{");
+	snprintf_safe("\"sysname\":\"%s\",", uname_ptr->sysname);
+	/*
+	 * nodename might a sensitive information, skip.
+	 */
+	snprintf_safe("\"release\":\"%s\",", uname_ptr->release);
+	snprintf_safe("\"version\":\"%s\",", uname_ptr->version);
+	snprintf_safe("\"machine\":\"%s\"", uname_ptr->machine);
+	snprintf_safe("},");
+
+	if (p >= (char *)(void *)uname_ptr)
+		goto out;
+
+	snprintf_safe("\"build\":{");
+	snprintf_safe("\"version\":\"%s\",", PACKAGE_VERSION);
+	snprintf_safe("\"cmake_type\":\"%s\"", BUILD_INFO);
+	snprintf_safe("},");
+
+	snprintf_safe("\"signal\":{");
+	snprintf_safe("\"signo\":%d,", cinfo->signo);
+	snprintf_safe("\"si_code\":%d,", cinfo->sicode);
+	if (cinfo->signo == SIGSEGV) {
+		if (cinfo->sicode == SEGV_MAPERR) {
+			snprintf_safe("\"si_code_str\":\"%s\",",
+				      "SEGV_MAPERR");
+		} else if (cinfo->sicode == SEGV_ACCERR) {
+			snprintf_safe("\"si_code_str\":\"%s\",",
+				      "SEGV_ACCERR");
+		}
+		snprintf_safe("\"si_addr\":\"0x%llx\",",
+			      (long long)cinfo->siaddr);
+	}
+
+#ifdef ENABLE_BACKTRACE
+	/*
+	 * The backtrace itself is encoded into base64 form
+	 * since it might have arbitrary symbols not suitable
+	 * for json encoding (newlines and etc).
+	 */
+	size_t bt_len = strlen(cinfo->backtrace_buf);
+	size_t bt_elen = base64_bufsize(bt_len, BASE64_NOWRAP);
+	char *bt_base64 = &tail[-bt_elen-8];
+	if (p >= bt_base64)
+		goto out;
+	base64_encode(cinfo->backtrace_buf, bt_len,
+		      bt_base64, bt_elen, BASE64_NOWRAP);
+	bt_base64[bt_elen] = '\0';
+	snprintf_safe("\"backtrace\":\"%s\",", bt_base64);
+#endif
+
+	char *timestamp_rt_str = &tail[-128];
+	if (p >= timestamp_rt_str)
+		goto out;
+	ns_to_localtime(cinfo->timestamp_rt, timestamp_rt_str, 128);
+	snprintf_safe("\"timestamp\":\"%s\"", timestamp_rt_str);
+	snprintf_safe("}");
+	snprintf_safe("}");
+
+	size_t report_len = p - head;
+	size_t report_elen = base64_bufsize(report_len, BASE64_NOWRAP);
+
+	char *report_base64 = &tail[-report_elen-8];
+	if (p >= report_base64)
+		goto out;
+	base64_encode(head, report_len, report_base64,
+		      report_elen, BASE64_NOWRAP);
+	report_base64[report_elen] = '\0';
+
+	/*
+	 * Encoded report now sits at report_base64 position,
+	 * at the tail of 'small' static buffer. Lets prepare
+	 * the script to run (make sure the strings do not
+	 * overlap).
+	 */
+	p = head;
+	snprintf_safe("require(\'http.client\').post(\'%s\',"
+		      "'{\"crashdump\":{\"version\":\"%d\","
+		      "\"data\":", feedback_host,
+		      crashinfo_version);
+	if (report_base64 - p <= (long)(report_elen - 2))
+		goto out;
+	snprintf_safe("\"%s\"", report_base64);
+	snprintf_safe("}}',{timeout=1}); os.exit(1);");
+
+	pr_debug("crashinfo script: %s", head);
+
+	char *exec_argv[4] = {
+		[0] = tarantool_path,
+		[1] = "-e",
+		[2] = head,
+		[3] = NULL,
+	};
+
+	/*
+	 * Can't use fork here because libev has own
+	 * at_fork helpers with mutex where we might
+	 * stuck (see popen code).
+	 */
+	pid_t pid = vfork();
+	if (pid == 0) {
+		/*
+		 * The script must exit at the end but there
+		 * is no simple way to make sure from inside
+		 * of a signal crash handler. So just hope it
+		 * is running fine.
+		 */
+		execve(exec_argv[0], exec_argv, NULL);
+		pr_crit("exec(%s,[%s,%s,%s,NULL]) failed",
+			exec_argv[0], exec_argv[0],
+			exec_argv[1], exec_argv[2]);
+		_exit(1);
+	} else if (pid < 0) {
+		pr_crit("unable to vfork (errno %d)", errno);
+	}
+
+	return;
+out:
+	pr_crit("unable to prepare a crash report");
+}
+
 /**
  * Report crash information to the stderr
  * (usually a current console).
@@ -232,6 +493,8 @@ crash_signal_cb(int signo, siginfo_t *siginfo, void *context)
 		in_cb = 1;
 		cinfo = crash_collect(signo, siginfo, context);
 		crash_report_stderr(cinfo);
+		if (send_crashinfo)
+			crash_report_feedback_daemon(cinfo);
 	} else {
 		/* Got a signal while running the handler. */
 		fprintf(stderr, "Fatal %d while backtracing", signo);
diff --git a/src/lib/core/crash.h b/src/lib/core/crash.h
index d107cd953..b005e3ec4 100644
--- a/src/lib/core/crash.h
+++ b/src/lib/core/crash.h
@@ -15,12 +15,24 @@
 extern "C" {
 #endif /* defined(__cplusplus) */
 
+/**
+ * Initialize crash subsystem.
+ */
+void
+crash_init(const char *tarantool_bin);
+
 /**
  * Initialize crash signal handlers.
  */
 void
 crash_signal_init(void);
 
+/**
+ * Configure crash engine from box.cfg.
+ */
+void
+crash_cfg(void);
+
 /**
  * Reset crash signal handlers.
  */
diff --git a/src/main.cc b/src/main.cc
index 391e0f878..2fce81bb3 100644
--- a/src/main.cc
+++ b/src/main.cc
@@ -687,6 +687,7 @@ main(int argc, char **argv)
 		title_set_script_name(argv[0]);
 	}
 
+	crash_init(tarantool_bin);
 	export_syms();
 
 	random_init();
diff --git a/test/box/admin.result b/test/box/admin.result
index 8c5626c36..b8fa68749 100644
--- a/test/box/admin.result
+++ b/test/box/admin.result
@@ -47,6 +47,8 @@ cfg_filter(box.cfg)
     - https://feedback.tarantool.io
   - - feedback_interval
     - 3600
+  - - feedback_no_crashinfo
+    - false
   - - force_recovery
     - false
   - - hot_standby
diff --git a/test/box/cfg.result b/test/box/cfg.result
index 5ca6ce72b..68c45b4cc 100644
--- a/test/box/cfg.result
+++ b/test/box/cfg.result
@@ -35,6 +35,8 @@ cfg_filter(box.cfg)
  |     - https://feedback.tarantool.io
  |   - - feedback_interval
  |     - 3600
+ |   - - feedback_no_crashinfo
+ |     - false
  |   - - force_recovery
  |     - false
  |   - - hot_standby
@@ -148,6 +150,8 @@ cfg_filter(box.cfg)
  |     - https://feedback.tarantool.io
  |   - - feedback_interval
  |     - 3600
+ |   - - feedback_no_crashinfo
+ |     - false
  |   - - force_recovery
  |     - false
  |   - - hot_standby
-- 
2.26.2



More information about the Tarantool-patches mailing list