From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lj1-f177.google.com (mail-lj1-f177.google.com [209.85.208.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 9E93745C304 for ; Thu, 10 Dec 2020 19:19:24 +0300 (MSK) Received: by mail-lj1-f177.google.com with SMTP id m13so7290712ljo.11 for ; Thu, 10 Dec 2020 08:19:24 -0800 (PST) From: Cyrill Gorcunov Date: Thu, 10 Dec 2020 19:18:32 +0300 Message-Id: <20201210161832.729439-5-gorcunov@gmail.com> In-Reply-To: <20201210161832.729439-1-gorcunov@gmail.com> References: <20201210161832.729439-1-gorcunov@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: [Tarantool-patches] [PATCH v4 4/4] crash: report crash data to the feedback server List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: tml Cc: Mons Anderson , Vladislav Shpilevoy We have a feedback server which gathers information about a running instnace. While general info is enough for now we may loose a precious information about crashes (such as call backtrace which caused the issue, type of build and etc). In the commit we add support of sending this kind of infomation to the feedback server. Internally we gather the reason of failure, pack it into base64 form and then run another tarantool instance which sends it out. A typical report might look like | { | "crashdump": { | "version": "1", | "data": "eyJ1bmFtZSI6eyJzeXNuYW1lIjoiTGludXgiLCJyZWxlYXNlIjoiNS" | "45LjExLTEwMC5mYzMyLng4Nl82NCIsInZlcnNpb24iOiIjMSBTTVAg" | "VHVlIE5vdiAyNCAxOToxNjo1MyBVVEMgMjAyMCIsIm1hY2hpbmUiOi" | "J4ODZfNjQifSwiYnVpbGQiOnsidmVyc2lvbiI6IjIuNy4wLTgzLWc5" | "ZTg1MDM4ZmIiLCJjbWFrZV90eXBlIjoiTGludXgteDg2XzY0LURlYn" | "VnIn0sInNpZ25hbCI6eyJzaWdubyI6MTEsInNpX2NvZGUiOjAsInNp" | "X2FkZHIiOiIweDNlODAwMDVjOGE5IiwiYmFja3RyYWNlIjoiSXpBZ0" | "lEQjROak14WkRBeklHbHVJR055WVhOb1gyTnZiR3hsWTNRclltWUtJ" | "ekVnSURCNE5qTXlaV1JoSUdsdUlHTnlZWE5vWDNOcFoyNWhiRjlqWW" | "lzMU1Rb2pNaUFnTUhnM1pqRmtORGt5TXpsaE9UQWdhVzRnWm5WdWJH" | "OWphMlpwYkdVck5qQUtJek1nSURCNE4yWXhaRFE0WkdReFl6VmxJR2" | "x1SUdWd2IyeHNYM2RoYVhRck5XVUtJelFnSURCNE9ETTNOekV6SUds" | "dUlHVndiMnhzWDNCdmJHd3JPVGtLSXpVZ0lEQjRPRE5oWkRRNElHbH" | "VJR1YyWDNKMWJpczBPRGNLSXpZZ0lEQjROakJtTUdGbElHbHVJSFJo" | "Y21GdWRHOXZiRjlzZFdGZmNuVnVYM05qY21sd2RDc3hNemdLSXpjZ0" | "lEQjRORFEyTldZMUlHbHVJRzFoYVc0ck5XRmpDaU00SUNBd2VEZG1N" | "V1EwT0dObU56QTBNaUJwYmlCZlgyeHBZbU5mYzNSaGNuUmZiV0ZwYm" | "l0bU1nb2pPU0FnTUhnME5EUmhNR1VnYVc0Z1gzTjBZWEowS3pKbENn" | "PT0iLCJ0aW1lc3RhbXAiOiIyMDIwLTEyLTEwIDE3OjA4OjQyIE1TSyJ9fQ==" | } | } The `data` value is a single string I wrapped it for commit message only. When `data` is decoded it consists of | { | "uname": { | "sysname": "Linux", | "release": "5.9.11-100.fc32.x86_64", | "version": "#1 SMP Tue Nov 24 19:16:53 UTC 2020", | "machine": "x86_64" | }, | "build": { | "version": "2.7.0-83-g9e85038fb", | "cmake_type": "Linux-x86_64-Debug" | }, | "signal": { | "signo": 11, | "si_code": 0, | "si_addr": "0x3e80005c8a9", | "backtrace": "IzAgIDB4NjMxZDAzIGluIGNyYXNoX2NvbGxlY3QrYmYKIzEgID" | "B4NjMyZWRhIGluIGNyYXNoX3NpZ25hbF9jYis1MQojMiAgMHg3ZjFkNDkyMzlhO" | "TAgaW4gZnVubG9ja2ZpbGUrNjAKIzMgIDB4N2YxZDQ4ZGQxYzVlIGluIGVwb2xs" | "X3dhaXQrNWUKIzQgIDB4ODM3NzEzIGluIGVwb2xsX3BvbGwrOTkKIzUgIDB4ODN" | "hZDQ4IGluIGV2X3J1bis0ODcKIzYgIDB4NjBmMGFlIGluIHRhcmFudG9vbF9sdW" | "FfcnVuX3NjcmlwdCsxMzgKIzcgIDB4NDQ2NWY1IGluIG1haW4rNWFjCiM4ICAwe" | "DdmMWQ0OGNmNzA0MiBpbiBfX2xpYmNfc3RhcnRfbWFpbitmMgojOSAgMHg0NDRh" | "MGUgaW4gX3N0YXJ0KzJlCg==", | "timestamp": "2020-12-10 17:08:42 MSK" | } | } The `backtrace` istself is encoded as base64 because of newline symbols (and may comprise some nonascii symbols as well). There is no simple way to test this so I did it manually: 1) Run instance with box.cfg{log_level = 8, feedback_host="127.0.0.1:1500"} 2) Run listener shell as while true ; do nc -l -p 1500 -c 'echo -e "HTTP/1.1 200 OK\n\n $(date)"'; done 3) Send SIGSEGV kill -11 `pidof tarantool` Once SIGSEGV is delivered the crashinfo data is generated and sent out. For debug purpose this data is also printed to the terminal on debug log level. Closes #5261 Signed-off-by: Cyrill Gorcunov @TarantoolBot document Title: Configuration update, allow to disable sending crash information For better analisys of program crashes the information associated with the crash such as - utsname (similar to `uname -a` output except the network name) - build information - reason for a crash - call backtrace is sent to the feedback server. To disable it set `feedback_no_crashinfo` to `false`. --- src/box/lua/feedback_daemon.lua | 7 + src/box/lua/load_cfg.lua | 3 + src/exports.h | 1 + src/lib/core/CMakeLists.txt | 2 +- src/lib/core/crash.c | 263 ++++++++++++++++++++++++++++++++ src/lib/core/crash.h | 12 ++ src/main.cc | 1 + test/box/admin.result | 2 + test/box/cfg.result | 4 + 9 files changed, 294 insertions(+), 1 deletion(-) diff --git a/src/box/lua/feedback_daemon.lua b/src/box/lua/feedback_daemon.lua index 1d39012ed..ae4d1a25f 100644 --- a/src/box/lua/feedback_daemon.lua +++ b/src/box/lua/feedback_daemon.lua @@ -1,5 +1,6 @@ -- feedback_daemon.lua (internal file) -- +local ffi = require('ffi') local log = require('log') local json = require('json') local fiber = require('fiber') @@ -360,12 +361,18 @@ local function reload(self) self:start() end +ffi.cdef[[ +void +crash_cfg(void); +]] + setmetatable(daemon, { __index = { set_feedback_params = function() daemon.enabled = box.cfg.feedback_enabled daemon.host = box.cfg.feedback_host daemon.interval = box.cfg.feedback_interval + ffi.C.crash_cfg() reload(daemon) return end, diff --git a/src/box/lua/load_cfg.lua b/src/box/lua/load_cfg.lua index 770442052..1878c31bd 100644 --- a/src/box/lua/load_cfg.lua +++ b/src/box/lua/load_cfg.lua @@ -99,6 +99,7 @@ local default_cfg = { replication_skip_conflict = false, replication_anon = false, feedback_enabled = true, + feedback_no_crashinfo = false, feedback_host = "https://feedback.tarantool.io", feedback_interval = 3600, net_msg_max = 768, @@ -179,6 +180,7 @@ local template_cfg = { replication_skip_conflict = 'boolean', replication_anon = 'boolean', feedback_enabled = ifdef_feedback('boolean'), + feedback_no_crashinfo = ifdef_feedback('boolean'), feedback_host = ifdef_feedback('string'), feedback_interval = ifdef_feedback('number'), net_msg_max = 'number', @@ -277,6 +279,7 @@ local dynamic_cfg = { checkpoint_wal_threshold = private.cfg_set_checkpoint_wal_threshold, worker_pool_threads = private.cfg_set_worker_pool_threads, feedback_enabled = ifdef_feedback_set_params, + feedback_no_crashinfo = ifdef_feedback_set_params, feedback_host = ifdef_feedback_set_params, feedback_interval = ifdef_feedback_set_params, -- do nothing, affects new replicas, which query this value on start diff --git a/src/exports.h b/src/exports.h index 867a027dc..a91edbc1a 100644 --- a/src/exports.h +++ b/src/exports.h @@ -113,6 +113,7 @@ EXPORT(coio_wait) EXPORT(console_get_output_format) EXPORT(console_set_output_format) EXPORT(cord_slab_cache) +EXPORT(crash_cfg) EXPORT(crc32_calc) EXPORT(crypto_EVP_MD_CTX_free) EXPORT(crypto_EVP_MD_CTX_new) diff --git a/src/lib/core/CMakeLists.txt b/src/lib/core/CMakeLists.txt index 06b2b91e1..d057c855d 100644 --- a/src/lib/core/CMakeLists.txt +++ b/src/lib/core/CMakeLists.txt @@ -40,7 +40,7 @@ endif() add_library(core STATIC ${core_sources}) -target_link_libraries(core salad small uri decNumber ${LIBEV_LIBRARIES} +target_link_libraries(core salad small uri decNumber misc ${LIBEV_LIBRARIES} ${LIBEIO_LIBRARIES} ${LIBCORO_LIBRARIES} ${MSGPUCK_LIBRARIES} ${ICU_LIBRARIES}) diff --git a/src/lib/core/crash.c b/src/lib/core/crash.c index 9572a023c..06cc240a4 100644 --- a/src/lib/core/crash.c +++ b/src/lib/core/crash.c @@ -7,16 +7,28 @@ #include #include #include +#include +#include +#include "third_party/base64.h" +#include "small/static.h" #include "trivia/util.h" #include "backtrace.h" #include "crash.h" +#include "cfg.h" #include "say.h" #define pr_fmt(fmt) "crash: " fmt +#define pr_debug(fmt, ...) say_debug(pr_fmt(fmt), ##__VA_ARGS__) +#define pr_info(fmt, ...) say_info(pr_fmt(fmt), ##__VA_ARGS__) +#define pr_err(fmt, ...) say_error(pr_fmt(fmt), ##__VA_ARGS__) #define pr_syserr(fmt, ...) say_syserror(pr_fmt(fmt), ##__VA_ARGS__) +#define pr_crit(fmt, ...) fprintf(stderr, pr_fmt(fmt) "\n", ##__VA_ARGS__) #define pr_panic(fmt, ...) panic(pr_fmt(fmt), ##__VA_ARGS__) +/* Use strlcpy with destination as an array */ +#define strlcpy_a(dst, src) strlcpy(dst, src, sizeof(dst)) + #ifdef TARGET_OS_LINUX #ifndef __x86_64__ # error "Non x86-64 architectures are not supported" @@ -67,6 +79,10 @@ static struct crash_info { */ struct crash_greg greg; #endif + /** + * Timestamp in nanoseconds (realtime). + */ + uint64_t timestamp_rt; /** * Faulting address. */ @@ -88,6 +104,72 @@ static struct crash_info { #endif } crash_info; +static char tarantool_path[PATH_MAX]; +static char feedback_host[2048]; +static bool send_crashinfo = true; + +static inline uint64_t +timespec_to_ns(struct timespec *ts) +{ + return (uint64_t)ts->tv_sec * 1000000000 + (uint64_t)ts->tv_nsec; +} + +static char * +ns_to_localtime(uint64_t timestamp, char *buf, ssize_t len) +{ + time_t sec = timestamp / 1000000000; + char *start = buf; + struct tm tm; + + /* + * Use similar format as say_x logger. Except plain + * seconds should be enough. + */ + localtime_r(&sec, &tm); + ssize_t total = strftime(start, len, "%F %T %Z", &tm); + start += total; + if (total < len) + return buf; + buf[len - 1] = '\0'; + return buf; +} + +void +crash_init(const char *tarantool_bin) +{ + strlcpy_a(tarantool_path, tarantool_bin); + if (strlen(tarantool_path) < strlen(tarantool_bin)) + pr_panic("executable path is trimmed"); +} + +void +crash_cfg(void) +{ + const char *host = cfg_gets("feedback_host"); + bool is_enabled = cfg_getb("feedback_enabled"); + bool no_crashinfo = cfg_getb("feedback_no_crashinfo"); + + if (host == NULL || !is_enabled || no_crashinfo) { + if (send_crashinfo) { + pr_info("disable sedning crashinfo feedback"); + send_crashinfo = false; + feedback_host[0] = '\0'; + } + return; + } + + if (strcmp(feedback_host, host) != 0) { + strlcpy_a(feedback_host, host); + if (strlen(feedback_host) < strlen(host)) + pr_panic("feedback_host is too long"); + } + + if (!send_crashinfo) { + pr_info("enable sedning crashinfo feedback"); + send_crashinfo = true; + } +} + /** * The routine is called inside crash signal handler so * be careful to not cause additional signals inside. @@ -96,6 +178,12 @@ static struct crash_info * crash_collect(int signo, siginfo_t *siginfo, void *ucontext) { struct crash_info *cinfo = &crash_info; + struct timespec ts; + + if (clock_gettime(CLOCK_REALTIME, &ts) == 0) + cinfo->timestamp_rt = timespec_to_ns(&ts); + else + cinfo->timestamp_rt = 0; cinfo->signo = signo; cinfo->sicode = siginfo->si_code; @@ -126,6 +214,179 @@ crash_collect(int signo, siginfo_t *siginfo, void *ucontext) return cinfo; } +/** + * Report crash information to the feedback daemon + * (ie send it to feedback daemon). + */ +static void +crash_report_feedback_daemon(struct crash_info *cinfo) +{ + /* + * Update to a new number if format get changed. + */ + static const int crashinfo_version = 1; + + char *p = static_alloc(SMALL_STATIC_SIZE); + char *e = &p[SMALL_STATIC_SIZE - 1]; + char *head = p; + char *tail = &p[SMALL_STATIC_SIZE - 8]; + + /* + * Note that while we encode the report we + * intensively use a tail of the allocated + * buffer as a temporary store. + */ + +#define snprintf_safe(fmt, ...) \ + do { \ + p += snprintf(p, e - p, fmt, ##__VA_ARGS__); \ + if (p >= e) \ + goto out; \ + } while (0) + + /* + * On linux there is new_utsname structure which + * encodes each field to __NEW_UTS_LEN + 1 => 64 + 1 = 65. + * + * So lets just reserve more data in advance: 5 fields + * 128 bytes each => 640 bytes. + */ + static_assert(sizeof(struct utsname) <= 640, + "utsname is bigger than expected"); + + /* + * Lets reuse tail of the buffer as a temp space. + */ + struct utsname *uname_ptr = (void *)&tail[-640]; + if (p > (char *)(void *)uname_ptr) + goto out; + + if (uname(uname_ptr) != 0) { + pr_syserr("uname call failed, ignore"); + memset(uname_ptr, 0, 640); + } + + snprintf_safe("{"); + snprintf_safe("\"uname\":{"); + snprintf_safe("\"sysname\":\"%s\",", uname_ptr->sysname); + /* + * nodename might a sensitive information, skip. + */ + snprintf_safe("\"release\":\"%s\",", uname_ptr->release); + snprintf_safe("\"version\":\"%s\",", uname_ptr->version); + snprintf_safe("\"machine\":\"%s\"", uname_ptr->machine); + snprintf_safe("},"); + + if (p >= (char *)(void *)uname_ptr) + goto out; + + snprintf_safe("\"build\":{"); + snprintf_safe("\"version\":\"%s\",", PACKAGE_VERSION); + snprintf_safe("\"cmake_type\":\"%s\"", BUILD_INFO); + snprintf_safe("},"); + + snprintf_safe("\"signal\":{"); + snprintf_safe("\"signo\":%d,", cinfo->signo); + snprintf_safe("\"si_code\":%d,", cinfo->sicode); + if (cinfo->signo == SIGSEGV) { + if (cinfo->sicode == SEGV_MAPERR) { + snprintf_safe("\"si_code_str\":\"%s\",", + "SEGV_MAPERR"); + } else if (cinfo->sicode == SEGV_ACCERR) { + snprintf_safe("\"si_code_str\":\"%s\",", + "SEGV_ACCERR"); + } + snprintf_safe("\"si_addr\":\"0x%llx\",", + (long long)cinfo->siaddr); + } + +#ifdef ENABLE_BACKTRACE + /* + * The backtrace itself is encoded into base64 form + * since it might have arbitrary symbols not suitable + * for json encoding (newlines and etc). + */ + size_t bt_len = strlen(cinfo->backtrace_buf); + size_t bt_elen = base64_bufsize(bt_len, BASE64_NOWRAP); + char *bt_base64 = &tail[-bt_elen-8]; + if (p >= bt_base64) + goto out; + base64_encode(cinfo->backtrace_buf, bt_len, + bt_base64, bt_elen, BASE64_NOWRAP); + bt_base64[bt_elen] = '\0'; + snprintf_safe("\"backtrace\":\"%s\",", bt_base64); +#endif + + char *timestamp_rt_str = &tail[-128]; + if (p >= timestamp_rt_str) + goto out; + ns_to_localtime(cinfo->timestamp_rt, timestamp_rt_str, 128); + snprintf_safe("\"timestamp\":\"%s\"", timestamp_rt_str); + snprintf_safe("}"); + snprintf_safe("}"); + + size_t report_len = p - head; + size_t report_elen = base64_bufsize(report_len, BASE64_NOWRAP); + + char *report_base64 = &tail[-report_elen-8]; + if (p >= report_base64) + goto out; + base64_encode(head, report_len, report_base64, + report_elen, BASE64_NOWRAP); + report_base64[report_elen] = '\0'; + + /* + * Encoded report now sits at report_base64 position, + * at the tail of 'small' static buffer. Lets prepare + * the script to run (make sure the strings do not + * overlap). + */ + p = head; + snprintf_safe("require(\'http.client\').post(\'%s\'," + "'{\"crashdump\":{\"version\":\"%d\"," + "\"data\":", feedback_host, + crashinfo_version); + if (report_base64 - p <= (long)(report_elen - 2)) + goto out; + snprintf_safe("\"%s\"", report_base64); + snprintf_safe("}}',{timeout=1}); os.exit(1);"); + + pr_debug("crashinfo script: %s", head); + + char *exec_argv[4] = { + [0] = tarantool_path, + [1] = "-e", + [2] = head, + [3] = NULL, + }; + + /* + * Can't use fork here because libev has own + * at_fork helpers with mutex where we might + * stuck (see popen code). + */ + pid_t pid = vfork(); + if (pid == 0) { + /* + * The script must exit at the end but there + * is no simple way to make sure from inside + * of a signal crash handler. So just hope it + * is running fine. + */ + execve(exec_argv[0], exec_argv, NULL); + pr_crit("exec(%s,[%s,%s,%s,NULL]) failed", + exec_argv[0], exec_argv[0], + exec_argv[1], exec_argv[2]); + _exit(1); + } else if (pid < 0) { + pr_crit("unable to vfork (errno %d)", errno); + } + + return; +out: + pr_crit("unable to prepare a crash report"); +} + /** * Report crash information to the stderr * (usually a current console). @@ -232,6 +493,8 @@ crash_signal_cb(int signo, siginfo_t *siginfo, void *context) in_cb = 1; cinfo = crash_collect(signo, siginfo, context); crash_report_stderr(cinfo); + if (send_crashinfo) + crash_report_feedback_daemon(cinfo); } else { /* Got a signal while running the handler. */ fprintf(stderr, "Fatal %d while backtracing", signo); diff --git a/src/lib/core/crash.h b/src/lib/core/crash.h index d107cd953..b005e3ec4 100644 --- a/src/lib/core/crash.h +++ b/src/lib/core/crash.h @@ -15,12 +15,24 @@ extern "C" { #endif /* defined(__cplusplus) */ +/** + * Initialize crash subsystem. + */ +void +crash_init(const char *tarantool_bin); + /** * Initialize crash signal handlers. */ void crash_signal_init(void); +/** + * Configure crash engine from box.cfg. + */ +void +crash_cfg(void); + /** * Reset crash signal handlers. */ diff --git a/src/main.cc b/src/main.cc index 391e0f878..2fce81bb3 100644 --- a/src/main.cc +++ b/src/main.cc @@ -687,6 +687,7 @@ main(int argc, char **argv) title_set_script_name(argv[0]); } + crash_init(tarantool_bin); export_syms(); random_init(); diff --git a/test/box/admin.result b/test/box/admin.result index 8c5626c36..b8fa68749 100644 --- a/test/box/admin.result +++ b/test/box/admin.result @@ -47,6 +47,8 @@ cfg_filter(box.cfg) - https://feedback.tarantool.io - - feedback_interval - 3600 + - - feedback_no_crashinfo + - false - - force_recovery - false - - hot_standby diff --git a/test/box/cfg.result b/test/box/cfg.result index 5ca6ce72b..68c45b4cc 100644 --- a/test/box/cfg.result +++ b/test/box/cfg.result @@ -35,6 +35,8 @@ cfg_filter(box.cfg) | - https://feedback.tarantool.io | - - feedback_interval | - 3600 + | - - feedback_no_crashinfo + | - false | - - force_recovery | - false | - - hot_standby @@ -148,6 +150,8 @@ cfg_filter(box.cfg) | - https://feedback.tarantool.io | - - feedback_interval | - 3600 + | - - feedback_no_crashinfo + | - false | - - force_recovery | - false | - - hot_standby -- 2.26.2