[Tarantool-patches] [PATCH v4 4/4] crash: report crash data to the feedback server
Vladislav Shpilevoy
v.shpilevoy at tarantool.org
Sun Dec 20 17:48:14 MSK 2020
>>> | "backtrace": "IzAgIDB4NjMxZDAzIGluIGNyYXNoX2NvbGxlY3QrYmYKIzEgID"
>>> | "B4NjMyZWRhIGluIGNyYXNoX3NpZ25hbF9jYis1MQojMiAgMHg3ZjFkNDkyMzlhO"
>>> | "TAgaW4gZnVubG9ja2ZpbGUrNjAKIzMgIDB4N2YxZDQ4ZGQxYzVlIGluIGVwb2xs"
>>> | "X3dhaXQrNWUKIzQgIDB4ODM3NzEzIGluIGVwb2xsX3BvbGwrOTkKIzUgIDB4ODN"
>>> | "hZDQ4IGluIGV2X3J1bis0ODcKIzYgIDB4NjBmMGFlIGluIHRhcmFudG9vbF9sdW"
>>> | "FfcnVuX3NjcmlwdCsxMzgKIzcgIDB4NDQ2NWY1IGluIG1haW4rNWFjCiM4ICAwe"
>>> | "DdmMWQ0OGNmNzA0MiBpbiBfX2xpYmNfc3RhcnRfbWFpbitmMgojOSAgMHg0NDRh"
>>> | "MGUgaW4gX3N0YXJ0KzJlCg==",
>>> | "timestamp": "2020-12-10 17:08:42 MSK"
>>> | }
>>> | }
>>
>> 3. And what is the reason you encode it all into base64 second time?
>
> Because backtrace contains newlines and other special symbols, this is very inconvenient
> when you print the json-dump into console.
I didn't ask about backtrace base64. I ask about why do you encode crashdump.data
field into base64 again. Currently you do base64 encoding 2 times: one for
signal.backtrace field, and second time you encode the whole
{uname: ..., build: ..., signal: ...}. Signal.backtrace here is already in base64.
I ask about the second encoding.
>>> +
>>> + char *p = static_alloc(SMALL_STATIC_SIZE);
>>> + char *e = &p[SMALL_STATIC_SIZE - 1];
>>> + char *head = p;
>>> + char *tail = &p[SMALL_STATIC_SIZE - 8];
>>
>> 10. Why -8?
>
> To have at least a gap before the end of a buffer.
Sorry, but this is not an answer. Why not -7, -100, -3?
I don't see where these 8 bytes are used for anything,
even as a special 8-byte gap.
>>> +
>>> + /*
>>> + * Lets reuse tail of the buffer as a temp space.
>>> + */
>>> + struct utsname *uname_ptr = (void *)&tail[-sizeof(struct utsname)];
>>> + if (p >= (char *)(void *)uname_ptr)
>>
>> 11. Why do you need double cast? I can't imagine that it does anything
>> except adds confusion.
>
> Compiler doesn't like to compare a char pointer with a structure pointer
>
> src/lib/core/crash.c:251:8: error: comparison of distinct pointer types lacks a cast [-Werror]
> 251 | if (p >= (void *)uname_ptr)
> | ^~
Can you just use (char *) cast instead of (void *)? Could you simply
try that on your own without me suggesting it?
>>> + if (uname(uname_ptr) != 0) {
>>> + pr_syserr("uname call failed, ignore");
>>> + memset(uname_ptr, 0, sizeof(struct utsname));
>>> + }
>>> +
>>> + snprintf_safe("{");
>>
>> 12. snprintf_safe uses memory [begin, e]. Where 'begin' is the initial
>> value of 'p'. 'uname_ptr' occupies [e - 7 - sizeof(*uname_ptr), e - 7].
>> Assume static size is 12KB, and sizeof(*uname_ptr) is 640.
>>
>> [begin = 0, snprintf e = 12287]
>> [11640 uname 12280]
>>
>> So snprintf and uname buffers intersect. If there is a big enough dump,
>> it will corrupt uname object.
>>
>> Is it correct?
>
> This is undefined behaviour. I though about it and I don't like this but
> there is no other way if we want to use a single buffer for all. This was
> the reason while I've been using preallocated memory for all this and you
> said to reuse static alloc instead. Another option is to adjust @e pointer
> to be sure that we're not jumping into the tail. Letme think maybe I manage
> to make this more clear and readable...
It is not about clear/readable, for now. You just use buffer which can be
corrupted. It is a bug. Regardless of how clear will it look.
>>> + if (cinfo->signo == SIGSEGV) {
>>> + if (cinfo->sicode == SEGV_MAPERR) {
>>> + snprintf_safe("\"si_code_str\":\"%s\",",
>>> + "SEGV_MAPERR");
>>> + } else if (cinfo->sicode == SEGV_ACCERR) {
>>> + snprintf_safe("\"si_code_str\":\"%s\",",
>>> + "SEGV_ACCERR");
>>> + }
>>> + snprintf_safe("\"si_addr\":\"0x%llx\",",
>>> + (long long)cinfo->siaddr);
>>> + }
>>
>> 13. Instead of using all these global many KB buffers you
>> could also create a pipe to the new process, which the
>> child would read as stdin from Lua, without any tiny limits
>> on how many KBs it can use. And do write() to it directly.
>
> Pipes use kernel memory for own buffers and what is worse this
> memory is not accountable as far as I remember.
I don't see any problem with kernel using memory to support pipes.
But I do see a problem in the code above, which can simply return
garbage or crash again if the dump will be big enough.
>>> +
>>> + char *timestamp_rt_str = &tail[-128];
>>> + if (p >= timestamp_rt_str)
>>> + goto out;
>>> + ns_to_localtime(cinfo->timestamp_rt, timestamp_rt_str, 128);
>>
>> 14. Why 128? And why is it hardcoded?
>
> This should fit any possible timestring I think. The timestring is bound
> to locale and I think it might vary in size. I choose a big value but
> to be honest I'm not 100% if this will be enough forever.
Ok, then the second question remains - why is it hardcoded in 2 places?
I remember you sent some commits to the other source files where you
changed such places to use sizeof() or const variable or something like
that. How is it different here?
>>> + snprintf_safe("\"timestamp\":\"%s\"", timestamp_rt_str);
>>> + snprintf_safe("}");
>>> + snprintf_safe("}");
>>> +
>>> + size_t report_len = p - head;
>>> + size_t report_elen = base64_bufsize(report_len, BASE64_NOWRAP);
>>> +
>>> + char *report_base64 = &tail[-report_elen-8];
>>
>> 15. Why -8?
>
> Already explained.
Didn't understand, sorry.
>>> + pid_t pid = vfork();
>>> + if (pid == 0) {
>>> + /*
>>> + * The script must exit at the end but there
>>> + * is no simple way to make sure from inside
>>> + * of a signal crash handler. So just hope it
>>> + * is running fine.
>>> + */
>>> + execve(exec_argv[0], exec_argv, NULL);
>>> + pr_crit("exec(%s,[%s,%s,%s,NULL]) failed",
>>> + exec_argv[0], exec_argv[0],
>>> + exec_argv[1], exec_argv[2]);
>>
>> 17. What will happen if this tarantool also will crash? Will there
>> be an infinite chain of crashes? For instance, if we will have an
>> issue with curl again. Or somebody else will build with his own
>> curl leading to a crash on any usage, like it already happened one
>> time.
>
> Strictly speaking -- yes, you're right. We might have a chain of crashes
> and I think we might need a command line option for this?
If it will crash before checking the command line option, it won't
help.
I see that feedback_daemon is disabled until first box.cfg. I suggest
you do the same here - disable crash sending by default. Enable it
with first box.cfg. In your script you don't use box.cfg, so a chain
crash won't happen. Unless some bug will override the flag responsible
for crash send.
I also realized, that before box.cfg you don't know the URL. So if I
crash the process before it, I get this:
SystemError curl: URL using bad/illegal format or missing URL: Invalid argument
fatal error, exiting the event loop
This is because you try to send the dump before the crash module is
configured.
More information about the Tarantool-patches
mailing list