[Tarantool-patches] [PATCH v4 4/4] crash: report crash data to the feedback server

Vladislav Shpilevoy v.shpilevoy at tarantool.org
Sun Dec 20 17:48:14 MSK 2020


>>>  |     "backtrace": "IzAgIDB4NjMxZDAzIGluIGNyYXNoX2NvbGxlY3QrYmYKIzEgID"
>>>  |                  "B4NjMyZWRhIGluIGNyYXNoX3NpZ25hbF9jYis1MQojMiAgMHg3ZjFkNDkyMzlhO"
>>>  |                  "TAgaW4gZnVubG9ja2ZpbGUrNjAKIzMgIDB4N2YxZDQ4ZGQxYzVlIGluIGVwb2xs"
>>>  |                  "X3dhaXQrNWUKIzQgIDB4ODM3NzEzIGluIGVwb2xsX3BvbGwrOTkKIzUgIDB4ODN"
>>>  |                  "hZDQ4IGluIGV2X3J1bis0ODcKIzYgIDB4NjBmMGFlIGluIHRhcmFudG9vbF9sdW"
>>>  |                  "FfcnVuX3NjcmlwdCsxMzgKIzcgIDB4NDQ2NWY1IGluIG1haW4rNWFjCiM4ICAwe"
>>>  |                  "DdmMWQ0OGNmNzA0MiBpbiBfX2xpYmNfc3RhcnRfbWFpbitmMgojOSAgMHg0NDRh"
>>>  |                  "MGUgaW4gX3N0YXJ0KzJlCg==",
>>>  |     "timestamp": "2020-12-10 17:08:42 MSK"
>>>  |   }
>>>  | }
>>
>> 3. And what is the reason you encode it all into base64 second time?
> 
> Because backtrace contains newlines and other special symbols, this is very inconvenient
> when you print the json-dump into console.

I didn't ask about backtrace base64. I ask about why do you encode crashdump.data
field into base64 again. Currently you do base64 encoding 2 times: one for
signal.backtrace field, and second time you encode the whole
{uname: ..., build: ..., signal: ...}. Signal.backtrace here is already in base64.
I ask about the second encoding.

>>> +
>>> +     char *p = static_alloc(SMALL_STATIC_SIZE);
>>> +     char *e = &p[SMALL_STATIC_SIZE - 1];
>>> +     char *head = p;
>>> +     char *tail = &p[SMALL_STATIC_SIZE - 8];
>>
>> 10. Why -8?
> 
> To have at least a gap before the end of a buffer.

Sorry, but this is not an answer. Why not -7, -100, -3?
I don't see where these 8 bytes are used for anything,
even as a special 8-byte gap.

>>> +
>>> +     /*
>>> +      * Lets reuse tail of the buffer as a temp space.
>>> +      */
>>> +     struct utsname *uname_ptr = (void *)&tail[-sizeof(struct utsname)];
>>> +     if (p >= (char *)(void *)uname_ptr)
>>
>> 11. Why do you need double cast? I can't imagine that it does anything
>> except adds confusion.
> 
> Compiler doesn't like to compare a char pointer with a structure pointer
> 
> src/lib/core/crash.c:251:8: error: comparison of distinct pointer types lacks a cast [-Werror]
>   251 |  if (p >= (void *)uname_ptr)
>       |        ^~

Can you just use (char *) cast instead of (void *)? Could you simply
try that on your own without me suggesting it?

>>> +     if (uname(uname_ptr) != 0) {
>>> +             pr_syserr("uname call failed, ignore");
>>> +             memset(uname_ptr, 0, sizeof(struct utsname));
>>> +     }
>>> +
>>> +     snprintf_safe("{");
>>
>> 12. snprintf_safe uses memory [begin, e]. Where 'begin' is the initial
>> value of 'p'. 'uname_ptr' occupies [e - 7 - sizeof(*uname_ptr), e - 7].
>> Assume static size is 12KB, and sizeof(*uname_ptr) is 640.
>>
>> [begin = 0,              snprintf                          e = 12287]
>>                                     [11640      uname        12280]
>>
>> So snprintf and uname buffers intersect. If there is a big enough dump,
>> it will corrupt uname object.
>>
>> Is it correct?
> 
> This is undefined behaviour. I though about it and I don't like this but
> there is no other way if we want to use a single buffer for all. This was
> the reason while I've been using preallocated memory for all this and you
> said to reuse static alloc instead. Another option is to adjust @e pointer
> to be sure that we're not jumping into the tail. Letme think maybe I manage
> to make this more clear and readable...

It is not about clear/readable, for now. You just use buffer which can be
corrupted. It is a bug. Regardless of how clear will it look.

>>> +     if (cinfo->signo == SIGSEGV) {
>>> +             if (cinfo->sicode == SEGV_MAPERR) {
>>> +                     snprintf_safe("\"si_code_str\":\"%s\",",
>>> +                                   "SEGV_MAPERR");
>>> +             } else if (cinfo->sicode == SEGV_ACCERR) {
>>> +                     snprintf_safe("\"si_code_str\":\"%s\",",
>>> +                                   "SEGV_ACCERR");
>>> +             }
>>> +             snprintf_safe("\"si_addr\":\"0x%llx\",",
>>> +                           (long long)cinfo->siaddr);
>>> +     }
>>
>> 13. Instead of using all these global many KB buffers you
>> could also create a pipe to the new process, which the
>> child would read as stdin from Lua, without any tiny limits
>> on how many KBs it can use. And do write() to it directly.
> 
> Pipes use kernel memory for own buffers and what is worse this
> memory is not accountable as far as I remember.

I don't see any problem with kernel using memory to support pipes.
But I do see a problem in the code above, which can simply return
garbage or crash again if the dump will be big enough.

>>> +
>>> +     char *timestamp_rt_str = &tail[-128];
>>> +     if (p >= timestamp_rt_str)
>>> +             goto out;
>>> +     ns_to_localtime(cinfo->timestamp_rt, timestamp_rt_str, 128);
>>
>> 14. Why 128? And why is it hardcoded?
> 
> This should fit any possible timestring I think. The timestring is bound
> to locale and I think it might vary in size. I choose a big value but
> to be honest I'm not 100% if this will be enough forever.

Ok, then the second question remains - why is it hardcoded in 2 places?
I remember you sent some commits to the other source files where you
changed such places to use sizeof() or const variable or something like
that. How is it different here?

>>> +     snprintf_safe("\"timestamp\":\"%s\"", timestamp_rt_str);
>>> +     snprintf_safe("}");
>>> +     snprintf_safe("}");
>>> +
>>> +     size_t report_len = p - head;
>>> +     size_t report_elen = base64_bufsize(report_len, BASE64_NOWRAP);
>>> +
>>> +     char *report_base64 = &tail[-report_elen-8];
>>
>> 15. Why -8?
> 
> Already explained.

Didn't understand, sorry.

>>> +     pid_t pid = vfork();
>>> +     if (pid == 0) {
>>> +             /*
>>> +              * The script must exit at the end but there
>>> +              * is no simple way to make sure from inside
>>> +              * of a signal crash handler. So just hope it
>>> +              * is running fine.
>>> +              */
>>> +             execve(exec_argv[0], exec_argv, NULL);
>>> +             pr_crit("exec(%s,[%s,%s,%s,NULL]) failed",
>>> +                     exec_argv[0], exec_argv[0],
>>> +                     exec_argv[1], exec_argv[2]);
>>
>> 17. What will happen if this tarantool also will crash? Will there
>> be an infinite chain of crashes? For instance, if we will have an
>> issue with curl again. Or somebody else will build with his own
>> curl leading to a crash on any usage, like it already happened one
>> time.
> 
> Strictly speaking -- yes, you're right. We might have a chain of crashes
> and I think we might need a command line option for this?

If it will crash before checking the command line option, it won't
help.

I see that feedback_daemon is disabled until first box.cfg. I suggest
you do the same here - disable crash sending by default. Enable it
with first box.cfg. In your script you don't use box.cfg, so a chain
crash won't happen. Unless some bug will override the flag responsible
for crash send.

I also realized, that before box.cfg you don't know the URL. So if I
crash the process before it, I get this:

	SystemError curl: URL using bad/illegal format or missing URL: Invalid argument
	fatal error, exiting the event loop

This is because you try to send the dump before the crash module is
configured.


More information about the Tarantool-patches mailing list