[Tarantool-patches] [PATCH v4 4/4] crash: report crash data to the feedback server

Wed Dec 16 14:16:06 MSK 2020

On Mon, Dec 14, 2020 at 11:54:26PM +0100, Vladislav Shpilevoy wrote:
> 
> 2. infomation -> information.
> 
> It helps to write a commit message in an editor like Google Docs or your
> local offline editor, and see all the highlighted errors, if you can't
> get a spell checker working in vim or whatever you usually use.

Sorry, I've been checking the patch with spell checker but happen to miss
merging fixed verion in. Thank you!

> >  |     "backtrace": "IzAgIDB4NjMxZDAzIGluIGNyYXNoX2NvbGxlY3QrYmYKIzEgID"
> >  |                  "B4NjMyZWRhIGluIGNyYXNoX3NpZ25hbF9jYis1MQojMiAgMHg3ZjFkNDkyMzlhO"
> >  |                  "TAgaW4gZnVubG9ja2ZpbGUrNjAKIzMgIDB4N2YxZDQ4ZGQxYzVlIGluIGVwb2xs"
> >  |                  "X3dhaXQrNWUKIzQgIDB4ODM3NzEzIGluIGVwb2xsX3BvbGwrOTkKIzUgIDB4ODN"
> >  |                  "hZDQ4IGluIGV2X3J1bis0ODcKIzYgIDB4NjBmMGFlIGluIHRhcmFudG9vbF9sdW"
> >  |                  "FfcnVuX3NjcmlwdCsxMzgKIzcgIDB4NDQ2NWY1IGluIG1haW4rNWFjCiM4ICAwe"
> >  |                  "DdmMWQ0OGNmNzA0MiBpbiBfX2xpYmNfc3RhcnRfbWFpbitmMgojOSAgMHg0NDRh"
> >  |                  "MGUgaW4gX3N0YXJ0KzJlCg==",
> >  |     "timestamp": "2020-12-10 17:08:42 MSK"
> >  |   }
> >  | }
> 
> 3. And what is the reason you encode it all into base64 second time?

Because backtrace contains newlines and other special symbols, this is very inconvenient
when you print the json-dump into console.

> 
> > The `backtrace` istself is encoded as base64 because of newline symbols
> > (and may comprise some nonascii symbols as well).

^^^ here I explained why

> 
> 4. analisys -> analysis.

yup, thanks!

> > 
> > is sent to the feedback server. To disable it set `feedback_no_crashinfo`
> > to `false`.
> 
> 5. Doesn't the name look wrong to you? To disable something you propose
> to set 'no' to 'false'. No_crashinfo = false. It is the same as
> crashinfo = true. Vice versa.
> 
> I propose not to use 'no' in option names, as it is always confusing.
> crashinfo = false means don't send anything, true means send.

ok, will do, thanks!

> >  setmetatable(daemon, {
> >      __index = {
> >          set_feedback_params = function()
> >              daemon.enabled  = box.cfg.feedback_enabled
> >              daemon.host     = box.cfg.feedback_host
> >              daemon.interval = box.cfg.feedback_interval
> > +            ffi.C.crash_cfg()
> 
> 6. I realized just now that your new option has nothing to do with
> the feedback daemon except the option name and the URL.
> 
> You probably should configure it via Lua C, just like almost all
> the other box.cfg options do.

I'll try, thanks!

> >  
> > +#include "third_party/base64.h"
> > +#include "small/static.h"
> >  #include "trivia/util.h"
> >  #include "backtrace.h"
> >  #include "crash.h"
> > +#include "cfg.h"
> 
> 7. By using 'cfg' here and linking with 'misc' you introduce a cyclic
> dependency. Because 'cfg' module is a part of 'server' library.
> 'Server' depends on 'core', which has the new crash handling system.
> Also you start depending on Lua here, because 'cfg' uses Lua inside.
> 
> In addition, 'cfg' depends on box, although implicitly. Because it
> takes the config options from box.cfg. I suggest you to configure this
> module like other modules - via crash_cfg_*() functions. Or pass
> the options as arguments into one crash_cfg().

OK, thanks!

> >  
> > +static char tarantool_path[PATH_MAX];
> > +static char feedback_host[2048];
> 
> 8. I don't like that you just added +6KB to the process
> size, not counting tons of other memory such as backtrace
> buffer (+4KB), lots of 8 byte registers, and so on, when
> we fight for each byte. At least +10KB in total. But I
> have no a good idea how to fix it.

Personally I don't like that we use static buffer inside
crash handler. I would prefer to have the whole memory we
need being preallocated, our executable file is about 11M
(for debug build) so I don't think that even 16K of
preallocated memory matter somethow. And would like to
have a separate preallocated stack for crash handler (which
requires to disable simultaneous signals handling as we have
now, and frankly I don't understand why we need SA_NODEFER
flag). I think we can rethink about memory usage later since
this is definitely not a critical part.

> > +static void
> > +crash_report_feedback_daemon(struct crash_info *cinfo)
> > +{
> > +     /*
> > +      * Update to a new number if format get changed.
> > +      */
> > +     static const int crashinfo_version = 1;
> 
> 9. Why is it static? I suspect we discussed that at least 4
> times ...

To use stack less, still 8 bytes won't make much difference
so I thik indeed we can simply se nonstatick value.

> > +
> > +     char *p = static_alloc(SMALL_STATIC_SIZE);
> > +     char *e = &p[SMALL_STATIC_SIZE - 1];
> > +     char *head = p;
> > +     char *tail = &p[SMALL_STATIC_SIZE - 8];
> 
> 10. Why -8?

To have at least a gap before the end of a buffer. Note
that this addresses are not used for snprintf (where we have
-1 as the last addressable pointer) but rather for arbitrary
data. For safety in short.

> > +
> > +     /*
> > +      * Lets reuse tail of the buffer as a temp space.
> > +      */
> > +     struct utsname *uname_ptr = (void *)&tail[-sizeof(struct utsname)];
> > +     if (p >= (char *)(void *)uname_ptr)
> 
> 11. Why do you need double cast? I can't imagine that it does anything
> except adds confusion.

Compiler doesn't like to compare a char pointer with a structure pointer

src/lib/core/crash.c:251:8: error: comparison of distinct pointer types lacks a cast [-Werror]
  251 |  if (p >= (void *)uname_ptr)
      |        ^~

> > +     if (uname(uname_ptr) != 0) {
> > +             pr_syserr("uname call failed, ignore");
> > +             memset(uname_ptr, 0, sizeof(struct utsname));
> > +     }
> > +
> > +     snprintf_safe("{");
> 
> 12. snprintf_safe uses memory [begin, e]. Where 'begin' is the initial
> value of 'p'. 'uname_ptr' occupies [e - 7 - sizeof(*uname_ptr), e - 7].
> Assume static size is 12KB, and sizeof(*uname_ptr) is 640.
> 
> [begin = 0,              snprintf                          e = 12287]
>                                     [11640      uname        12280]
> 
> So snprintf and uname buffers intersect. If there is a big enough dump,
> it will corrupt uname object.
> 
> Is it correct?

This is undefined behaviour. I though about it and I don't like this but
there is no other way if we want to use a single buffer for all. This was
the reason while I've been using preallocated memory for all this and you
said to reuse static alloc instead. Another option is to adjust @e pointer
to be sure that we're not jumping into the tail. Letme think maybe I manage
to make this more clear and readable...

> > +     if (cinfo->signo == SIGSEGV) {
> > +             if (cinfo->sicode == SEGV_MAPERR) {
> > +                     snprintf_safe("\"si_code_str\":\"%s\",",
> > +                                   "SEGV_MAPERR");
> > +             } else if (cinfo->sicode == SEGV_ACCERR) {
> > +                     snprintf_safe("\"si_code_str\":\"%s\",",
> > +                                   "SEGV_ACCERR");
> > +             }
> > +             snprintf_safe("\"si_addr\":\"0x%llx\",",
> > +                           (long long)cinfo->siaddr);
> > +     }
> 
> 13. Instead of using all these global many KB buffers you
> could also create a pipe to the new process, which the
> child would read as stdin from Lua, without any tiny limits
> on how many KBs it can use. And do write() to it directly.

Pipes use kernel memory for own buffers and what is worse this
memory is not accountable as far as I remember. I'm not sure how
critical this would be but I like the idea in general. Gimme some
time to figure out the details.

> > +
> > +     char *timestamp_rt_str = &tail[-128];
> > +     if (p >= timestamp_rt_str)
> > +             goto out;
> > +     ns_to_localtime(cinfo->timestamp_rt, timestamp_rt_str, 128);
> 
> 14. Why 128? And why is it hardcoded?

This should fit any possible timestring I think. The timestring is bound
to locale and I think it might vary in size. I choose a big value but
to be honest I'm not 100% if this will be enough forever.

> > +     snprintf_safe("\"timestamp\":\"%s\"", timestamp_rt_str);
> > +     snprintf_safe("}");
> > +     snprintf_safe("}");
> > +
> > +     size_t report_len = p - head;
> > +     size_t report_elen = base64_bufsize(report_len, BASE64_NOWRAP);
> > +
> > +     char *report_base64 = &tail[-report_elen-8];
> 
> 15. Why -8?

Already explained.

> 
> 16. What if report_elen is bigger than *uname_ptr? Will you override
> the output buffer even more?

Need to recheck, thanks!

> 
> > +     if (p >= report_base64)
> > +             goto out;
> > +     base64_encode(head, report_len, report_base64,
> > +                   report_elen, BASE64_NOWRAP);
> > +     report_base64[report_elen] = '\0';
> > +
...
> > +     pid_t pid = vfork();
> > +     if (pid == 0) {
> > +             /*
> > +              * The script must exit at the end but there
> > +              * is no simple way to make sure from inside
> > +              * of a signal crash handler. So just hope it
> > +              * is running fine.
> > +              */
> > +             execve(exec_argv[0], exec_argv, NULL);
> > +             pr_crit("exec(%s,[%s,%s,%s,NULL]) failed",
> > +                     exec_argv[0], exec_argv[0],
> > +                     exec_argv[1], exec_argv[2]);
> 
> 17. What will happen if this tarantool also will crash? Will there
> be an infinite chain of crashes? For instance, if we will have an
> issue with curl again. Or somebody else will build with his own
> curl leading to a crash on any usage, like it already happened one
> time.

Strictly speaking -- yes, you're right. We might have a chain of crashes
and I think we might need a command line option for this?