Re: [Tarantool-patches] [PATCH v2 2/2] base64: Improve decoder performance

Tarantool development patches archive
 help / color / mirror / Atom feed

From: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
To: Sergey Nikiforov <void@tarantool.org>,
	tarantool-patches@dev.tarantool.org
Subject: Re: [Tarantool-patches] [PATCH v2 2/2] base64: Improve decoder performance
Date: Sun, 20 Dec 2020 17:27:32 +0100	[thread overview]
Message-ID: <03e5806c-9439-e724-be22-fd8aee278bf3@tarantool.org> (raw)
In-Reply-To: <3f4a382a-f74e-1829-2d94-83b66da59e6c@tarantool.org>

>> This test now crashes:
>>
>> ====================
>> --- a/test/unit/base64.c
>> +++ b/test/unit/base64.c
>> @@ -7,7 +7,7 @@ static void
>>   base64_test(const char *str, int options, const char *no_symbols,
>>           int no_symbols_len)
>>   {
>> -    plan(3 + no_symbols_len);
>> +    plan(4 + no_symbols_len);
>>         int len = strlen(str);
>>       int base64_buflen = base64_bufsize(len + 1, options);
>> @@ -34,6 +34,11 @@ base64_test(const char *str, int options, const char *no_symbols,
>>       free(base64_buf);
>>       free(strbuf);
>>   +    const char *in = "sIIpHw==";
>> +    int in_len = strlen(in);
>> +    rc = base64_decode(in, in_len, NULL, 0);
>> +    is(rc, 0, "no space in out buffer");
>> +
>>       check_plan();
>>   }
>> ====================
>>
>> It didn't crash while the checks were in place.
> 
> I knew about this "problem" even before first patch version. Glad someone noticed. More on that below.
> 
>> I would suggest to
>> add this test to the previous commit. Because technically it is
>> also 'buffer overrun'. And it will crash 3069 bug even without ASAN.
> 
> I do not think this is a good idea.
> 
> Both old code and your proposed change (below) behave like this: if output buffer size if 0, stop everything and return zero. No input chars are decoded even when this is possible (no "stored" bits in state - aka "step_a" - and only one valid input char with 6 new bits). My "optimized" code already handles such situation properly. And this is "free" (no extra if's).
> 
> It is of course possible to process "zero output buffer size" situation properly but it would require additional "if" in base64_decode_block() beginning and some logic - and in every case except "step_a with no more than one useful input char" would cause some input characters to be lost anyway. Existing API for base64_decode() completely ignores situation when output buffer is too small (it is easy to calculate worst-case output buffer size) - not-yet-processed input data is ignored in such case. No reasonable programmer would use "0" as output buffer size. Such requirement is even DOCUMENTED in base64.h.
> 
> What would you say?

I don't mind leaving it as is. But then I suggest to add an assertion, that
the out-buffer length >= in-buffer length * 3 / 4. If it will crash
anywhere in the tests, it will mean the function is used not only with
the buffers of the given size. And you can't assume it can be not big
enough to fit the whole output, but never 0.

If it will not crash, it means you can drop the out-buffer length
parameter as it does not matter. Also you need to validate all the
usages of base64_decode() just in case.

Talking of DOCUMENTATION - it is for base64_decode(), not for
base64_decode_block(), which you patch. This function does not have
documentation, but clearly it tries not to overflow the out-buffer
somewhy. From its name it seems it can be called multiple times to
decode a buffer step-by-step.

>> For this patch you may want to add a zero length in the beginning.
>> Also we can keep the number of 'if's unchanged like this:
>>
>> ====================
>> --- a/third_party/base64.c
>> +++ b/third_party/base64.c
>> @@ -256,6 +256,12 @@ base64_decode_block(const char *in_base64, int in_len,
>>           while (1)
>>           {
>>       case step_a:
>> +            if (out_pos >= out_end)
>> +            {
>> +                state->step = step_a;
>> +                state->result = curr_byte;
>> +                return out_pos - out_bin;
>> +            }
>>               do {
>>                   if (in_pos >= in_end)
>>                   {
>> @@ -316,12 +322,7 @@ base64_decode_block(const char *in_base64, int in_len,
>>               } while (fragment < 0);
>>               curr_byte |= (fragment & 0x03f);
>>               *out_pos = curr_byte;
>> -            if (++out_pos >= out_end)
>> -            {
>> -                state->step = step_a;
>> -                state->result = curr_byte;
>> -                return out_pos - out_bin;
>> -            }
>> +            ++out_pos;
>>           }
>>       }
>>       /* control should not reach here */
>> ====================
>>
>> Will that work?
> 
> Yes, but it would lose data when (out_len == 0), (state->step == step_a) and there is one useful char in input buffer. Not that it matter in real life.

This can be fixed, nothing scary here.

next prev parent reply	other threads:[~2020-12-20 16:27 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-15 14:22 [Tarantool-patches] [PATCH v2 0/2] base64: Fix decoder, improve its performance Sergey Nikiforov
2020-12-15 14:22 ` [Tarantool-patches] [PATCH v2 1/2] base64: Fix decoder output buffer overrun (reads) Sergey Nikiforov
2020-12-15 14:22 ` [Tarantool-patches] [PATCH v2 2/2] base64: Improve decoder performance Sergey Nikiforov
2020-12-16 23:22   ` Vladislav Shpilevoy
2020-12-17 12:43     ` Sergey Nikiforov
2020-12-20 16:27       ` Vladislav Shpilevoy [this message]
2020-12-22 10:30         ` Sergey Nikiforov
2020-12-22 15:05           ` Vladislav Shpilevoy
2020-12-22 16:08             ` Sergey Nikiforov
2020-12-22 16:39               ` Vladislav Shpilevoy
2020-12-25 10:39                 ` Sergey Nikiforov
2020-12-26 13:25                   ` Vladislav Shpilevoy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=03e5806c-9439-e724-be22-fd8aee278bf3@tarantool.org \
    --to=v.shpilevoy@tarantool.org \
    --cc=tarantool-patches@dev.tarantool.org \
    --cc=void@tarantool.org \
    --subject='Re: [Tarantool-patches] [PATCH v2 2/2] base64: Improve decoder performance' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox