From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtpng3.m.smailru.net (smtpng3.m.smailru.net [94.100.177.149]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 170334765E0 for ; Tue, 22 Dec 2020 13:30:33 +0300 (MSK) References: <33050dea-c4b8-44cf-02fb-67928d5f8a93@tarantool.org> <3f4a382a-f74e-1829-2d94-83b66da59e6c@tarantool.org> <03e5806c-9439-e724-be22-fd8aee278bf3@tarantool.org> From: Sergey Nikiforov Message-ID: <164c0e1c-e4b0-f9ba-7e42-ab6c259bb4fd@tarantool.org> Date: Tue, 22 Dec 2020 13:30:32 +0300 MIME-Version: 1.0 In-Reply-To: <03e5806c-9439-e724-be22-fd8aee278bf3@tarantool.org> Content-Type: text/plain; charset="utf-8"; format="flowed" Content-Language: en-US Content-Transfer-Encoding: 8bit Subject: Re: [Tarantool-patches] [PATCH v2 2/2] base64: Improve decoder performance List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Vladislav Shpilevoy , tarantool-patches@dev.tarantool.org Hi! On 20.12.2020 19:27, Vladislav Shpilevoy wrote: >>> This test now crashes: >>> >>> ==================== >>> --- a/test/unit/base64.c >>> +++ b/test/unit/base64.c >>> @@ -7,7 +7,7 @@ static void >>>   base64_test(const char *str, int options, const char *no_symbols, >>>           int no_symbols_len) >>>   { >>> -    plan(3 + no_symbols_len); >>> +    plan(4 + no_symbols_len); >>>         int len = strlen(str); >>>       int base64_buflen = base64_bufsize(len + 1, options); >>> @@ -34,6 +34,11 @@ base64_test(const char *str, int options, const char *no_symbols, >>>       free(base64_buf); >>>       free(strbuf); >>>   +    const char *in = "sIIpHw=="; >>> +    int in_len = strlen(in); >>> +    rc = base64_decode(in, in_len, NULL, 0); >>> +    is(rc, 0, "no space in out buffer"); >>> + >>>       check_plan(); >>>   } >>> ==================== >>> >>> It didn't crash while the checks were in place. >> >> I knew about this "problem" even before first patch version. Glad someone noticed. More on that below. >> >>> I would suggest to >>> add this test to the previous commit. Because technically it is >>> also 'buffer overrun'. And it will crash 3069 bug even without ASAN. >> >> I do not think this is a good idea. >> >> Both old code and your proposed change (below) behave like this: if output buffer size if 0, stop everything and return zero. No input chars are decoded even when this is possible (no "stored" bits in state - aka "step_a" - and only one valid input char with 6 new bits). My "optimized" code already handles such situation properly. And this is "free" (no extra if's). >> >> It is of course possible to process "zero output buffer size" situation properly but it would require additional "if" in base64_decode_block() beginning and some logic - and in every case except "step_a with no more than one useful input char" would cause some input characters to be lost anyway. Existing API for base64_decode() completely ignores situation when output buffer is too small (it is easy to calculate worst-case output buffer size) - not-yet-processed input data is ignored in such case. No reasonable programmer would use "0" as output buffer size. Such requirement is even DOCUMENTED in base64.h. >> >> What would you say? > > I don't mind leaving it as is. But then I suggest to add an assertion, that > the out-buffer length >= in-buffer length * 3 / 4. If it will crash > anywhere in the tests, it will mean the function is used not only with > the buffers of the given size. And you can't assume it can be not big > enough to fit the whole output, but never 0. This would break code which accepts base64 string from whatever and decodes it into fixed-sized buffer - extra characters are silently ignored now. > If it will not crash, it means you can drop the out-buffer length > parameter as it does not matter. Also you need to validate all the > usages of base64_decode() just in case. I believe that overhauling API is outside of the task assigned to me. I have changed out_pos checks to be performed before *out_pos writes, now code correctly processes out_len==0 and it even works faster. Differential patch is below, I will e-mail everything from git in a new thread. Note that there no point in saving state at all in places marked as "We are losing some data" - our caller would have to guess the offset from in_base64 where we have stopped decoding. diff --git a/third_party/base64.c b/third_party/base64.c index f4fbbf477..93442c04b 100644 --- a/third_party/base64.c +++ b/third_party/base64.c @@ -260,7 +260,8 @@ base64_decode_block(const char *in_base64, int in_len, if (in_pos >= in_end) { state->step = step_a; - state->result = curr_byte; + /* curr_byte is useless now */ + /* state->result = curr_byte; */ return out_pos - out_bin; } fragment = base64_decode_value(*in_pos++); @@ -276,15 +277,16 @@ base64_decode_block(const char *in_base64, int in_len, } fragment = base64_decode_value(*in_pos++); } while (fragment < 0); - curr_byte |= (fragment & 0x030) >> 4; - *out_pos = curr_byte; - curr_byte = (fragment & 0x00f) << 4; - if (++out_pos >= out_end) + if (out_pos >= out_end) { - state->step = step_c; + /* We are losing some data */ + state->step = step_b; state->result = curr_byte; return out_pos - out_bin; } + curr_byte |= (fragment & 0x030) >> 4; + *out_pos++ = curr_byte; + curr_byte = (fragment & 0x00f) << 4; case step_c: do { if (in_pos >= in_end) @@ -295,15 +297,16 @@ base64_decode_block(const char *in_base64, int in_len, } fragment = base64_decode_value(*in_pos++); } while (fragment < 0); - curr_byte |= (fragment & 0x03c) >> 2; - *out_pos = curr_byte; - curr_byte = (fragment & 0x003) << 6; - if (++out_pos >= out_end) + if (out_pos >= out_end) { - state->step = step_d; + /* We are losing some data */ + state->step = step_c; state->result = curr_byte; return out_pos - out_bin; } + curr_byte |= (fragment & 0x03c) >> 2; + *out_pos++ = curr_byte; + curr_byte = (fragment & 0x003) << 6; case step_d: do { if (in_pos >= in_end) @@ -314,14 +317,15 @@ base64_decode_block(const char *in_base64, int in_len, } fragment = base64_decode_value(*in_pos++); } while (fragment < 0); - curr_byte |= (fragment & 0x03f); - *out_pos = curr_byte; - if (++out_pos >= out_end) + if (out_pos >= out_end) { - state->step = step_a; + /* We are losing some data */ + state->step = step_d; state->result = curr_byte; return out_pos - out_bin; } + curr_byte |= (fragment & 0x03f); + *out_pos++ = curr_byte; } } /* control should not reach here */