* [Tarantool-patches] [PATCH v2 0/2] base64: Fix decoder, improve its performance @ 2020-12-15 14:22 Sergey Nikiforov 2020-12-15 14:22 ` [Tarantool-patches] [PATCH v2 1/2] base64: Fix decoder output buffer overrun (reads) Sergey Nikiforov 2020-12-15 14:22 ` [Tarantool-patches] [PATCH v2 2/2] base64: Improve decoder performance Sergey Nikiforov 0 siblings, 2 replies; 12+ messages in thread From: Sergey Nikiforov @ 2020-12-15 14:22 UTC (permalink / raw) To: tarantool-patches; +Cc: Vladislav Shpilevoy First patch fixes #3069, second one improves base64 decoder performance Branch: https://github.com/tarantool/tarantool/tree/void234/gh-3069-fix-base64-memory-overrun-v2 Issue: https://github.com/tarantool/tarantool/issues/3069 Sergey Nikiforov (2): base64: Fix decoder output buffer overrun (reads) base64: Improve decoder performance third_party/base64.c | 54 ++++++++++++++++++++++++++++++-------------- 1 file changed, 37 insertions(+), 17 deletions(-) -- 2.25.1 ^ permalink raw reply [flat|nested] 12+ messages in thread
* [Tarantool-patches] [PATCH v2 1/2] base64: Fix decoder output buffer overrun (reads) 2020-12-15 14:22 [Tarantool-patches] [PATCH v2 0/2] base64: Fix decoder, improve its performance Sergey Nikiforov @ 2020-12-15 14:22 ` Sergey Nikiforov 2020-12-15 14:22 ` [Tarantool-patches] [PATCH v2 2/2] base64: Improve decoder performance Sergey Nikiforov 1 sibling, 0 replies; 12+ messages in thread From: Sergey Nikiforov @ 2020-12-15 14:22 UTC (permalink / raw) To: tarantool-patches; +Cc: Vladislav Shpilevoy Was caught by base64 test with enabled ASAN. It also caused data corruption - garbage instead of "extra bits" was saved into state->result if there was no space in output buffer. Fixes: #3069 --- Branch: https://github.com/tarantool/tarantool/tree/void234/gh-3069-fix-base64-memory-overrun-v2 Issue: https://github.com/tarantool/tarantool/issues/3069 third_party/base64.c | 28 +++++++++++++++++----------- 1 file changed, 17 insertions(+), 11 deletions(-) diff --git a/third_party/base64.c b/third_party/base64.c index 8ecab23eb..3350a98ff 100644 --- a/third_party/base64.c +++ b/third_party/base64.c @@ -247,8 +247,9 @@ base64_decode_block(const char *in_base64, int in_len, char *out_pos = out_bin; char *out_end = out_bin + out_len; int fragment; + char curr_byte; - *out_pos = state->result; + curr_byte = state->result; switch (state->step) { @@ -259,49 +260,54 @@ base64_decode_block(const char *in_base64, int in_len, if (in_pos == in_end || out_pos >= out_end) { state->step = step_a; - state->result = *out_pos; + state->result = curr_byte; return out_pos - out_bin; } fragment = base64_decode_value(*in_pos++); } while (fragment < 0); - *out_pos = (fragment & 0x03f) << 2; + curr_byte = (fragment & 0x03f) << 2; case step_b: do { if (in_pos == in_end || out_pos >= out_end) { state->step = step_b; - state->result = *out_pos; + state->result = curr_byte; return out_pos - out_bin; } fragment = base64_decode_value(*in_pos++); } while (fragment < 0); - *out_pos++ |= (fragment & 0x030) >> 4; + curr_byte |= (fragment & 0x030) >> 4; + *out_pos++ = curr_byte; + curr_byte = (fragment & 0x00f) << 4; if (out_pos < out_end) - *out_pos = (fragment & 0x00f) << 4; + *out_pos = curr_byte; case step_c: do { if (in_pos == in_end || out_pos >= out_end) { state->step = step_c; - state->result = *out_pos; + state->result = curr_byte; return out_pos - out_bin; } fragment = base64_decode_value(*in_pos++); } while (fragment < 0); - *out_pos++ |= (fragment & 0x03c) >> 2; + curr_byte |= (fragment & 0x03c) >> 2; + *out_pos++ = curr_byte; + curr_byte = (fragment & 0x003) << 6; if (out_pos < out_end) - *out_pos = (fragment & 0x003) << 6; + *out_pos = curr_byte; case step_d: do { if (in_pos == in_end || out_pos >= out_end) { state->step = step_d; - state->result = *out_pos; + state->result = curr_byte; return out_pos - out_bin; } fragment = base64_decode_value(*in_pos++); } while (fragment < 0); - *out_pos++ |= (fragment & 0x03f); + curr_byte |= (fragment & 0x03f); + *out_pos++ = curr_byte; } } /* control should not reach here */ -- 2.25.1 ^ permalink raw reply [flat|nested] 12+ messages in thread
* [Tarantool-patches] [PATCH v2 2/2] base64: Improve decoder performance 2020-12-15 14:22 [Tarantool-patches] [PATCH v2 0/2] base64: Fix decoder, improve its performance Sergey Nikiforov 2020-12-15 14:22 ` [Tarantool-patches] [PATCH v2 1/2] base64: Fix decoder output buffer overrun (reads) Sergey Nikiforov @ 2020-12-15 14:22 ` Sergey Nikiforov 2020-12-16 23:22 ` Vladislav Shpilevoy 1 sibling, 1 reply; 12+ messages in thread From: Sergey Nikiforov @ 2020-12-15 14:22 UTC (permalink / raw) To: tarantool-patches; +Cc: Vladislav Shpilevoy Unnecessary checks were removed from internal loops. Benchmark shows that performance is now ~1.15 times higher (release build, Intel Core I7-9700K, only one thread). --- Branch: https://github.com/tarantool/tarantool/tree/void234/gh-3069-fix-base64-memory-overrun-v2 third_party/base64.c | 36 +++++++++++++++++++++++++----------- 1 file changed, 25 insertions(+), 11 deletions(-) diff --git a/third_party/base64.c b/third_party/base64.c index 3350a98ff..f4fbbf477 100644 --- a/third_party/base64.c +++ b/third_party/base64.c @@ -257,7 +257,7 @@ base64_decode_block(const char *in_base64, int in_len, { case step_a: do { - if (in_pos == in_end || out_pos >= out_end) + if (in_pos >= in_end) { state->step = step_a; state->result = curr_byte; @@ -268,7 +268,7 @@ base64_decode_block(const char *in_base64, int in_len, curr_byte = (fragment & 0x03f) << 2; case step_b: do { - if (in_pos == in_end || out_pos >= out_end) + if (in_pos >= in_end) { state->step = step_b; state->result = curr_byte; @@ -277,13 +277,17 @@ base64_decode_block(const char *in_base64, int in_len, fragment = base64_decode_value(*in_pos++); } while (fragment < 0); curr_byte |= (fragment & 0x030) >> 4; - *out_pos++ = curr_byte; + *out_pos = curr_byte; curr_byte = (fragment & 0x00f) << 4; - if (out_pos < out_end) - *out_pos = curr_byte; + if (++out_pos >= out_end) + { + state->step = step_c; + state->result = curr_byte; + return out_pos - out_bin; + } case step_c: do { - if (in_pos == in_end || out_pos >= out_end) + if (in_pos >= in_end) { state->step = step_c; state->result = curr_byte; @@ -292,13 +296,17 @@ base64_decode_block(const char *in_base64, int in_len, fragment = base64_decode_value(*in_pos++); } while (fragment < 0); curr_byte |= (fragment & 0x03c) >> 2; - *out_pos++ = curr_byte; + *out_pos = curr_byte; curr_byte = (fragment & 0x003) << 6; - if (out_pos < out_end) - *out_pos = curr_byte; + if (++out_pos >= out_end) + { + state->step = step_d; + state->result = curr_byte; + return out_pos - out_bin; + } case step_d: do { - if (in_pos == in_end || out_pos >= out_end) + if (in_pos >= in_end) { state->step = step_d; state->result = curr_byte; @@ -307,7 +315,13 @@ base64_decode_block(const char *in_base64, int in_len, fragment = base64_decode_value(*in_pos++); } while (fragment < 0); curr_byte |= (fragment & 0x03f); - *out_pos++ = curr_byte; + *out_pos = curr_byte; + if (++out_pos >= out_end) + { + state->step = step_a; + state->result = curr_byte; + return out_pos - out_bin; + } } } /* control should not reach here */ -- 2.25.1 ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Tarantool-patches] [PATCH v2 2/2] base64: Improve decoder performance 2020-12-15 14:22 ` [Tarantool-patches] [PATCH v2 2/2] base64: Improve decoder performance Sergey Nikiforov @ 2020-12-16 23:22 ` Vladislav Shpilevoy 2020-12-17 12:43 ` Sergey Nikiforov 0 siblings, 1 reply; 12+ messages in thread From: Vladislav Shpilevoy @ 2020-12-16 23:22 UTC (permalink / raw) To: Sergey Nikiforov, tarantool-patches Hi! Thanks for the patch! On 15.12.2020 15:22, Sergey Nikiforov wrote: > Unnecessary checks were removed from internal loops. > Benchmark shows that performance is now ~1.15 times higher > (release build, Intel Core I7-9700K, only one thread). > --- > > Branch: https://github.com/tarantool/tarantool/tree/void234/gh-3069-fix-base64-memory-overrun-v2 > > third_party/base64.c | 36 +++++++++++++++++++++++++----------- > 1 file changed, 25 insertions(+), 11 deletions(-) > > diff --git a/third_party/base64.c b/third_party/base64.c > index 3350a98ff..f4fbbf477 100644 > --- a/third_party/base64.c > +++ b/third_party/base64.c > @@ -257,7 +257,7 @@ base64_decode_block(const char *in_base64, int in_len, > { > case step_a: > do { > - if (in_pos == in_end || out_pos >= out_end) > + if (in_pos >= in_end) This test now crashes: ==================== --- a/test/unit/base64.c +++ b/test/unit/base64.c @@ -7,7 +7,7 @@ static void base64_test(const char *str, int options, const char *no_symbols, int no_symbols_len) { - plan(3 + no_symbols_len); + plan(4 + no_symbols_len); int len = strlen(str); int base64_buflen = base64_bufsize(len + 1, options); @@ -34,6 +34,11 @@ base64_test(const char *str, int options, const char *no_symbols, free(base64_buf); free(strbuf); + const char *in = "sIIpHw=="; + int in_len = strlen(in); + rc = base64_decode(in, in_len, NULL, 0); + is(rc, 0, "no space in out buffer"); + check_plan(); } ==================== It didn't crash while the checks were in place. I would suggest to add this test to the previous commit. Because technically it is also 'buffer overrun'. And it will crash 3069 bug even without ASAN. For this patch you may want to add a zero length in the beginning. Also we can keep the number of 'if's unchanged like this: ==================== --- a/third_party/base64.c +++ b/third_party/base64.c @@ -256,6 +256,12 @@ base64_decode_block(const char *in_base64, int in_len, while (1) { case step_a: + if (out_pos >= out_end) + { + state->step = step_a; + state->result = curr_byte; + return out_pos - out_bin; + } do { if (in_pos >= in_end) { @@ -316,12 +322,7 @@ base64_decode_block(const char *in_base64, int in_len, } while (fragment < 0); curr_byte |= (fragment & 0x03f); *out_pos = curr_byte; - if (++out_pos >= out_end) - { - state->step = step_a; - state->result = curr_byte; - return out_pos - out_bin; - } + ++out_pos; } } /* control should not reach here */ ==================== Will that work? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Tarantool-patches] [PATCH v2 2/2] base64: Improve decoder performance 2020-12-16 23:22 ` Vladislav Shpilevoy @ 2020-12-17 12:43 ` Sergey Nikiforov 2020-12-20 16:27 ` Vladislav Shpilevoy 0 siblings, 1 reply; 12+ messages in thread From: Sergey Nikiforov @ 2020-12-17 12:43 UTC (permalink / raw) To: Vladislav Shpilevoy, tarantool-patches Hi. On 17.12.2020 2:22, Vladislav Shpilevoy wrote: > Hi! Thanks for the patch! > > On 15.12.2020 15:22, Sergey Nikiforov wrote: >> Unnecessary checks were removed from internal loops. >> Benchmark shows that performance is now ~1.15 times higher >> (release build, Intel Core I7-9700K, only one thread). >> --- >> >> Branch: https://github.com/tarantool/tarantool/tree/void234/gh-3069-fix-base64-memory-overrun-v2 >> >> third_party/base64.c | 36 +++++++++++++++++++++++++----------- >> 1 file changed, 25 insertions(+), 11 deletions(-) >> >> diff --git a/third_party/base64.c b/third_party/base64.c >> index 3350a98ff..f4fbbf477 100644 >> --- a/third_party/base64.c >> +++ b/third_party/base64.c >> @@ -257,7 +257,7 @@ base64_decode_block(const char *in_base64, int in_len, >> { >> case step_a: >> do { >> - if (in_pos == in_end || out_pos >= out_end) >> + if (in_pos >= in_end) > > This test now crashes: > > ==================== > --- a/test/unit/base64.c > +++ b/test/unit/base64.c > @@ -7,7 +7,7 @@ static void > base64_test(const char *str, int options, const char *no_symbols, > int no_symbols_len) > { > - plan(3 + no_symbols_len); > + plan(4 + no_symbols_len); > > int len = strlen(str); > int base64_buflen = base64_bufsize(len + 1, options); > @@ -34,6 +34,11 @@ base64_test(const char *str, int options, const char *no_symbols, > free(base64_buf); > free(strbuf); > > + const char *in = "sIIpHw=="; > + int in_len = strlen(in); > + rc = base64_decode(in, in_len, NULL, 0); > + is(rc, 0, "no space in out buffer"); > + > check_plan(); > } > ==================== > > It didn't crash while the checks were in place. I knew about this "problem" even before first patch version. Glad someone noticed. More on that below. > I would suggest to > add this test to the previous commit. Because technically it is > also 'buffer overrun'. And it will crash 3069 bug even without ASAN. I do not think this is a good idea. Both old code and your proposed change (below) behave like this: if output buffer size if 0, stop everything and return zero. No input chars are decoded even when this is possible (no "stored" bits in state - aka "step_a" - and only one valid input char with 6 new bits). My "optimized" code already handles such situation properly. And this is "free" (no extra if's). It is of course possible to process "zero output buffer size" situation properly but it would require additional "if" in base64_decode_block() beginning and some logic - and in every case except "step_a with no more than one useful input char" would cause some input characters to be lost anyway. Existing API for base64_decode() completely ignores situation when output buffer is too small (it is easy to calculate worst-case output buffer size) - not-yet-processed input data is ignored in such case. No reasonable programmer would use "0" as output buffer size. Such requirement is even DOCUMENTED in base64.h. What would you say? > For this patch you may want to add a zero length in the beginning. > Also we can keep the number of 'if's unchanged like this: > > ==================== > --- a/third_party/base64.c > +++ b/third_party/base64.c > @@ -256,6 +256,12 @@ base64_decode_block(const char *in_base64, int in_len, > while (1) > { > case step_a: > + if (out_pos >= out_end) > + { > + state->step = step_a; > + state->result = curr_byte; > + return out_pos - out_bin; > + } > do { > if (in_pos >= in_end) > { > @@ -316,12 +322,7 @@ base64_decode_block(const char *in_base64, int in_len, > } while (fragment < 0); > curr_byte |= (fragment & 0x03f); > *out_pos = curr_byte; > - if (++out_pos >= out_end) > - { > - state->step = step_a; > - state->result = curr_byte; > - return out_pos - out_bin; > - } > + ++out_pos; > } > } > /* control should not reach here */ > ==================== > > Will that work? Yes, but it would lose data when (out_len == 0), (state->step == step_a) and there is one useful char in input buffer. Not that it matter in real life. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Tarantool-patches] [PATCH v2 2/2] base64: Improve decoder performance 2020-12-17 12:43 ` Sergey Nikiforov @ 2020-12-20 16:27 ` Vladislav Shpilevoy 2020-12-22 10:30 ` Sergey Nikiforov 0 siblings, 1 reply; 12+ messages in thread From: Vladislav Shpilevoy @ 2020-12-20 16:27 UTC (permalink / raw) To: Sergey Nikiforov, tarantool-patches >> This test now crashes: >> >> ==================== >> --- a/test/unit/base64.c >> +++ b/test/unit/base64.c >> @@ -7,7 +7,7 @@ static void >> base64_test(const char *str, int options, const char *no_symbols, >> int no_symbols_len) >> { >> - plan(3 + no_symbols_len); >> + plan(4 + no_symbols_len); >> int len = strlen(str); >> int base64_buflen = base64_bufsize(len + 1, options); >> @@ -34,6 +34,11 @@ base64_test(const char *str, int options, const char *no_symbols, >> free(base64_buf); >> free(strbuf); >> + const char *in = "sIIpHw=="; >> + int in_len = strlen(in); >> + rc = base64_decode(in, in_len, NULL, 0); >> + is(rc, 0, "no space in out buffer"); >> + >> check_plan(); >> } >> ==================== >> >> It didn't crash while the checks were in place. > > I knew about this "problem" even before first patch version. Glad someone noticed. More on that below. > >> I would suggest to >> add this test to the previous commit. Because technically it is >> also 'buffer overrun'. And it will crash 3069 bug even without ASAN. > > I do not think this is a good idea. > > Both old code and your proposed change (below) behave like this: if output buffer size if 0, stop everything and return zero. No input chars are decoded even when this is possible (no "stored" bits in state - aka "step_a" - and only one valid input char with 6 new bits). My "optimized" code already handles such situation properly. And this is "free" (no extra if's). > > It is of course possible to process "zero output buffer size" situation properly but it would require additional "if" in base64_decode_block() beginning and some logic - and in every case except "step_a with no more than one useful input char" would cause some input characters to be lost anyway. Existing API for base64_decode() completely ignores situation when output buffer is too small (it is easy to calculate worst-case output buffer size) - not-yet-processed input data is ignored in such case. No reasonable programmer would use "0" as output buffer size. Such requirement is even DOCUMENTED in base64.h. > > What would you say? I don't mind leaving it as is. But then I suggest to add an assertion, that the out-buffer length >= in-buffer length * 3 / 4. If it will crash anywhere in the tests, it will mean the function is used not only with the buffers of the given size. And you can't assume it can be not big enough to fit the whole output, but never 0. If it will not crash, it means you can drop the out-buffer length parameter as it does not matter. Also you need to validate all the usages of base64_decode() just in case. Talking of DOCUMENTATION - it is for base64_decode(), not for base64_decode_block(), which you patch. This function does not have documentation, but clearly it tries not to overflow the out-buffer somewhy. From its name it seems it can be called multiple times to decode a buffer step-by-step. >> For this patch you may want to add a zero length in the beginning. >> Also we can keep the number of 'if's unchanged like this: >> >> ==================== >> --- a/third_party/base64.c >> +++ b/third_party/base64.c >> @@ -256,6 +256,12 @@ base64_decode_block(const char *in_base64, int in_len, >> while (1) >> { >> case step_a: >> + if (out_pos >= out_end) >> + { >> + state->step = step_a; >> + state->result = curr_byte; >> + return out_pos - out_bin; >> + } >> do { >> if (in_pos >= in_end) >> { >> @@ -316,12 +322,7 @@ base64_decode_block(const char *in_base64, int in_len, >> } while (fragment < 0); >> curr_byte |= (fragment & 0x03f); >> *out_pos = curr_byte; >> - if (++out_pos >= out_end) >> - { >> - state->step = step_a; >> - state->result = curr_byte; >> - return out_pos - out_bin; >> - } >> + ++out_pos; >> } >> } >> /* control should not reach here */ >> ==================== >> >> Will that work? > > Yes, but it would lose data when (out_len == 0), (state->step == step_a) and there is one useful char in input buffer. Not that it matter in real life. This can be fixed, nothing scary here. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Tarantool-patches] [PATCH v2 2/2] base64: Improve decoder performance 2020-12-20 16:27 ` Vladislav Shpilevoy @ 2020-12-22 10:30 ` Sergey Nikiforov 2020-12-22 15:05 ` Vladislav Shpilevoy 0 siblings, 1 reply; 12+ messages in thread From: Sergey Nikiforov @ 2020-12-22 10:30 UTC (permalink / raw) To: Vladislav Shpilevoy, tarantool-patches Hi! On 20.12.2020 19:27, Vladislav Shpilevoy wrote: >>> This test now crashes: >>> >>> ==================== >>> --- a/test/unit/base64.c >>> +++ b/test/unit/base64.c >>> @@ -7,7 +7,7 @@ static void >>> base64_test(const char *str, int options, const char *no_symbols, >>> int no_symbols_len) >>> { >>> - plan(3 + no_symbols_len); >>> + plan(4 + no_symbols_len); >>> int len = strlen(str); >>> int base64_buflen = base64_bufsize(len + 1, options); >>> @@ -34,6 +34,11 @@ base64_test(const char *str, int options, const char *no_symbols, >>> free(base64_buf); >>> free(strbuf); >>> + const char *in = "sIIpHw=="; >>> + int in_len = strlen(in); >>> + rc = base64_decode(in, in_len, NULL, 0); >>> + is(rc, 0, "no space in out buffer"); >>> + >>> check_plan(); >>> } >>> ==================== >>> >>> It didn't crash while the checks were in place. >> >> I knew about this "problem" even before first patch version. Glad someone noticed. More on that below. >> >>> I would suggest to >>> add this test to the previous commit. Because technically it is >>> also 'buffer overrun'. And it will crash 3069 bug even without ASAN. >> >> I do not think this is a good idea. >> >> Both old code and your proposed change (below) behave like this: if output buffer size if 0, stop everything and return zero. No input chars are decoded even when this is possible (no "stored" bits in state - aka "step_a" - and only one valid input char with 6 new bits). My "optimized" code already handles such situation properly. And this is "free" (no extra if's). >> >> It is of course possible to process "zero output buffer size" situation properly but it would require additional "if" in base64_decode_block() beginning and some logic - and in every case except "step_a with no more than one useful input char" would cause some input characters to be lost anyway. Existing API for base64_decode() completely ignores situation when output buffer is too small (it is easy to calculate worst-case output buffer size) - not-yet-processed input data is ignored in such case. No reasonable programmer would use "0" as output buffer size. Such requirement is even DOCUMENTED in base64.h. >> >> What would you say? > > I don't mind leaving it as is. But then I suggest to add an assertion, that > the out-buffer length >= in-buffer length * 3 / 4. If it will crash > anywhere in the tests, it will mean the function is used not only with > the buffers of the given size. And you can't assume it can be not big > enough to fit the whole output, but never 0. This would break code which accepts base64 string from whatever and decodes it into fixed-sized buffer - extra characters are silently ignored now. > If it will not crash, it means you can drop the out-buffer length > parameter as it does not matter. Also you need to validate all the > usages of base64_decode() just in case. I believe that overhauling API is outside of the task assigned to me. I have changed out_pos checks to be performed before *out_pos writes, now code correctly processes out_len==0 and it even works faster. Differential patch is below, I will e-mail everything from git in a new thread. Note that there no point in saving state at all in places marked as "We are losing some data" - our caller would have to guess the offset from in_base64 where we have stopped decoding. diff --git a/third_party/base64.c b/third_party/base64.c index f4fbbf477..93442c04b 100644 --- a/third_party/base64.c +++ b/third_party/base64.c @@ -260,7 +260,8 @@ base64_decode_block(const char *in_base64, int in_len, if (in_pos >= in_end) { state->step = step_a; - state->result = curr_byte; + /* curr_byte is useless now */ + /* state->result = curr_byte; */ return out_pos - out_bin; } fragment = base64_decode_value(*in_pos++); @@ -276,15 +277,16 @@ base64_decode_block(const char *in_base64, int in_len, } fragment = base64_decode_value(*in_pos++); } while (fragment < 0); - curr_byte |= (fragment & 0x030) >> 4; - *out_pos = curr_byte; - curr_byte = (fragment & 0x00f) << 4; - if (++out_pos >= out_end) + if (out_pos >= out_end) { - state->step = step_c; + /* We are losing some data */ + state->step = step_b; state->result = curr_byte; return out_pos - out_bin; } + curr_byte |= (fragment & 0x030) >> 4; + *out_pos++ = curr_byte; + curr_byte = (fragment & 0x00f) << 4; case step_c: do { if (in_pos >= in_end) @@ -295,15 +297,16 @@ base64_decode_block(const char *in_base64, int in_len, } fragment = base64_decode_value(*in_pos++); } while (fragment < 0); - curr_byte |= (fragment & 0x03c) >> 2; - *out_pos = curr_byte; - curr_byte = (fragment & 0x003) << 6; - if (++out_pos >= out_end) + if (out_pos >= out_end) { - state->step = step_d; + /* We are losing some data */ + state->step = step_c; state->result = curr_byte; return out_pos - out_bin; } + curr_byte |= (fragment & 0x03c) >> 2; + *out_pos++ = curr_byte; + curr_byte = (fragment & 0x003) << 6; case step_d: do { if (in_pos >= in_end) @@ -314,14 +317,15 @@ base64_decode_block(const char *in_base64, int in_len, } fragment = base64_decode_value(*in_pos++); } while (fragment < 0); - curr_byte |= (fragment & 0x03f); - *out_pos = curr_byte; - if (++out_pos >= out_end) + if (out_pos >= out_end) { - state->step = step_a; + /* We are losing some data */ + state->step = step_d; state->result = curr_byte; return out_pos - out_bin; } + curr_byte |= (fragment & 0x03f); + *out_pos++ = curr_byte; } } /* control should not reach here */ ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Tarantool-patches] [PATCH v2 2/2] base64: Improve decoder performance 2020-12-22 10:30 ` Sergey Nikiforov @ 2020-12-22 15:05 ` Vladislav Shpilevoy 2020-12-22 16:08 ` Sergey Nikiforov 0 siblings, 1 reply; 12+ messages in thread From: Vladislav Shpilevoy @ 2020-12-22 15:05 UTC (permalink / raw) To: Sergey Nikiforov, tarantool-patches >>>> This test now crashes: >>>> >>>> ==================== >>>> --- a/test/unit/base64.c >>>> +++ b/test/unit/base64.c >>>> @@ -7,7 +7,7 @@ static void >>>> base64_test(const char *str, int options, const char *no_symbols, >>>> int no_symbols_len) >>>> { >>>> - plan(3 + no_symbols_len); >>>> + plan(4 + no_symbols_len); >>>> int len = strlen(str); >>>> int base64_buflen = base64_bufsize(len + 1, options); >>>> @@ -34,6 +34,11 @@ base64_test(const char *str, int options, const char *no_symbols, >>>> free(base64_buf); >>>> free(strbuf); >>>> + const char *in = "sIIpHw=="; >>>> + int in_len = strlen(in); >>>> + rc = base64_decode(in, in_len, NULL, 0); >>>> + is(rc, 0, "no space in out buffer"); >>>> + >>>> check_plan(); >>>> } >>>> ==================== >>>> >>>> It didn't crash while the checks were in place. >>> >>> I knew about this "problem" even before first patch version. Glad someone noticed. More on that below. >>> >>>> I would suggest to >>>> add this test to the previous commit. Because technically it is >>>> also 'buffer overrun'. And it will crash 3069 bug even without ASAN. >>> >>> I do not think this is a good idea. >>> >>> Both old code and your proposed change (below) behave like this: if output buffer size if 0, stop everything and return zero. No input chars are decoded even when this is possible (no "stored" bits in state - aka "step_a" - and only one valid input char with 6 new bits). My "optimized" code already handles such situation properly. And this is "free" (no extra if's). >>> >>> It is of course possible to process "zero output buffer size" situation properly but it would require additional "if" in base64_decode_block() beginning and some logic - and in every case except "step_a with no more than one useful input char" would cause some input characters to be lost anyway. Existing API for base64_decode() completely ignores situation when output buffer is too small (it is easy to calculate worst-case output buffer size) - not-yet-processed input data is ignored in such case. No reasonable programmer would use "0" as output buffer size. Such requirement is even DOCUMENTED in base64.h. >>> >>> What would you say? >> >> I don't mind leaving it as is. But then I suggest to add an assertion, that >> the out-buffer length >= in-buffer length * 3 / 4. If it will crash >> anywhere in the tests, it will mean the function is used not only with >> the buffers of the given size. And you can't assume it can be not big >> enough to fit the whole output, but never 0. > > This would break code which accepts base64 string from whatever and decodes it into fixed-sized buffer - extra characters are silently ignored now. > >> If it will not crash, it means you can drop the out-buffer length >> parameter as it does not matter. Also you need to validate all the >> usages of base64_decode() just in case. > > I believe that overhauling API is outside of the task assigned to me. The problem is that this whole 'optimization' is also outside of the task assigned to you. But you decided to do it anyway. So lets do it properly then. > I have changed out_pos checks to be performed before *out_pos writes, now code correctly processes out_len==0 and it even works faster. Differential patch is below, I will e-mail everything from git in a new thread. Note that there no point in saving state at all in places marked as "We are losing some data" - our caller would have to guess the offset from in_base64 where we have stopped decoding. > > diff --git a/third_party/base64.c b/third_party/base64.c > index f4fbbf477..93442c04b 100644 > --- a/third_party/base64.c > +++ b/third_party/base64.c > @@ -260,7 +260,8 @@ base64_decode_block(const char *in_base64, int in_len, > if (in_pos >= in_end) > { > state->step = step_a; > - state->result = curr_byte; > + /* curr_byte is useless now */ > + /* state->result = curr_byte; */ > return out_pos - out_bin; > } > fragment = base64_decode_value(*in_pos++); > @@ -276,15 +277,16 @@ base64_decode_block(const char *in_base64, int in_len, > } > fragment = base64_decode_value(*in_pos++); > } while (fragment < 0); > - curr_byte |= (fragment & 0x030) >> 4; > - *out_pos = curr_byte; > - curr_byte = (fragment & 0x00f) << 4; > - if (++out_pos >= out_end) > + if (out_pos >= out_end) > { > - state->step = step_c; > + /* We are losing some data */ > + state->step = step_b; > state->result = curr_byte; > return out_pos - out_bin; Can you just not loose the data? For example, move this check one line below, after curr_byte |= ? I don't know if it helps though. Since you decided to change this function, I would suggest to make it work properly for all inputs. And even better - expose it in base64.h and test separately. It was initially taken from there: https://github.com/libb64/libb64/blob/master/src/cdecode.c in this commit: https://github.com/tarantool/tarantool/commit/4e707c0f989434b74eb3a859a1b26822a87c42e2#diff-4fd3a92107ebfe91b691f08cd53a84eccbcf89f1a7bad47e20f950d6c4214e6cR216 Readme of libb64 says it uses 'coroutines' implemented via switch-while-case. Which means the function is supposed to be called multiple times. Although in the source repository it does not accept output buffer length at all. The reenterability means the function should be allowed to process all the input step by step, and must not loose data. Otherwise it makes no sense to have this 'state machine' via 'coroutines', and it also can be removed in order to pursue more performance. You don't even need the state variable. The bottom line: you either need to fix the function not to loose data + keep it reenterable + test properly; or remove the state entirely because if it looses information, it is useless anyway. > } > + curr_byte |= (fragment & 0x030) >> 4; > + *out_pos++ = curr_byte; > + curr_byte = (fragment & 0x00f) << 4; > case step_c: > do { > if (in_pos >= in_end) ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Tarantool-patches] [PATCH v2 2/2] base64: Improve decoder performance 2020-12-22 15:05 ` Vladislav Shpilevoy @ 2020-12-22 16:08 ` Sergey Nikiforov 2020-12-22 16:39 ` Vladislav Shpilevoy 0 siblings, 1 reply; 12+ messages in thread From: Sergey Nikiforov @ 2020-12-22 16:08 UTC (permalink / raw) To: Vladislav Shpilevoy, tarantool-patches; +Cc: Alexander Turenko Hi! On 22.12.2020 18:05, Vladislav Shpilevoy wrote: >>>>> This test now crashes: >>>>> >>>>> ==================== >>>>> --- a/test/unit/base64.c >>>>> +++ b/test/unit/base64.c >>>>> @@ -7,7 +7,7 @@ static void >>>>> base64_test(const char *str, int options, const char *no_symbols, >>>>> int no_symbols_len) >>>>> { >>>>> - plan(3 + no_symbols_len); >>>>> + plan(4 + no_symbols_len); >>>>> int len = strlen(str); >>>>> int base64_buflen = base64_bufsize(len + 1, options); >>>>> @@ -34,6 +34,11 @@ base64_test(const char *str, int options, const char *no_symbols, >>>>> free(base64_buf); >>>>> free(strbuf); >>>>> + const char *in = "sIIpHw=="; >>>>> + int in_len = strlen(in); >>>>> + rc = base64_decode(in, in_len, NULL, 0); >>>>> + is(rc, 0, "no space in out buffer"); >>>>> + >>>>> check_plan(); >>>>> } >>>>> ==================== >>>>> >>>>> It didn't crash while the checks were in place. >>>> >>>> I knew about this "problem" even before first patch version. Glad someone noticed. More on that below. >>>> >>>>> I would suggest to >>>>> add this test to the previous commit. Because technically it is >>>>> also 'buffer overrun'. And it will crash 3069 bug even without ASAN. >>>> >>>> I do not think this is a good idea. >>>> >>>> Both old code and your proposed change (below) behave like this: if output buffer size if 0, stop everything and return zero. No input chars are decoded even when this is possible (no "stored" bits in state - aka "step_a" - and only one valid input char with 6 new bits). My "optimized" code already handles such situation properly. And this is "free" (no extra if's). >>>> >>>> It is of course possible to process "zero output buffer size" situation properly but it would require additional "if" in base64_decode_block() beginning and some logic - and in every case except "step_a with no more than one useful input char" would cause some input characters to be lost anyway. Existing API for base64_decode() completely ignores situation when output buffer is too small (it is easy to calculate worst-case output buffer size) - not-yet-processed input data is ignored in such case. No reasonable programmer would use "0" as output buffer size. Such requirement is even DOCUMENTED in base64.h. >>>> >>>> What would you say? >>> >>> I don't mind leaving it as is. But then I suggest to add an assertion, that >>> the out-buffer length >= in-buffer length * 3 / 4. If it will crash >>> anywhere in the tests, it will mean the function is used not only with >>> the buffers of the given size. And you can't assume it can be not big >>> enough to fit the whole output, but never 0. >> >> This would break code which accepts base64 string from whatever and decodes it into fixed-sized buffer - extra characters are silently ignored now. >> >>> If it will not crash, it means you can drop the out-buffer length >>> parameter as it does not matter. Also you need to validate all the >>> usages of base64_decode() just in case. >> >> I believe that overhauling API is outside of the task assigned to me. > > The problem is that this whole 'optimization' is also outside of the > task assigned to you. But you decided to do it anyway. So lets do it > properly then. I already did. Code is now faster, fixes several bugs and introduces no new ones. >> I have changed out_pos checks to be performed before *out_pos writes, now code correctly processes out_len==0 and it even works faster. Differential patch is below, I will e-mail everything from git in a new thread. Note that there no point in saving state at all in places marked as "We are losing some data" - our caller would have to guess the offset from in_base64 where we have stopped decoding. >> >> diff --git a/third_party/base64.c b/third_party/base64.c >> index f4fbbf477..93442c04b 100644 >> --- a/third_party/base64.c >> +++ b/third_party/base64.c >> @@ -260,7 +260,8 @@ base64_decode_block(const char *in_base64, int in_len, >> if (in_pos >= in_end) >> { >> state->step = step_a; >> - state->result = curr_byte; >> + /* curr_byte is useless now */ >> + /* state->result = curr_byte; */ >> return out_pos - out_bin; >> } >> fragment = base64_decode_value(*in_pos++); >> @@ -276,15 +277,16 @@ base64_decode_block(const char *in_base64, int in_len, >> } >> fragment = base64_decode_value(*in_pos++); >> } while (fragment < 0); >> - curr_byte |= (fragment & 0x030) >> 4; >> - *out_pos = curr_byte; >> - curr_byte = (fragment & 0x00f) << 4; >> - if (++out_pos >= out_end) >> + if (out_pos >= out_end) >> { >> - state->step = step_c; >> + /* We are losing some data */ >> + state->step = step_b; >> state->result = curr_byte; >> return out_pos - out_bin; > > Can you just not loose the data? Just to be clear: this is NOT degradation, old code loses even more data ("intermediate" bits when output buffer size is barely enough). > For example, move this check > one line below, after curr_byte |= ? I don't know if it helps though. It would help if we would add 3 new states like "step_b and there was not enough space in output buffer on last iteration" AND (if something is still present in the input buffer) somehow report to the caller the value of final in_base64 offset. > Since you decided to change this function, I would suggest to make > it work properly for all inputs. It does - as long as out_len is large enough. > And even better - expose it in base64.h and test separately. What is the point in wasting more time on this? Right now we do not need to process base64 input in chunks. > It was initially taken from there: > https://github.com/libb64/libb64/blob/master/src/cdecode.c > in this commit: > https://github.com/tarantool/tarantool/commit/4e707c0f989434b74eb3a859a1b26822a87c42e2#diff-4fd3a92107ebfe91b691f08cd53a84eccbcf89f1a7bad47e20f950d6c4214e6cR216 > > Readme of libb64 says it uses 'coroutines' implemented via > switch-while-case. Which means the function is supposed to be > called multiple times. Although in the source repository it > does not accept output buffer length at all. So clearly someone just quickly added protection against output buffer overrun. libb64 required properly sized output buffer unconditionally, our current implementation supports ignoring "extra" input data for fixes-sized output buffer instead. > The reenterability means the function should be allowed to > process all the input step by step, and must not loose data. > Otherwise it makes no sense to have this 'state machine' via > 'coroutines', and it also can be removed in order to pursue > more performance. You don't even need the state variable. State is being read and written only once so we would gain literally nothing. It does not lose data if output buffer is properly sized. Removing state variable would make it harder for someone to implement processing input data in chunks if we would ever need this. And it is extremely unlikely that we would processing input data in chunks WITHOUT properly sized output buffer. Worst-case output buffer size is easily calculated from input data size. > The bottom line: you either need to fix the function not to loose > data + keep it reenterable + test properly; or remove the state > entirely because if it looses information, it is useless anyway. It only loses information in "atypical" cases when output buffer is not sized properly. >> } >> + curr_byte |= (fragment & 0x030) >> 4; >> + *out_pos++ = curr_byte; >> + curr_byte = (fragment & 0x00f) << 4; >> case step_c: >> do { >> if (in_pos >= in_end) ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Tarantool-patches] [PATCH v2 2/2] base64: Improve decoder performance 2020-12-22 16:08 ` Sergey Nikiforov @ 2020-12-22 16:39 ` Vladislav Shpilevoy 2020-12-25 10:39 ` Sergey Nikiforov 0 siblings, 1 reply; 12+ messages in thread From: Vladislav Shpilevoy @ 2020-12-22 16:39 UTC (permalink / raw) To: Sergey Nikiforov, tarantool-patches; +Cc: Alexander Turenko >>>>>> This test now crashes: >>>>>> >>>>>> ==================== >>>>>> --- a/test/unit/base64.c >>>>>> +++ b/test/unit/base64.c >>>>>> @@ -7,7 +7,7 @@ static void >>>>>> base64_test(const char *str, int options, const char *no_symbols, >>>>>> int no_symbols_len) >>>>>> { >>>>>> - plan(3 + no_symbols_len); >>>>>> + plan(4 + no_symbols_len); >>>>>> int len = strlen(str); >>>>>> int base64_buflen = base64_bufsize(len + 1, options); >>>>>> @@ -34,6 +34,11 @@ base64_test(const char *str, int options, const char *no_symbols, >>>>>> free(base64_buf); >>>>>> free(strbuf); >>>>>> + const char *in = "sIIpHw=="; >>>>>> + int in_len = strlen(in); >>>>>> + rc = base64_decode(in, in_len, NULL, 0); >>>>>> + is(rc, 0, "no space in out buffer"); >>>>>> + >>>>>> check_plan(); >>>>>> } >>>>>> ==================== >>>>>> >>>>>> It didn't crash while the checks were in place. >>>>> >>>>> I knew about this "problem" even before first patch version. Glad someone noticed. More on that below. >>>>> >>>>>> I would suggest to >>>>>> add this test to the previous commit. Because technically it is >>>>>> also 'buffer overrun'. And it will crash 3069 bug even without ASAN. >>>>> >>>>> I do not think this is a good idea. >>>>> >>>>> Both old code and your proposed change (below) behave like this: if output buffer size if 0, stop everything and return zero. No input chars are decoded even when this is possible (no "stored" bits in state - aka "step_a" - and only one valid input char with 6 new bits). My "optimized" code already handles such situation properly. And this is "free" (no extra if's). >>>>> >>>>> It is of course possible to process "zero output buffer size" situation properly but it would require additional "if" in base64_decode_block() beginning and some logic - and in every case except "step_a with no more than one useful input char" would cause some input characters to be lost anyway. Existing API for base64_decode() completely ignores situation when output buffer is too small (it is easy to calculate worst-case output buffer size) - not-yet-processed input data is ignored in such case. No reasonable programmer would use "0" as output buffer size. Such requirement is even DOCUMENTED in base64.h. >>>>> >>>>> What would you say? >>>> >>>> I don't mind leaving it as is. But then I suggest to add an assertion, that >>>> the out-buffer length >= in-buffer length * 3 / 4. If it will crash >>>> anywhere in the tests, it will mean the function is used not only with >>>> the buffers of the given size. And you can't assume it can be not big >>>> enough to fit the whole output, but never 0. >>> >>> This would break code which accepts base64 string from whatever and decodes it into fixed-sized buffer - extra characters are silently ignored now. >>> >>>> If it will not crash, it means you can drop the out-buffer length >>>> parameter as it does not matter. Also you need to validate all the >>>> usages of base64_decode() just in case. >>> >>> I believe that overhauling API is outside of the task assigned to me. >> >> The problem is that this whole 'optimization' is also outside of the >> task assigned to you. But you decided to do it anyway. So lets do it >> properly then. > > I already did. Code is now faster, fixes several bugs and introduces no new ones. You did it in 2 other commits, which look good. This commit is your own initiative not related to any bugs. Talking of 'no new ones' - there are no proofs really. We can only *hope* that there are no new bugs and that the existing tests cover this code enough. >> For example, move this check >> one line below, after curr_byte |= ? I don't know if it helps though. > > It would help if we would add 3 new states like "step_b and there was not enough space in output buffer on last iteration" AND (if something is still present in the input buffer) somehow report to the caller the value of final in_base64 offset. > >> Since you decided to change this function, I would suggest to make >> it work properly for all inputs. > > It does - as long as out_len is large enough. > >> And even better - expose it in base64.h and test separately. > > What is the point in wasting more time on this? Right now we do not need to process base64 input in chunks. The point is that you want to optimize this function. But somewhy you don't want to go the end. I can even ask the same, look: What is the point in wasting more time on this? Right now base64 is not present in any flamegraphs as something expensive. >> It was initially taken from there: >> https://github.com/libb64/libb64/blob/master/src/cdecode.c >> in this commit: >> https://github.com/tarantool/tarantool/commit/4e707c0f989434b74eb3a859a1b26822a87c42e2#diff-4fd3a92107ebfe91b691f08cd53a84eccbcf89f1a7bad47e20f950d6c4214e6cR216 >> >> Readme of libb64 says it uses 'coroutines' implemented via >> switch-while-case. Which means the function is supposed to be >> called multiple times. Although in the source repository it >> does not accept output buffer length at all. > > So clearly someone just quickly added protection against output buffer overrun. And introduced a bug. Now you quickly optimized it leaving some dead code hang here (the state variable). Don't you think that these 2 cases look very similar? > libb64 required properly sized output buffer unconditionally, our current implementation supports ignoring "extra" input data for fixes-sized output buffer instead. Yes, libb64 does require a full buffer. When it was ported to our code base, the bug was introduced. The reason I have shown you libb64 is that originally this function was supposed not to loose data. It simply couldn't loose it due to output buffer being too short. But the user must have been able to receive 'input' data in chunks, get an 'out' buffer of the big enough size, and fill it. If some last bytes of 'input' were not translated, they were saved into the 'state'. In our code the state is not needed now. It is dead code, if you don't want to fix it to work properly in chunks. >> The reenterability means the function should be allowed to >> process all the input step by step, and must not loose data. >> Otherwise it makes no sense to have this 'state machine' via >> 'coroutines', and it also can be removed in order to pursue >> more performance. You don't even need the state variable. > > State is being read and written only once so we would gain literally nothing. It does not lose data if output buffer is properly sized. I never said it looses data when it is properly sized. > Removing state variable would make it harder for someone to implement processing input data in chunks if we would ever need this. Then fix it :) I don't understand, really. You want to optimize this code - lets optimize it. Delete the state, delete the dead code. What you did now is half-measure. Your patch even adds more dead code, because you added 3! new places where you update the state. Which is useless now anyway. If somebody will need to implement chunked parsing, I am sure he will be able to google libb64 and take the source code from there again. Or use git history to find the original code. Or take another implementation, because there is no any proof that libb64 is the fastest solution written in C. Anyway, a non-chunked function would be enough for most of the cases, and would be faster and easier for sure. > And it is extremely unlikely that we would processing input data in chunks WITHOUT properly sized output buffer. Worst-case output buffer size is easily calculated from input data size. > >> The bottom line: you either need to fix the function not to loose >> data + keep it reenterable + test properly; or remove the state >> entirely because if it looses information, it is useless anyway. > > It only loses information in "atypical" cases when output buffer is not sized properly. Which means the function can't be called again on the same input buffer. And that makes the state useless in all cases. Not only in some. But I see we are not going anywhere here. You don't really need LGTM from me on this patch, if you don't want to finish it. I am not strictly against these changes, because *probably* they don't add new bugs, and seem to be a tiny bit better for perf. I only don't like it being not finished. You can ask 2 other people making decisions to review it in its current state and still get it committed. For example, Nikita or Alexander Tu. or Alexander L. As for the other 2 commits - all looks fine to me except the comment I gave about the same test being run multiple times. Fix that comment, I will give LGTM, and you will need just +1 LGTM for the first 2 commits. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Tarantool-patches] [PATCH v2 2/2] base64: Improve decoder performance 2020-12-22 16:39 ` Vladislav Shpilevoy @ 2020-12-25 10:39 ` Sergey Nikiforov 2020-12-26 13:25 ` Vladislav Shpilevoy 0 siblings, 1 reply; 12+ messages in thread From: Sergey Nikiforov @ 2020-12-25 10:39 UTC (permalink / raw) To: Vladislav Shpilevoy, tarantool-patches; +Cc: Alexander Turenko Hi! On 22.12.2020 19:39, Vladislav Shpilevoy wrote: >>> For example, move this check >>> one line below, after curr_byte |= ? I don't know if it helps though. >> >> It would help if we would add 3 new states like "step_b and there was not enough space in output buffer on last iteration" AND (if something is still present in the input buffer) somehow report to the caller the value of final in_base64 offset. >> >>> Since you decided to change this function, I would suggest to make >>> it work properly for all inputs. >> >> It does - as long as out_len is large enough. >> >>> And even better - expose it in base64.h and test separately. >> >> What is the point in wasting more time on this? Right now we do not need to process base64 input in chunks. > > The point is that you want to optimize this function. Optimization happened "automatically" while I was investigating bug and cleaning up code to better understand and simplify logic. > But somewhy you don't want to go the end. Because base64 decode performance is hardly matters for Tarantool use case. > I can even ask the same, look: > > What is the point in wasting more time on this? > Right now base64 is not present in any flamegraphs as > something expensive. > >>> It was initially taken from there: >>> https://github.com/libb64/libb64/blob/master/src/cdecode.c >>> in this commit: >>> https://github.com/tarantool/tarantool/commit/4e707c0f989434b74eb3a859a1b26822a87c42e2#diff-4fd3a92107ebfe91b691f08cd53a84eccbcf89f1a7bad47e20f950d6c4214e6cR216 >>> >>> Readme of libb64 says it uses 'coroutines' implemented via >>> switch-while-case. Which means the function is supposed to be >>> called multiple times. Although in the source repository it >>> does not accept output buffer length at all. >> >> So clearly someone just quickly added protection against output buffer overrun. > > And introduced a bug. Now you quickly optimized it leaving some dead > code hang here (the state variable). Don't you think that these 2 cases > look very similar? > >> libb64 required properly sized output buffer unconditionally, our current implementation supports ignoring "extra" input data for fixes-sized output buffer instead. > > Yes, libb64 does require a full buffer. When it was ported to our > code base, the bug was introduced. > > The reason I have shown you libb64 is that originally this function > was supposed not to loose data. It simply couldn't loose it due to > output buffer being too short. But the user must have been able to > receive 'input' data in chunks, get an 'out' buffer of the big > enough size, and fill it. If some last bytes of 'input' were not > translated, they were saved into the 'state'. > > In our code the state is not needed now. It is dead code, if you > don't want to fix it to work properly in chunks. > >>> The reenterability means the function should be allowed to >>> process all the input step by step, and must not loose data. >>> Otherwise it makes no sense to have this 'state machine' via >>> 'coroutines', and it also can be removed in order to pursue >>> more performance. You don't even need the state variable. >> >> State is being read and written only once so we would gain literally nothing. It does not lose data if output buffer is properly sized. > > I never said it looses data when it is properly sized. > >> Removing state variable would make it harder for someone to implement processing input data in chunks if we would ever need this. > > Then fix it :) > > I don't understand, really. You want to optimize this code - lets > optimize it. Delete the state, delete the dead code. What you did > now is half-measure. Your patch even adds more dead code, because > you added 3! new places where you update the state. Which is useless > now anyway. > > If somebody will need to implement chunked parsing, I am sure > he will be able to google libb64 and take the source code from > there again. Or use git history to find the original code. Or take > another implementation, because there is no any proof that libb64 > is the fastest solution written in C. > > Anyway, a non-chunked function would be enough for most of the cases, > and would be faster and easier for sure. ok, I have just wasted some more time on optimization and benchmarking. v4 (I have pushed this branch to github - https://github.com/tarantool/tarantool/tree/void234/gh-3069-fix-base64-memory-overrun-v4 - but do not plan to post patches into the mailing list) is SLOWER than v3 despite cleaned up "state". Results: unoptimized: It took 6189848309 ns to decode 7087960000 bytes, speed is 1145094297 bps v3: It took 5189617171 ns to decode 7087960000 bytes, speed is 1365796313 bps v4 when output buffer size may be not large enough (as far as decoder can check without decoding whole input buffer): It took 6243847147 ns to decode 7087960000 bytes, speed is 1135191146 bps v4 when output buffer size is large enough: It took 5976677666 ns to decode 7087960000 bytes, speed is 1185936467 bps As you can see, v4 can be slower than unoptimized code when output buffer size is "suspicious" (in benchmarks I have used large enough buffer but decoder can't be sure of that due to newlines in base64 data). v3 is unconditionally faster. You can ask GCC why. And I am sure that on another platform and/or buffer sizes results would be different. >> And it is extremely unlikely that we would processing input data in chunks WITHOUT properly sized output buffer. Worst-case output buffer size is easily calculated from input data size. >> >>> The bottom line: you either need to fix the function not to loose >>> data + keep it reenterable + test properly; or remove the state >>> entirely because if it looses information, it is useless anyway. >> >> It only loses information in "atypical" cases when output buffer is not sized properly. > > Which means the function can't be called again on the same input buffer. > And that makes the state useless in all cases. Not only in some. > > > But I see we are not going anywhere here. You don't really need LGTM > from me on this patch, if you don't want to finish it. I am not strictly > against these changes, because *probably* they don't add new bugs, and > seem to be a tiny bit better for perf. I only don't like it being not > finished. > > You can ask 2 other people making decisions to review it in its current > state and still get it committed. For example, Nikita or Alexander Tu. or > Alexander L. > > As for the other 2 commits - all looks fine to me except the comment > I gave about the same test being run multiple times. Fix that comment, > I will give LGTM, and you will need just +1 LGTM for the first 2 commits. I have made v5 which is v3 with cosmetic changes per review - test moved into first commit etc. Can we please move on? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Tarantool-patches] [PATCH v2 2/2] base64: Improve decoder performance 2020-12-25 10:39 ` Sergey Nikiforov @ 2020-12-26 13:25 ` Vladislav Shpilevoy 0 siblings, 0 replies; 12+ messages in thread From: Vladislav Shpilevoy @ 2020-12-26 13:25 UTC (permalink / raw) To: Sergey Nikiforov, tarantool-patches; +Cc: Alexander Turenko On 25.12.2020 13:39, Sergey Nikiforov wrote: > Hi! > > On 22.12.2020 19:39, Vladislav Shpilevoy wrote: >>>> For example, move this check >>>> one line below, after curr_byte |= ? I don't know if it helps though. >>> >>> It would help if we would add 3 new states like "step_b and there was not enough space in output buffer on last iteration" AND (if something is still present in the input buffer) somehow report to the caller the value of final in_base64 offset. >>> >>>> Since you decided to change this function, I would suggest to make >>>> it work properly for all inputs. >>> >>> It does - as long as out_len is large enough. >>> >>>> And even better - expose it in base64.h and test separately. >>> >>> What is the point in wasting more time on this? Right now we do not need to process base64 input in chunks. >> >> The point is that you want to optimize this function. > > Optimization happened "automatically" while I was investigating bug and cleaning up code to better understand and simplify logic. And this led to the commits being non-atomic. I couldn't see what was the fix and what was refactoring. >> But somewhy you don't want to go the end. > > Because base64 decode performance is hardly matters for Tarantool use case. And yet you try to optimize it :) We can go infinitely round and round here. >> I can even ask the same, look: >> >> What is the point in wasting more time on this? >> Right now base64 is not present in any flamegraphs as >> something expensive. >> >>>> It was initially taken from there: >>>> https://github.com/libb64/libb64/blob/master/src/cdecode.c >>>> in this commit: >>>> https://github.com/tarantool/tarantool/commit/4e707c0f989434b74eb3a859a1b26822a87c42e2#diff-4fd3a92107ebfe91b691f08cd53a84eccbcf89f1a7bad47e20f950d6c4214e6cR216 >>>> >>>> Readme of libb64 says it uses 'coroutines' implemented via >>>> switch-while-case. Which means the function is supposed to be >>>> called multiple times. Although in the source repository it >>>> does not accept output buffer length at all. >>> >>> So clearly someone just quickly added protection against output buffer overrun. >> >> And introduced a bug. Now you quickly optimized it leaving some dead >> code hang here (the state variable). Don't you think that these 2 cases >> look very similar? >> >>> libb64 required properly sized output buffer unconditionally, our current implementation supports ignoring "extra" input data for fixes-sized output buffer instead. >> >> Yes, libb64 does require a full buffer. When it was ported to our >> code base, the bug was introduced. >> >> The reason I have shown you libb64 is that originally this function >> was supposed not to loose data. It simply couldn't loose it due to >> output buffer being too short. But the user must have been able to >> receive 'input' data in chunks, get an 'out' buffer of the big >> enough size, and fill it. If some last bytes of 'input' were not >> translated, they were saved into the 'state'. >> >> In our code the state is not needed now. It is dead code, if you >> don't want to fix it to work properly in chunks. >> >>>> The reenterability means the function should be allowed to >>>> process all the input step by step, and must not loose data. >>>> Otherwise it makes no sense to have this 'state machine' via >>>> 'coroutines', and it also can be removed in order to pursue >>>> more performance. You don't even need the state variable. >>> >>> State is being read and written only once so we would gain literally nothing. It does not lose data if output buffer is properly sized. >> >> I never said it looses data when it is properly sized. >> >>> Removing state variable would make it harder for someone to implement processing input data in chunks if we would ever need this. >> >> Then fix it :) >> >> I don't understand, really. You want to optimize this code - lets >> optimize it. Delete the state, delete the dead code. What you did >> now is half-measure. Your patch even adds more dead code, because >> you added 3! new places where you update the state. Which is useless >> now anyway. >> >> If somebody will need to implement chunked parsing, I am sure >> he will be able to google libb64 and take the source code from >> there again. Or use git history to find the original code. Or take >> another implementation, because there is no any proof that libb64 >> is the fastest solution written in C. >> >> Anyway, a non-chunked function would be enough for most of the cases, >> and would be faster and easier for sure. > > ok, I have just wasted some more time on optimization and benchmarking. v4 (I have pushed this branch to github - https://github.com/tarantool/tarantool/tree/void234/gh-3069-fix-base64-memory-overrun-v4 - but do not plan to post patches into the mailing list) is SLOWER than v3 despite cleaned up "state". > > Results: > > unoptimized: > It took 6189848309 ns to decode 7087960000 bytes, speed is 1145094297 bps > > v3: > It took 5189617171 ns to decode 7087960000 bytes, speed is 1365796313 bps > > v4 when output buffer size may be not large enough (as far as decoder can check without decoding whole input buffer): > It took 6243847147 ns to decode 7087960000 bytes, speed is 1135191146 bps > > v4 when output buffer size is large enough: > It took 5976677666 ns to decode 7087960000 bytes, speed is 1185936467 bps > > As you can see, v4 can be slower than unoptimized code when output buffer size is "suspicious" (in benchmarks I have used large enough buffer but decoder can't be sure of that due to newlines in base64 data). v3 is unconditionally faster. You can ask GCC why. And I am sure that on another platform and/or buffer sizes results would be different. So this whole optimizations can be just some jitter, basically. I didn't look at the code too hard though. Probably it just not efficient enough in its implementation (the one with state removed). >>> And it is extremely unlikely that we would processing input data in chunks WITHOUT properly sized output buffer. Worst-case output buffer size is easily calculated from input data size. >>> >>>> The bottom line: you either need to fix the function not to loose >>>> data + keep it reenterable + test properly; or remove the state >>>> entirely because if it looses information, it is useless anyway. >>> >>> It only loses information in "atypical" cases when output buffer is not sized properly. >> >> Which means the function can't be called again on the same input buffer. >> And that makes the state useless in all cases. Not only in some. >> >> >> But I see we are not going anywhere here. You don't really need LGTM >> from me on this patch, if you don't want to finish it. I am not strictly >> against these changes, because *probably* they don't add new bugs, and >> seem to be a tiny bit better for perf. I only don't like it being not >> finished. >> >> You can ask 2 other people making decisions to review it in its current >> state and still get it committed. For example, Nikita or Alexander Tu. or >> Alexander L. >> >> As for the other 2 commits - all looks fine to me except the comment >> I gave about the same test being run multiple times. Fix that comment, >> I will give LGTM, and you will need just +1 LGTM for the first 2 commits. > > I have made v5 which is v3 with cosmetic changes per review - test moved into first commit etc. Can we please move on? As I said - please, move on to another 2 reviewers, if you don't want to finish the version with state removed. You are not blocked on me. You already have 1 LGTM from Leonid. Second one should be from some lead/senior with push-rights such as Alexander Tu. or Nikita. Maybe Alexander L. also can push, but I don't remember him pushing any patches, tbh. ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2020-12-26 13:25 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-12-15 14:22 [Tarantool-patches] [PATCH v2 0/2] base64: Fix decoder, improve its performance Sergey Nikiforov 2020-12-15 14:22 ` [Tarantool-patches] [PATCH v2 1/2] base64: Fix decoder output buffer overrun (reads) Sergey Nikiforov 2020-12-15 14:22 ` [Tarantool-patches] [PATCH v2 2/2] base64: Improve decoder performance Sergey Nikiforov 2020-12-16 23:22 ` Vladislav Shpilevoy 2020-12-17 12:43 ` Sergey Nikiforov 2020-12-20 16:27 ` Vladislav Shpilevoy 2020-12-22 10:30 ` Sergey Nikiforov 2020-12-22 15:05 ` Vladislav Shpilevoy 2020-12-22 16:08 ` Sergey Nikiforov 2020-12-22 16:39 ` Vladislav Shpilevoy 2020-12-25 10:39 ` Sergey Nikiforov 2020-12-26 13:25 ` Vladislav Shpilevoy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox