[Tarantool-patches] [PATCH v6 2/2] base64: improve decoder performance
Sergey Nikiforov
void at tarantool.org
Tue Jan 26 19:37:04 MSK 2021
On 21.01.2021 18:31, Alexander Turenko wrote:
> On Mon, Jan 11, 2021 at 12:45:01PM +0300, Sergey Nikiforov wrote:
>> Unnecessary checks were removed from internal loops.
>> Benchmark shows that performance is now ~1.19 times higher
>> (release build, Intel Core I7-9700K, only one thread).
>> ---
>>
>> Branch: https://github.com/tarantool/tarantool/tree/void234/gh-3069-fix-base64-memory-overrun-v6
>>
>> third_party/base64.c | 35 ++++++++++++++++++++++++++---------
>> 1 file changed, 26 insertions(+), 9 deletions(-)
>>
>> diff --git a/third_party/base64.c b/third_party/base64.c
>> index 3ecd0843a..74260971b 100644
>> --- a/third_party/base64.c
>> +++ b/third_party/base64.c
>> @@ -258,10 +258,10 @@ base64_decode_block(const char *in_base64, int in_len,
>> {
>> case step_a:
>> do {
>> - if (in_pos == in_end || out_pos >= out_end)
>> + if (in_pos >= in_end)
>
> Okay, it is the loop invariant and so we can move it outside the loop.
> But why we sink it down rather than hoisting up? We know that there is
> no space in the output buffer -- so there is no sense to read the input,
> right? Or it will break 'correctness' of the hypotetical use of the
> `state`?
Yes. It also prevents us from skipping newlines and other ignored
characters at the end of input.
> So we should imagine how the state would be used if it would be
> possible. From which input position the processing would be resumed if
> the code would provide such ability? How a caller supposed to
> distinguish end of the output buffer from successful processing the
> whole input?
>
> Aside of this, skipping unrecognized input bytes looks as another
> incorrectness. We should be ready to work with untrusted input, perform
> appropriate validation and report errors.
There is no point because "legal" characters can also be corrupted. CRCs
/ Digital Signatures, that is outside of base64 decoder scope.
> We can skip newlines, but I guess there is no sense to do this in a
> loop: only one newline is a row is allowed (but it worth to verify the
> guess with the standard).
This will make code complex and slow (we have to calculate line position
- and it can be split between chunks). It would also break
compatibility. A lot of real-life code breaks standards.
> It is hard to discuss an optimization until we'll define how the result
> should look in the corner cases like 'short output buffer', 'unexpected
> end of an input buffer' and so on. Current code does not report errors:
> it silently skips invalid input,
As it should, in my opinion.
> accepts trimmed input,
There is no way to determine if input is trimmed or not. Our caller can
determine whether enough bytes were decoded.
>silently trims output at the buffer boundary.
Which in our case (no chunks) means we have received more input than
necessary. Just look at base64_decode() call sites - even returned value
is not always checked. Should we fix this right now?
> I don't understand how it supposed to
> work with untrusted input.
>
> I propose to stop attempts to do some 'quick fixups' and either work on
> the code or don't.
I haven't attempted to fix anything in this particular patch, it is
about performance only. We can drop it entirely.
> If we would define behaviour first, I would spend MUCH less time trying
> to deduce it.
This patch is about "keeping existing behavior" to avoid breaking
something (right now extra "legal" characters are silently ignored).
> That's really hard to discuss correctness of dead code. To be honest, it
> looks easier to me to finish the chained processing implementation, test
> it and remove than rather than trying to interpter the code (both
> existing and imaginary) in the mind.
>
> BTW, it seems we able to entirely eliminate the output buffer length
> checks for the first (out_len * 3 / 4) input bytes and than process the
> tail with full checks. (After we'll done with strict and correct API.)
This means processing input in two chunks, so we need state and "final"
input position reporting (and a lot of tests for corner cases). Extra
logic would hurt performance for small buffers (our typical use case) -
even simple check in base64_decode() from patch v4 does, I have tested
this. Benchmark code with "barely enough output buffer" and "definitely
large enough output buffer" (it accounted for newlines added by encoder)
got lost somewhere, but I still have numbers from patch v4:
Without length check in the beginning of base64_decode() - output buffer
size is checked on all iterations
It took 6170225078 ns to decode 7087960000 bytes, speed is 1148736052 bps
It took 6141109481 ns to decode 7087960000 bytes, speed is 1154182321 bps
It took 6142448824 ns to decode 7087960000 bytes, speed is 1153930655 bps
It took 6165578853 ns to decode 7087960000 bytes, speed is 1149601711 bps
It took 6126750494 ns to decode 7087960000 bytes, speed is 1156887326 bps
It took 6146041244 ns to decode 7087960000 bytes, speed is 1153256172 bps
With length check in the beginning of base64_decode() - output buffer
size is checked on all iterations because its size is "suspicuous" (may
not be large enough)
It took 6310650551 ns to decode 7087960000 bytes, speed is 1123174218 bps
It took 6243847147 ns to decode 7087960000 bytes, speed is 1135191146 bps
It took 6263236313 ns to decode 7087960000 bytes, speed is 1131676923 bps
It took 6260555388 ns to decode 7087960000 bytes, speed is 1132161535 bps
With length check in the beginning of base64_decode() - output buffer
size is definitely large enough so it is not checked within loops.
It took 6037055181 ns to decode 7087960000 bytes, speed is 1174075735 bps
It took 5993200086 ns to decode 7087960000 bytes, speed is 1182667005 bps
It took 5976677666 ns to decode 7087960000 bytes, speed is 1185936467 bps
It took 5990214242 ns to decode 7087960000 bytes, speed is 1183256510 bps
Or should we do as patch v4 - eliminate chunks support completely? Then
at least in_len should be made size_t instead of int (easiest way to
avoid integer overflow) and probably tests for corner cases
(in_len/out_len close to 4/3) should be added. Loop invariant can be
moved up in that case.
Vladislav Shpilevoy is clearly voting on patch v4 approach (Quote: "As I
said - please, move on to another 2 reviewers, if you don't want to
finish the version with state removed.")
Leonid Vasiliev is happy even with v5.
Please make up your mind.
More information about the Tarantool-patches
mailing list