[Tarantool-patches] [PATCH v6 2/2] base64: improve decoder performance

Tue Jan 26 19:37:04 MSK 2021

On 21.01.2021 18:31, Alexander Turenko wrote:
> On Mon, Jan 11, 2021 at 12:45:01PM +0300, Sergey Nikiforov wrote:
>> Unnecessary checks were removed from internal loops.
>> Benchmark shows that performance is now ~1.19 times higher
>> (release build, Intel Core I7-9700K, only one thread).
>> ---
>>
>> Branch: https://github.com/tarantool/tarantool/tree/void234/gh-3069-fix-base64-memory-overrun-v6
>>
>>   third_party/base64.c | 35 ++++++++++++++++++++++++++---------
>>   1 file changed, 26 insertions(+), 9 deletions(-)
>>
>> diff --git a/third_party/base64.c b/third_party/base64.c
>> index 3ecd0843a..74260971b 100644
>> --- a/third_party/base64.c
>> +++ b/third_party/base64.c
>> @@ -258,10 +258,10 @@ base64_decode_block(const char *in_base64, int in_len,
>>   		{
>>   	case step_a:
>>   			do {
>> -				if (in_pos == in_end || out_pos >= out_end)
>> +				if (in_pos >= in_end)
> 
> Okay, it is the loop invariant and so we can move it outside the loop.
> But why we sink it down rather than hoisting up? We know that there is
> no space in the output buffer -- so there is no sense to read the input,
> right? Or it will break 'correctness' of the hypotetical use of the
> `state`?

Yes. It also prevents us from skipping newlines and other ignored 
characters at the end of input.

> So we should imagine how the state would be used if it would be
> possible. From which input position the processing would be resumed if
> the code would provide such ability? How a caller supposed to
> distinguish end of the output buffer from successful processing the
> whole input?
> 
> Aside of this, skipping unrecognized input bytes looks as another
> incorrectness. We should be ready to work with untrusted input, perform
> appropriate validation and report errors.

There is no point because "legal" characters can also be corrupted. CRCs 
/ Digital Signatures, that is outside of base64 decoder scope.

> We can skip newlines, but I guess there is no sense to do this in a
> loop: only one newline is a row is allowed (but it worth to verify the
> guess with the standard).

This will make code complex and slow (we have to calculate line position 
- and it can be split between chunks). It would also break 
compatibility. A lot of real-life code breaks standards.

> It is hard to discuss an optimization until we'll define how the result
> should look in the corner cases like 'short output buffer', 'unexpected
> end of an input buffer' and so on. Current code does not report errors:
> it silently skips invalid input,

As it should, in my opinion.

> accepts trimmed input,

There is no way to determine if input is trimmed or not. Our caller can 
determine whether enough bytes were decoded.

 >silently trims output at the buffer boundary.

Which in our case (no chunks) means we have received more input than 
necessary. Just look at base64_decode() call sites - even returned value 
is not always checked. Should we fix this right now?

 > I don't understand how it supposed to
> work with untrusted input.
> 
> I propose to stop attempts to do some 'quick fixups' and either work on
> the code or don't.

I haven't attempted to fix anything in this particular patch, it is 
about performance only. We can drop it entirely.

> If we would define behaviour first, I would spend MUCH less time trying
> to deduce it.

This patch is about "keeping existing behavior" to avoid breaking 
something (right now extra "legal" characters are silently ignored).

> That's really hard to discuss correctness of dead code. To be honest, it
> looks easier to me to finish the chained processing implementation, test
> it and remove than rather than trying to interpter the code (both
> existing and imaginary) in the mind.
> 
> BTW, it seems we able to entirely eliminate the output buffer length
> checks for the first (out_len * 3 / 4) input bytes and than process the
> tail with full checks. (After we'll done with strict and correct API.)

This means processing input in two chunks, so we need state and "final" 
input position reporting (and a lot of tests for corner cases). Extra 
logic would hurt performance for small buffers (our typical use case) - 
even simple check in base64_decode() from patch v4 does, I have tested 
this. Benchmark code with "barely enough output buffer" and "definitely 
large enough output buffer" (it accounted for newlines added by encoder) 
got lost somewhere, but I still have numbers from patch v4:

Without length check in the beginning of base64_decode() - output buffer 
size is checked on all iterations
It took 6170225078 ns to decode 7087960000 bytes, speed is 1148736052 bps
It took 6141109481 ns to decode 7087960000 bytes, speed is 1154182321 bps
It took 6142448824 ns to decode 7087960000 bytes, speed is 1153930655 bps
It took 6165578853 ns to decode 7087960000 bytes, speed is 1149601711 bps
It took 6126750494 ns to decode 7087960000 bytes, speed is 1156887326 bps
It took 6146041244 ns to decode 7087960000 bytes, speed is 1153256172 bps

With length check in the beginning of base64_decode() - output buffer 
size is checked on all iterations because its size is "suspicuous" (may 
not be large enough)
It took 6310650551 ns to decode 7087960000 bytes, speed is 1123174218 bps
It took 6243847147 ns to decode 7087960000 bytes, speed is 1135191146 bps
It took 6263236313 ns to decode 7087960000 bytes, speed is 1131676923 bps
It took 6260555388 ns to decode 7087960000 bytes, speed is 1132161535 bps

With length check in the beginning of base64_decode() - output buffer 
size is definitely large enough so it is not checked within loops.
It took 6037055181 ns to decode 7087960000 bytes, speed is 1174075735 bps
It took 5993200086 ns to decode 7087960000 bytes, speed is 1182667005 bps
It took 5976677666 ns to decode 7087960000 bytes, speed is 1185936467 bps
It took 5990214242 ns to decode 7087960000 bytes, speed is 1183256510 bps

Or should we do as patch v4 - eliminate chunks support completely? Then 
at least in_len should be made size_t instead of int (easiest way to 
avoid integer overflow) and probably tests for corner cases 
(in_len/out_len close to 4/3) should be added. Loop invariant can be 
moved up in that case.

Vladislav Shpilevoy is clearly voting on patch v4 approach (Quote: "As I 
said - please, move on to another 2 reviewers, if you don't want to 
finish the version with state removed.")

Leonid Vasiliev is happy even with v5.

Please make up your mind.