[Tarantool-patches] [PATCH v2 2/2] base64: Improve decoder performance

Fri Dec 25 13:39:21 MSK 2020

Hi!

On 22.12.2020 19:39, Vladislav Shpilevoy wrote:
>>> For example, move this check
>>> one line below, after curr_byte |= ? I don't know if it helps though.
>>
>> It would help if we would add 3 new states like "step_b and there was not enough space in output buffer on last iteration" AND (if something is still present in the input buffer) somehow report to the caller the value of final in_base64 offset.
>>
>>> Since you decided to change this function, I would suggest to make
>>> it work properly for all inputs.
>>
>> It does - as long as out_len is large enough.
>>
>>> And even better - expose it in base64.h and test separately.
>>
>> What is the point in wasting more time on this? Right now we do not need to process base64 input in chunks.
> 
> The point is that you want to optimize this function.

Optimization happened "automatically" while I was investigating bug and 
cleaning up code to better understand and simplify logic.

> But somewhy you don't want to go the end.

Because base64 decode performance is hardly matters for Tarantool use case.

> I can even ask the same, look:
> 
> 	What is the point in wasting more time on this?
> 	Right now base64 is not present in any flamegraphs as
> 	something expensive.
> 
>>> It was initially taken from there:
>>> https://github.com/libb64/libb64/blob/master/src/cdecode.c
>>> in this commit:
>>> https://github.com/tarantool/tarantool/commit/4e707c0f989434b74eb3a859a1b26822a87c42e2#diff-4fd3a92107ebfe91b691f08cd53a84eccbcf89f1a7bad47e20f950d6c4214e6cR216
>>>
>>> Readme of libb64 says it uses 'coroutines' implemented via
>>> switch-while-case. Which means the function is supposed to be
>>> called multiple times. Although in the source repository it
>>> does not accept output buffer length at all.
>>
>> So clearly someone just quickly added protection against output buffer overrun.
> 
> And introduced a bug. Now you quickly optimized it leaving some dead
> code hang here (the state variable). Don't you think that these 2 cases
> look very similar?
> 
>> libb64 required properly sized output buffer unconditionally, our current implementation supports ignoring "extra" input data for fixes-sized output buffer instead.
> 
> Yes, libb64 does require a full buffer. When it was ported to our
> code base, the bug was introduced.
> 
> The reason I have shown you libb64 is that originally this function
> was supposed not to loose data. It simply couldn't loose it due to
> output buffer being too short. But the user must have been able to
> receive 'input' data in chunks, get an 'out' buffer of the big
> enough size, and fill it. If some last bytes of 'input' were not
> translated, they were saved into the 'state'.
> 
> In our code the state is not needed now. It is dead code, if you
> don't want to fix it to work properly in chunks.
> 
>>> The reenterability means the function should be allowed to
>>> process all the input step by step, and must not loose data.
>>> Otherwise it makes no sense to have this 'state machine' via
>>> 'coroutines', and it also can be removed in order to pursue
>>> more performance. You don't even need the state variable.
>>
>> State is being read and written only once so we would gain literally nothing. It does not lose data if output buffer is properly sized.
> 
> I never said it looses data when it is properly sized.
> 
>> Removing state variable would make it harder for someone to implement processing input data in chunks if we would ever need this.
> 
> Then fix it :)
> 
> I don't understand, really. You want to optimize this code - lets
> optimize it. Delete the state, delete the dead code. What you did
> now is half-measure. Your patch even adds more dead code, because
> you added 3! new places where you update the state. Which is useless
> now anyway.
> 
> If somebody will need to implement chunked parsing, I am sure
> he will be able to google libb64 and take the source code from
> there again. Or use git history to find the original code. Or take
> another implementation, because there is no any proof that libb64
> is the fastest solution written in C.
> 
> Anyway, a non-chunked function would be enough for most of the cases,
> and would be faster and easier for sure.

ok, I have just wasted some more time on optimization and benchmarking. 
v4 (I have pushed this branch to github - 
https://github.com/tarantool/tarantool/tree/void234/gh-3069-fix-base64-memory-overrun-v4 
- but do not plan to post patches into the mailing list) is SLOWER than 
v3 despite cleaned up "state".

Results:

unoptimized:
It took 6189848309 ns to decode 7087960000 bytes, speed is 1145094297 bps

v3:
It took 5189617171 ns to decode 7087960000 bytes, speed is 1365796313 bps

v4 when output buffer size may be not large enough (as far as decoder 
can check without decoding whole input buffer):
It took 6243847147 ns to decode 7087960000 bytes, speed is 1135191146 bps

v4 when output buffer size is large enough:
It took 5976677666 ns to decode 7087960000 bytes, speed is 1185936467 bps

As you can see, v4 can be slower than unoptimized code when output 
buffer size is "suspicious" (in benchmarks I have used large enough 
buffer but decoder can't be sure of that due to newlines in base64 
data). v3 is unconditionally faster. You can ask GCC why. And I am sure 
that on another platform and/or buffer sizes results would be different.

>> And it is extremely unlikely that we would processing input data in chunks WITHOUT properly sized output buffer. Worst-case output buffer size is easily calculated from input data size.
>>
>>> The bottom line: you either need to fix the function not to loose
>>> data + keep it reenterable + test properly; or remove the state
>>> entirely because if it looses information, it is useless anyway.
>>
>> It only loses information in "atypical" cases when output buffer is not sized properly.
> 
> Which means the function can't be called again on the same input buffer.
> And that makes the state useless in all cases. Not only in some.
> 
> 
> But I see we are not going anywhere here. You don't really need LGTM
> from me on this patch, if you don't want to finish it. I am not strictly
> against these changes, because *probably* they don't add new bugs, and
> seem to be a tiny bit better for perf. I only don't like it being not
> finished.
> 
> You can ask 2 other people making decisions to review it in its current
> state and still get it committed. For example, Nikita or Alexander Tu. or
> Alexander L.
> 
> As for the other 2 commits - all looks fine to me except the comment
> I gave about the same test being run multiple times. Fix that comment,
> I will give LGTM, and you will need just +1 LGTM for the first 2 commits.

I have made v5 which is v3 with cosmetic changes per review - test moved 
into first commit etc. Can we please move on?