[Tarantool-patches] [PATCH v6 1/2] base64: fix decoder output buffer overrun (reads)

Thu Feb 25 11:25:21 MSK 2021

On 20.02.2021 15:49, Alexander Turenko wrote:
>>>> It also caused data corruption - garbage instead of "extra bits" was
>>>> saved into state->result if there was no space in output buffer.
>>>
>>> We have the dead code and it appears to be broken. Why don't remove it?
>>> (AFAIS, the rest of the code does not read off the buffer.)
>>>
>>> Is it due to a little probability that we'll decide to support chunked
>>> decoding and we'll decide to implement it in exactly this way (and not
>>> just leaving undecoded bytes in, say, ibuf)?
>>>
>>> Another side of this little probability is:
>>>
>>> * The code complexity and so waste of the time for anyone who need to
>>>     dive into it.
>>> * Have untested code in the code base that may give us more surprises.
>>> * Extra def-use dependencies may hide optimization opportunities and
>>>     increase register pressure.
>>
>> And yet you are youself proposing to improve performance:
>> "entirely eliminate the output buffer lengthchecks for the first (out_len *
>> 3 / 4) input bytes" (quoted from your e-mail about 2/2). This means saving
>> state and reporting input buffer stop position. So: do we want complexity
>> (and performance) or simplicity?
> 
> Nope, it does not. Just a kind of loop unrolling optimization:
> 
>   | for (int i = 0; i < count; ++i) {
>   |         <..processing..>
>   | }
> 
> ->
> 
>   | for (int i = 0; i < count; i += 4) {
>   |     <..processing by 4 bytes..>
>   | }
>   |
>   | /* Remainder. */
>   | for (int i = count - count % 4 - 1; i < count; ++i) {
>   |         <..processing..>
>   | }
> 
> (I didn't verify the arithmetic, just for the idea.)
> 
> (Sure, it is meaningful only for large inputs.)
> 
> The remainder always starts from 'state_a'.

Alas, it is not applicable - there could be skipped characters in input. 
So we get unknown intermediate state. The alternative - unknown 
processed input stop position - does not look good because w/o extra 
checks within loop we would overrun input.

> But here we don't discuss optimizations. You completely ignored my
> main point. I'll repeat:
> 
>   | We have the dead code and it appears to be broken. Why don't remove it?
>   | (AFAIS, the rest of the code does not read off the buffer.)
> 
> (I verified it locally with ASAN before I wrote to you the past time.)
> 
> Please, confirm that it is true or show that it is wrong.

Confirm.

> It is not a dialogue if you respond only to those parts that looks
> interesting for you.

There are two parallel e-mail threads, I am trying to avoid duplicating 
my responses.
There was no request to verify your findings in earlier e-mails.

>>> This is not the question regarding the patch, but this code looks broken
>>> in several other ways. At least:
>>>
>>> * It skips unrecognized symbols and does not report a decoding error.
>>
>> Some of these symbols should "legally" be skipped - like newlines in e-mail
>> use case. And there are probably other use cases which use some other
>> symbols. We could break something and gain nothing.
> 
> Sure, backward compatibility is the question.
> 
> However 'some symbols' and 'all unrecognized symbols' is not the same
> thing. I may want to skip newlines, but catch ill-formed incoming data.
> For example, it may be part of functional requirements for an API of my
> service that receives or reads base64 encoded data. Say, when an
> external service send me a text in ascii instead of base64, I want to
> report an error and decline the request rather than accept the request
> that is known to be ill-formed.
> 
>> Nothing prevents somebody from corrupting "legal" base64 characters, this is
>> not detected nor should be. These issues are outside base64 decoder scope,
>> CRCs or digital signatures should be used when data can be accidentally or
>> intentionally corrupted.
> 
> There are cases, when a user wants to follow the robustness principle
> and when (s)he wants the opposite. There is no silver bullet here,
> different usage scenarious are different.

I still think CRC/signatures should be used when input is untrusted. 
Extra validation is hardly useful (even taking compatibility issues out 
of the picture) but increases code complexity and degrades performance.

>> Summary: I propose commiting this particular patch (1/2) "as is" (it was
>> posted unmodified several times already) and discussing performance patch
>> (2/2) a little further.
> 
> First I need a response to the question above regarding the unused code.

I can easily create patch which fixes overrun bug and gets rid of state 
in one stroke. Should I do that?
We may need state to implement optimizations later.

I will postpone responding in another e-mail thread (API changes for 
robustness; optimizations).