[Tarantool-patches] [PATCH v6 1/2] base64: fix decoder output buffer overrun (reads)

Sat Feb 20 15:49:33 MSK 2021

> > > It also caused data corruption - garbage instead of "extra bits" was
> > > saved into state->result if there was no space in output buffer.
> > 
> > We have the dead code and it appears to be broken. Why don't remove it?
> > (AFAIS, the rest of the code does not read off the buffer.)
> > 
> > Is it due to a little probability that we'll decide to support chunked
> > decoding and we'll decide to implement it in exactly this way (and not
> > just leaving undecoded bytes in, say, ibuf)?
> > 
> > Another side of this little probability is:
> > 
> > * The code complexity and so waste of the time for anyone who need to
> >    dive into it.
> > * Have untested code in the code base that may give us more surprises.
> > * Extra def-use dependencies may hide optimization opportunities and
> >    increase register pressure.
> 
> And yet you are youself proposing to improve performance:
> "entirely eliminate the output buffer lengthchecks for the first (out_len *
> 3 / 4) input bytes" (quoted from your e-mail about 2/2). This means saving
> state and reporting input buffer stop position. So: do we want complexity
> (and performance) or simplicity?

Nope, it does not. Just a kind of loop unrolling optimization:

 | for (int i = 0; i < count; ++i) {
 |         <..processing..>
 | }

->

 | for (int i = 0; i < count; i += 4) {
 |     <..processing by 4 bytes..>
 | }
 |
 | /* Remainder. */
 | for (int i = count - count % 4 - 1; i < count; ++i) {
 |         <..processing..>
 | }

(I didn't verify the arithmetic, just for the idea.)

(Sure, it is meaningful only for large inputs.)

The remainder always starts from 'state_a'.

But here we don't discuss optimizations. You completely ignored my
main point. I'll repeat:

 | We have the dead code and it appears to be broken. Why don't remove it?
 | (AFAIS, the rest of the code does not read off the buffer.)

(I verified it locally with ASAN before I wrote to you the past time.)

Please, confirm that it is true or show that it is wrong.

It is not a dialogue if you respond only to those parts that looks
interesting for you.

> > This is not the question regarding the patch, but this code looks broken
> > in several other ways. At least:
> > 
> > * It skips unrecognized symbols and does not report a decoding error.
> 
> Some of these symbols should "legally" be skipped - like newlines in e-mail
> use case. And there are probably other use cases which use some other
> symbols. We could break something and gain nothing.

Sure, backward compatibility is the question.

However 'some symbols' and 'all unrecognized symbols' is not the same
thing. I may want to skip newlines, but catch ill-formed incoming data.
For example, it may be part of functional requirements for an API of my
service that receives or reads base64 encoded data. Say, when an
external service send me a text in ascii instead of base64, I want to
report an error and decline the request rather than accept the request
that is known to be ill-formed.

> Nothing prevents somebody from corrupting "legal" base64 characters, this is
> not detected nor should be. These issues are outside base64 decoder scope,
> CRCs or digital signatures should be used when data can be accidentally or
> intentionally corrupted.

There are cases, when a user wants to follow the robustness principle
and when (s)he wants the opposite. There is no silver bullet here,
different usage scenarious are different.

> Summary: I propose commiting this particular patch (1/2) "as is" (it was
> posted unmodified several times already) and discussing performance patch
> (2/2) a little further.

First I need a response to the question above regarding the unused code.