Re: [Tarantool-patches] [PATCH v6 1/2] base64: fix decoder output buffer overrun (reads)

Tarantool development patches archive
 help / color / mirror / Atom feed

From: Alexander Turenko via Tarantool-patches <tarantool-patches@dev.tarantool.org>
To: Sergey Nikiforov <void@tarantool.org>
Cc: tarantool-patches@dev.tarantool.org,
	Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Subject: Re: [Tarantool-patches] [PATCH v6 1/2] base64: fix decoder output buffer overrun (reads)
Date: Sat, 27 Feb 2021 21:47:51 +0300	[thread overview]
Message-ID: <20210227184751.xsq34g6tfkr57z3s@tkn_work_nb> (raw)
In-Reply-To: <d81a7ad0-fd98-1798-3102-83bfc93eea76@tarantool.org>

On Thu, Feb 25, 2021 at 11:25:21AM +0300, Sergey Nikiforov wrote:
> On 20.02.2021 15:49, Alexander Turenko wrote:
> > > > > It also caused data corruption - garbage instead of "extra bits" was
> > > > > saved into state->result if there was no space in output buffer.
> > > > 
> > > > We have the dead code and it appears to be broken. Why don't remove it?
> > > > (AFAIS, the rest of the code does not read off the buffer.)
> > > > 
> > > > Is it due to a little probability that we'll decide to support chunked
> > > > decoding and we'll decide to implement it in exactly this way (and not
> > > > just leaving undecoded bytes in, say, ibuf)?
> > > > 
> > > > Another side of this little probability is:
> > > > 
> > > > * The code complexity and so waste of the time for anyone who need to
> > > >     dive into it.
> > > > * Have untested code in the code base that may give us more surprises.
> > > > * Extra def-use dependencies may hide optimization opportunities and
> > > >     increase register pressure.
> > > 
> > > And yet you are youself proposing to improve performance:
> > > "entirely eliminate the output buffer lengthchecks for the first (out_len *
> > > 3 / 4) input bytes" (quoted from your e-mail about 2/2). This means saving
> > > state and reporting input buffer stop position. So: do we want complexity
> > > (and performance) or simplicity?
> > 
> > Nope, it does not. Just a kind of loop unrolling optimization:
> > 
> >   | for (int i = 0; i < count; ++i) {
> >   |         <..processing..>
> >   | }
> > 
> > ->
> > 
> >   | for (int i = 0; i < count; i += 4) {
> >   |     <..processing by 4 bytes..>
> >   | }
> >   |
> >   | /* Remainder. */
> >   | for (int i = count - count % 4 - 1; i < count; ++i) {
> >   |         <..processing..>
> >   | }
> > 
> > (I didn't verify the arithmetic, just for the idea.)
> > 
> > (Sure, it is meaningful only for large inputs.)
> > 
> > The remainder always starts from 'state_a'.
> 
> Alas, it is not applicable - there could be skipped characters in input. So
> we get unknown intermediate state. The alternative - unknown processed input
> stop position - does not look good because w/o extra checks within loop we
> would overrun input.

Okay, some optimization ways may require the state (but locally, not as
part of the function contract). Some does not. I have some thoughts in
the mind, but I'll hold myself to don't discuss optimizations.

> 
> > But here we don't discuss optimizations. You completely ignored my
> > main point. I'll repeat:
> > 
> >   | We have the dead code and it appears to be broken. Why don't remove it?
> >   | (AFAIS, the rest of the code does not read off the buffer.)
> > 
> > (I verified it locally with ASAN before I wrote to you the past time.)
> > 
> > Please, confirm that it is true or show that it is wrong.
> 
> Confirm.

So I would remove it.

There is no confidence that we'll need it in a future.

> > > > This is not the question regarding the patch, but this code looks broken
> > > > in several other ways. At least:
> > > > 
> > > > * It skips unrecognized symbols and does not report a decoding error.
> > > 
> > > Some of these symbols should "legally" be skipped - like newlines in e-mail
> > > use case. And there are probably other use cases which use some other
> > > symbols. We could break something and gain nothing.
> > 
> > Sure, backward compatibility is the question.
> > 
> > However 'some symbols' and 'all unrecognized symbols' is not the same
> > thing. I may want to skip newlines, but catch ill-formed incoming data.
> > For example, it may be part of functional requirements for an API of my
> > service that receives or reads base64 encoded data. Say, when an
> > external service send me a text in ascii instead of base64, I want to
> > report an error and decline the request rather than accept the request
> > that is known to be ill-formed.
> > 
> > > Nothing prevents somebody from corrupting "legal" base64 characters, this is
> > > not detected nor should be. These issues are outside base64 decoder scope,
> > > CRCs or digital signatures should be used when data can be accidentally or
> > > intentionally corrupted.
> > 
> > There are cases, when a user wants to follow the robustness principle
> > and when (s)he wants the opposite. There is no silver bullet here,
> > different usage scenarious are different.
> 
> I still think CRC/signatures should be used when input is untrusted. Extra
> validation is hardly useful (even taking compatibility issues out of the
> picture) but increases code complexity and degrades performance.

CRC is not a silver validation bullet. An external service may just send
data in a wrong format (not base64), because of a developer mistake
(say, forgotten encode call). There is no question whether validation is
useful or not, when it is part of requirements for an application that
you develop.

Implementation of a particular format (without skipping unknown symbols,
without skipping newlines, without accepting more than one newline in
row, with certain amount of symbols in a line) may be simpler and more
performant than the common implementation.

> 
> > > Summary: I propose commiting this particular patch (1/2) "as is" (it was
> > > posted unmodified several times already) and discussing performance patch
> > > (2/2) a little further.
> > 
> > First I need a response to the question above regarding the unused code.
> 
> I can easily create patch which fixes overrun bug and gets rid of state in
> one stroke. Should I do that?
> We may need state to implement optimizations later.
> 
> I will postpone responding in another e-mail thread (API changes for
> robustness; optimizations).

I'm sure we should remove the unused code now. We may need some way to
store state in a future, but there is no confidence now.

next prev parent reply	other threads:[~2021-02-27 18:47 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-11  9:44 [Tarantool-patches] [PATCH v6 0/2] base64: Fix decoder, improve its performance Sergey Nikiforov via Tarantool-patches
2021-01-11  9:45 ` [Tarantool-patches] [PATCH v6 1/2] base64: fix decoder output buffer overrun (reads) Sergey Nikiforov via Tarantool-patches
2021-01-21  2:16   ` Alexander Turenko via Tarantool-patches
2021-01-26 16:37     ` Sergey Nikiforov via Tarantool-patches
2021-01-26 16:48       ` Sergey Nikiforov via Tarantool-patches
2021-02-20 11:30         ` Alexander Turenko via Tarantool-patches
2021-02-20 12:49       ` Alexander Turenko via Tarantool-patches
2021-02-25  8:25         ` Sergey Nikiforov via Tarantool-patches
2021-02-27 18:47           ` Alexander Turenko via Tarantool-patches [this message]
2021-01-11  9:45 ` [Tarantool-patches] [PATCH v6 2/2] base64: improve decoder performance Sergey Nikiforov via Tarantool-patches
2021-01-21 15:31   ` Alexander Turenko via Tarantool-patches
2021-01-26 16:37     ` Sergey Nikiforov via Tarantool-patches
2021-02-20 12:51       ` Alexander Turenko via Tarantool-patches
2021-01-12 19:39 ` [Tarantool-patches] [PATCH v6 0/2] base64: Fix decoder, improve its performance Alexander Turenko via Tarantool-patches

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210227184751.xsq34g6tfkr57z3s@tkn_work_nb \
    --to=tarantool-patches@dev.tarantool.org \
    --cc=alexander.turenko@tarantool.org \
    --cc=v.shpilevoy@tarantool.org \
    --cc=void@tarantool.org \
    --subject='Re: [Tarantool-patches] [PATCH v6 1/2] base64: fix decoder output buffer overrun (reads)' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox