From: Alexander Turenko via Tarantool-patches <tarantool-patches@dev.tarantool.org> To: Sergey Nikiforov <void@tarantool.org> Cc: tarantool-patches@dev.tarantool.org, Vladislav Shpilevoy <v.shpilevoy@tarantool.org> Subject: Re: [Tarantool-patches] [PATCH v6 1/2] base64: fix decoder output buffer overrun (reads) Date: Sat, 27 Feb 2021 21:47:51 +0300 [thread overview] Message-ID: <20210227184751.xsq34g6tfkr57z3s@tkn_work_nb> (raw) In-Reply-To: <d81a7ad0-fd98-1798-3102-83bfc93eea76@tarantool.org> On Thu, Feb 25, 2021 at 11:25:21AM +0300, Sergey Nikiforov wrote: > On 20.02.2021 15:49, Alexander Turenko wrote: > > > > > It also caused data corruption - garbage instead of "extra bits" was > > > > > saved into state->result if there was no space in output buffer. > > > > > > > > We have the dead code and it appears to be broken. Why don't remove it? > > > > (AFAIS, the rest of the code does not read off the buffer.) > > > > > > > > Is it due to a little probability that we'll decide to support chunked > > > > decoding and we'll decide to implement it in exactly this way (and not > > > > just leaving undecoded bytes in, say, ibuf)? > > > > > > > > Another side of this little probability is: > > > > > > > > * The code complexity and so waste of the time for anyone who need to > > > > dive into it. > > > > * Have untested code in the code base that may give us more surprises. > > > > * Extra def-use dependencies may hide optimization opportunities and > > > > increase register pressure. > > > > > > And yet you are youself proposing to improve performance: > > > "entirely eliminate the output buffer lengthchecks for the first (out_len * > > > 3 / 4) input bytes" (quoted from your e-mail about 2/2). This means saving > > > state and reporting input buffer stop position. So: do we want complexity > > > (and performance) or simplicity? > > > > Nope, it does not. Just a kind of loop unrolling optimization: > > > > | for (int i = 0; i < count; ++i) { > > | <..processing..> > > | } > > > > -> > > > > | for (int i = 0; i < count; i += 4) { > > | <..processing by 4 bytes..> > > | } > > | > > | /* Remainder. */ > > | for (int i = count - count % 4 - 1; i < count; ++i) { > > | <..processing..> > > | } > > > > (I didn't verify the arithmetic, just for the idea.) > > > > (Sure, it is meaningful only for large inputs.) > > > > The remainder always starts from 'state_a'. > > Alas, it is not applicable - there could be skipped characters in input. So > we get unknown intermediate state. The alternative - unknown processed input > stop position - does not look good because w/o extra checks within loop we > would overrun input. Okay, some optimization ways may require the state (but locally, not as part of the function contract). Some does not. I have some thoughts in the mind, but I'll hold myself to don't discuss optimizations. > > > But here we don't discuss optimizations. You completely ignored my > > main point. I'll repeat: > > > > | We have the dead code and it appears to be broken. Why don't remove it? > > | (AFAIS, the rest of the code does not read off the buffer.) > > > > (I verified it locally with ASAN before I wrote to you the past time.) > > > > Please, confirm that it is true or show that it is wrong. > > Confirm. So I would remove it. There is no confidence that we'll need it in a future. > > > > This is not the question regarding the patch, but this code looks broken > > > > in several other ways. At least: > > > > > > > > * It skips unrecognized symbols and does not report a decoding error. > > > > > > Some of these symbols should "legally" be skipped - like newlines in e-mail > > > use case. And there are probably other use cases which use some other > > > symbols. We could break something and gain nothing. > > > > Sure, backward compatibility is the question. > > > > However 'some symbols' and 'all unrecognized symbols' is not the same > > thing. I may want to skip newlines, but catch ill-formed incoming data. > > For example, it may be part of functional requirements for an API of my > > service that receives or reads base64 encoded data. Say, when an > > external service send me a text in ascii instead of base64, I want to > > report an error and decline the request rather than accept the request > > that is known to be ill-formed. > > > > > Nothing prevents somebody from corrupting "legal" base64 characters, this is > > > not detected nor should be. These issues are outside base64 decoder scope, > > > CRCs or digital signatures should be used when data can be accidentally or > > > intentionally corrupted. > > > > There are cases, when a user wants to follow the robustness principle > > and when (s)he wants the opposite. There is no silver bullet here, > > different usage scenarious are different. > > I still think CRC/signatures should be used when input is untrusted. Extra > validation is hardly useful (even taking compatibility issues out of the > picture) but increases code complexity and degrades performance. CRC is not a silver validation bullet. An external service may just send data in a wrong format (not base64), because of a developer mistake (say, forgotten encode call). There is no question whether validation is useful or not, when it is part of requirements for an application that you develop. Implementation of a particular format (without skipping unknown symbols, without skipping newlines, without accepting more than one newline in row, with certain amount of symbols in a line) may be simpler and more performant than the common implementation. > > > > Summary: I propose commiting this particular patch (1/2) "as is" (it was > > > posted unmodified several times already) and discussing performance patch > > > (2/2) a little further. > > > > First I need a response to the question above regarding the unused code. > > I can easily create patch which fixes overrun bug and gets rid of state in > one stroke. Should I do that? > We may need state to implement optimizations later. > > I will postpone responding in another e-mail thread (API changes for > robustness; optimizations). I'm sure we should remove the unused code now. We may need some way to store state in a future, but there is no confidence now.
next prev parent reply other threads:[~2021-02-27 18:47 UTC|newest] Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-01-11 9:44 [Tarantool-patches] [PATCH v6 0/2] base64: Fix decoder, improve its performance Sergey Nikiforov via Tarantool-patches 2021-01-11 9:45 ` [Tarantool-patches] [PATCH v6 1/2] base64: fix decoder output buffer overrun (reads) Sergey Nikiforov via Tarantool-patches 2021-01-21 2:16 ` Alexander Turenko via Tarantool-patches 2021-01-26 16:37 ` Sergey Nikiforov via Tarantool-patches 2021-01-26 16:48 ` Sergey Nikiforov via Tarantool-patches 2021-02-20 11:30 ` Alexander Turenko via Tarantool-patches 2021-02-20 12:49 ` Alexander Turenko via Tarantool-patches 2021-02-25 8:25 ` Sergey Nikiforov via Tarantool-patches 2021-02-27 18:47 ` Alexander Turenko via Tarantool-patches [this message] 2021-01-11 9:45 ` [Tarantool-patches] [PATCH v6 2/2] base64: improve decoder performance Sergey Nikiforov via Tarantool-patches 2021-01-21 15:31 ` Alexander Turenko via Tarantool-patches 2021-01-26 16:37 ` Sergey Nikiforov via Tarantool-patches 2021-02-20 12:51 ` Alexander Turenko via Tarantool-patches 2021-01-12 19:39 ` [Tarantool-patches] [PATCH v6 0/2] base64: Fix decoder, improve its performance Alexander Turenko via Tarantool-patches
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20210227184751.xsq34g6tfkr57z3s@tkn_work_nb \ --to=tarantool-patches@dev.tarantool.org \ --cc=alexander.turenko@tarantool.org \ --cc=v.shpilevoy@tarantool.org \ --cc=void@tarantool.org \ --subject='Re: [Tarantool-patches] [PATCH v6 1/2] base64: fix decoder output buffer overrun (reads)' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox