From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <tarantool-patches-bounces@dev.tarantool.org>
Received: from [87.239.111.99] (localhost [127.0.0.1])
	by dev.tarantool.org (Postfix) with ESMTP id 0CBA87030F;
	Thu, 25 Feb 2021 11:25:23 +0300 (MSK)
DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 0CBA87030F
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev;
	t=1614241523; bh=s5UkEVNo9hU2FMcaJLd9Q/QzFf9IXW1iGkRLNu5pYXU=;
	h=To:References:Date:In-Reply-To:Subject:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc:
	 From;
	b=MvPrcogKOaDAz5QyXrr1DvM9nkubzMUVuyQ2HVDEkR0TOTJup4uz25uZZCwOPj+2o
	 EcFdeRA3F5kfZ4WBCC0rPUygemKrCQ4OzDbcD7C1ROhl61ODn1nnMpJoMQMNJPz4cr
	 w7ZgqtDun01Q3XxcSsUlNAn1HFIeJwkz1FPx1qqQ=
Received: from smtpng3.m.smailru.net (smtpng3.m.smailru.net [94.100.177.149])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
 bits)) (No client certificate requested)
 by dev.tarantool.org (Postfix) with ESMTPS id BF3D87030F
 for <tarantool-patches@dev.tarantool.org>;
 Thu, 25 Feb 2021 11:25:21 +0300 (MSK)
DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org BF3D87030F
Received: by smtpng3.m.smailru.net with esmtpa (envelope-from
 <void@tarantool.org>)
 id 1lFBxX-0004dJ-Or; Thu, 25 Feb 2021 11:25:20 +0300
To: Alexander Turenko <alexander.turenko@tarantool.org>
References: <cover.1610357625.git.void@tarantool.org>
 <7f5ce3321cda6d4305282af486ed2f493b920d1f.1610357625.git.void@tarantool.org>
 <20210121021627.2dzyh7fho2cvjlrz@tkn_work_nb>
 <a015ce9d-5c46-a3f6-0fa0-b776b01d82b3@tarantool.org>
 <20210220124933.rg4fg4ouvljuv6dq@tkn_work_nb>
Message-ID: <d81a7ad0-fd98-1798-3102-83bfc93eea76@tarantool.org>
Date: Thu, 25 Feb 2021 11:25:21 +0300
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
 Thunderbird/78.8.0
MIME-Version: 1.0
In-Reply-To: <20210220124933.rg4fg4ouvljuv6dq@tkn_work_nb>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-7564579A: B8F34718100C35BD
X-77F55803: 4F1203BC0FB41BD975C3EC174F5669228C980786C9DBA186ED94931C77F7596E182A05F538085040795ACC7E988D4B97F3727126C9D6528178A9306EEF3DD1FDD89D2B3E425821BB
X-7FA49CB5: FF5795518A3D127A4AD6D5ED66289B5278DA827A17800CE7492D3E4238663367EA1F7E6F0F101C67BD4B6F7A4D31EC0BCC500DACC3FED6E28638F802B75D45FF8AA50765F7900637CF20B9B7F5DD35A68638F802B75D45FF914D58D5BE9E6BC131B5C99E7648C95CDDE882590F889B1C31E9B675738C0B0D4B0BA4D8DD99ACD4A471835C12D1D9774AD6D5ED66289B5278DA827A17800CE7BEE62E5629C982429FA2833FD35BB23D2EF20D2F80756B5F868A13BD56FB6657A471835C12D1D977725E5C173C3A84C3FCE5E68AF59B345D117882F4460429728AD0CFFFB425014E868A13BD56FB6657A7F4EDE966BC389F9E8FC8737B5C2249A45692FFBBD75A6A089D37D7C0E48F6CCF19DD082D7633A0E7DDDDC251EA7DABAAAE862A0553A39223F8577A6DFFEA7CC415C329B279CF9D43847C11F186F3C5E7DDDDC251EA7DABCC89B49CDF41148FBC2A4A7A8370ED8B3C9F3DD0FB1AF5EB4E70A05D1297E1BBCB5012B2E24CD356
X-C1DE0DAB: 0D63561A33F958A5189E5237E21DED7D145D5F2B9C045A9B6EF468F2368210D9D59269BC5F550898D99A6476B3ADF6B47008B74DF8BB9EF7333BD3B22AA88B938A852937E12ACA7557E988E9157162368E8E86DC7131B365E7726E8460B7C23C
X-C8649E89: 4E36BF7865823D7055A7F0CF078B5EC49A30900B95165D34D8C933888226C841F79542B064D8993E301B142BC83A4B7F86EB41BD2077A64F8796511C79F67FF71D7E09C32AA3244CFCD3CD3E6E72ED3C09C083BEC85C67E43A76366E8A9DE7CA83B48618A63566E0
X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2biojvz1c9SWJtj+PwzO7Yg7E8w==
X-Mailru-Sender: 689FA8AB762F73936BC43F508A0638224F659996EB0051A4AE1CC7022BBB4109DD675A873A6B1A573284F99205A65E8EFB559BB5D741EB966AABCD5B59A9F6DF9ABAAAF6BC5F075B67EA787935ED9F1B
X-Mras: Ok
Subject: Re: [Tarantool-patches] [PATCH v6 1/2] base64: fix decoder output
 buffer overrun (reads)
X-BeenThere: tarantool-patches@dev.tarantool.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: Tarantool development patches <tarantool-patches.dev.tarantool.org>
List-Unsubscribe: <https://lists.tarantool.org/mailman/options/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=unsubscribe>
List-Archive: <https://lists.tarantool.org/pipermail/tarantool-patches/>
List-Post: <mailto:tarantool-patches@dev.tarantool.org>
List-Help: <mailto:tarantool-patches-request@dev.tarantool.org?subject=help>
List-Subscribe: <https://lists.tarantool.org/mailman/listinfo/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=subscribe>
From: Sergey Nikiforov via Tarantool-patches
 <tarantool-patches@dev.tarantool.org>
Reply-To: Sergey Nikiforov <void@tarantool.org>
Cc: tarantool-patches@dev.tarantool.org,
 Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Errors-To: tarantool-patches-bounces@dev.tarantool.org
Sender: "Tarantool-patches" <tarantool-patches-bounces@dev.tarantool.org>

On 20.02.2021 15:49, Alexander Turenko wrote:
>>>> It also caused data corruption - garbage instead of "extra bits" was
>>>> saved into state->result if there was no space in output buffer.
>>>
>>> We have the dead code and it appears to be broken. Why don't remove it?
>>> (AFAIS, the rest of the code does not read off the buffer.)
>>>
>>> Is it due to a little probability that we'll decide to support chunked
>>> decoding and we'll decide to implement it in exactly this way (and not
>>> just leaving undecoded bytes in, say, ibuf)?
>>>
>>> Another side of this little probability is:
>>>
>>> * The code complexity and so waste of the time for anyone who need to
>>>     dive into it.
>>> * Have untested code in the code base that may give us more surprises.
>>> * Extra def-use dependencies may hide optimization opportunities and
>>>     increase register pressure.
>>
>> And yet you are youself proposing to improve performance:
>> "entirely eliminate the output buffer lengthchecks for the first (out_len *
>> 3 / 4) input bytes" (quoted from your e-mail about 2/2). This means saving
>> state and reporting input buffer stop position. So: do we want complexity
>> (and performance) or simplicity?
> 
> Nope, it does not. Just a kind of loop unrolling optimization:
> 
>   | for (int i = 0; i < count; ++i) {
>   |         <..processing..>
>   | }
> 
> ->
> 
>   | for (int i = 0; i < count; i += 4) {
>   |     <..processing by 4 bytes..>
>   | }
>   |
>   | /* Remainder. */
>   | for (int i = count - count % 4 - 1; i < count; ++i) {
>   |         <..processing..>
>   | }
> 
> (I didn't verify the arithmetic, just for the idea.)
> 
> (Sure, it is meaningful only for large inputs.)
> 
> The remainder always starts from 'state_a'.

Alas, it is not applicable - there could be skipped characters in input. 
So we get unknown intermediate state. The alternative - unknown 
processed input stop position - does not look good because w/o extra 
checks within loop we would overrun input.

> But here we don't discuss optimizations. You completely ignored my
> main point. I'll repeat:
> 
>   | We have the dead code and it appears to be broken. Why don't remove it?
>   | (AFAIS, the rest of the code does not read off the buffer.)
> 
> (I verified it locally with ASAN before I wrote to you the past time.)
> 
> Please, confirm that it is true or show that it is wrong.

Confirm.

> It is not a dialogue if you respond only to those parts that looks
> interesting for you.

There are two parallel e-mail threads, I am trying to avoid duplicating 
my responses.
There was no request to verify your findings in earlier e-mails.

>>> This is not the question regarding the patch, but this code looks broken
>>> in several other ways. At least:
>>>
>>> * It skips unrecognized symbols and does not report a decoding error.
>>
>> Some of these symbols should "legally" be skipped - like newlines in e-mail
>> use case. And there are probably other use cases which use some other
>> symbols. We could break something and gain nothing.
> 
> Sure, backward compatibility is the question.
> 
> However 'some symbols' and 'all unrecognized symbols' is not the same
> thing. I may want to skip newlines, but catch ill-formed incoming data.
> For example, it may be part of functional requirements for an API of my
> service that receives or reads base64 encoded data. Say, when an
> external service send me a text in ascii instead of base64, I want to
> report an error and decline the request rather than accept the request
> that is known to be ill-formed.
> 
>> Nothing prevents somebody from corrupting "legal" base64 characters, this is
>> not detected nor should be. These issues are outside base64 decoder scope,
>> CRCs or digital signatures should be used when data can be accidentally or
>> intentionally corrupted.
> 
> There are cases, when a user wants to follow the robustness principle
> and when (s)he wants the opposite. There is no silver bullet here,
> different usage scenarious are different.

I still think CRC/signatures should be used when input is untrusted. 
Extra validation is hardly useful (even taking compatibility issues out 
of the picture) but increases code complexity and degrades performance.

>> Summary: I propose commiting this particular patch (1/2) "as is" (it was
>> posted unmodified several times already) and discussing performance patch
>> (2/2) a little further.
> 
> First I need a response to the question above regarding the unused code.

I can easily create patch which fixes overrun bug and gets rid of state 
in one stroke. Should I do that?
We may need state to implement optimizations later.

I will postpone responding in another e-mail thread (API changes for 
robustness; optimizations).