[Tarantool-patches] [PATCH v6 0/2] base64: Fix decoder, improve its performance

Tarantool development patches archive
 help / color / mirror / Atom feed

* [Tarantool-patches] [PATCH v6 0/2] base64: Fix decoder, improve its performance
@ 2021-01-11  9:44 Sergey Nikiforov via Tarantool-patches
  2021-01-11  9:45 ` [Tarantool-patches] [PATCH v6 1/2] base64: fix decoder output buffer overrun (reads) Sergey Nikiforov via Tarantool-patches
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Sergey Nikiforov via Tarantool-patches @ 2021-01-11  9:44 UTC (permalink / raw)
  To: tarantool-patches; +Cc: Vladislav Shpilevoy, Alexander Turenko

First patch fixes #3069 and adds test for zero-sized output buffer,
second one improves base64 decoder performance.

Second patch is optional to merge.

v6 is v5 rebased - there was merge conflict in base64 test,
base64_no_space_test() is now implemented as subtest per review.
No more changes.

Branch: https://github.com/tarantool/tarantool/tree/void234/gh-3069-fix-base64-memory-overrun-v6
Issue: https://github.com/tarantool/tarantool/issues/3069

Sergey Nikiforov (2):
  base64: fix decoder output buffer overrun (reads)
  base64: improve decoder performance

 test/unit/base64.c      | 16 +++++++++++-
 test/unit/base64.result |  5 +++-
 third_party/base64.c    | 57 +++++++++++++++++++++++++++++------------
 3 files changed, 59 insertions(+), 19 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Tarantool-patches] [PATCH v6 1/2] base64: fix decoder output buffer overrun (reads)
  2021-01-11  9:44 [Tarantool-patches] [PATCH v6 0/2] base64: Fix decoder, improve its performance Sergey Nikiforov via Tarantool-patches
@ 2021-01-11  9:45 ` Sergey Nikiforov via Tarantool-patches
  2021-01-21  2:16   ` Alexander Turenko via Tarantool-patches
  2021-01-11  9:45 ` [Tarantool-patches] [PATCH v6 2/2] base64: improve decoder performance Sergey Nikiforov via Tarantool-patches
  2021-01-12 19:39 ` [Tarantool-patches] [PATCH v6 0/2] base64: Fix decoder, improve its performance Alexander Turenko via Tarantool-patches
  2 siblings, 1 reply; 14+ messages in thread
From: Sergey Nikiforov via Tarantool-patches @ 2021-01-11  9:45 UTC (permalink / raw)
  To: tarantool-patches; +Cc: Vladislav Shpilevoy, Alexander Turenko

Was caught by base64 test with enabled ASAN.

It also caused data corruption - garbage instead of "extra bits" was
saved into state->result if there was no space in output buffer.

Added test for "zero-sized output buffer" case.

Fixes: #3069
---

Branch: https://github.com/tarantool/tarantool/tree/void234/gh-3069-fix-base64-memory-overrun-v6
Issue: https://github.com/tarantool/tarantool/issues/3069

 test/unit/base64.c      | 16 +++++++++++++++-
 test/unit/base64.result |  5 ++++-
 third_party/base64.c    | 28 +++++++++++++++++-----------
 3 files changed, 36 insertions(+), 13 deletions(-)

diff --git a/test/unit/base64.c b/test/unit/base64.c
index cc74f64d1..508877217 100644
--- a/test/unit/base64.c
+++ b/test/unit/base64.c
@@ -75,9 +75,22 @@ base64_invalid_chars_test(void)
 	check_plan();
 }
 
+static void
+base64_no_space_test(void)
+{
+	plan(1);
+
+	const char *const in = "sIIpHw==";
+	const int in_len = strlen(in);
+	const int rc = base64_decode(in, in_len, NULL, 0);
+	is(rc, 0, "no space in out buffer");
+
+	check_plan();
+}
+
 int main(int argc, char *argv[])
 {
-	plan(29);
+	plan(30);
 	header();
 
 	const char *option_tests[] = {
@@ -96,6 +109,7 @@ int main(int argc, char *argv[])
 	}
 
 	base64_invalid_chars_test();
+	base64_no_space_test();
 
 	footer();
 	return check_plan();
diff --git a/test/unit/base64.result b/test/unit/base64.result
index 3bc2c2275..495e2d0a2 100644
--- a/test/unit/base64.result
+++ b/test/unit/base64.result
@@ -1,4 +1,4 @@
-1..29
+1..30
 	*** main ***
     1..3
     ok 1 - length
@@ -178,4 +178,7 @@ ok 28 - subtests
     1..1
     ok 1 - ignoring invalid chars
 ok 29 - subtests
+    1..1
+    ok 1 - no space in out buffer
+ok 30 - subtests
 	*** main: done ***
diff --git a/third_party/base64.c b/third_party/base64.c
index 7c69315ea..3ecd0843a 100644
--- a/third_party/base64.c
+++ b/third_party/base64.c
@@ -248,8 +248,9 @@ base64_decode_block(const char *in_base64, int in_len,
 	char *out_pos = out_bin;
 	char *out_end = out_bin + out_len;
 	int fragment;
+	char curr_byte;
 
-	*out_pos = state->result;
+	curr_byte = state->result;
 
 	switch (state->step)
 	{
@@ -260,49 +261,54 @@ base64_decode_block(const char *in_base64, int in_len,
 				if (in_pos == in_end || out_pos >= out_end)
 				{
 					state->step = step_a;
-					state->result = *out_pos;
+					state->result = curr_byte;
 					return out_pos - out_bin;
 				}
 				fragment = base64_decode_value(*in_pos++);
 			} while (fragment < 0);
-			*out_pos    = (fragment & 0x03f) << 2;
+			curr_byte = (fragment & 0x03f) << 2;
 	case step_b:
 			do {
 				if (in_pos == in_end || out_pos >= out_end)
 				{
 					state->step = step_b;
-					state->result = *out_pos;
+					state->result = curr_byte;
 					return out_pos - out_bin;
 				}
 				fragment = base64_decode_value(*in_pos++);
 			} while (fragment < 0);
-			*out_pos++ |= (fragment & 0x030) >> 4;
+			curr_byte |= (fragment & 0x030) >> 4;
+			*out_pos++ = curr_byte;
+			curr_byte = (fragment & 0x00f) << 4;
 			if (out_pos < out_end)
-				*out_pos = (fragment & 0x00f) << 4;
+				*out_pos = curr_byte;
 	case step_c:
 			do {
 				if (in_pos == in_end || out_pos >= out_end)
 				{
 					state->step = step_c;
-					state->result = *out_pos;
+					state->result = curr_byte;
 					return out_pos - out_bin;
 				}
 				fragment = base64_decode_value(*in_pos++);
 			} while (fragment < 0);
-			*out_pos++ |= (fragment & 0x03c) >> 2;
+			curr_byte |= (fragment & 0x03c) >> 2;
+			*out_pos++ = curr_byte;
+			curr_byte = (fragment & 0x003) << 6;
 			if (out_pos < out_end)
-				*out_pos = (fragment & 0x003) << 6;
+				*out_pos = curr_byte;
 	case step_d:
 			do {
 				if (in_pos == in_end || out_pos >= out_end)
 				{
 					state->step = step_d;
-					state->result = *out_pos;
+					state->result = curr_byte;
 					return out_pos - out_bin;
 				}
 				fragment = base64_decode_value(*in_pos++);
 			} while (fragment < 0);
-			*out_pos++   |= (fragment & 0x03f);
+			curr_byte |= (fragment & 0x03f);
+			*out_pos++ = curr_byte;
 		}
 	}
 	/* control should not reach here */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Tarantool-patches] [PATCH v6 2/2] base64: improve decoder performance
  2021-01-11  9:44 [Tarantool-patches] [PATCH v6 0/2] base64: Fix decoder, improve its performance Sergey Nikiforov via Tarantool-patches
  2021-01-11  9:45 ` [Tarantool-patches] [PATCH v6 1/2] base64: fix decoder output buffer overrun (reads) Sergey Nikiforov via Tarantool-patches
@ 2021-01-11  9:45 ` Sergey Nikiforov via Tarantool-patches
  2021-01-21 15:31   ` Alexander Turenko via Tarantool-patches
  2021-01-12 19:39 ` [Tarantool-patches] [PATCH v6 0/2] base64: Fix decoder, improve its performance Alexander Turenko via Tarantool-patches
  2 siblings, 1 reply; 14+ messages in thread
From: Sergey Nikiforov via Tarantool-patches @ 2021-01-11  9:45 UTC (permalink / raw)
  To: tarantool-patches; +Cc: Vladislav Shpilevoy, Alexander Turenko

Unnecessary checks were removed from internal loops.
Benchmark shows that performance is now ~1.19 times higher
(release build, Intel Core I7-9700K, only one thread).
---

Branch: https://github.com/tarantool/tarantool/tree/void234/gh-3069-fix-base64-memory-overrun-v6

 third_party/base64.c | 35 ++++++++++++++++++++++++++---------
 1 file changed, 26 insertions(+), 9 deletions(-)

diff --git a/third_party/base64.c b/third_party/base64.c
index 3ecd0843a..74260971b 100644
--- a/third_party/base64.c
+++ b/third_party/base64.c
@@ -258,10 +258,10 @@ base64_decode_block(const char *in_base64, int in_len,
 		{
 	case step_a:
 			do {
-				if (in_pos == in_end || out_pos >= out_end)
+				if (in_pos >= in_end)
 				{
 					state->step = step_a;
-					state->result = curr_byte;
+					/* curr_byte is useless now. */
 					return out_pos - out_bin;
 				}
 				fragment = base64_decode_value(*in_pos++);
@@ -269,7 +269,7 @@ base64_decode_block(const char *in_base64, int in_len,
 			curr_byte = (fragment & 0x03f) << 2;
 	case step_b:
 			do {
-				if (in_pos == in_end || out_pos >= out_end)
+				if (in_pos >= in_end)
 				{
 					state->step = step_b;
 					state->result = curr_byte;
@@ -277,14 +277,19 @@ base64_decode_block(const char *in_base64, int in_len,
 				}
 				fragment = base64_decode_value(*in_pos++);
 			} while (fragment < 0);
+			if (out_pos >= out_end)
+			{
+				/* We are losing some data. */
+				state->step = step_b;
+				state->result = curr_byte;
+				return out_pos - out_bin;
+			}
 			curr_byte |= (fragment & 0x030) >> 4;
 			*out_pos++ = curr_byte;
 			curr_byte = (fragment & 0x00f) << 4;
-			if (out_pos < out_end)
-				*out_pos = curr_byte;
 	case step_c:
 			do {
-				if (in_pos == in_end || out_pos >= out_end)
+				if (in_pos >= in_end)
 				{
 					state->step = step_c;
 					state->result = curr_byte;
@@ -292,14 +297,19 @@ base64_decode_block(const char *in_base64, int in_len,
 				}
 				fragment = base64_decode_value(*in_pos++);
 			} while (fragment < 0);
+			if (out_pos >= out_end)
+			{
+				/* We are losing some data. */
+				state->step = step_c;
+				state->result = curr_byte;
+				return out_pos - out_bin;
+			}
 			curr_byte |= (fragment & 0x03c) >> 2;
 			*out_pos++ = curr_byte;
 			curr_byte = (fragment & 0x003) << 6;
-			if (out_pos < out_end)
-				*out_pos = curr_byte;
 	case step_d:
 			do {
-				if (in_pos == in_end || out_pos >= out_end)
+				if (in_pos >= in_end)
 				{
 					state->step = step_d;
 					state->result = curr_byte;
@@ -307,6 +317,13 @@ base64_decode_block(const char *in_base64, int in_len,
 				}
 				fragment = base64_decode_value(*in_pos++);
 			} while (fragment < 0);
+			if (out_pos >= out_end)
+			{
+				/* We are losing some data. */
+				state->step = step_d;
+				state->result = curr_byte;
+				return out_pos - out_bin;
+			}
 			curr_byte |= (fragment & 0x03f);
 			*out_pos++ = curr_byte;
 		}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 0/2] base64: Fix decoder, improve its performance
  2021-01-11  9:44 [Tarantool-patches] [PATCH v6 0/2] base64: Fix decoder, improve its performance Sergey Nikiforov via Tarantool-patches
  2021-01-11  9:45 ` [Tarantool-patches] [PATCH v6 1/2] base64: fix decoder output buffer overrun (reads) Sergey Nikiforov via Tarantool-patches
  2021-01-11  9:45 ` [Tarantool-patches] [PATCH v6 2/2] base64: improve decoder performance Sergey Nikiforov via Tarantool-patches
@ 2021-01-12 19:39 ` Alexander Turenko via Tarantool-patches
  2 siblings, 0 replies; 14+ messages in thread
From: Alexander Turenko via Tarantool-patches @ 2021-01-12 19:39 UTC (permalink / raw)
  To: Sergey Nikiforov; +Cc: tarantool-patches, Vladislav Shpilevoy

On Mon, Jan 11, 2021 at 12:44:59PM +0300, Sergey Nikiforov wrote:
> First patch fixes #3069 and adds test for zero-sized output buffer,
> second one improves base64 decoder performance.
> 
> Second patch is optional to merge.
> 
> v6 is v5 rebased - there was merge conflict in base64 test,
> base64_no_space_test() is now implemented as subtest per review.
> No more changes.
> 
> Branch: https://github.com/tarantool/tarantool/tree/void234/gh-3069-fix-base64-memory-overrun-v6
> Issue: https://github.com/tarantool/tarantool/issues/3069
> 
> Sergey Nikiforov (2):
>   base64: fix decoder output buffer overrun (reads)
>   base64: improve decoder performance
> 
>  test/unit/base64.c      | 16 +++++++++++-
>  test/unit/base64.result |  5 +++-
>  third_party/base64.c    | 57 +++++++++++++++++++++++++++++------------
>  3 files changed, 59 insertions(+), 19 deletions(-)
> 
> -- 
> 2.25.1
> 

I was surprised by the benchmarking results [1] and asked Sergey for the
workload to reproduce them. I placed the workload at end of the email.

For now I just blindly run the benchmark and look at results.

In short, the simplified code is faster on my system. So maybe it worth
to look at this direction?

[1]: https://lists.tarantool.org/tarantool-patches/661ec8ed-ff23-b67d-9177-20eb55ba8a92@tarantool.org/

Hardware
--------

Laptop: SMP, 8 cores (with enabled HT, so 4 in fact), Intel(R) Core(TM)
i7-7820HQ CPU @ 2.90GHz.

Gentoo Linux, GCC 10.1.0, glibc 2.31-r3.

Method
------

I prepared three builds:

* baseline:
  - master (2.8.0-0-gefc30ccf8)
  - perf test
* optimized:
  - master (2.8.0-0-gefc30ccf8)
  - two commits from v6: 7f5ce3321cda6d4305282af486ed2f493b920d1f and
    b3f6a37aef7bc59668e9f1e6b9638920a8f52846
  - perf test
* simplified:
  - master (2.8.0-0-gefc30ccf8)
  - top commit from v4 (d759647f9f68644d031498ae6a540756dcc4228e)
  - perf test

And build them like in RPM / Deb packages (RelWithDebInfo cmake build
type); disabled the libcurl bundling, because it does not matter here:

$ cmake . -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_BACKTRACE=ON \
      -DENABLE_DIST=ON -DENABLE_BUNDLED_LIBCURL=OFF && make -j

I prepared my laptop for benchmarking:

* Ensure that intel_pstate is disabled:
  | $ grep -o 'intel_pstate=[^ ]*' /proc/cmdline 
  | intel_pstate=disable
* Set fixed CPU frequency (mainly to ensure that Turbo Boost will not be
  in action):
  | # cpupower -c all frequency-set -g userspace
  | # cpupower -c all frequency-set -f 2.9GHz
* Closed all applications that may significantly use CPU (firefox and so
  on).

Then I run the following steps three times:

 | # sync && echo 3 > /proc/sys/vm/drop_caches`  # from root
 | $ (cd baseline && for i in $(seq 1 10); do taskset 0x2 ./test/unit/perf-base64.test; done >>res.txt 2>&1)
 | # sync && echo 3 > /proc/sys/vm/drop_caches`  # from root
 | $ (cd optimized && for i in $(seq 1 10); do taskset 0x2 ./test/unit/perf-base64.test; done >>res.txt 2>&1)
 | # sync && echo 3 > /proc/sys/vm/drop_caches`  # from root
 | $ (cd simplified && for i in $(seq 1 10); do taskset 0x2 ./test/unit/perf-base64.test; done >>res.txt 2>&1)

Raw results are at end of the email.

Scripts to accumulate the results:

 | # It is in ~/.bashrc on my machine.
 |
 | function s() {
 |     awk '
 |         BEGIN {
 |             min = "NaN"
 |             max = "NaN"
 |             cnt = 0
 |             var = "NaN"
 |             sum = "NaN"
 |             avg = "NaN"
 |         }
 |         {
 |             min = (NR==1 || $1<min ? $1 : min)
 |             max = (NR==1 || $1>max ? $1 : max)
 |             cnt += 1
 |             sum += $1
 |         }
 |         END {
 |             avg = sum / cnt
 |             var = max - min
 |             var_rel = 100 * var / avg
 |             print "cnt:", cnt
 |             print "min:", min
 |             print "max:", max
 |             printf "avg: %f\n", avg
 |             print "var:", var, "("var_rel"%)"
 |             print ""
 |         }
 |     ';
 | }

 | (cd baseline; grep ^It res.txt | cut -d' ' -f11 | s >stat.txt)
 | (cd optimized; grep ^It res.txt | cut -d' ' -f11 | s >stat.txt)
 | (cd simplified; grep ^It res.txt | cut -d' ' -f11 | s >stat.txt)
 | (for metric in min max avg; do                                                                                              \
 |     baseline="$(grep "${metric}:" baseline/stat.txt | cut -d' ' -f2)";                                                      \
 |     optimized="$(grep "${metric}:" optimized/stat.txt | cut -d' ' -f2)";                                                    \
 |     simplified="$(grep "${metric}:" simplified/stat.txt | cut -d' ' -f2)";                                                  \
 |     echo "${metric}: baseline: ${baseline}";                                                                                \
 |     echo "${metric}: optimized: ${optimized} (+$(echo -e "scale=2\n(${optimized}-${baseline})*100/${baseline}" | bc)%)";    \
 |     echo "${metric}: simplified: ${simplified} (+$(echo -e "scale=2\n(${simplified}-${baseline})*100/${baseline}" | bc)%)"; \
 | done) | column -t

Results
-------

min:  baseline:    883238644          
min:  optimized:   936618311          (+6.04%)
min:  simplified:  1029621785         (+16.57%)
max:  baseline:    932895993          
max:  optimized:   974532472          (+4.46%)
max:  simplified:  1075078359         (+15.24%)
avg:  baseline:    898423016.633333   
avg:  optimized:   948646528.166667   (+5.59%)
avg:  simplified:  1044391403.466667  (+16.24%)

Interpretation
--------------

On given hardware and toolchain given performance test shows definitely
better results for the 'simplified' case (Sergey's patch v4).

What I curious about: results are better for the first several
iterations (when they're one right after another): 1-2, 11-12, 21-22. I
saw this effect on different benchmarking results on this machine (not
only for this workload), but I have no good interpretation for it. So I
looked not only on average, but also on minimal and maximal results.

Workload
--------

Made by Sergey N.

diff --git a/test/unit/CMakeLists.txt b/test/unit/CMakeLists.txt
index eaa97e63c..8a4e6e9be 100644
--- a/test/unit/CMakeLists.txt
+++ b/test/unit/CMakeLists.txt
@@ -50,6 +50,8 @@ add_executable(bitset_index.test bitset_index.c)
 target_link_libraries(bitset_index.test bitset)
 add_executable(base64.test base64.c)
 target_link_libraries(base64.test misc unit)
+add_executable(perf-base64.test perf-base64.c)
+target_link_libraries(perf-base64.test misc unit)
 add_executable(uuid.test uuid.c core_test_utils.c)
 target_link_libraries(uuid.test uuid unit)
 add_executable(random.test random.c core_test_utils.c)
diff --git a/test/unit/perf-base64.c b/test/unit/perf-base64.c
new file mode 100644
index 000000000..d8bab4951
--- /dev/null
+++ b/test/unit/perf-base64.c
@@ -0,0 +1,70 @@
+#include <third_party/base64.h>
+#include "unit.h"
+#include "trivia/util.h"
+#include <string.h>
+
+/* Not too large, it should fit into CPU cache(s) */
+#define RAW_BUFFER_BYTES 1048576
+#define TEST_ITERATIONS 5000
+
+static void
+bench_base64_decoder(void)
+{
+	plan(1);
+
+	char *const raw_buffer = (char *)malloc(RAW_BUFFER_BYTES);
+	/* It is more realistic to encode/decode random data than zeroes */
+	for (size_t pos = 0; pos < RAW_BUFFER_BYTES; pos += 2) {
+		*(uint16_t *)((char *)raw_buffer + pos) = rand();
+	}
+
+	const size_t base64_buffer_bytes = 1 +
+		((BASE64_CHARS_PER_LINE + 1) *
+		(((RAW_BUFFER_BYTES << 2) / 3) + (BASE64_CHARS_PER_LINE - 1)) /
+		BASE64_CHARS_PER_LINE);
+	char *const base64_buffer =
+		(char *)malloc(base64_buffer_bytes);
+	const size_t base64_bytes = base64_encode(raw_buffer, RAW_BUFFER_BYTES,
+		       base64_buffer, base64_buffer_bytes, 0);
+
+	/* Warm up caches */
+	for (int i = 0; i < 3; ++i) {
+		base64_decode(base64_buffer, base64_bytes,
+			raw_buffer, RAW_BUFFER_BYTES);
+	}
+
+	/* Actual test */
+	struct timespec start_time, end_time;
+
+	clock_gettime(CLOCK_MONOTONIC, &start_time);
+	for (int i = 0; i < TEST_ITERATIONS; ++i) {
+		base64_decode(base64_buffer, base64_bytes,
+			raw_buffer, RAW_BUFFER_BYTES);
+	}
+	clock_gettime(CLOCK_MONOTONIC, &end_time);
+
+	const uint64_t passed_ns =
+		(end_time.tv_sec - start_time.tv_sec) * 1000ULL * 1000 * 1000 +
+		end_time.tv_nsec - start_time.tv_nsec;
+	const unsigned long long total_decoded_bytes =
+		(unsigned long long)base64_buffer_bytes * TEST_ITERATIONS;
+	fprintf(stderr, "It took %llu ns to decode %llu bytes, speed is "
+			"%llu bps\n",
+			(unsigned long long)passed_ns,
+			total_decoded_bytes,
+			1000 * 1000 * 1000 * total_decoded_bytes / passed_ns);
+	ok(1, "base64 decode benchmark");
+
+	check_plan();
+}
+
+int main(int argc, char *argv[])
+{
+	plan(1);
+	header();
+
+	bench_base64_decoder();
+
+	footer();
+	return check_plan();
+}

Raw results
-----------

$ (cd baseline; grep ^It res.txt | can -n)
     1	It took 7597803020 ns to decode 7087960000 bytes, speed is 932895993 bps
     2	It took 7748740885 ns to decode 7087960000 bytes, speed is 914724095 bps
     3	It took 7777420988 ns to decode 7087960000 bytes, speed is 911350949 bps
     4	It took 7806469047 ns to decode 7087960000 bytes, speed is 907959790 bps
     5	It took 7849334625 ns to decode 7087960000 bytes, speed is 903001380 bps
     6	It took 7888755755 ns to decode 7087960000 bytes, speed is 898488965 bps
     7	It took 7913204578 ns to decode 7087960000 bytes, speed is 895712973 bps
     8	It took 7934173972 ns to decode 7087960000 bytes, speed is 893345674 bps
     9	It took 7941474974 ns to decode 7087960000 bytes, speed is 892524376 bps
    10	It took 7972236306 ns to decode 7087960000 bytes, speed is 889080519 bps
    11	It took 7682013524 ns to decode 7087960000 bytes, speed is 922669555 bps
    12	It took 7774033320 ns to decode 7087960000 bytes, speed is 911748085 bps
    13	It took 7860418423 ns to decode 7087960000 bytes, speed is 901728078 bps
    14	It took 7886080928 ns to decode 7087960000 bytes, speed is 898793718 bps
    15	It took 7952493586 ns to decode 7087960000 bytes, speed is 891287735 bps
    16	It took 7974309710 ns to decode 7087960000 bytes, speed is 888849349 bps
    17	It took 7979781814 ns to decode 7087960000 bytes, speed is 888239824 bps
    18	It took 8003638574 ns to decode 7087960000 bytes, speed is 885592213 bps
    19	It took 8014734769 ns to decode 7087960000 bytes, speed is 884366133 bps
    20	It took 8010969247 ns to decode 7087960000 bytes, speed is 884781826 bps
    21	It took 7728000646 ns to decode 7087960000 bytes, speed is 917179012 bps
    22	It took 7758351072 ns to decode 7087960000 bytes, speed is 913591036 bps
    23	It took 7829038607 ns to decode 7087960000 bytes, speed is 905342323 bps
    24	It took 7897805166 ns to decode 7087960000 bytes, speed is 897459465 bps
    25	It took 7915229547 ns to decode 7087960000 bytes, speed is 895483821 bps
    26	It took 7978044397 ns to decode 7087960000 bytes, speed is 888433260 bps
    27	It took 7991806664 ns to decode 7087960000 bytes, speed is 886903337 bps
    28	It took 8024965897 ns to decode 7087960000 bytes, speed is 883238644 bps
    29	It took 8020449161 ns to decode 7087960000 bytes, speed is 883736042 bps
    30	It took 8016400879 ns to decode 7087960000 bytes, speed is 884182329 bps

$ (cd optimized; grep ^It res.txt | can -n)
     1	It took 7273190169 ns to decode 7087960000 bytes, speed is 974532472 bps
     2	It took 7335598812 ns to decode 7087960000 bytes, speed is 966241500 bps
     3	It took 7418727074 ns to decode 7087960000 bytes, speed is 955414578 bps
     4	It took 7477115808 ns to decode 7087960000 bytes, speed is 947953754 bps
     5	It took 7484032590 ns to decode 7087960000 bytes, speed is 947077650 bps
     6	It took 7498435941 ns to decode 7087960000 bytes, speed is 945258458 bps
     7	It took 7491035013 ns to decode 7087960000 bytes, speed is 946192346 bps
     8	It took 7506550074 ns to decode 7087960000 bytes, speed is 944236690 bps
     9	It took 7509484245 ns to decode 7087960000 bytes, speed is 943867750 bps
    10	It took 7520255639 ns to decode 7087960000 bytes, speed is 942515831 bps
    11	It took 7369634873 ns to decode 7087960000 bytes, speed is 961778992 bps
    12	It took 7420389508 ns to decode 7087960000 bytes, speed is 955200531 bps
    13	It took 7479596162 ns to decode 7087960000 bytes, speed is 947639397 bps
    14	It took 7479660125 ns to decode 7087960000 bytes, speed is 947631293 bps
    15	It took 7495067598 ns to decode 7087960000 bytes, speed is 945683265 bps
    16	It took 7514497457 ns to decode 7087960000 bytes, speed is 943238059 bps
    17	It took 7502681984 ns to decode 7087960000 bytes, speed is 944723502 bps
    18	It took 7544322070 ns to decode 7087960000 bytes, speed is 939509200 bps
    19	It took 7545605619 ns to decode 7087960000 bytes, speed is 939349385 bps
    20	It took 7567607757 ns to decode 7087960000 bytes, speed is 936618311 bps
    21	It took 7375056849 ns to decode 7087960000 bytes, speed is 961071913 bps
    22	It took 7430044250 ns to decode 7087960000 bytes, speed is 953959325 bps
    23	It took 7474667860 ns to decode 7087960000 bytes, speed is 948264208 bps
    24	It took 7465094193 ns to decode 7087960000 bytes, speed is 949480316 bps
    25	It took 7479047135 ns to decode 7087960000 bytes, speed is 947708962 bps
    26	It took 7486999663 ns to decode 7087960000 bytes, speed is 946702326 bps
    27	It took 7488959403 ns to decode 7087960000 bytes, speed is 946454589 bps
    28	It took 7502416145 ns to decode 7087960000 bytes, speed is 944756977 bps
    29	It took 7514547630 ns to decode 7087960000 bytes, speed is 943231761 bps
    30	It took 7515577542 ns to decode 7087960000 bytes, speed is 943102504 bps

$ (cd simplified; grep ^It res.txt | can -n)
     1	It took 6592970584 ns to decode 7087960000 bytes, speed is 1075078359 bps
     2	It took 6707664613 ns to decode 7087960000 bytes, speed is 1056695647 bps
     3	It took 6764450628 ns to decode 7087960000 bytes, speed is 1047824929 bps
     4	It took 6791082089 ns to decode 7087960000 bytes, speed is 1043715847 bps
     5	It took 6784496437 ns to decode 7087960000 bytes, speed is 1044728973 bps
     6	It took 6784578988 ns to decode 7087960000 bytes, speed is 1044716262 bps
     7	It took 6801729770 ns to decode 7087960000 bytes, speed is 1042081976 bps
     8	It took 6804953495 ns to decode 7087960000 bytes, speed is 1041588308 bps
     9	It took 6848572538 ns to decode 7087960000 bytes, speed is 1034954358 bps
    10	It took 6841976407 ns to decode 7087960000 bytes, speed is 1035952125 bps
    11	It took 6670393290 ns to decode 7087960000 bytes, speed is 1062600013 bps
    12	It took 6734840978 ns to decode 7087960000 bytes, speed is 1052431679 bps
    13	It took 6764724852 ns to decode 7087960000 bytes, speed is 1047782453 bps
    14	It took 6783498439 ns to decode 7087960000 bytes, speed is 1044882675 bps
    15	It took 6775755346 ns to decode 7087960000 bytes, speed is 1046076730 bps
    16	It took 6783469382 ns to decode 7087960000 bytes, speed is 1044887151 bps
    17	It took 6809182585 ns to decode 7087960000 bytes, speed is 1040941392 bps
    18	It took 6834388578 ns to decode 7087960000 bytes, speed is 1037102283 bps
    19	It took 6845617417 ns to decode 7087960000 bytes, speed is 1035401128 bps
    20	It took 6843915566 ns to decode 7087960000 bytes, speed is 1035658598 bps
    21	It took 6707754592 ns to decode 7087960000 bytes, speed is 1056681472 bps
    22	It took 6754254986 ns to decode 7087960000 bytes, speed is 1049406635 bps
    23	It took 6787549608 ns to decode 7087960000 bytes, speed is 1044259034 bps
    24	It took 6779319691 ns to decode 7087960000 bytes, speed is 1045526737 bps
    25	It took 6769543930 ns to decode 7087960000 bytes, speed is 1047036561 bps
    26	It took 6808031086 ns to decode 7087960000 bytes, speed is 1041117455 bps
    27	It took 6849752166 ns to decode 7087960000 bytes, speed is 1034776124 bps
    28	It took 6859630114 ns to decode 7087960000 bytes, speed is 1033286034 bps
    29	It took 6884042373 ns to decode 7087960000 bytes, speed is 1029621785 bps
    30	It took 6848737822 ns to decode 7087960000 bytes, speed is 1034929381 bps

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 1/2] base64: fix decoder output buffer overrun (reads)
  2021-01-11  9:45 ` [Tarantool-patches] [PATCH v6 1/2] base64: fix decoder output buffer overrun (reads) Sergey Nikiforov via Tarantool-patches
@ 2021-01-21  2:16   ` Alexander Turenko via Tarantool-patches
  2021-01-26 16:37     ` Sergey Nikiforov via Tarantool-patches
  0 siblings, 1 reply; 14+ messages in thread
From: Alexander Turenko via Tarantool-patches @ 2021-01-21  2:16 UTC (permalink / raw)
  To: Sergey Nikiforov; +Cc: tarantool-patches, Vladislav Shpilevoy

On Mon, Jan 11, 2021 at 12:45:00PM +0300, Sergey Nikiforov wrote:
> Was caught by base64 test with enabled ASAN.

It seems, we have a problem in CI, otherwise it would be detected. At
least, I don't see an explicit suppression relevant to the base64 code
or disabling the test under this CI job.

Can you, please, investigate, how to enable it in CI, so we'll catch the
similar problem next time if it'll appear?

> 
> It also caused data corruption - garbage instead of "extra bits" was
> saved into state->result if there was no space in output buffer.

We have the dead code and it appears to be broken. Why don't remove it?
(AFAIS, the rest of the code does not read off the buffer.)

Is it due to a little probability that we'll decide to support chunked
decoding and we'll decide to implement it in exactly this way (and not
just leaving undecoded bytes in, say, ibuf)?

Another side of this little probability is:

* The code complexity and so waste of the time for anyone who need to
  dive into it.
* Have untested code in the code base that may give us more surprises.
* Extra def-use dependencies may hide optimization opportunities and
  increase register pressure.

This is not the question regarding the patch, but this code looks broken
in several other ways. At least:

* It skips unrecognized symbols and does not report a decoding error.
* If the output buffer is too short, it neither report an error, nor a
  required buffer length (like snprintf()). No way to distinguish a
  successful and an 'interrupted' processing.

> 
> Added test for "zero-sized output buffer" case.

Nice catch.

> 
> Fixes: #3069
> ---
> 
> Branch: https://github.com/tarantool/tarantool/tree/void234/gh-3069-fix-base64-memory-overrun-v6
> Issue: https://github.com/tarantool/tarantool/issues/3069

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 2/2] base64: improve decoder performance
  2021-01-11  9:45 ` [Tarantool-patches] [PATCH v6 2/2] base64: improve decoder performance Sergey Nikiforov via Tarantool-patches
@ 2021-01-21 15:31   ` Alexander Turenko via Tarantool-patches
  2021-01-26 16:37     ` Sergey Nikiforov via Tarantool-patches
  0 siblings, 1 reply; 14+ messages in thread
From: Alexander Turenko via Tarantool-patches @ 2021-01-21 15:31 UTC (permalink / raw)
  To: Sergey Nikiforov; +Cc: tarantool-patches, Vladislav Shpilevoy

On Mon, Jan 11, 2021 at 12:45:01PM +0300, Sergey Nikiforov wrote:
> Unnecessary checks were removed from internal loops.
> Benchmark shows that performance is now ~1.19 times higher
> (release build, Intel Core I7-9700K, only one thread).
> ---
> 
> Branch: https://github.com/tarantool/tarantool/tree/void234/gh-3069-fix-base64-memory-overrun-v6
> 
>  third_party/base64.c | 35 ++++++++++++++++++++++++++---------
>  1 file changed, 26 insertions(+), 9 deletions(-)
> 
> diff --git a/third_party/base64.c b/third_party/base64.c
> index 3ecd0843a..74260971b 100644
> --- a/third_party/base64.c
> +++ b/third_party/base64.c
> @@ -258,10 +258,10 @@ base64_decode_block(const char *in_base64, int in_len,
>  		{
>  	case step_a:
>  			do {
> -				if (in_pos == in_end || out_pos >= out_end)
> +				if (in_pos >= in_end)

Okay, it is the loop invariant and so we can move it outside the loop.
But why we sink it down rather than hoisting up? We know that there is
no space in the output buffer -- so there is no sense to read the input,
right? Or it will break 'correctness' of the hypotetical use of the
`state`?

So we should imagine how the state would be used if it would be
possible. From which input position the processing would be resumed if
the code would provide such ability? How a caller supposed to
distinguish end of the output buffer from successful processing the
whole input?

Aside of this, skipping unrecognized input bytes looks as another
incorrectness. We should be ready to work with untrusted input, perform
appropriate validation and report errors.

We can skip newlines, but I guess there is no sense to do this in a
loop: only one newline is a row is allowed (but it worth to verify the
guess with the standard).

It is hard to discuss an optimization until we'll define how the result
should look in the corner cases like 'short output buffer', 'unexpected
end of an input buffer' and so on. Current code does not report errors:
it silently skips invalid input, accepts trimmed input, silently trims
output at the buffer boundary. I don't understand how it supposed to
work with untrusted input.

I propose to stop attempts to do some 'quick fixups' and either work on
the code or don't.

If we would define behaviour first, I would spend MUCH less time trying
to deduce it.

That's really hard to discuss correctness of dead code. To be honest, it
looks easier to me to finish the chained processing implementation, test
it and remove than rather than trying to interpter the code (both
existing and imaginary) in the mind.

BTW, it seems we able to entirely eliminate the output buffer length
checks for the first (out_len * 3 / 4) input bytes and than process the
tail with full checks. (After we'll done with strict and correct API.)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 1/2] base64: fix decoder output buffer overrun (reads)
  2021-01-21  2:16   ` Alexander Turenko via Tarantool-patches
@ 2021-01-26 16:37     ` Sergey Nikiforov via Tarantool-patches
  2021-01-26 16:48       ` Sergey Nikiforov via Tarantool-patches
  2021-02-20 12:49       ` Alexander Turenko via Tarantool-patches
  0 siblings, 2 replies; 14+ messages in thread
From: Sergey Nikiforov via Tarantool-patches @ 2021-01-26 16:37 UTC (permalink / raw)
  To: Alexander Turenko; +Cc: tarantool-patches, Vladislav Shpilevoy

On 21.01.2021 5:16, Alexander Turenko wrote:
> On Mon, Jan 11, 2021 at 12:45:00PM +0300, Sergey Nikiforov wrote:
>> Was caught by base64 test with enabled ASAN.
> 
> It seems, we have a problem in CI, otherwise it would be detected. At
> least, I don't see an explicit suppression relevant to the base64 code
> or disabling the test under this CI job.
> 
> Can you, please, investigate, how to enable it in CI, so we'll catch the
> similar problem next time if it'll appear?

ASAN is not used in CI now. Which is clearly wrong.
Right now a lot of tests fail if LeakSanitizer is enabled (the default 
for ASAN), but only 1 test (unit/guard.test) fails if LeakSanitizer if 
disabled. So it is quite straightforward:

CC=clang CXX=clang++ cmake . -DENABLE_ASAN=ON && make -j
ASAN_OPTIONS=detect_leaks=0 test/test-run.py

(test-run.py is launched from several Makefiles)

I propose creating tasks to make unit/guard.test "ASAN-tolerant" (ASAN 
prints warning which causes .result mismatch) and to add ASAN targets to 
CI. Should it be GitLab or GitHub Actions?

We should probably also look on LeakSanitizer issues, some of them are 
probably real bugs and not just tests sloppiness.

>>
>> It also caused data corruption - garbage instead of "extra bits" was
>> saved into state->result if there was no space in output buffer.
> 
> We have the dead code and it appears to be broken. Why don't remove it?
> (AFAIS, the rest of the code does not read off the buffer.)
> 
> Is it due to a little probability that we'll decide to support chunked
> decoding and we'll decide to implement it in exactly this way (and not
> just leaving undecoded bytes in, say, ibuf)?
> 
> Another side of this little probability is:
> 
> * The code complexity and so waste of the time for anyone who need to
>    dive into it.
> * Have untested code in the code base that may give us more surprises.
> * Extra def-use dependencies may hide optimization opportunities and
>    increase register pressure.

And yet you are youself proposing to improve performance:
"entirely eliminate the output buffer lengthchecks for the first 
(out_len * 3 / 4) input bytes" (quoted from your e-mail about 2/2). This 
means saving state and reporting input buffer stop position. So: do we 
want complexity (and performance) or simplicity?

> This is not the question regarding the patch, but this code looks broken
> in several other ways. At least:
> 
> * It skips unrecognized symbols and does not report a decoding error.

Some of these symbols should "legally" be skipped - like newlines in 
e-mail use case. And there are probably other use cases which use some 
other symbols. We could break something and gain nothing.

Nothing prevents somebody from corrupting "legal" base64 characters, 
this is not detected nor should be. These issues are outside base64 
decoder scope, CRCs or digital signatures should be used when data can 
be accidentally or intentionally corrupted.

> * If the output buffer is too short, it neither report an error, nor a
>    required buffer length (like snprintf()). No way to distinguish a
>    successful and an 'interrupted' processing.

Here I agree.
There was no output buffer length checks in code imported into 
Tarantool. That is clearly stopgap to prevent buffer overrun.

Summary: I propose commiting this particular patch (1/2) "as is" (it was 
posted unmodified several times already) and discussing performance 
patch (2/2) a little further.

>>
>> Added test for "zero-sized output buffer" case.
> 
> Nice catch.
> 
>>
>> Fixes: #3069
>> ---
>>
>> Branch: https://github.com/tarantool/tarantool/tree/void234/gh-3069-fix-base64-memory-overrun-v6
>> Issue: https://github.com/tarantool/tarantool/issues/3069

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 2/2] base64: improve decoder performance
  2021-01-21 15:31   ` Alexander Turenko via Tarantool-patches
@ 2021-01-26 16:37     ` Sergey Nikiforov via Tarantool-patches
  2021-02-20 12:51       ` Alexander Turenko via Tarantool-patches
  0 siblings, 1 reply; 14+ messages in thread
From: Sergey Nikiforov via Tarantool-patches @ 2021-01-26 16:37 UTC (permalink / raw)
  To: Alexander Turenko; +Cc: tarantool-patches, Vladislav Shpilevoy

On 21.01.2021 18:31, Alexander Turenko wrote:
> On Mon, Jan 11, 2021 at 12:45:01PM +0300, Sergey Nikiforov wrote:
>> Unnecessary checks were removed from internal loops.
>> Benchmark shows that performance is now ~1.19 times higher
>> (release build, Intel Core I7-9700K, only one thread).
>> ---
>>
>> Branch: https://github.com/tarantool/tarantool/tree/void234/gh-3069-fix-base64-memory-overrun-v6
>>
>>   third_party/base64.c | 35 ++++++++++++++++++++++++++---------
>>   1 file changed, 26 insertions(+), 9 deletions(-)
>>
>> diff --git a/third_party/base64.c b/third_party/base64.c
>> index 3ecd0843a..74260971b 100644
>> --- a/third_party/base64.c
>> +++ b/third_party/base64.c
>> @@ -258,10 +258,10 @@ base64_decode_block(const char *in_base64, int in_len,
>>   		{
>>   	case step_a:
>>   			do {
>> -				if (in_pos == in_end || out_pos >= out_end)
>> +				if (in_pos >= in_end)
> 
> Okay, it is the loop invariant and so we can move it outside the loop.
> But why we sink it down rather than hoisting up? We know that there is
> no space in the output buffer -- so there is no sense to read the input,
> right? Or it will break 'correctness' of the hypotetical use of the
> `state`?

Yes. It also prevents us from skipping newlines and other ignored 
characters at the end of input.

> So we should imagine how the state would be used if it would be
> possible. From which input position the processing would be resumed if
> the code would provide such ability? How a caller supposed to
> distinguish end of the output buffer from successful processing the
> whole input?
> 
> Aside of this, skipping unrecognized input bytes looks as another
> incorrectness. We should be ready to work with untrusted input, perform
> appropriate validation and report errors.

There is no point because "legal" characters can also be corrupted. CRCs 
/ Digital Signatures, that is outside of base64 decoder scope.

> We can skip newlines, but I guess there is no sense to do this in a
> loop: only one newline is a row is allowed (but it worth to verify the
> guess with the standard).

This will make code complex and slow (we have to calculate line position 
- and it can be split between chunks). It would also break 
compatibility. A lot of real-life code breaks standards.

> It is hard to discuss an optimization until we'll define how the result
> should look in the corner cases like 'short output buffer', 'unexpected
> end of an input buffer' and so on. Current code does not report errors:
> it silently skips invalid input,

As it should, in my opinion.

> accepts trimmed input,

There is no way to determine if input is trimmed or not. Our caller can 
determine whether enough bytes were decoded.

 >silently trims output at the buffer boundary.

Which in our case (no chunks) means we have received more input than 
necessary. Just look at base64_decode() call sites - even returned value 
is not always checked. Should we fix this right now?

 > I don't understand how it supposed to
> work with untrusted input.
> 
> I propose to stop attempts to do some 'quick fixups' and either work on
> the code or don't.

I haven't attempted to fix anything in this particular patch, it is 
about performance only. We can drop it entirely.

> If we would define behaviour first, I would spend MUCH less time trying
> to deduce it.

This patch is about "keeping existing behavior" to avoid breaking 
something (right now extra "legal" characters are silently ignored).

> That's really hard to discuss correctness of dead code. To be honest, it
> looks easier to me to finish the chained processing implementation, test
> it and remove than rather than trying to interpter the code (both
> existing and imaginary) in the mind.
> 
> BTW, it seems we able to entirely eliminate the output buffer length
> checks for the first (out_len * 3 / 4) input bytes and than process the
> tail with full checks. (After we'll done with strict and correct API.)

This means processing input in two chunks, so we need state and "final" 
input position reporting (and a lot of tests for corner cases). Extra 
logic would hurt performance for small buffers (our typical use case) - 
even simple check in base64_decode() from patch v4 does, I have tested 
this. Benchmark code with "barely enough output buffer" and "definitely 
large enough output buffer" (it accounted for newlines added by encoder) 
got lost somewhere, but I still have numbers from patch v4:

Without length check in the beginning of base64_decode() - output buffer 
size is checked on all iterations
It took 6170225078 ns to decode 7087960000 bytes, speed is 1148736052 bps
It took 6141109481 ns to decode 7087960000 bytes, speed is 1154182321 bps
It took 6142448824 ns to decode 7087960000 bytes, speed is 1153930655 bps
It took 6165578853 ns to decode 7087960000 bytes, speed is 1149601711 bps
It took 6126750494 ns to decode 7087960000 bytes, speed is 1156887326 bps
It took 6146041244 ns to decode 7087960000 bytes, speed is 1153256172 bps

With length check in the beginning of base64_decode() - output buffer 
size is checked on all iterations because its size is "suspicuous" (may 
not be large enough)
It took 6310650551 ns to decode 7087960000 bytes, speed is 1123174218 bps
It took 6243847147 ns to decode 7087960000 bytes, speed is 1135191146 bps
It took 6263236313 ns to decode 7087960000 bytes, speed is 1131676923 bps
It took 6260555388 ns to decode 7087960000 bytes, speed is 1132161535 bps

With length check in the beginning of base64_decode() - output buffer 
size is definitely large enough so it is not checked within loops.
It took 6037055181 ns to decode 7087960000 bytes, speed is 1174075735 bps
It took 5993200086 ns to decode 7087960000 bytes, speed is 1182667005 bps
It took 5976677666 ns to decode 7087960000 bytes, speed is 1185936467 bps
It took 5990214242 ns to decode 7087960000 bytes, speed is 1183256510 bps

Or should we do as patch v4 - eliminate chunks support completely? Then 
at least in_len should be made size_t instead of int (easiest way to 
avoid integer overflow) and probably tests for corner cases 
(in_len/out_len close to 4/3) should be added. Loop invariant can be 
moved up in that case.

Vladislav Shpilevoy is clearly voting on patch v4 approach (Quote: "As I 
said - please, move on to another 2 reviewers, if you don't want to 
finish the version with state removed.")

Leonid Vasiliev is happy even with v5.

Please make up your mind.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 1/2] base64: fix decoder output buffer overrun (reads)
  2021-01-26 16:37     ` Sergey Nikiforov via Tarantool-patches
@ 2021-01-26 16:48       ` Sergey Nikiforov via Tarantool-patches
  2021-02-20 11:30         ` Alexander Turenko via Tarantool-patches
  2021-02-20 12:49       ` Alexander Turenko via Tarantool-patches
  1 sibling, 1 reply; 14+ messages in thread
From: Sergey Nikiforov via Tarantool-patches @ 2021-01-26 16:48 UTC (permalink / raw)
  To: Alexander Turenko; +Cc: tarantool-patches, Vladislav Shpilevoy

On 26.01.2021 19:37, Sergey Nikiforov wrote:
> On 21.01.2021 5:16, Alexander Turenko wrote:
>> On Mon, Jan 11, 2021 at 12:45:00PM +0300, Sergey Nikiforov wrote:
>>> Was caught by base64 test with enabled ASAN.
>>
>> It seems, we have a problem in CI, otherwise it would be detected. At
>> least, I don't see an explicit suppression relevant to the base64 code
>> or disabling the test under this CI job.
>>
>> Can you, please, investigate, how to enable it in CI, so we'll catch the
>> similar problem next time if it'll appear?
> 
> ASAN is not used in CI now. Which is clearly wrong.
> Right now a lot of tests fail if LeakSanitizer is enabled (the default 
> for ASAN), but only 1 test (unit/guard.test) fails if LeakSanitizer if 
> disabled. So it is quite straightforward:
> 
> CC=clang CXX=clang++ cmake . -DENABLE_ASAN=ON && make -j
> ASAN_OPTIONS=detect_leaks=0 test/test-run.py
> 
> (test-run.py is launched from several Makefiles)
> 
> I propose creating tasks to make unit/guard.test "ASAN-tolerant" (ASAN 
> prints warning which causes .result mismatch) and to add ASAN targets to 
> CI. Should it be GitLab or GitHub Actions?
> 
> We should probably also look on LeakSanitizer issues, some of them are 
> probably real bugs and not just tests sloppiness.

ASAN with LeakSanitizer enabled (the default):
Statistics:
* disabled: 127
* fail: 755
* pass: 414

Fixing this is a BIG task.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 1/2] base64: fix decoder output buffer overrun (reads)
  2021-01-26 16:48       ` Sergey Nikiforov via Tarantool-patches
@ 2021-02-20 11:30         ` Alexander Turenko via Tarantool-patches
  0 siblings, 0 replies; 14+ messages in thread
From: Alexander Turenko via Tarantool-patches @ 2021-02-20 11:30 UTC (permalink / raw)
  To: Sergey Nikiforov; +Cc: tarantool-patches, Vladislav Shpilevoy

On Tue, Jan 26, 2021 at 07:48:48PM +0300, Sergey Nikiforov wrote:
> On 26.01.2021 19:37, Sergey Nikiforov wrote:
> > On 21.01.2021 5:16, Alexander Turenko wrote:
> > > On Mon, Jan 11, 2021 at 12:45:00PM +0300, Sergey Nikiforov wrote:
> > > > Was caught by base64 test with enabled ASAN.
> > > 
> > > It seems, we have a problem in CI, otherwise it would be detected. At
> > > least, I don't see an explicit suppression relevant to the base64 code
> > > or disabling the test under this CI job.
> > > 
> > > Can you, please, investigate, how to enable it in CI, so we'll catch the
> > > similar problem next time if it'll appear?
> > 
> > ASAN is not used in CI now. Which is clearly wrong.

There is something called 'ASAN' in CI, see [1]. And it seems to be
problematic. That's why I asked.

[1]: https://github.com/tarantool/tarantool/blob/6610bce9bf43a139043518cd76d3c0c81c981ce2/.github/workflows/release_asan_clang11.yml#L36

> > Right now a lot of tests fail if LeakSanitizer is enabled (the default
> > for ASAN), but only 1 test (unit/guard.test) fails if LeakSanitizer if
> > disabled. So it is quite straightforward:
> > 
> > CC=clang CXX=clang++ cmake . -DENABLE_ASAN=ON && make -j
> > ASAN_OPTIONS=detect_leaks=0 test/test-run.py
> > 
> > (test-run.py is launched from several Makefiles)
> > 
> > I propose creating tasks to make unit/guard.test "ASAN-tolerant" (ASAN
> > prints warning which causes .result mismatch) and to add ASAN targets to
> > CI. Should it be GitLab or GitHub Actions?

This is quite specific 'hacky' test and it is okay to exclude it when
we're run with ASAN.

> > 
> > We should probably also look on LeakSanitizer issues, some of them are
> > probably real bugs and not just tests sloppiness.
> 
> ASAN with LeakSanitizer enabled (the default):
> Statistics:
> * disabled: 127
> * fail: 755
> * pass: 414
> 
> Fixing this is a BIG task.

We can enable ASAN test by test to a regular CI runs.

I've added a note to [2].

[2]: https://github.com/tarantool/tarantool-qa/issues/18

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 1/2] base64: fix decoder output buffer overrun (reads)
  2021-01-26 16:37     ` Sergey Nikiforov via Tarantool-patches
  2021-01-26 16:48       ` Sergey Nikiforov via Tarantool-patches
@ 2021-02-20 12:49       ` Alexander Turenko via Tarantool-patches
  2021-02-25  8:25         ` Sergey Nikiforov via Tarantool-patches
  1 sibling, 1 reply; 14+ messages in thread
From: Alexander Turenko via Tarantool-patches @ 2021-02-20 12:49 UTC (permalink / raw)
  To: Sergey Nikiforov; +Cc: tarantool-patches, Vladislav Shpilevoy

> > > It also caused data corruption - garbage instead of "extra bits" was
> > > saved into state->result if there was no space in output buffer.
> > 
> > We have the dead code and it appears to be broken. Why don't remove it?
> > (AFAIS, the rest of the code does not read off the buffer.)
> > 
> > Is it due to a little probability that we'll decide to support chunked
> > decoding and we'll decide to implement it in exactly this way (and not
> > just leaving undecoded bytes in, say, ibuf)?
> > 
> > Another side of this little probability is:
> > 
> > * The code complexity and so waste of the time for anyone who need to
> >    dive into it.
> > * Have untested code in the code base that may give us more surprises.
> > * Extra def-use dependencies may hide optimization opportunities and
> >    increase register pressure.
> 
> And yet you are youself proposing to improve performance:
> "entirely eliminate the output buffer lengthchecks for the first (out_len *
> 3 / 4) input bytes" (quoted from your e-mail about 2/2). This means saving
> state and reporting input buffer stop position. So: do we want complexity
> (and performance) or simplicity?

Nope, it does not. Just a kind of loop unrolling optimization:

 | for (int i = 0; i < count; ++i) {
 |         <..processing..>
 | }

->

 | for (int i = 0; i < count; i += 4) {
 |     <..processing by 4 bytes..>
 | }
 |
 | /* Remainder. */
 | for (int i = count - count % 4 - 1; i < count; ++i) {
 |         <..processing..>
 | }

(I didn't verify the arithmetic, just for the idea.)

(Sure, it is meaningful only for large inputs.)

The remainder always starts from 'state_a'.

But here we don't discuss optimizations. You completely ignored my
main point. I'll repeat:

 | We have the dead code and it appears to be broken. Why don't remove it?
 | (AFAIS, the rest of the code does not read off the buffer.)

(I verified it locally with ASAN before I wrote to you the past time.)

Please, confirm that it is true or show that it is wrong.

It is not a dialogue if you respond only to those parts that looks
interesting for you.

> > This is not the question regarding the patch, but this code looks broken
> > in several other ways. At least:
> > 
> > * It skips unrecognized symbols and does not report a decoding error.
> 
> Some of these symbols should "legally" be skipped - like newlines in e-mail
> use case. And there are probably other use cases which use some other
> symbols. We could break something and gain nothing.

Sure, backward compatibility is the question.

However 'some symbols' and 'all unrecognized symbols' is not the same
thing. I may want to skip newlines, but catch ill-formed incoming data.
For example, it may be part of functional requirements for an API of my
service that receives or reads base64 encoded data. Say, when an
external service send me a text in ascii instead of base64, I want to
report an error and decline the request rather than accept the request
that is known to be ill-formed.

> Nothing prevents somebody from corrupting "legal" base64 characters, this is
> not detected nor should be. These issues are outside base64 decoder scope,
> CRCs or digital signatures should be used when data can be accidentally or
> intentionally corrupted.

There are cases, when a user wants to follow the robustness principle
and when (s)he wants the opposite. There is no silver bullet here,
different usage scenarious are different.

> Summary: I propose commiting this particular patch (1/2) "as is" (it was
> posted unmodified several times already) and discussing performance patch
> (2/2) a little further.

First I need a response to the question above regarding the unused code.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 2/2] base64: improve decoder performance
  2021-01-26 16:37     ` Sergey Nikiforov via Tarantool-patches
@ 2021-02-20 12:51       ` Alexander Turenko via Tarantool-patches
  0 siblings, 0 replies; 14+ messages in thread
From: Alexander Turenko via Tarantool-patches @ 2021-02-20 12:51 UTC (permalink / raw)
  To: Sergey Nikiforov; +Cc: tarantool-patches, Vladislav Shpilevoy

So we have four topics:

1. Why don't remove the unused (status storing) code?
2. Internal API changes that we definitely should do: error reporting
   for too short output buffer.
3. Internal API changes that are subject to discussion: input
   validation.
4. Optimizations.

I highly want to define precisely the internal API first (have agreement
on 1, 2, 3 at least, but better have it implemented) and only then
discuss optimizations. Otherwise any proposed optimization may become
impossible (or just unneeded).

Sorry, but I'll postpone the discussion regarding optimizations because
of this.

WBR, Alexander Turenko.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 1/2] base64: fix decoder output buffer overrun (reads)
  2021-02-20 12:49       ` Alexander Turenko via Tarantool-patches
@ 2021-02-25  8:25         ` Sergey Nikiforov via Tarantool-patches
  2021-02-27 18:47           ` Alexander Turenko via Tarantool-patches
  0 siblings, 1 reply; 14+ messages in thread
From: Sergey Nikiforov via Tarantool-patches @ 2021-02-25  8:25 UTC (permalink / raw)
  To: Alexander Turenko; +Cc: tarantool-patches, Vladislav Shpilevoy

On 20.02.2021 15:49, Alexander Turenko wrote:
>>>> It also caused data corruption - garbage instead of "extra bits" was
>>>> saved into state->result if there was no space in output buffer.
>>>
>>> We have the dead code and it appears to be broken. Why don't remove it?
>>> (AFAIS, the rest of the code does not read off the buffer.)
>>>
>>> Is it due to a little probability that we'll decide to support chunked
>>> decoding and we'll decide to implement it in exactly this way (and not
>>> just leaving undecoded bytes in, say, ibuf)?
>>>
>>> Another side of this little probability is:
>>>
>>> * The code complexity and so waste of the time for anyone who need to
>>>     dive into it.
>>> * Have untested code in the code base that may give us more surprises.
>>> * Extra def-use dependencies may hide optimization opportunities and
>>>     increase register pressure.
>>
>> And yet you are youself proposing to improve performance:
>> "entirely eliminate the output buffer lengthchecks for the first (out_len *
>> 3 / 4) input bytes" (quoted from your e-mail about 2/2). This means saving
>> state and reporting input buffer stop position. So: do we want complexity
>> (and performance) or simplicity?
> 
> Nope, it does not. Just a kind of loop unrolling optimization:
> 
>   | for (int i = 0; i < count; ++i) {
>   |         <..processing..>
>   | }
> 
> ->
> 
>   | for (int i = 0; i < count; i += 4) {
>   |     <..processing by 4 bytes..>
>   | }
>   |
>   | /* Remainder. */
>   | for (int i = count - count % 4 - 1; i < count; ++i) {
>   |         <..processing..>
>   | }
> 
> (I didn't verify the arithmetic, just for the idea.)
> 
> (Sure, it is meaningful only for large inputs.)
> 
> The remainder always starts from 'state_a'.

Alas, it is not applicable - there could be skipped characters in input. 
So we get unknown intermediate state. The alternative - unknown 
processed input stop position - does not look good because w/o extra 
checks within loop we would overrun input.

> But here we don't discuss optimizations. You completely ignored my
> main point. I'll repeat:
> 
>   | We have the dead code and it appears to be broken. Why don't remove it?
>   | (AFAIS, the rest of the code does not read off the buffer.)
> 
> (I verified it locally with ASAN before I wrote to you the past time.)
> 
> Please, confirm that it is true or show that it is wrong.

Confirm.

> It is not a dialogue if you respond only to those parts that looks
> interesting for you.

There are two parallel e-mail threads, I am trying to avoid duplicating 
my responses.
There was no request to verify your findings in earlier e-mails.

>>> This is not the question regarding the patch, but this code looks broken
>>> in several other ways. At least:
>>>
>>> * It skips unrecognized symbols and does not report a decoding error.
>>
>> Some of these symbols should "legally" be skipped - like newlines in e-mail
>> use case. And there are probably other use cases which use some other
>> symbols. We could break something and gain nothing.
> 
> Sure, backward compatibility is the question.
> 
> However 'some symbols' and 'all unrecognized symbols' is not the same
> thing. I may want to skip newlines, but catch ill-formed incoming data.
> For example, it may be part of functional requirements for an API of my
> service that receives or reads base64 encoded data. Say, when an
> external service send me a text in ascii instead of base64, I want to
> report an error and decline the request rather than accept the request
> that is known to be ill-formed.
> 
>> Nothing prevents somebody from corrupting "legal" base64 characters, this is
>> not detected nor should be. These issues are outside base64 decoder scope,
>> CRCs or digital signatures should be used when data can be accidentally or
>> intentionally corrupted.
> 
> There are cases, when a user wants to follow the robustness principle
> and when (s)he wants the opposite. There is no silver bullet here,
> different usage scenarious are different.

I still think CRC/signatures should be used when input is untrusted. 
Extra validation is hardly useful (even taking compatibility issues out 
of the picture) but increases code complexity and degrades performance.

>> Summary: I propose commiting this particular patch (1/2) "as is" (it was
>> posted unmodified several times already) and discussing performance patch
>> (2/2) a little further.
> 
> First I need a response to the question above regarding the unused code.

I can easily create patch which fixes overrun bug and gets rid of state 
in one stroke. Should I do that?
We may need state to implement optimizations later.

I will postpone responding in another e-mail thread (API changes for 
robustness; optimizations).

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 1/2] base64: fix decoder output buffer overrun (reads)
  2021-02-25  8:25         ` Sergey Nikiforov via Tarantool-patches
@ 2021-02-27 18:47           ` Alexander Turenko via Tarantool-patches
  0 siblings, 0 replies; 14+ messages in thread
From: Alexander Turenko via Tarantool-patches @ 2021-02-27 18:47 UTC (permalink / raw)
  To: Sergey Nikiforov; +Cc: tarantool-patches, Vladislav Shpilevoy

On Thu, Feb 25, 2021 at 11:25:21AM +0300, Sergey Nikiforov wrote:
> On 20.02.2021 15:49, Alexander Turenko wrote:
> > > > > It also caused data corruption - garbage instead of "extra bits" was
> > > > > saved into state->result if there was no space in output buffer.
> > > > 
> > > > We have the dead code and it appears to be broken. Why don't remove it?
> > > > (AFAIS, the rest of the code does not read off the buffer.)
> > > > 
> > > > Is it due to a little probability that we'll decide to support chunked
> > > > decoding and we'll decide to implement it in exactly this way (and not
> > > > just leaving undecoded bytes in, say, ibuf)?
> > > > 
> > > > Another side of this little probability is:
> > > > 
> > > > * The code complexity and so waste of the time for anyone who need to
> > > >     dive into it.
> > > > * Have untested code in the code base that may give us more surprises.
> > > > * Extra def-use dependencies may hide optimization opportunities and
> > > >     increase register pressure.
> > > 
> > > And yet you are youself proposing to improve performance:
> > > "entirely eliminate the output buffer lengthchecks for the first (out_len *
> > > 3 / 4) input bytes" (quoted from your e-mail about 2/2). This means saving
> > > state and reporting input buffer stop position. So: do we want complexity
> > > (and performance) or simplicity?
> > 
> > Nope, it does not. Just a kind of loop unrolling optimization:
> > 
> >   | for (int i = 0; i < count; ++i) {
> >   |         <..processing..>
> >   | }
> > 
> > ->
> > 
> >   | for (int i = 0; i < count; i += 4) {
> >   |     <..processing by 4 bytes..>
> >   | }
> >   |
> >   | /* Remainder. */
> >   | for (int i = count - count % 4 - 1; i < count; ++i) {
> >   |         <..processing..>
> >   | }
> > 
> > (I didn't verify the arithmetic, just for the idea.)
> > 
> > (Sure, it is meaningful only for large inputs.)
> > 
> > The remainder always starts from 'state_a'.
> 
> Alas, it is not applicable - there could be skipped characters in input. So
> we get unknown intermediate state. The alternative - unknown processed input
> stop position - does not look good because w/o extra checks within loop we
> would overrun input.

Okay, some optimization ways may require the state (but locally, not as
part of the function contract). Some does not. I have some thoughts in
the mind, but I'll hold myself to don't discuss optimizations.

> 
> > But here we don't discuss optimizations. You completely ignored my
> > main point. I'll repeat:
> > 
> >   | We have the dead code and it appears to be broken. Why don't remove it?
> >   | (AFAIS, the rest of the code does not read off the buffer.)
> > 
> > (I verified it locally with ASAN before I wrote to you the past time.)
> > 
> > Please, confirm that it is true or show that it is wrong.
> 
> Confirm.

So I would remove it.

There is no confidence that we'll need it in a future.

> > > > This is not the question regarding the patch, but this code looks broken
> > > > in several other ways. At least:
> > > > 
> > > > * It skips unrecognized symbols and does not report a decoding error.
> > > 
> > > Some of these symbols should "legally" be skipped - like newlines in e-mail
> > > use case. And there are probably other use cases which use some other
> > > symbols. We could break something and gain nothing.
> > 
> > Sure, backward compatibility is the question.
> > 
> > However 'some symbols' and 'all unrecognized symbols' is not the same
> > thing. I may want to skip newlines, but catch ill-formed incoming data.
> > For example, it may be part of functional requirements for an API of my
> > service that receives or reads base64 encoded data. Say, when an
> > external service send me a text in ascii instead of base64, I want to
> > report an error and decline the request rather than accept the request
> > that is known to be ill-formed.
> > 
> > > Nothing prevents somebody from corrupting "legal" base64 characters, this is
> > > not detected nor should be. These issues are outside base64 decoder scope,
> > > CRCs or digital signatures should be used when data can be accidentally or
> > > intentionally corrupted.
> > 
> > There are cases, when a user wants to follow the robustness principle
> > and when (s)he wants the opposite. There is no silver bullet here,
> > different usage scenarious are different.
> 
> I still think CRC/signatures should be used when input is untrusted. Extra
> validation is hardly useful (even taking compatibility issues out of the
> picture) but increases code complexity and degrades performance.

CRC is not a silver validation bullet. An external service may just send
data in a wrong format (not base64), because of a developer mistake
(say, forgotten encode call). There is no question whether validation is
useful or not, when it is part of requirements for an application that
you develop.

Implementation of a particular format (without skipping unknown symbols,
without skipping newlines, without accepting more than one newline in
row, with certain amount of symbols in a line) may be simpler and more
performant than the common implementation.

> 
> > > Summary: I propose commiting this particular patch (1/2) "as is" (it was
> > > posted unmodified several times already) and discussing performance patch
> > > (2/2) a little further.
> > 
> > First I need a response to the question above regarding the unused code.
> 
> I can easily create patch which fixes overrun bug and gets rid of state in
> one stroke. Should I do that?
> We may need state to implement optimizations later.
> 
> I will postpone responding in another e-mail thread (API changes for
> robustness; optimizations).

I'm sure we should remove the unused code now. We may need some way to
store state in a future, but there is no confidence now.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-02-27 18:47 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-11  9:44 [Tarantool-patches] [PATCH v6 0/2] base64: Fix decoder, improve its performance Sergey Nikiforov via Tarantool-patches
2021-01-11  9:45 ` [Tarantool-patches] [PATCH v6 1/2] base64: fix decoder output buffer overrun (reads) Sergey Nikiforov via Tarantool-patches
2021-01-21  2:16   ` Alexander Turenko via Tarantool-patches
2021-01-26 16:37     ` Sergey Nikiforov via Tarantool-patches
2021-01-26 16:48       ` Sergey Nikiforov via Tarantool-patches
2021-02-20 11:30         ` Alexander Turenko via Tarantool-patches
2021-02-20 12:49       ` Alexander Turenko via Tarantool-patches
2021-02-25  8:25         ` Sergey Nikiforov via Tarantool-patches
2021-02-27 18:47           ` Alexander Turenko via Tarantool-patches
2021-01-11  9:45 ` [Tarantool-patches] [PATCH v6 2/2] base64: improve decoder performance Sergey Nikiforov via Tarantool-patches
2021-01-21 15:31   ` Alexander Turenko via Tarantool-patches
2021-01-26 16:37     ` Sergey Nikiforov via Tarantool-patches
2021-02-20 12:51       ` Alexander Turenko via Tarantool-patches
2021-01-12 19:39 ` [Tarantool-patches] [PATCH v6 0/2] base64: Fix decoder, improve its performance Alexander Turenko via Tarantool-patches

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox