[Tarantool-patches] [PATCH v6 0/2] base64: Fix decoder, improve its performance

Alexander Turenko alexander.turenko at tarantool.org
Tue Jan 12 22:39:28 MSK 2021


On Mon, Jan 11, 2021 at 12:44:59PM +0300, Sergey Nikiforov wrote:
> First patch fixes #3069 and adds test for zero-sized output buffer,
> second one improves base64 decoder performance.
> 
> Second patch is optional to merge.
> 
> v6 is v5 rebased - there was merge conflict in base64 test,
> base64_no_space_test() is now implemented as subtest per review.
> No more changes.
> 
> Branch: https://github.com/tarantool/tarantool/tree/void234/gh-3069-fix-base64-memory-overrun-v6
> Issue: https://github.com/tarantool/tarantool/issues/3069
> 
> Sergey Nikiforov (2):
>   base64: fix decoder output buffer overrun (reads)
>   base64: improve decoder performance
> 
>  test/unit/base64.c      | 16 +++++++++++-
>  test/unit/base64.result |  5 +++-
>  third_party/base64.c    | 57 +++++++++++++++++++++++++++++------------
>  3 files changed, 59 insertions(+), 19 deletions(-)
> 
> -- 
> 2.25.1
> 

I was surprised by the benchmarking results [1] and asked Sergey for the
workload to reproduce them. I placed the workload at end of the email.

For now I just blindly run the benchmark and look at results.

In short, the simplified code is faster on my system. So maybe it worth
to look at this direction?

[1]: https://lists.tarantool.org/tarantool-patches/661ec8ed-ff23-b67d-9177-20eb55ba8a92@tarantool.org/

Hardware
--------

Laptop: SMP, 8 cores (with enabled HT, so 4 in fact), Intel(R) Core(TM)
i7-7820HQ CPU @ 2.90GHz.

Gentoo Linux, GCC 10.1.0, glibc 2.31-r3.

Method
------

I prepared three builds:

* baseline:
  - master (2.8.0-0-gefc30ccf8)
  - perf test
* optimized:
  - master (2.8.0-0-gefc30ccf8)
  - two commits from v6: 7f5ce3321cda6d4305282af486ed2f493b920d1f and
    b3f6a37aef7bc59668e9f1e6b9638920a8f52846
  - perf test
* simplified:
  - master (2.8.0-0-gefc30ccf8)
  - top commit from v4 (d759647f9f68644d031498ae6a540756dcc4228e)
  - perf test

And build them like in RPM / Deb packages (RelWithDebInfo cmake build
type); disabled the libcurl bundling, because it does not matter here:

$ cmake . -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_BACKTRACE=ON \
      -DENABLE_DIST=ON -DENABLE_BUNDLED_LIBCURL=OFF && make -j

I prepared my laptop for benchmarking:

* Ensure that intel_pstate is disabled:
  | $ grep -o 'intel_pstate=[^ ]*' /proc/cmdline 
  | intel_pstate=disable
* Set fixed CPU frequency (mainly to ensure that Turbo Boost will not be
  in action):
  | # cpupower -c all frequency-set -g userspace
  | # cpupower -c all frequency-set -f 2.9GHz
* Closed all applications that may significantly use CPU (firefox and so
  on).

Then I run the following steps three times:

 | # sync && echo 3 > /proc/sys/vm/drop_caches`  # from root
 | $ (cd baseline && for i in $(seq 1 10); do taskset 0x2 ./test/unit/perf-base64.test; done >>res.txt 2>&1)
 | # sync && echo 3 > /proc/sys/vm/drop_caches`  # from root
 | $ (cd optimized && for i in $(seq 1 10); do taskset 0x2 ./test/unit/perf-base64.test; done >>res.txt 2>&1)
 | # sync && echo 3 > /proc/sys/vm/drop_caches`  # from root
 | $ (cd simplified && for i in $(seq 1 10); do taskset 0x2 ./test/unit/perf-base64.test; done >>res.txt 2>&1)

Raw results are at end of the email.

Scripts to accumulate the results:

 | # It is in ~/.bashrc on my machine.
 |
 | function s() {
 |     awk '
 |         BEGIN {
 |             min = "NaN"
 |             max = "NaN"
 |             cnt = 0
 |             var = "NaN"
 |             sum = "NaN"
 |             avg = "NaN"
 |         }
 |         {
 |             min = (NR==1 || $1<min ? $1 : min)
 |             max = (NR==1 || $1>max ? $1 : max)
 |             cnt += 1
 |             sum += $1
 |         }
 |         END {
 |             avg = sum / cnt
 |             var = max - min
 |             var_rel = 100 * var / avg
 |             print "cnt:", cnt
 |             print "min:", min
 |             print "max:", max
 |             printf "avg: %f\n", avg
 |             print "var:", var, "("var_rel"%)"
 |             print ""
 |         }
 |     ';
 | }

 | (cd baseline; grep ^It res.txt | cut -d' ' -f11 | s >stat.txt)
 | (cd optimized; grep ^It res.txt | cut -d' ' -f11 | s >stat.txt)
 | (cd simplified; grep ^It res.txt | cut -d' ' -f11 | s >stat.txt)
 | (for metric in min max avg; do                                                                                              \
 |     baseline="$(grep "${metric}:" baseline/stat.txt | cut -d' ' -f2)";                                                      \
 |     optimized="$(grep "${metric}:" optimized/stat.txt | cut -d' ' -f2)";                                                    \
 |     simplified="$(grep "${metric}:" simplified/stat.txt | cut -d' ' -f2)";                                                  \
 |     echo "${metric}: baseline: ${baseline}";                                                                                \
 |     echo "${metric}: optimized: ${optimized} (+$(echo -e "scale=2\n(${optimized}-${baseline})*100/${baseline}" | bc)%)";    \
 |     echo "${metric}: simplified: ${simplified} (+$(echo -e "scale=2\n(${simplified}-${baseline})*100/${baseline}" | bc)%)"; \
 | done) | column -t

Results
-------

min:  baseline:    883238644          
min:  optimized:   936618311          (+6.04%)
min:  simplified:  1029621785         (+16.57%)
max:  baseline:    932895993          
max:  optimized:   974532472          (+4.46%)
max:  simplified:  1075078359         (+15.24%)
avg:  baseline:    898423016.633333   
avg:  optimized:   948646528.166667   (+5.59%)
avg:  simplified:  1044391403.466667  (+16.24%)

Interpretation
--------------

On given hardware and toolchain given performance test shows definitely
better results for the 'simplified' case (Sergey's patch v4).

What I curious about: results are better for the first several
iterations (when they're one right after another): 1-2, 11-12, 21-22. I
saw this effect on different benchmarking results on this machine (not
only for this workload), but I have no good interpretation for it. So I
looked not only on average, but also on minimal and maximal results.

Workload
--------

Made by Sergey N.

diff --git a/test/unit/CMakeLists.txt b/test/unit/CMakeLists.txt
index eaa97e63c..8a4e6e9be 100644
--- a/test/unit/CMakeLists.txt
+++ b/test/unit/CMakeLists.txt
@@ -50,6 +50,8 @@ add_executable(bitset_index.test bitset_index.c)
 target_link_libraries(bitset_index.test bitset)
 add_executable(base64.test base64.c)
 target_link_libraries(base64.test misc unit)
+add_executable(perf-base64.test perf-base64.c)
+target_link_libraries(perf-base64.test misc unit)
 add_executable(uuid.test uuid.c core_test_utils.c)
 target_link_libraries(uuid.test uuid unit)
 add_executable(random.test random.c core_test_utils.c)
diff --git a/test/unit/perf-base64.c b/test/unit/perf-base64.c
new file mode 100644
index 000000000..d8bab4951
--- /dev/null
+++ b/test/unit/perf-base64.c
@@ -0,0 +1,70 @@
+#include <third_party/base64.h>
+#include "unit.h"
+#include "trivia/util.h"
+#include <string.h>
+
+/* Not too large, it should fit into CPU cache(s) */
+#define RAW_BUFFER_BYTES 1048576
+#define TEST_ITERATIONS 5000
+
+static void
+bench_base64_decoder(void)
+{
+	plan(1);
+
+	char *const raw_buffer = (char *)malloc(RAW_BUFFER_BYTES);
+	/* It is more realistic to encode/decode random data than zeroes */
+	for (size_t pos = 0; pos < RAW_BUFFER_BYTES; pos += 2) {
+		*(uint16_t *)((char *)raw_buffer + pos) = rand();
+	}
+
+	const size_t base64_buffer_bytes = 1 +
+		((BASE64_CHARS_PER_LINE + 1) *
+		(((RAW_BUFFER_BYTES << 2) / 3) + (BASE64_CHARS_PER_LINE - 1)) /
+		BASE64_CHARS_PER_LINE);
+	char *const base64_buffer =
+		(char *)malloc(base64_buffer_bytes);
+	const size_t base64_bytes = base64_encode(raw_buffer, RAW_BUFFER_BYTES,
+		       base64_buffer, base64_buffer_bytes, 0);
+
+	/* Warm up caches */
+	for (int i = 0; i < 3; ++i) {
+		base64_decode(base64_buffer, base64_bytes,
+			raw_buffer, RAW_BUFFER_BYTES);
+	}
+
+	/* Actual test */
+	struct timespec start_time, end_time;
+
+	clock_gettime(CLOCK_MONOTONIC, &start_time);
+	for (int i = 0; i < TEST_ITERATIONS; ++i) {
+		base64_decode(base64_buffer, base64_bytes,
+			raw_buffer, RAW_BUFFER_BYTES);
+	}
+	clock_gettime(CLOCK_MONOTONIC, &end_time);
+
+	const uint64_t passed_ns =
+		(end_time.tv_sec - start_time.tv_sec) * 1000ULL * 1000 * 1000 +
+		end_time.tv_nsec - start_time.tv_nsec;
+	const unsigned long long total_decoded_bytes =
+		(unsigned long long)base64_buffer_bytes * TEST_ITERATIONS;
+	fprintf(stderr, "It took %llu ns to decode %llu bytes, speed is "
+			"%llu bps\n",
+			(unsigned long long)passed_ns,
+			total_decoded_bytes,
+			1000 * 1000 * 1000 * total_decoded_bytes / passed_ns);
+	ok(1, "base64 decode benchmark");
+
+	check_plan();
+}
+
+int main(int argc, char *argv[])
+{
+	plan(1);
+	header();
+
+	bench_base64_decoder();
+
+	footer();
+	return check_plan();
+}

Raw results
-----------

$ (cd baseline; grep ^It res.txt | can -n)
     1	It took 7597803020 ns to decode 7087960000 bytes, speed is 932895993 bps
     2	It took 7748740885 ns to decode 7087960000 bytes, speed is 914724095 bps
     3	It took 7777420988 ns to decode 7087960000 bytes, speed is 911350949 bps
     4	It took 7806469047 ns to decode 7087960000 bytes, speed is 907959790 bps
     5	It took 7849334625 ns to decode 7087960000 bytes, speed is 903001380 bps
     6	It took 7888755755 ns to decode 7087960000 bytes, speed is 898488965 bps
     7	It took 7913204578 ns to decode 7087960000 bytes, speed is 895712973 bps
     8	It took 7934173972 ns to decode 7087960000 bytes, speed is 893345674 bps
     9	It took 7941474974 ns to decode 7087960000 bytes, speed is 892524376 bps
    10	It took 7972236306 ns to decode 7087960000 bytes, speed is 889080519 bps
    11	It took 7682013524 ns to decode 7087960000 bytes, speed is 922669555 bps
    12	It took 7774033320 ns to decode 7087960000 bytes, speed is 911748085 bps
    13	It took 7860418423 ns to decode 7087960000 bytes, speed is 901728078 bps
    14	It took 7886080928 ns to decode 7087960000 bytes, speed is 898793718 bps
    15	It took 7952493586 ns to decode 7087960000 bytes, speed is 891287735 bps
    16	It took 7974309710 ns to decode 7087960000 bytes, speed is 888849349 bps
    17	It took 7979781814 ns to decode 7087960000 bytes, speed is 888239824 bps
    18	It took 8003638574 ns to decode 7087960000 bytes, speed is 885592213 bps
    19	It took 8014734769 ns to decode 7087960000 bytes, speed is 884366133 bps
    20	It took 8010969247 ns to decode 7087960000 bytes, speed is 884781826 bps
    21	It took 7728000646 ns to decode 7087960000 bytes, speed is 917179012 bps
    22	It took 7758351072 ns to decode 7087960000 bytes, speed is 913591036 bps
    23	It took 7829038607 ns to decode 7087960000 bytes, speed is 905342323 bps
    24	It took 7897805166 ns to decode 7087960000 bytes, speed is 897459465 bps
    25	It took 7915229547 ns to decode 7087960000 bytes, speed is 895483821 bps
    26	It took 7978044397 ns to decode 7087960000 bytes, speed is 888433260 bps
    27	It took 7991806664 ns to decode 7087960000 bytes, speed is 886903337 bps
    28	It took 8024965897 ns to decode 7087960000 bytes, speed is 883238644 bps
    29	It took 8020449161 ns to decode 7087960000 bytes, speed is 883736042 bps
    30	It took 8016400879 ns to decode 7087960000 bytes, speed is 884182329 bps

$ (cd optimized; grep ^It res.txt | can -n)
     1	It took 7273190169 ns to decode 7087960000 bytes, speed is 974532472 bps
     2	It took 7335598812 ns to decode 7087960000 bytes, speed is 966241500 bps
     3	It took 7418727074 ns to decode 7087960000 bytes, speed is 955414578 bps
     4	It took 7477115808 ns to decode 7087960000 bytes, speed is 947953754 bps
     5	It took 7484032590 ns to decode 7087960000 bytes, speed is 947077650 bps
     6	It took 7498435941 ns to decode 7087960000 bytes, speed is 945258458 bps
     7	It took 7491035013 ns to decode 7087960000 bytes, speed is 946192346 bps
     8	It took 7506550074 ns to decode 7087960000 bytes, speed is 944236690 bps
     9	It took 7509484245 ns to decode 7087960000 bytes, speed is 943867750 bps
    10	It took 7520255639 ns to decode 7087960000 bytes, speed is 942515831 bps
    11	It took 7369634873 ns to decode 7087960000 bytes, speed is 961778992 bps
    12	It took 7420389508 ns to decode 7087960000 bytes, speed is 955200531 bps
    13	It took 7479596162 ns to decode 7087960000 bytes, speed is 947639397 bps
    14	It took 7479660125 ns to decode 7087960000 bytes, speed is 947631293 bps
    15	It took 7495067598 ns to decode 7087960000 bytes, speed is 945683265 bps
    16	It took 7514497457 ns to decode 7087960000 bytes, speed is 943238059 bps
    17	It took 7502681984 ns to decode 7087960000 bytes, speed is 944723502 bps
    18	It took 7544322070 ns to decode 7087960000 bytes, speed is 939509200 bps
    19	It took 7545605619 ns to decode 7087960000 bytes, speed is 939349385 bps
    20	It took 7567607757 ns to decode 7087960000 bytes, speed is 936618311 bps
    21	It took 7375056849 ns to decode 7087960000 bytes, speed is 961071913 bps
    22	It took 7430044250 ns to decode 7087960000 bytes, speed is 953959325 bps
    23	It took 7474667860 ns to decode 7087960000 bytes, speed is 948264208 bps
    24	It took 7465094193 ns to decode 7087960000 bytes, speed is 949480316 bps
    25	It took 7479047135 ns to decode 7087960000 bytes, speed is 947708962 bps
    26	It took 7486999663 ns to decode 7087960000 bytes, speed is 946702326 bps
    27	It took 7488959403 ns to decode 7087960000 bytes, speed is 946454589 bps
    28	It took 7502416145 ns to decode 7087960000 bytes, speed is 944756977 bps
    29	It took 7514547630 ns to decode 7087960000 bytes, speed is 943231761 bps
    30	It took 7515577542 ns to decode 7087960000 bytes, speed is 943102504 bps

$ (cd simplified; grep ^It res.txt | can -n)
     1	It took 6592970584 ns to decode 7087960000 bytes, speed is 1075078359 bps
     2	It took 6707664613 ns to decode 7087960000 bytes, speed is 1056695647 bps
     3	It took 6764450628 ns to decode 7087960000 bytes, speed is 1047824929 bps
     4	It took 6791082089 ns to decode 7087960000 bytes, speed is 1043715847 bps
     5	It took 6784496437 ns to decode 7087960000 bytes, speed is 1044728973 bps
     6	It took 6784578988 ns to decode 7087960000 bytes, speed is 1044716262 bps
     7	It took 6801729770 ns to decode 7087960000 bytes, speed is 1042081976 bps
     8	It took 6804953495 ns to decode 7087960000 bytes, speed is 1041588308 bps
     9	It took 6848572538 ns to decode 7087960000 bytes, speed is 1034954358 bps
    10	It took 6841976407 ns to decode 7087960000 bytes, speed is 1035952125 bps
    11	It took 6670393290 ns to decode 7087960000 bytes, speed is 1062600013 bps
    12	It took 6734840978 ns to decode 7087960000 bytes, speed is 1052431679 bps
    13	It took 6764724852 ns to decode 7087960000 bytes, speed is 1047782453 bps
    14	It took 6783498439 ns to decode 7087960000 bytes, speed is 1044882675 bps
    15	It took 6775755346 ns to decode 7087960000 bytes, speed is 1046076730 bps
    16	It took 6783469382 ns to decode 7087960000 bytes, speed is 1044887151 bps
    17	It took 6809182585 ns to decode 7087960000 bytes, speed is 1040941392 bps
    18	It took 6834388578 ns to decode 7087960000 bytes, speed is 1037102283 bps
    19	It took 6845617417 ns to decode 7087960000 bytes, speed is 1035401128 bps
    20	It took 6843915566 ns to decode 7087960000 bytes, speed is 1035658598 bps
    21	It took 6707754592 ns to decode 7087960000 bytes, speed is 1056681472 bps
    22	It took 6754254986 ns to decode 7087960000 bytes, speed is 1049406635 bps
    23	It took 6787549608 ns to decode 7087960000 bytes, speed is 1044259034 bps
    24	It took 6779319691 ns to decode 7087960000 bytes, speed is 1045526737 bps
    25	It took 6769543930 ns to decode 7087960000 bytes, speed is 1047036561 bps
    26	It took 6808031086 ns to decode 7087960000 bytes, speed is 1041117455 bps
    27	It took 6849752166 ns to decode 7087960000 bytes, speed is 1034776124 bps
    28	It took 6859630114 ns to decode 7087960000 bytes, speed is 1033286034 bps
    29	It took 6884042373 ns to decode 7087960000 bytes, speed is 1029621785 bps
    30	It took 6848737822 ns to decode 7087960000 bytes, speed is 1034929381 bps


More information about the Tarantool-patches mailing list