From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id 126124C885; Tue, 12 Jan 2021 22:39:09 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 126124C885 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1610480349; bh=sl+jHFec82iuQAnsywhC+fW23SYbxGi3ocawAZiEck8=; h=Date:To:References:In-Reply-To:Subject:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=MWKNJx656KVlKmsWV0F6o0PTfk+wR4K9jMWr34S8zy2JrxOrSC0ar1DFluKVEhUJi mObH8xhvN+Gcbx3LS62slnR/rm5E6aEho6N+msO4WnoxbLWazq49UejaGjcaGB2s2x ipZXbqg1ky2cnVefmZJWY7k7AY6m2F/PwqWk5vgU= Received: from smtp43.i.mail.ru (smtp43.i.mail.ru [94.100.177.103]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id DEFAA4C885 for ; Tue, 12 Jan 2021 22:39:06 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org DEFAA4C885 Received: by smtp43.i.mail.ru with esmtpa (envelope-from ) id 1kzPVR-0005q4-IA; Tue, 12 Jan 2021 22:39:06 +0300 Date: Tue, 12 Jan 2021 22:39:28 +0300 To: Sergey Nikiforov Message-ID: <20210112193928.5sicek7uh6psdzpd@tkn_work_nb> References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: X-7564579A: 646B95376F6C166E X-77F55803: 4F1203BC0FB41BD9D0E79FBC973162CD9807AE62CAC2B33C765C844EAC374FFF00894C459B0CD1B9CAB5F185A46CC1AD1F080870792E4E448782265C442DB8A7556528066C3D4F40 X-7FA49CB5: FF5795518A3D127A4AD6D5ED66289B5278DA827A17800CE736C75B72B9FDC350B287FD4696A6DC2FA8DF7F3B2552694A4E2F5AFA99E116B42401471946AA11AF7680F9384605B9034B51810D54EC17F08F08D7030A58E5AD6BA297DBC24807EAA9D420A4CFB5DD3E5D3F91DF56C05A1EE4E50FC6FEB3109D69A9A00A1FDB421A8941B15DA834481FA18204E546F3947CC2B5EEE3591E0D35F6B57BC7E64490618DEB871D839B7333395957E7521B51C2DFABB839C843B9C08941B15DA834481F8AA50765F790063783E00425F71A4181389733CBF5DBD5E9B5C8C57E37DE458BD9DD9810294C998ED8FC6C240DEA76428AA50765F790063794DE1119350BF947D81D268191BDAD3DBD4B6F7A4D31EC0BEA7A3FFF5B025636D81D268191BDAD3D78DA827A17800CE70072543889BF767CEC76A7562686271EEC990983EF5C032935872C767BF85DA29E625A9149C048EE0A3850AC1BE2E7358734EF69C36C4A4D4AD6D5ED66289B524E70A05D1297E1BB35872C767BF85DA227C277FBC8AE2E8BECD345639C7F4A9E75ECD9A6C639B01B4E70A05D1297E1BBC6867C52282FAC85D9B7C4F32B44FF57F41B907980AE8E5900306258E7E6ABB4E4A6367B16DE6309 X-C1DE0DAB: 0D63561A33F958A55D3F91DF56C05A1EE4E50FC6FEB3109D60D285E714CEEA95D59269BC5F550898D99A6476B3ADF6B47008B74DF8BB9EF7333BD3B22AA88B938A852937E12ACA75448CF9D3A7B2C848410CA545F18667F91A7EA1CDA0B5A7A0 X-C8649E89: 4E36BF7865823D7055A7F0CF078B5EC49A30900B95165D34A541DB4061E93832663500CF2E3F23B549FA329090F96DD95407A5EA4DFE72051F6CEA1DAF205B1E1D7E09C32AA3244C28AAAD52D0592FA0A9EADEE355E7648BA90944CA99CF22E3927AC6DF5659F194 X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2biojKc4NKhtl7uHfv4uVmSnZLg== X-Mailru-Sender: FFAA8E4AEE17E37C3731A083A1A85ADECAB5F185A46CC1AD1F080870792E4E44B7EA9FE7735C3DBFC664A44C781FCEA7C77752E0C033A69EDF9F2CE1E9CF805D8CD356D4F938FF726C18EFA0BB12DBB0 X-Mras: Ok Subject: Re: [Tarantool-patches] [PATCH v6 0/2] base64: Fix decoder, improve its performance X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Alexander Turenko via Tarantool-patches Reply-To: Alexander Turenko Cc: tarantool-patches@dev.tarantool.org, Vladislav Shpilevoy Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" On Mon, Jan 11, 2021 at 12:44:59PM +0300, Sergey Nikiforov wrote: > First patch fixes #3069 and adds test for zero-sized output buffer, > second one improves base64 decoder performance. > > Second patch is optional to merge. > > v6 is v5 rebased - there was merge conflict in base64 test, > base64_no_space_test() is now implemented as subtest per review. > No more changes. > > Branch: https://github.com/tarantool/tarantool/tree/void234/gh-3069-fix-base64-memory-overrun-v6 > Issue: https://github.com/tarantool/tarantool/issues/3069 > > Sergey Nikiforov (2): > base64: fix decoder output buffer overrun (reads) > base64: improve decoder performance > > test/unit/base64.c | 16 +++++++++++- > test/unit/base64.result | 5 +++- > third_party/base64.c | 57 +++++++++++++++++++++++++++++------------ > 3 files changed, 59 insertions(+), 19 deletions(-) > > -- > 2.25.1 > I was surprised by the benchmarking results [1] and asked Sergey for the workload to reproduce them. I placed the workload at end of the email. For now I just blindly run the benchmark and look at results. In short, the simplified code is faster on my system. So maybe it worth to look at this direction? [1]: https://lists.tarantool.org/tarantool-patches/661ec8ed-ff23-b67d-9177-20eb55ba8a92@tarantool.org/ Hardware -------- Laptop: SMP, 8 cores (with enabled HT, so 4 in fact), Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz. Gentoo Linux, GCC 10.1.0, glibc 2.31-r3. Method ------ I prepared three builds: * baseline: - master (2.8.0-0-gefc30ccf8) - perf test * optimized: - master (2.8.0-0-gefc30ccf8) - two commits from v6: 7f5ce3321cda6d4305282af486ed2f493b920d1f and b3f6a37aef7bc59668e9f1e6b9638920a8f52846 - perf test * simplified: - master (2.8.0-0-gefc30ccf8) - top commit from v4 (d759647f9f68644d031498ae6a540756dcc4228e) - perf test And build them like in RPM / Deb packages (RelWithDebInfo cmake build type); disabled the libcurl bundling, because it does not matter here: $ cmake . -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_BACKTRACE=ON \ -DENABLE_DIST=ON -DENABLE_BUNDLED_LIBCURL=OFF && make -j I prepared my laptop for benchmarking: * Ensure that intel_pstate is disabled: | $ grep -o 'intel_pstate=[^ ]*' /proc/cmdline | intel_pstate=disable * Set fixed CPU frequency (mainly to ensure that Turbo Boost will not be in action): | # cpupower -c all frequency-set -g userspace | # cpupower -c all frequency-set -f 2.9GHz * Closed all applications that may significantly use CPU (firefox and so on). Then I run the following steps three times: | # sync && echo 3 > /proc/sys/vm/drop_caches` # from root | $ (cd baseline && for i in $(seq 1 10); do taskset 0x2 ./test/unit/perf-base64.test; done >>res.txt 2>&1) | # sync && echo 3 > /proc/sys/vm/drop_caches` # from root | $ (cd optimized && for i in $(seq 1 10); do taskset 0x2 ./test/unit/perf-base64.test; done >>res.txt 2>&1) | # sync && echo 3 > /proc/sys/vm/drop_caches` # from root | $ (cd simplified && for i in $(seq 1 10); do taskset 0x2 ./test/unit/perf-base64.test; done >>res.txt 2>&1) Raw results are at end of the email. Scripts to accumulate the results: | # It is in ~/.bashrc on my machine. | | function s() { | awk ' | BEGIN { | min = "NaN" | max = "NaN" | cnt = 0 | var = "NaN" | sum = "NaN" | avg = "NaN" | } | { | min = (NR==1 || $1max ? $1 : max) | cnt += 1 | sum += $1 | } | END { | avg = sum / cnt | var = max - min | var_rel = 100 * var / avg | print "cnt:", cnt | print "min:", min | print "max:", max | printf "avg: %f\n", avg | print "var:", var, "("var_rel"%)" | print "" | } | '; | } | (cd baseline; grep ^It res.txt | cut -d' ' -f11 | s >stat.txt) | (cd optimized; grep ^It res.txt | cut -d' ' -f11 | s >stat.txt) | (cd simplified; grep ^It res.txt | cut -d' ' -f11 | s >stat.txt) | (for metric in min max avg; do \ | baseline="$(grep "${metric}:" baseline/stat.txt | cut -d' ' -f2)"; \ | optimized="$(grep "${metric}:" optimized/stat.txt | cut -d' ' -f2)"; \ | simplified="$(grep "${metric}:" simplified/stat.txt | cut -d' ' -f2)"; \ | echo "${metric}: baseline: ${baseline}"; \ | echo "${metric}: optimized: ${optimized} (+$(echo -e "scale=2\n(${optimized}-${baseline})*100/${baseline}" | bc)%)"; \ | echo "${metric}: simplified: ${simplified} (+$(echo -e "scale=2\n(${simplified}-${baseline})*100/${baseline}" | bc)%)"; \ | done) | column -t Results ------- min: baseline: 883238644 min: optimized: 936618311 (+6.04%) min: simplified: 1029621785 (+16.57%) max: baseline: 932895993 max: optimized: 974532472 (+4.46%) max: simplified: 1075078359 (+15.24%) avg: baseline: 898423016.633333 avg: optimized: 948646528.166667 (+5.59%) avg: simplified: 1044391403.466667 (+16.24%) Interpretation -------------- On given hardware and toolchain given performance test shows definitely better results for the 'simplified' case (Sergey's patch v4). What I curious about: results are better for the first several iterations (when they're one right after another): 1-2, 11-12, 21-22. I saw this effect on different benchmarking results on this machine (not only for this workload), but I have no good interpretation for it. So I looked not only on average, but also on minimal and maximal results. Workload -------- Made by Sergey N. diff --git a/test/unit/CMakeLists.txt b/test/unit/CMakeLists.txt index eaa97e63c..8a4e6e9be 100644 --- a/test/unit/CMakeLists.txt +++ b/test/unit/CMakeLists.txt @@ -50,6 +50,8 @@ add_executable(bitset_index.test bitset_index.c) target_link_libraries(bitset_index.test bitset) add_executable(base64.test base64.c) target_link_libraries(base64.test misc unit) +add_executable(perf-base64.test perf-base64.c) +target_link_libraries(perf-base64.test misc unit) add_executable(uuid.test uuid.c core_test_utils.c) target_link_libraries(uuid.test uuid unit) add_executable(random.test random.c core_test_utils.c) diff --git a/test/unit/perf-base64.c b/test/unit/perf-base64.c new file mode 100644 index 000000000..d8bab4951 --- /dev/null +++ b/test/unit/perf-base64.c @@ -0,0 +1,70 @@ +#include +#include "unit.h" +#include "trivia/util.h" +#include + +/* Not too large, it should fit into CPU cache(s) */ +#define RAW_BUFFER_BYTES 1048576 +#define TEST_ITERATIONS 5000 + +static void +bench_base64_decoder(void) +{ + plan(1); + + char *const raw_buffer = (char *)malloc(RAW_BUFFER_BYTES); + /* It is more realistic to encode/decode random data than zeroes */ + for (size_t pos = 0; pos < RAW_BUFFER_BYTES; pos += 2) { + *(uint16_t *)((char *)raw_buffer + pos) = rand(); + } + + const size_t base64_buffer_bytes = 1 + + ((BASE64_CHARS_PER_LINE + 1) * + (((RAW_BUFFER_BYTES << 2) / 3) + (BASE64_CHARS_PER_LINE - 1)) / + BASE64_CHARS_PER_LINE); + char *const base64_buffer = + (char *)malloc(base64_buffer_bytes); + const size_t base64_bytes = base64_encode(raw_buffer, RAW_BUFFER_BYTES, + base64_buffer, base64_buffer_bytes, 0); + + /* Warm up caches */ + for (int i = 0; i < 3; ++i) { + base64_decode(base64_buffer, base64_bytes, + raw_buffer, RAW_BUFFER_BYTES); + } + + /* Actual test */ + struct timespec start_time, end_time; + + clock_gettime(CLOCK_MONOTONIC, &start_time); + for (int i = 0; i < TEST_ITERATIONS; ++i) { + base64_decode(base64_buffer, base64_bytes, + raw_buffer, RAW_BUFFER_BYTES); + } + clock_gettime(CLOCK_MONOTONIC, &end_time); + + const uint64_t passed_ns = + (end_time.tv_sec - start_time.tv_sec) * 1000ULL * 1000 * 1000 + + end_time.tv_nsec - start_time.tv_nsec; + const unsigned long long total_decoded_bytes = + (unsigned long long)base64_buffer_bytes * TEST_ITERATIONS; + fprintf(stderr, "It took %llu ns to decode %llu bytes, speed is " + "%llu bps\n", + (unsigned long long)passed_ns, + total_decoded_bytes, + 1000 * 1000 * 1000 * total_decoded_bytes / passed_ns); + ok(1, "base64 decode benchmark"); + + check_plan(); +} + +int main(int argc, char *argv[]) +{ + plan(1); + header(); + + bench_base64_decoder(); + + footer(); + return check_plan(); +} Raw results ----------- $ (cd baseline; grep ^It res.txt | can -n) 1 It took 7597803020 ns to decode 7087960000 bytes, speed is 932895993 bps 2 It took 7748740885 ns to decode 7087960000 bytes, speed is 914724095 bps 3 It took 7777420988 ns to decode 7087960000 bytes, speed is 911350949 bps 4 It took 7806469047 ns to decode 7087960000 bytes, speed is 907959790 bps 5 It took 7849334625 ns to decode 7087960000 bytes, speed is 903001380 bps 6 It took 7888755755 ns to decode 7087960000 bytes, speed is 898488965 bps 7 It took 7913204578 ns to decode 7087960000 bytes, speed is 895712973 bps 8 It took 7934173972 ns to decode 7087960000 bytes, speed is 893345674 bps 9 It took 7941474974 ns to decode 7087960000 bytes, speed is 892524376 bps 10 It took 7972236306 ns to decode 7087960000 bytes, speed is 889080519 bps 11 It took 7682013524 ns to decode 7087960000 bytes, speed is 922669555 bps 12 It took 7774033320 ns to decode 7087960000 bytes, speed is 911748085 bps 13 It took 7860418423 ns to decode 7087960000 bytes, speed is 901728078 bps 14 It took 7886080928 ns to decode 7087960000 bytes, speed is 898793718 bps 15 It took 7952493586 ns to decode 7087960000 bytes, speed is 891287735 bps 16 It took 7974309710 ns to decode 7087960000 bytes, speed is 888849349 bps 17 It took 7979781814 ns to decode 7087960000 bytes, speed is 888239824 bps 18 It took 8003638574 ns to decode 7087960000 bytes, speed is 885592213 bps 19 It took 8014734769 ns to decode 7087960000 bytes, speed is 884366133 bps 20 It took 8010969247 ns to decode 7087960000 bytes, speed is 884781826 bps 21 It took 7728000646 ns to decode 7087960000 bytes, speed is 917179012 bps 22 It took 7758351072 ns to decode 7087960000 bytes, speed is 913591036 bps 23 It took 7829038607 ns to decode 7087960000 bytes, speed is 905342323 bps 24 It took 7897805166 ns to decode 7087960000 bytes, speed is 897459465 bps 25 It took 7915229547 ns to decode 7087960000 bytes, speed is 895483821 bps 26 It took 7978044397 ns to decode 7087960000 bytes, speed is 888433260 bps 27 It took 7991806664 ns to decode 7087960000 bytes, speed is 886903337 bps 28 It took 8024965897 ns to decode 7087960000 bytes, speed is 883238644 bps 29 It took 8020449161 ns to decode 7087960000 bytes, speed is 883736042 bps 30 It took 8016400879 ns to decode 7087960000 bytes, speed is 884182329 bps $ (cd optimized; grep ^It res.txt | can -n) 1 It took 7273190169 ns to decode 7087960000 bytes, speed is 974532472 bps 2 It took 7335598812 ns to decode 7087960000 bytes, speed is 966241500 bps 3 It took 7418727074 ns to decode 7087960000 bytes, speed is 955414578 bps 4 It took 7477115808 ns to decode 7087960000 bytes, speed is 947953754 bps 5 It took 7484032590 ns to decode 7087960000 bytes, speed is 947077650 bps 6 It took 7498435941 ns to decode 7087960000 bytes, speed is 945258458 bps 7 It took 7491035013 ns to decode 7087960000 bytes, speed is 946192346 bps 8 It took 7506550074 ns to decode 7087960000 bytes, speed is 944236690 bps 9 It took 7509484245 ns to decode 7087960000 bytes, speed is 943867750 bps 10 It took 7520255639 ns to decode 7087960000 bytes, speed is 942515831 bps 11 It took 7369634873 ns to decode 7087960000 bytes, speed is 961778992 bps 12 It took 7420389508 ns to decode 7087960000 bytes, speed is 955200531 bps 13 It took 7479596162 ns to decode 7087960000 bytes, speed is 947639397 bps 14 It took 7479660125 ns to decode 7087960000 bytes, speed is 947631293 bps 15 It took 7495067598 ns to decode 7087960000 bytes, speed is 945683265 bps 16 It took 7514497457 ns to decode 7087960000 bytes, speed is 943238059 bps 17 It took 7502681984 ns to decode 7087960000 bytes, speed is 944723502 bps 18 It took 7544322070 ns to decode 7087960000 bytes, speed is 939509200 bps 19 It took 7545605619 ns to decode 7087960000 bytes, speed is 939349385 bps 20 It took 7567607757 ns to decode 7087960000 bytes, speed is 936618311 bps 21 It took 7375056849 ns to decode 7087960000 bytes, speed is 961071913 bps 22 It took 7430044250 ns to decode 7087960000 bytes, speed is 953959325 bps 23 It took 7474667860 ns to decode 7087960000 bytes, speed is 948264208 bps 24 It took 7465094193 ns to decode 7087960000 bytes, speed is 949480316 bps 25 It took 7479047135 ns to decode 7087960000 bytes, speed is 947708962 bps 26 It took 7486999663 ns to decode 7087960000 bytes, speed is 946702326 bps 27 It took 7488959403 ns to decode 7087960000 bytes, speed is 946454589 bps 28 It took 7502416145 ns to decode 7087960000 bytes, speed is 944756977 bps 29 It took 7514547630 ns to decode 7087960000 bytes, speed is 943231761 bps 30 It took 7515577542 ns to decode 7087960000 bytes, speed is 943102504 bps $ (cd simplified; grep ^It res.txt | can -n) 1 It took 6592970584 ns to decode 7087960000 bytes, speed is 1075078359 bps 2 It took 6707664613 ns to decode 7087960000 bytes, speed is 1056695647 bps 3 It took 6764450628 ns to decode 7087960000 bytes, speed is 1047824929 bps 4 It took 6791082089 ns to decode 7087960000 bytes, speed is 1043715847 bps 5 It took 6784496437 ns to decode 7087960000 bytes, speed is 1044728973 bps 6 It took 6784578988 ns to decode 7087960000 bytes, speed is 1044716262 bps 7 It took 6801729770 ns to decode 7087960000 bytes, speed is 1042081976 bps 8 It took 6804953495 ns to decode 7087960000 bytes, speed is 1041588308 bps 9 It took 6848572538 ns to decode 7087960000 bytes, speed is 1034954358 bps 10 It took 6841976407 ns to decode 7087960000 bytes, speed is 1035952125 bps 11 It took 6670393290 ns to decode 7087960000 bytes, speed is 1062600013 bps 12 It took 6734840978 ns to decode 7087960000 bytes, speed is 1052431679 bps 13 It took 6764724852 ns to decode 7087960000 bytes, speed is 1047782453 bps 14 It took 6783498439 ns to decode 7087960000 bytes, speed is 1044882675 bps 15 It took 6775755346 ns to decode 7087960000 bytes, speed is 1046076730 bps 16 It took 6783469382 ns to decode 7087960000 bytes, speed is 1044887151 bps 17 It took 6809182585 ns to decode 7087960000 bytes, speed is 1040941392 bps 18 It took 6834388578 ns to decode 7087960000 bytes, speed is 1037102283 bps 19 It took 6845617417 ns to decode 7087960000 bytes, speed is 1035401128 bps 20 It took 6843915566 ns to decode 7087960000 bytes, speed is 1035658598 bps 21 It took 6707754592 ns to decode 7087960000 bytes, speed is 1056681472 bps 22 It took 6754254986 ns to decode 7087960000 bytes, speed is 1049406635 bps 23 It took 6787549608 ns to decode 7087960000 bytes, speed is 1044259034 bps 24 It took 6779319691 ns to decode 7087960000 bytes, speed is 1045526737 bps 25 It took 6769543930 ns to decode 7087960000 bytes, speed is 1047036561 bps 26 It took 6808031086 ns to decode 7087960000 bytes, speed is 1041117455 bps 27 It took 6849752166 ns to decode 7087960000 bytes, speed is 1034776124 bps 28 It took 6859630114 ns to decode 7087960000 bytes, speed is 1033286034 bps 29 It took 6884042373 ns to decode 7087960000 bytes, speed is 1029621785 bps 30 It took 6848737822 ns to decode 7087960000 bytes, speed is 1034929381 bps