From: Vladislav Shpilevoy <v.shpilevoy@tarantool.org> To: tarantool-patches@dev.tarantool.org, tsafin@tarantool.org, alyapunov@tarantool.org Subject: [Tarantool-patches] [PATCH 08/11] digest: eliminate UBs from guava() Date: Fri, 5 Jun 2020 01:43:15 +0200 [thread overview] Message-ID: <3bfbea4a8e49952de1062971027f065db21850b1.1591313754.git.v.shpilevoy@tarantool.org> (raw) In-Reply-To: <cover.1591313754.git.v.shpilevoy@tarantool.org> Guava hash function follows the algorithm described in the paper "A Fast, Minimal Memory, Consistent Hash Algorithm", John Lamping, Eric Veach. But the implementation somewhy is ported from Java instead of taking a ready to use C function right from the paper. Java version, ported to C as is, leads to undefined behaviour: - signed integer overflow; - double value outside of integer range assigned to the integer. Here is the full old function: static const int64_t K = 2862933555777941757; static const double D = 0x1.0p31; static inline double lcg(int64_t *state) { return (double)((int32_t)(((uint64_t)*state >> 33) + 1)) / D; } int32_t guava(int64_t state, int32_t buckets) { int32_t candidate = 0; int32_t next; while (1) { state = K * state + 1; next = (int32_t)((candidate + 1) / lcg(&state)); if (next >= 0 && next < buckets) candidate = next; else return candidate; } } Signed integer overflow happened in this line: state = K * state + 1; This UB is fixed by changing state type to uint64_t. This is ok and does not change behaviour, because overflow for signed integers in reality behaves the same as for unsigned integers, in terms of binary representation. Change to uint64_t didn't lead to change of any single line of the result assembly code. Double -> int32_t truncation overflow happened in the next line: next = (int32_t)((candidate + 1) / lcg(&state)); Right expression can become something out of int32_t range. This is UB, but in reality the double value on the right is truncated to either INT32_MIN, if it is < INT32_MIN, or INT32_MAX, if it is > INT32_MAX. This is fixed by making 'next' and 'candidate' variables int64_t. This fixed the UB, because the right expression can't become < INT64_MIN and > INT64_MAX. Indeed, max possible value of 'next' from the expression above is: (candidate + 1) / lcg(&state) < ((2^31 - 1) + 1) / 2^-31 = 2^62 < INT64_MAX Min possible value of 'next' is: (-2^31 + 1) / 2^-31 > -2^62 > INT64_MAX 'Candidate' is assumed to be inside int32_t range, because it is initialized from the previous 'next' value. If it would be out of int32_t, it would become either bigger than 'buckets', which is int32_t, or < 0, and the cycle would stop on the previous iteration. The patch fixes the UBs, but guava() function is still broken. Because it returns values different than the function in the original paper from John Lamping and Eric Veach. Here is the correct function: int32_t correct_guava(uint64_t key, int32_t num_buckets) { int64_t b = -1, j = 0; while (j < num_buckets) { b=j; key = key * 2862933555777941757ULL + 1; j = (b + 1) * ((double)(1LL << 31) / (double)((key >> 33) + 1)); } return b; } correct_guava(6356101026326471242, 813154869) = 362866355, guava(<the same>) = 2. correct_guava(112571758688054605, 1355446793) = 907865430, guava(<the same>) = 907865431. And there are more examples. It means, guava() returns something with an unknown distrubution, consistency, and not described in the paper. That is a subject for a separate patch to deprecate this function and add a correct one. Part of #4609 --- src/lib/salad/guava.c | 18 +++++++++++------- src/lib/salad/guava.h | 2 +- src/lua/digest.lua | 2 +- test/app/digest.result | 2 +- 4 files changed, 14 insertions(+), 10 deletions(-) diff --git a/src/lib/salad/guava.c b/src/lib/salad/guava.c index 46bf1ac15..16c1e8cc4 100644 --- a/src/lib/salad/guava.c +++ b/src/lib/salad/guava.c @@ -39,22 +39,26 @@ * John Lamping, Eric Veach */ -static const int64_t K = 2862933555777941757; +static const uint64_t K = 2862933555777941757ULL; static const double D = 0x1.0p31; -static inline double lcg(int64_t *state) +static inline double lcg(uint64_t *state) { - return (double )((int32_t)(((uint64_t )*state >> 33) + 1)) / D; + return (double)((int32_t)((*state >> 33) + 1)) / D; } +/** + * The function does not conform to the original paper, and + * therefore has incorrect behaviour. It should be deprecated. + */ int32_t -guava(int64_t state, int32_t buckets) +guava(uint64_t state, int32_t buckets) { - int32_t candidate = 0; - int32_t next; + int64_t candidate = 0; + int64_t next; while (1) { state = K * state + 1; - next = (int32_t)((candidate + 1) / lcg(&state)); + next = (candidate + 1) / lcg(&state); if (next >= 0 && next < buckets) candidate = next; else diff --git a/src/lib/salad/guava.h b/src/lib/salad/guava.h index 9ed93d641..bf0181f1e 100644 --- a/src/lib/salad/guava.h +++ b/src/lib/salad/guava.h @@ -39,7 +39,7 @@ extern "C" { #endif int32_t -guava(int64_t state, int32_t buckets); +guava(uint64_t state, int32_t buckets); #if defined(__cplusplus) } /* extern C */ diff --git a/src/lua/digest.lua b/src/lua/digest.lua index 6ed91cfa2..6beaf562a 100644 --- a/src/lua/digest.lua +++ b/src/lua/digest.lua @@ -14,7 +14,7 @@ ffi.cdef[[ typedef uint32_t (*crc32_func)(uint32_t crc, const unsigned char *buf, unsigned int len); - extern int32_t guava(int64_t state, int32_t buckets); + extern int32_t guava(uint64_t state, int32_t buckets); extern crc32_func crc32_calc; /* base64 */ diff --git a/test/app/digest.result b/test/app/digest.result index ecfa518c9..d946c6a3b 100644 --- a/test/app/digest.result +++ b/test/app/digest.result @@ -357,7 +357,7 @@ digest.base64_decode(123) ... digest.guava('hello', 0) --- -- error: 'bad argument #1 to ''?'' (cannot convert ''string'' to ''int64_t'')' +- error: 'bad argument #1 to ''?'' (cannot convert ''string'' to ''uint64_t'')' ... digest.guava(1, 'nope_') --- -- 2.21.1 (Apple Git-122.3)
next prev parent reply other threads:[~2020-06-04 23:43 UTC|newest] Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-06-04 23:43 [Tarantool-patches] [PATCH 00/11] Enable miscelaneous sanitations Vladislav Shpilevoy 2020-06-04 23:43 ` [Tarantool-patches] [PATCH 01/11] cmake: enable misc types of UB detection in clang Vladislav Shpilevoy 2020-06-04 23:43 ` [Tarantool-patches] [PATCH 10/11] sql: fix usage of not initialized index_stat Vladislav Shpilevoy 2020-06-04 23:43 ` [Tarantool-patches] [PATCH 11/11] sql: fix mem_apply_type double type truncation Vladislav Shpilevoy 2020-06-04 23:43 ` [Tarantool-patches] [PATCH 02/11] util: introduce double_compare_nint64() Vladislav Shpilevoy 2020-06-04 23:43 ` [Tarantool-patches] [PATCH 03/11] test: avoid usleep() usage for error injections Vladislav Shpilevoy 2020-06-04 23:43 ` [Tarantool-patches] [PATCH 04/11] vinyl: fix 0 division in case of canceled dump Vladislav Shpilevoy 2020-06-04 23:43 ` [Tarantool-patches] [PATCH 05/11] xrow: don't cast double to float unconditionally Vladislav Shpilevoy 2020-06-04 23:43 ` [Tarantool-patches] [PATCH 06/11] swim: fix zero division Vladislav Shpilevoy 2020-06-04 23:43 ` [Tarantool-patches] [PATCH 07/11] test: fix signed integer overflow in vclock test Vladislav Shpilevoy 2020-06-04 23:43 ` Vladislav Shpilevoy [this message] 2020-06-04 23:43 ` [Tarantool-patches] [PATCH 09/11] salad: fix UB pointer arithmetics in bps_tree Vladislav Shpilevoy 2020-06-05 22:09 ` [Tarantool-patches] [PATCH 00/11] Enable miscelaneous sanitations Timur Safin 2020-06-09 8:19 ` Cyrill Gorcunov 2020-06-09 8:28 ` Kirill Yukhin
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=3bfbea4a8e49952de1062971027f065db21850b1.1591313754.git.v.shpilevoy@tarantool.org \ --to=v.shpilevoy@tarantool.org \ --cc=alyapunov@tarantool.org \ --cc=tarantool-patches@dev.tarantool.org \ --cc=tsafin@tarantool.org \ --subject='Re: [Tarantool-patches] [PATCH 08/11] digest: eliminate UBs from guava()' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox