From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtpng3.m.smailru.net (smtpng3.m.smailru.net [94.100.177.149]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 4DA94469719 for ; Sat, 3 Oct 2020 01:43:03 +0300 (MSK) References: <20200930000652.10603-1-i.kosarev@tarantool.org> From: Vladislav Shpilevoy Message-ID: <784163c9-c68f-7e54-7884-f2a267660a5d@tarantool.org> Date: Sat, 3 Oct 2020 00:43:01 +0200 MIME-Version: 1.0 In-Reply-To: <20200930000652.10603-1-i.kosarev@tarantool.org> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Subject: Re: [Tarantool-patches] [PATCH v2] core: introduce evenly distributed int64 random in range List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Ilya Kosarev , alyapunov@tarantool.org Cc: tarantool-patches@dev.tarantool.org Hi! Thanks for the patch and for the good research, seriously! See 11 comments below. On 30.09.2020 02:06, Ilya Kosarev wrote: > Tarantool codebase had at least two functions to generate random > integer in given range and both of them had problems at least with > integer overflow. This patch brings nice function to generate random > int64_t in given range without overflow while preserving uniform random > generator distribution using unbiased bitmask with rejection method. > It is now possible to use xoshiro256++ PRNG or random bytes as a basis. > Most relevant replacements have been made. > > Closes #5075 > --- > Branch: https://github.com/tarantool/tarantool/tree/i.kosarev/gh-5075-fine-random-in-range > Issue: https://github.com/tarantool/tarantool/issues/5075 > > @ChangeLog: > * Introduce overflow resistant & evenly distributed random in range (gh-5075). 1. I wouldn't put it into the changelog. It is not a visible bugfix, nor a public feature. Rather some refactoring. > Changes in v2: > - implemented fine 64-bit random > - implemented fine range bounding > > src/lib/core/random.c | 57 +++++++++++++++++++++++++++++++++++++++++++ > src/lib/core/random.h | 38 +++++++++++++++++++++++++++++ > src/lib/swim/swim.c | 26 +++----------------- > test/unit/histogram.c | 12 +++------ 2. Please, add some unit tests checking that, for example, the randoms work fine even on the maximal range [INT64_MIN, INT64_MAX], with some edge cases like [INT64_MIN, INT64_MIN], [INT64_MAX, INT64_MAX], [1, 1], [0, 0], and some regular cases. Also the tests should check the borders are respected, and ASAN passes (originally the issue was discovered by ASAN which found signed integer overflows under ENABLE_UB_SANITIZER option). ASAN checks for signed int overflow UB are disabled for now because of SQL producing this UB a lot. But you can enable it by changing compiler.cmake, and running the unit tests. If you want to be sure it won't fail, when the overflow check will be enabled after SQL is fixed. > 4 files changed, 102 insertions(+), 31 deletions(-) > > diff --git a/src/lib/core/random.c b/src/lib/core/random.c > index f4fa75b1c..4b52408ea 100644 > --- a/src/lib/core/random.c > +++ b/src/lib/core/random.c> @@ -63,6 +65,8 @@ random_init(void) > srand: > srandom(seed); > srand(seed); > + > + random_bytes((char *)state, 32); 3. Never hardcode an array size except maybe in its declaration. sizeof(state) will return the same, but it is safe in case the state size ever changes. > } > > void > @@ -97,3 +101,56 @@ rand: > while (generated < size) > buf[generated++] = rand(); > } > + > +uint64_t > +real_random(void) > +{ > + uint64_t result; > + random_bytes((char *)(&result), 8); > + return result; > +} > + > +static inline > +uint64_t rotl(const uint64_t x, int k) 4. Sorry, we never put return type on the same line as function name. Also why did you declare 'x' const? Why 'k' is not const then? It seems you copy-pasted it from somewhere without changing. Please, convert it to our code style. 5. What is 'rotl'? > +{ > + return (x << k) | (x >> (64 - k)); > +} > + > +uint64_t > +pseudo_random(void) > +{ > + const uint64_t result = rotl(state[0] + state[3], 23) + state[0]; > + const uint64_t t = state[1] << 17; > + state[2] ^= state[0]; > + state[3] ^= state[1]; > + state[1] ^= state[2]; > + state[0] ^= state[3]; > + state[2] ^= t; > + state[3] = rotl(state[3], 45);> + return result; > +} > + > +static inline uint64_t > +random_in_range(uint64_t (*rnd)(void), uint64_t range) > +{ > + uint64_t mask = UINT64_MAX >> __builtin_clzll(range | 1); 6. Use bit_clz_u64(). You may not have __builtin_clzll. > + uint64_t result; > + do { > + result = rnd() & mask; 7. I guess we can hope the compiler will inline rnd() virtual call to a non-virtual call below, but I can't be 100% sure, and nobody can. Even if now you see it inlined in the assembly. Better duplicate these few lines or use a macro, than relying on the compiler. A single virtual call would ruin the perf of the algorithm you chose over mersen twister. > + } while (result > range); > + return result; > +} > + > +int64_t > +real_random_in_range(int64_t min, int64_t max) > +{ > + assert(max >= min); > + return min + random_in_range(real_random, (uint64_t) (max - min)); 8. I guess this will fail UB checks from ASAN, because max-min will be firstly stored into int64_t and only then cast to uint64_t. The first step can overflow int64_t. I remember there was a simple solution, but can't remember what it was. > +} > + > +int64_t > +pseudo_random_in_range(int64_t min, int64_t max) > +{ > + assert(max >= min); > + return min + random_in_range(pseudo_random, (uint64_t) (max - min)); > +} > diff --git a/src/lib/core/random.h b/src/lib/core/random.h > index 26109b14b..9228df72e 100644 > --- a/src/lib/core/random.h > +++ b/src/lib/core/random.h > @@ -45,6 +46,43 @@ random_free(void); > void > random_bytes(char *buf, size_t size); > > +/** > + * Just 8 random_bytes(). > + */ > +uint64_t > +real_random(void); > + > +/** > + * xoshiro256++ pseudo random generator. > + * http://prng.di.unimi.it/ > + * State is initialized with random_bytes(). > + */ > +uint64_t > +pseudo_random(void); > + > +/** > + * Return a random int64_t number within given boundaries. > + * > + * Instead of blindly calculating a modulo, this function uses > + * unbiased bitmask with rejection method to provide number in > + * given boundaries while preserving uniform distribution and > + * avoiding overflow. It uses random bytes as a basis. 9. Are the boundaries included or not? Both included, both excluded, first included and last excluded? > + */ > +int64_t > +real_random_in_range(int64_t min, int64_t max); > + > +/** > + * Return a pseudo random int64_t number within given boundaries. > + * > + * Instead of blindly calculating a modulo, this function uses > + * unbiased bitmask with rejection method to provide number in > + * given boundaries while preserving uniform distribution and > + * avoiding overflow. It uses xoshiro256++ as an uint64 pseudo > + * random generator. > + */ > +int64_t > +pseudo_random_in_range(int64_t min, int64_t max); > + > #if defined(__cplusplus) > } 10. When you changed SWIM to use the new generator, you made the tests harder to reproduce in case they fail. SWIM tests print random seed in the beginning, so as it could be used to repeat the test. Now you need to print the state of your pseudo generator too, so as it could be restored and reproduced. See swim_run_test(). 11. Would be nice to see motivation why the chosen generator was chosen in the comments in random.c.