[Tarantool-patches] [PATCH v2] core: introduce evenly distributed int64 random in range

Sat Oct 3 01:43:01 MSK 2020

Hi! Thanks for the patch and for the good research, seriously!

See 11 comments below.

On 30.09.2020 02:06, Ilya Kosarev wrote:
> Tarantool codebase had at least two functions to generate random
> integer in given range and both of them had problems at least with
> integer overflow. This patch brings nice function to generate random
> int64_t in given range without overflow while preserving uniform random
> generator distribution using unbiased bitmask with rejection method.
> It is now possible to use xoshiro256++ PRNG or random bytes as a basis.
> Most relevant replacements have been made.
> 
> Closes #5075
> ---
> Branch: https://github.com/tarantool/tarantool/tree/i.kosarev/gh-5075-fine-random-in-range
> Issue: https://github.com/tarantool/tarantool/issues/5075
> 
> @ChangeLog:
>  * Introduce overflow resistant & evenly distributed random in range (gh-5075).

1. I wouldn't put it into the changelog. It is not a visible bugfix,
nor a public feature. Rather some refactoring.

> Changes in v2:
>  - implemented fine 64-bit random
>  - implemented fine range bounding
> 
>  src/lib/core/random.c | 57 +++++++++++++++++++++++++++++++++++++++++++
>  src/lib/core/random.h | 38 +++++++++++++++++++++++++++++
>  src/lib/swim/swim.c   | 26 +++-----------------
>  test/unit/histogram.c | 12 +++------

2. Please, add some unit tests checking that, for example,
the randoms work fine even on the maximal range [INT64_MIN, INT64_MAX],
with some edge cases like [INT64_MIN, INT64_MIN], [INT64_MAX, INT64_MAX],
[1, 1], [0, 0], and some regular cases. Also the tests should check the
borders are respected, and ASAN passes (originally the issue was discovered
by ASAN which found signed integer overflows under ENABLE_UB_SANITIZER
option).

ASAN checks for signed int overflow UB are disabled for now because
of SQL producing this UB a lot. But you can enable it by changing
compiler.cmake, and running the unit tests. If you want to be sure
it won't fail, when the overflow check will be enabled after SQL is
fixed.

>  4 files changed, 102 insertions(+), 31 deletions(-)
> 
> diff --git a/src/lib/core/random.c b/src/lib/core/random.c
> index f4fa75b1c..4b52408ea 100644
> --- a/src/lib/core/random.c
> +++ b/src/lib/core/random.c> @@ -63,6 +65,8 @@ random_init(void)
>  srand:
>  	srandom(seed);
>  	srand(seed);
> +
> +	random_bytes((char *)state, 32);

3. Never hardcode an array size except maybe in its declaration.
sizeof(state) will return the same, but it is safe in case the
state size ever changes.

>  }
>  
>  void
> @@ -97,3 +101,56 @@ rand:
>  	while (generated < size)
>  		buf[generated++] = rand();
>  }
> +
> +uint64_t
> +real_random(void)
> +{
> +	uint64_t result;
> +	random_bytes((char *)(&result), 8);
> +	return result;
> +}
> +
> +static inline
> +uint64_t rotl(const uint64_t x, int k)

4. Sorry, we never put return type on the same line as function
name. Also why did you declare 'x' const? Why 'k' is not const
then? It seems you copy-pasted it from somewhere without
changing. Please, convert it to our code style.

5. What is 'rotl'?

> +{
> +	return (x << k) | (x >> (64 - k));
> +}
> +
> +uint64_t
> +pseudo_random(void)
> +{
> +	const uint64_t result = rotl(state[0] + state[3], 23) + state[0];
> +	const uint64_t t = state[1] << 17;
> +	state[2] ^= state[0];
> +	state[3] ^= state[1];
> +	state[1] ^= state[2];
> +	state[0] ^= state[3];
> +	state[2] ^= t;
> +	state[3] = rotl(state[3], 45);> +	return result;
> +}
> +
> +static inline uint64_t
> +random_in_range(uint64_t (*rnd)(void), uint64_t range)
> +{
> +	uint64_t mask = UINT64_MAX >> __builtin_clzll(range | 1);

6. Use bit_clz_u64(). You may not have __builtin_clzll.

> +	uint64_t result;
> +	do {
> +		result = rnd() & mask;

7. I guess we can hope the compiler will inline rnd() virtual
call to a non-virtual call below, but I can't be 100% sure, and
nobody can. Even if now you see it inlined in the assembly. Better
duplicate these few lines or use a macro, than relying on the
compiler. A single virtual call would ruin the perf of the
algorithm you chose over mersen twister.

> +	} while (result > range);
> +	return result;
> +}
> +
> +int64_t
> +real_random_in_range(int64_t min, int64_t max)
> +{
> +	assert(max >= min);
> +	return min + random_in_range(real_random, (uint64_t) (max - min));

8. I guess this will fail UB checks from ASAN, because max-min will
be firstly stored into int64_t and only then cast to uint64_t. The
first step can overflow int64_t. I remember there was a simple solution,
but can't remember what it was.

> +}
> +
> +int64_t
> +pseudo_random_in_range(int64_t min, int64_t max)
> +{
> +	assert(max >= min);
> +	return min + random_in_range(pseudo_random, (uint64_t) (max - min));
> +}
> diff --git a/src/lib/core/random.h b/src/lib/core/random.h
> index 26109b14b..9228df72e 100644
> --- a/src/lib/core/random.h
> +++ b/src/lib/core/random.h
> @@ -45,6 +46,43 @@ random_free(void);
>  void
>  random_bytes(char *buf, size_t size);
>  
> +/**
> + * Just 8 random_bytes().
> + */
> +uint64_t
> +real_random(void);
> +
> +/**
> + * xoshiro256++ pseudo random generator.
> + * http://prng.di.unimi.it/
> + * State is initialized with random_bytes().
> + */
> +uint64_t
> +pseudo_random(void);
> +
> +/**
> + * Return a random int64_t number within given boundaries.
> + *
> + * Instead of blindly calculating a modulo, this function uses
> + * unbiased bitmask with rejection method to provide number in
> + * given boundaries while preserving uniform distribution and
> + * avoiding overflow. It uses random bytes as a basis.

9. Are the boundaries included or not? Both included, both
excluded, first included and last excluded?

> + */
> +int64_t
> +real_random_in_range(int64_t min, int64_t max);
> +
> +/**
> + * Return a pseudo random int64_t number within given boundaries.
> + *
> + * Instead of blindly calculating a modulo, this function uses
> + * unbiased bitmask with rejection method to provide number in
> + * given boundaries while preserving uniform distribution and
> + * avoiding overflow. It uses xoshiro256++ as an uint64 pseudo
> + * random generator.
> + */
> +int64_t
> +pseudo_random_in_range(int64_t min, int64_t max);
> +
>  #if defined(__cplusplus)
>  }

10. When you changed SWIM to use the new generator, you made the
tests harder to reproduce in case they fail. SWIM tests print random
seed in the beginning, so as it could be used to repeat the test.
Now you need to print the state of your pseudo generator too, so as
it could be restored and reproduced. See swim_run_test().

11. Would be nice to see motivation why the chosen generator was
chosen in the comments in random.c.