[Tarantool-patches] [PATCH] core: introduce evenly distributed int64 random in range

Sat Sep 5 01:39:10 MSK 2020

Hi! Thanks for the patch!

See 2 comments below.

On 04.09.2020 15:51, Ilya Kosarev wrote:
> Tarantool codebase had at least two functions to generate random
> integer in given range and both of them had problems at least with
> integer overflow. This patch brings nice function to generate random
> int64_t in given range without overflow while preserving uniform random
> generator distribution. Most relevant replacements have been made.
> 
> Closes #5075
> ---
> Branch: https://github.com/tarantool/tarantool/tree/i.kosarev/gh-5075-fine-random-in-range
> Issue: https://github.com/tarantool/tarantool/issues/5075
> 
> @ChangeLog:
>  * Introduce overflow resistand & evenly distributed random in range (gh-5075).

1. resistand -> resistant

> 
>  src/lib/core/random.c |  8 ++++++++
>  src/lib/core/random.h | 12 ++++++++++++
>  src/lib/swim/swim.c   | 26 ++++----------------------
>  test/unit/histogram.c | 12 +++---------
>  4 files changed, 27 insertions(+), 31 deletions(-)
> 
> diff --git a/src/lib/core/random.c b/src/lib/core/random.c
> index f4fa75b1c1..2229c5fbbc 100644
> --- a/src/lib/core/random.c
> +++ b/src/lib/core/random.c
> @@ -97,3 +97,11 @@ rand:
>  	while (generated < size)
>  		buf[generated++] = rand();
>  }
> +
> +int64_t
> +random_in_range(int64_t min, int64_t max)
> +{
> +	assert(max >= min);
> +	double drand = drand48();
> +	return (int64_t)(drand * max + (1 - drand) * min);
> +}

2. The problem here is that you won't get an overflow, but you won't get a
completely random number either. This function is not able to return
anything > 48 bits without precision loss. So for example if I will give it
a range [INT64_MIN, INT64_MAX], it will never return lots of integers > 2^48.
Especially the ones > 2^53, since starting from this point double will loose
integer precision.

I expect a random function be able to return any value from the given range.
Not only certain ones.

In other words, this solution is not much better than rand() %. It also is not
able to return the complete range.

I was thinking about that a lot, and this is why I filed a ticket. It is not
a trivial task.

The function should be able to return any value from 0 to 2^64-1 to cover the
entire [INT64_MIN, INT64_MAX], or [0, UINT64_MAX].

I was thinking we could use random_bytes() to get a value [0, UINT64_MAX], and
then scale it down to the given range. For example, if the range is [A, B], then
its length L = B - A. And the random value is R.

Then a scaled down value would be

    S = R / UINT64_MAX * L

The problem here is the precision loss again. When you will divide R / UINT64_MAX
to get a value from [0, 1] range, the result is likely to be garbage. Double
can't handle that precision. This is where I stopped and filed the ticket.

Another issue - random_bytes() is easy to use, but it is not pseudo-random. It is
completely random. This is actually good usually, but not for the tests. In swim
unit tests knowing random seed is the only way to reproduce a fail. That makes me
think we need two random functions: for pseudo-random values and for real random.

Pseudo-random could be generated by concatenation of several rand() calls using
<< and |.

Another option - implement Mersenne twister algorithm of pseudo-random numbers of
uint64_t type.

Also I would look into what std:: has in C++. Its random functionality is much
richer than rand().