Re: [Tarantool-patches] [PATCH] core: introduce evenly distributed int64 random in range

Tarantool development patches archive
 help / color / mirror / Atom feed

From: "Ilya Kosarev" <i.kosarev@tarantool.org>
To: "Vladislav Shpilevoy" <v.shpilevoy@tarantool.org>
Cc: tarantool-patches@dev.tarantool.org
Subject: Re: [Tarantool-patches] [PATCH] core: introduce evenly distributed int64 random in range
Date: Sun, 06 Sep 2020 16:19:58 +0300	[thread overview]
Message-ID: <1599398398.275563907@f169.i.mail.ru> (raw)
In-Reply-To: <67586d6e-46aa-e4f4-1ee7-3fb733bd80db@tarantool.org>

[-- Attachment #1: Type: text/plain, Size: 5273 bytes --]


Hi!
 
Thanks for your review.
 
See my answers & questions below.
  
>Суббота, 5 сентября 2020, 1:39 +03:00 от Vladislav Shpilevoy <v.shpilevoy@tarantool.org>:
> 
>Hi! Thanks for the patch!
>
>See 2 comments below.
>
>On 04.09.2020 15:51, Ilya Kosarev wrote:
>> Tarantool codebase had at least two functions to generate random
>> integer in given range and both of them had problems at least with
>> integer overflow. This patch brings nice function to generate random
>> int64_t in given range without overflow while preserving uniform random
>> generator distribution. Most relevant replacements have been made.
>>
>> Closes #5075
>> ---
>> Branch:  https://github.com/tarantool/tarantool/tree/i.kosarev/gh-5075-fine-random-in-range
>> Issue:  https://github.com/tarantool/tarantool/issues/5075
>>
>> @ChangeLog:
>> * Introduce overflow resistand & evenly distributed random in range (gh-5075).
>
>1. resistand -> resistant
>
>>
>> src/lib/core/random.c | 8 ++++++++
>> src/lib/core/random.h | 12 ++++++++++++
>> src/lib/swim/swim.c | 26 ++++----------------------
>> test/unit/histogram.c | 12 +++---------
>> 4 files changed, 27 insertions(+), 31 deletions(-)
>>
>> diff --git a/src/lib/core/random.c b/src/lib/core/random.c
>> index f4fa75b1c1..2229c5fbbc 100644
>> --- a/src/lib/core/random.c
>> +++ b/src/lib/core/random.c
>> @@ -97,3 +97,11 @@ rand:
>> while (generated < size)
>> buf[generated++] = rand();
>> }
>> +
>> +int64_t
>> +random_in_range(int64_t min, int64_t max)
>> +{
>> + assert(max >= min);
>> + double drand = drand48();
>> + return (int64_t)(drand * max + (1 - drand) * min);
>> +}
>
>2. The problem here is that you won't get an overflow, but you won't get a
>completely random number either. This function is not able to return
>anything > 48 bits without precision loss. So for example if I will give it
>a range [INT64_MIN, INT64_MAX], it will never return lots of integers > 2^48.
>Especially the ones > 2^53, since starting from this point double will loose
>integer precision.
*  
Ok, right, i see the problem here. Didn’t think it through enough.
>
>I expect a random function be able to return any value from the given range.
>Not only certain ones.
>
>In other words, this solution is not much better than rand() %. It also is not
>able to return the complete range.
>
>I was thinking about that a lot, and this is why I filed a ticket. It is not
>a trivial task.
>
>The function should be able to return any value from 0 to 2^64-1 to cover the
>entire [INT64_MIN, INT64_MAX], or [0, UINT64_MAX].
>
>I was thinking we could use random_bytes() to get a value [0, UINT64_MAX], and
>then scale it down to the given range. For example, if the range is [A, B], then
>its length L = B - A. And the random value is R.
>
>Then a scaled down value would be
>
>    S = R / UINT64_MAX * L
>
>The problem here is the precision loss again. When you will divide R / UINT64_MAX
>to get a value from [0, 1] range, the result is likely to be garbage. Double
>can't handle that precision. This is where I stopped and filed the ticket.
*  
Well, yes, scaling doesn’t really seem to work as good as we want to.
I looked through this page:  https://www.pcg-random.org/posts/bounded-rands.html
on generation in range and found out a bounding method which i think is
the most suitable for us. It works for int64_t too and can be used both
for «complete random» and «pseudo-random»:
```
uint32_t bounded_rand(rng_t& rng, uint32_t range) {
    uint32_t mask = ~uint32_t(0);
    --range;
    mask >>= __builtin_clz(range|1);
    uint32_t x;
    do {
        x = rng() & mask;
    } while (x > range);
    return x;
}
```
>Another issue - random_bytes() is easy to use, but it is not pseudo-random. It is
>completely random. This is actually good usually, but not for the tests. In swim
>unit tests knowing random seed is the only way to reproduce a fail. That makes me
>think we need two random functions: for pseudo-random values and for real random.
>
>Pseudo-random could be generated by concatenation of several rand() calls using
><< and |.
>
>Another option - implement Mersenne twister algorithm of pseudo-random numbers of
>uint64_t type.
>
>Also I would look into what std:: has in C++. Its random functionality is much
>richer than rand().
*  
Right, C++ has mersenne_twister int64_t implementation and some solution
to get int64_t in range as following:
 
 
```
std::random_device rd;
std::mt19937_64 generator(rd());
std::uniform_int_distribution<int64_t> range(min, max);
return range(generator);
```
However, C++ implementation seems to be overcomplicated.
I think it is better idea to implement something simple & fast.
 
For the mersenne_twister the good idea is to adopt this implementation,
as far as i see it:
http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt64.html
 
But what i think might be even better is to take something less classic
while more relevant, according to this paper:  https://arxiv.org/pdf/1910.06437.pdf
 
I think the best option for us now is xoshiro256++:  http://prng.di.unimi.it/
It seems to be much faster and doesn’t fail any known statistical test as far as i see.
The implementation to adapt:  http://prng.di.unimi.it/xoshiro256plusplus.c
 
--
Ilya Kosarev
 

[-- Attachment #2: Type: text/html, Size: 7483 bytes --]

next prev parent reply	other threads:[~2020-09-06 13:19 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-04 13:51 Ilya Kosarev
2020-09-04 22:39 ` Vladislav Shpilevoy
2020-09-06 13:19   ` Ilya Kosarev [this message]
2020-09-17 13:56     ` Vladislav Shpilevoy
  -- strict thread matches above, loose matches on Subject: below --
2020-09-04 13:20 Ilya Kosarev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1599398398.275563907@f169.i.mail.ru \
    --to=i.kosarev@tarantool.org \
    --cc=tarantool-patches@dev.tarantool.org \
    --cc=v.shpilevoy@tarantool.org \
    --subject='Re: [Tarantool-patches] [PATCH] core: introduce evenly distributed int64 random in range' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox