Re: [Tarantool-patches] [PATCH v2] core: introduce evenly distributed int64 random in range

Tarantool development patches archive
 help / color / mirror / Atom feed

From: "Ilya Kosarev" <i.kosarev@tarantool.org>
To: "Vladislav Shpilevoy" <v.shpilevoy@tarantool.org>
Cc: tarantool-patches@dev.tarantool.org
Subject: Re: [Tarantool-patches] [PATCH v2] core: introduce evenly distributed int64 random in range
Date: Sat, 24 Oct 2020 00:12:30 +0300	[thread overview]
Message-ID: <1603487550.604682360@f196.i.mail.ru> (raw)
In-Reply-To: <784163c9-c68f-7e54-7884-f2a267660a5d@tarantool.org>

[-- Attachment #1: Type: text/plain, Size: 7633 bytes --]


Hi,
 
Thanks for your review!
 
Sent v3 of the patch considering your comments. 
>Суббота, 3 октября 2020, 1:43 +03:00 от Vladislav Shpilevoy <v.shpilevoy@tarantool.org>:
> 
>Hi! Thanks for the patch and for the good research, seriously!
>
>See 11 comments below.
>
>On 30.09.2020 02:06, Ilya Kosarev wrote:
>> Tarantool codebase had at least two functions to generate random
>> integer in given range and both of them had problems at least with
>> integer overflow. This patch brings nice function to generate random
>> int64_t in given range without overflow while preserving uniform random
>> generator distribution using unbiased bitmask with rejection method.
>> It is now possible to use xoshiro256++ PRNG or random bytes as a basis.
>> Most relevant replacements have been made.
>>
>> Closes #5075
>> ---
>> Branch:  https://github.com/tarantool/tarantool/tree/i.kosarev/gh-5075-fine-random-in-range
>> Issue:  https://github.com/tarantool/tarantool/issues/5075
>>
>> @ChangeLog:
>> * Introduce overflow resistant & evenly distributed random in range (gh-5075).
>
>1. I wouldn't put it into the changelog. It is not a visible bugfix,
>nor a public feature. Rather some refactoring.
Ok, i see, removed in v3 of the patch.
>
>> Changes in v2:
>> - implemented fine 64-bit random
>> - implemented fine range bounding
>>
>> src/lib/core/random.c | 57 +++++++++++++++++++++++++++++++++++++++++++
>> src/lib/core/random.h | 38 +++++++++++++++++++++++++++++
>> src/lib/swim/swim.c | 26 +++-----------------
>> test/unit/histogram.c | 12 +++------
>
>2. Please, add some unit tests checking that, for example,
>the randoms work fine even on the maximal range [INT64_MIN, INT64_MAX],
>with some edge cases like [INT64_MIN, INT64_MIN], [INT64_MAX, INT64_MAX],
>[1, 1], [0, 0], and some regular cases. Also the tests should check the
>borders are respected, and ASAN passes (originally the issue was discovered
>by ASAN which found signed integer overflows under ENABLE_UB_SANITIZER
>option).
All right, done in v3.
>
>ASAN checks for signed int overflow UB are disabled for now because
>of SQL producing this UB a lot. But you can enable it by changing
>compiler.cmake, and running the unit tests. If you want to be sure
>it won't fail, when the overflow check will be enabled after SQL is
>fixed.
Yes, thanks for the instruction, checked locally.
>
>> 4 files changed, 102 insertions(+), 31 deletions(-)
>>
>> diff --git a/src/lib/core/random.c b/src/lib/core/random.c
>> index f4fa75b1c..4b52408ea 100644
>> --- a/src/lib/core/random.c
>> +++ b/src/lib/core/random.c> @@ -63,6 +65,8 @@ random_init(void)
>> srand:
>> srandom(seed);
>> srand(seed);
>> +
>> + random_bytes((char *)state, 32);
>
>3. Never hardcode an array size except maybe in its declaration.
>sizeof(state) will return the same, but it is safe in case the
>state size ever changes.
Right, fixed in v3.
>
>> }
>>
>> void
>> @@ -97,3 +101,56 @@ rand:
>> while (generated < size)
>> buf[generated++] = rand();
>> }
>> +
>> +uint64_t
>> +real_random(void)
>> +{
>> + uint64_t result;
>> + random_bytes((char *)(&result), 8);
>> + return result;
>> +}
>> +
>> +static inline
>> +uint64_t rotl(const uint64_t x, int k)
>
>4. Sorry, we never put return type on the same line as function
>name. Also why did you declare 'x' const? Why 'k' is not const
>then? It seems you copy-pasted it from somewhere without
>changing. Please, convert it to our code style.
Yes, right, didn’t think about the difference. Fixed in v3.
Well, yes, it was borrowed from here originally, as mentioned:
http://prng.di.unimi.it/xoshiro256plusplus.c
>
>5. What is 'rotl'?
Comment is added in v3: «rotate left».
>
>> +{
>> + return (x << k) | (x >> (64 - k));
>> +}
>> +
>> +uint64_t
>> +pseudo_random(void)
>> +{
>> + const uint64_t result = rotl(state[0] + state[3], 23) + state[0];
>> + const uint64_t t = state[1] << 17;
>> + state[2] ^= state[0];
>> + state[3] ^= state[1];
>> + state[1] ^= state[2];
>> + state[0] ^= state[3];
>> + state[2] ^= t;
>> + state[3] = rotl(state[3], 45);> + return result;
>> +}
>> +
>> +static inline uint64_t
>> +random_in_range(uint64_t (*rnd)(void), uint64_t range)
>> +{
>> + uint64_t mask = UINT64_MAX >> __builtin_clzll(range | 1);
>
>6. Use bit_clz_u64(). You may not have __builtin_clzll.
All right, fixed in v3.
>
>> + uint64_t result;
>> + do {
>> + result = rnd() & mask;
>
>7. I guess we can hope the compiler will inline rnd() virtual
>call to a non-virtual call below, but I can't be 100% sure, and
>nobody can. Even if now you see it inlined in the assembly. Better
>duplicate these few lines or use a macro, than relying on the
>compiler. A single virtual call would ruin the perf of the
>algorithm you chose over mersen twister.
I was quite sure it is reliable, but couldn’t find the proof. Changed in v3.
>
>> + } while (result > range);
>> + return result;
>> +}
>> +
>> +int64_t
>> +real_random_in_range(int64_t min, int64_t max)
>> +{
>> + assert(max >= min);
>> + return min + random_in_range(real_random, (uint64_t) (max - min));
>
>8. I guess this will fail UB checks from ASAN, because max-min will
>be firstly stored into int64_t and only then cast to uint64_t. The
>first step can overflow int64_t. I remember there was a simple solution,
>but can't remember what it was.
Yes, ASAN is fine with  ( uint64_t )max - min.  Fixed in v3.
>
>> +}
>> +
>> +int64_t
>> +pseudo_random_in_range(int64_t min, int64_t max)
>> +{
>> + assert(max >= min);
>> + return min + random_in_range(pseudo_random, (uint64_t) (max - min));
>> +}
>> diff --git a/src/lib/core/random.h b/src/lib/core/random.h
>> index 26109b14b..9228df72e 100644
>> --- a/src/lib/core/random.h
>> +++ b/src/lib/core/random.h
>> @@ -45,6 +46,43 @@ random_free(void);
>> void
>> random_bytes(char *buf, size_t size);
>>
>> +/**
>> + * Just 8 random_bytes().
>> + */
>> +uint64_t
>> +real_random(void);
>> +
>> +/**
>> + * xoshiro256++ pseudo random generator.
>> + *  http://prng.di.unimi.it/
>> + * State is initialized with random_bytes().
>> + */
>> +uint64_t
>> +pseudo_random(void);
>> +
>> +/**
>> + * Return a random int64_t number within given boundaries.
>> + *
>> + * Instead of blindly calculating a modulo, this function uses
>> + * unbiased bitmask with rejection method to provide number in
>> + * given boundaries while preserving uniform distribution and
>> + * avoiding overflow. It uses random bytes as a basis.
>
>9. Are the boundaries included or not? Both included, both
>excluded, first included and last excluded?
Added to the comment in v3.
>
>> + */
>> +int64_t
>> +real_random_in_range(int64_t min, int64_t max);
>> +
>> +/**
>> + * Return a pseudo random int64_t number within given boundaries.
>> + *
>> + * Instead of blindly calculating a modulo, this function uses
>> + * unbiased bitmask with rejection method to provide number in
>> + * given boundaries while preserving uniform distribution and
>> + * avoiding overflow. It uses xoshiro256++ as an uint64 pseudo
>> + * random generator.
>> + */
>> +int64_t
>> +pseudo_random_in_range(int64_t min, int64_t max);
>> +
>> #if defined(__cplusplus)
>> }
>
>10. When you changed SWIM to use the new generator, you made the
>tests harder to reproduce in case they fail. SWIM tests print random
>seed in the beginning, so as it could be used to repeat the test.
>Now you need to print the state of your pseudo generator too, so as
>it could be restored and reproduced. See swim_run_test().
All right, done in v3.
>
>11. Would be nice to see motivation why the chosen generator was
>chosen in the comments in random.c.
Right, added some arguments in comment.
 
--
Ilya Kosarev

[-- Attachment #2: Type: text/html, Size: 11680 bytes --]

     prev parent reply	other threads:[~2020-10-23 21:12 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-30  0:06 Ilya Kosarev
2020-10-02 22:43 ` Vladislav Shpilevoy
2020-10-23 21:12   ` Ilya Kosarev [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1603487550.604682360@f196.i.mail.ru \
    --to=i.kosarev@tarantool.org \
    --cc=tarantool-patches@dev.tarantool.org \
    --cc=v.shpilevoy@tarantool.org \
    --subject='Re: [Tarantool-patches] [PATCH v2] core: introduce evenly distributed int64 random in range' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox