<HTML><BODY><div><div>Hi,</div><div> </div><div>Thanks for your review!</div><div> </div><div>Sent v3 of the patch considering your comments.</div><blockquote style="border-left:1px solid #0857A6; margin:10px; padding:0 0 0 10px;">Суббота, 3 октября 2020, 1:43 +03:00 от Vladislav Shpilevoy <v.shpilevoy@tarantool.org>:<br> <div id=""><div class="js-helper js-readmsg-msg"><div><div id="style_16016785822096867628_BODY">Hi! Thanks for the patch and for the good research, seriously!<br><br>See 11 comments below.<br><br>On 30.09.2020 02:06, Ilya Kosarev wrote:<br>> Tarantool codebase had at least two functions to generate random<br>> integer in given range and both of them had problems at least with<br>> integer overflow. This patch brings nice function to generate random<br>> int64_t in given range without overflow while preserving uniform random<br>> generator distribution using unbiased bitmask with rejection method.<br>> It is now possible to use xoshiro256++ PRNG or random bytes as a basis.<br>> Most relevant replacements have been made.<br>><br>> Closes #5075<br>> ---<br>> Branch: <a href="https://github.com/tarantool/tarantool/tree/i.kosarev/gh-5075-fine-random-in-range" target="_blank">https://github.com/tarantool/tarantool/tree/i.kosarev/gh-5075-fine-random-in-range</a><br>> Issue: <a href="https://github.com/tarantool/tarantool/issues/5075" target="_blank">https://github.com/tarantool/tarantool/issues/5075</a><br>><br>> @ChangeLog:<br>> * Introduce overflow resistant & evenly distributed random in range (gh-5075).<br><br>1. I wouldn't put it into the changelog. It is not a visible bugfix,<br>nor a public feature. Rather some refactoring.</div></div></div></div></blockquote></div><div>Ok, i see, removed in v3 of the patch.</div><div><blockquote style="border-left:1px solid #0857A6; margin:10px; padding:0 0 0 10px;"><div><div class="js-helper js-readmsg-msg"><div><div><br>> Changes in v2:<br>> - implemented fine 64-bit random<br>> - implemented fine range bounding<br>><br>> src/lib/core/random.c | 57 +++++++++++++++++++++++++++++++++++++++++++<br>> src/lib/core/random.h | 38 +++++++++++++++++++++++++++++<br>> src/lib/swim/swim.c | 26 +++-----------------<br>> test/unit/histogram.c | 12 +++------<br><br>2. Please, add some unit tests checking that, for example,<br>the randoms work fine even on the maximal range [INT64_MIN, INT64_MAX],<br>with some edge cases like [INT64_MIN, INT64_MIN], [INT64_MAX, INT64_MAX],<br>[1, 1], [0, 0], and some regular cases. Also the tests should check the<br>borders are respected, and ASAN passes (originally the issue was discovered<br>by ASAN which found signed integer overflows under ENABLE_UB_SANITIZER<br>option).</div></div></div></div></blockquote></div><div>All right, done in v3.</div><div><blockquote style="border-left:1px solid #0857A6; margin:10px; padding:0 0 0 10px;"><div><div class="js-helper js-readmsg-msg"><div><div><br>ASAN checks for signed int overflow UB are disabled for now because<br>of SQL producing this UB a lot. But you can enable it by changing<br>compiler.cmake, and running the unit tests. If you want to be sure<br>it won't fail, when the overflow check will be enabled after SQL is<br>fixed.</div></div></div></div></blockquote></div><div>Yes, thanks for the instruction, checked locally.</div><div><blockquote style="border-left:1px solid #0857A6; margin:10px; padding:0 0 0 10px;"><div><div class="js-helper js-readmsg-msg"><div><div><br>> 4 files changed, 102 insertions(+), 31 deletions(-)<br>><br>> diff --git a/src/lib/core/random.c b/src/lib/core/random.c<br>> index f4fa75b1c..4b52408ea 100644<br>> --- a/src/lib/core/random.c<br>> +++ b/src/lib/core/random.c> @@ -63,6 +65,8 @@ random_init(void)<br>> srand:<br>> srandom(seed);<br>> srand(seed);<br>> +<br>> + random_bytes((char *)state, 32);<br><br>3. Never hardcode an array size except maybe in its declaration.<br>sizeof(state) will return the same, but it is safe in case the<br>state size ever changes.</div></div></div></div></blockquote></div><div>Right, fixed in v3.</div><div><blockquote style="border-left:1px solid #0857A6; margin:10px; padding:0 0 0 10px;"><div><div class="js-helper js-readmsg-msg"><div><div><br>> }<br>><br>> void<br>> @@ -97,3 +101,56 @@ rand:<br>> while (generated < size)<br>> buf[generated++] = rand();<br>> }<br>> +<br>> +uint64_t<br>> +real_random(void)<br>> +{<br>> + uint64_t result;<br>> + random_bytes((char *)(&result), 8);<br>> + return result;<br>> +}<br>> +<br>> +static inline<br>> +uint64_t rotl(const uint64_t x, int k)<br><br>4. Sorry, we never put return type on the same line as function<br>name. Also why did you declare 'x' const? Why 'k' is not const<br>then? It seems you copy-pasted it from somewhere without<br>changing. Please, convert it to our code style.</div></div></div></div></blockquote></div><div>Yes, right, didn’t think about the difference. Fixed in v3.</div><div>Well, yes, it was borrowed from here originally, as mentioned:</div><div>http://prng.di.unimi.it/xoshiro256plusplus.c</div><div><blockquote style="border-left:1px solid #0857A6; margin:10px; padding:0 0 0 10px;"><div><div class="js-helper js-readmsg-msg"><div><div><br>5. What is 'rotl'?</div></div></div></div></blockquote></div><div>Comment is added in v3: «rotate left».</div><div><blockquote style="border-left:1px solid #0857A6; margin:10px; padding:0 0 0 10px;"><div><div class="js-helper js-readmsg-msg"><div><div><br>> +{<br>> + return (x << k) | (x >> (64 - k));<br>> +}<br>> +<br>> +uint64_t<br>> +pseudo_random(void)<br>> +{<br>> + const uint64_t result = rotl(state[0] + state[3], 23) + state[0];<br>> + const uint64_t t = state[1] << 17;<br>> + state[2] ^= state[0];<br>> + state[3] ^= state[1];<br>> + state[1] ^= state[2];<br>> + state[0] ^= state[3];<br>> + state[2] ^= t;<br>> + state[3] = rotl(state[3], 45);> + return result;<br>> +}<br>> +<br>> +static inline uint64_t<br>> +random_in_range(uint64_t (*rnd)(void), uint64_t range)<br>> +{<br>> + uint64_t mask = UINT64_MAX >> __builtin_clzll(range | 1);<br><br>6. Use bit_clz_u64(). You may not have __builtin_clzll.</div></div></div></div></blockquote></div><div>All right, fixed in v3.</div><div><blockquote style="border-left:1px solid #0857A6; margin:10px; padding:0 0 0 10px;"><div><div class="js-helper js-readmsg-msg"><div><div><br>> + uint64_t result;<br>> + do {<br>> + result = rnd() & mask;<br><br>7. I guess we can hope the compiler will inline rnd() virtual<br>call to a non-virtual call below, but I can't be 100% sure, and<br>nobody can. Even if now you see it inlined in the assembly. Better<br>duplicate these few lines or use a macro, than relying on the<br>compiler. A single virtual call would ruin the perf of the<br>algorithm you chose over mersen twister.</div></div></div></div></blockquote></div><div>I was quite sure it is reliable, but couldn’t find the proof. Changed in v3.</div><div><blockquote style="border-left:1px solid #0857A6; margin:10px; padding:0 0 0 10px;"><div><div class="js-helper js-readmsg-msg"><div><div><br>> + } while (result > range);<br>> + return result;<br>> +}<br>> +<br>> +int64_t<br>> +real_random_in_range(int64_t min, int64_t max)<br>> +{<br>> + assert(max >= min);<br>> + return min + random_in_range(real_random, (uint64_t) (max - min));<br><br>8. I guess this will fail UB checks from ASAN, because max-min will<br>be firstly stored into int64_t and only then cast to uint64_t. The<br>first step can overflow int64_t. I remember there was a simple solution,<br>but can't remember what it was.</div></div></div></div></blockquote></div><div>Yes, ASAN is fine with <span style="color: rgb(8, 8, 8); font-family: "JetBrains Mono", monospace; font-size: 9.8pt;">(</span><span style="font-family: "JetBrains Mono", monospace; font-size: 9.8pt; color: rgb(55, 31, 128);">uint64_t</span><span style="color: rgb(8, 8, 8); font-family: "JetBrains Mono", monospace; font-size: 9.8pt;">)max - min. </span>Fixed in v3.</div><div><blockquote style="border-left:1px solid #0857A6; margin:10px; padding:0 0 0 10px;"><div><div class="js-helper js-readmsg-msg"><div><div><br>> +}<br>> +<br>> +int64_t<br>> +pseudo_random_in_range(int64_t min, int64_t max)<br>> +{<br>> + assert(max >= min);<br>> + return min + random_in_range(pseudo_random, (uint64_t) (max - min));<br>> +}<br>> diff --git a/src/lib/core/random.h b/src/lib/core/random.h<br>> index 26109b14b..9228df72e 100644<br>> --- a/src/lib/core/random.h<br>> +++ b/src/lib/core/random.h<br>> @@ -45,6 +46,43 @@ random_free(void);<br>> void<br>> random_bytes(char *buf, size_t size);<br>><br>> +/**<br>> + * Just 8 random_bytes().<br>> + */<br>> +uint64_t<br>> +real_random(void);<br>> +<br>> +/**<br>> + * xoshiro256++ pseudo random generator.<br>> + * <a href="http://prng.di.unimi.it/" target="_blank">http://prng.di.unimi.it/</a><br>> + * State is initialized with random_bytes().<br>> + */<br>> +uint64_t<br>> +pseudo_random(void);<br>> +<br>> +/**<br>> + * Return a random int64_t number within given boundaries.<br>> + *<br>> + * Instead of blindly calculating a modulo, this function uses<br>> + * unbiased bitmask with rejection method to provide number in<br>> + * given boundaries while preserving uniform distribution and<br>> + * avoiding overflow. It uses random bytes as a basis.<br><br>9. Are the boundaries included or not? Both included, both<br>excluded, first included and last excluded?</div></div></div></div></blockquote></div><div>Added to the comment in v3.</div><div><blockquote style="border-left:1px solid #0857A6; margin:10px; padding:0 0 0 10px;"><div><div class="js-helper js-readmsg-msg"><div><div><br>> + */<br>> +int64_t<br>> +real_random_in_range(int64_t min, int64_t max);<br>> +<br>> +/**<br>> + * Return a pseudo random int64_t number within given boundaries.<br>> + *<br>> + * Instead of blindly calculating a modulo, this function uses<br>> + * unbiased bitmask with rejection method to provide number in<br>> + * given boundaries while preserving uniform distribution and<br>> + * avoiding overflow. It uses xoshiro256++ as an uint64 pseudo<br>> + * random generator.<br>> + */<br>> +int64_t<br>> +pseudo_random_in_range(int64_t min, int64_t max);<br>> +<br>> #if defined(__cplusplus)<br>> }<br><br>10. When you changed SWIM to use the new generator, you made the<br>tests harder to reproduce in case they fail. SWIM tests print random<br>seed in the beginning, so as it could be used to repeat the test.<br>Now you need to print the state of your pseudo generator too, so as<br>it could be restored and reproduced. See swim_run_test().</div></div></div></div></blockquote></div><div>All right, done in v3.</div><div><blockquote style="border-left:1px solid #0857A6; margin:10px; padding:0 0 0 10px;"><div><div class="js-helper js-readmsg-msg"><div><div><br>11. Would be nice to see motivation why the chosen generator was<br>chosen in the comments in random.c.</div></div></div></div></blockquote></div><div>Right, added some arguments in comment.</div><div> <div data-signature-widget="container"><div data-signature-widget="content"><div>--<br>Ilya Kosarev</div></div></div></div></BODY></HTML>