Hi!   Thanks for your review.   See my answers & questions below.   >Суббота, 5 сентября 2020, 1:39 +03:00 от Vladislav Shpilevoy : >  >Hi! Thanks for the patch! > >See 2 comments below. > >On 04.09.2020 15:51, Ilya Kosarev wrote: >> Tarantool codebase had at least two functions to generate random >> integer in given range and both of them had problems at least with >> integer overflow. This patch brings nice function to generate random >> int64_t in given range without overflow while preserving uniform random >> generator distribution. Most relevant replacements have been made. >> >> Closes #5075 >> --- >> Branch: https://github.com/tarantool/tarantool/tree/i.kosarev/gh-5075-fine-random-in-range >> Issue: https://github.com/tarantool/tarantool/issues/5075 >> >> @ChangeLog: >> * Introduce overflow resistand & evenly distributed random in range (gh-5075). > >1. resistand -> resistant > >> >> src/lib/core/random.c | 8 ++++++++ >> src/lib/core/random.h | 12 ++++++++++++ >> src/lib/swim/swim.c | 26 ++++---------------------- >> test/unit/histogram.c | 12 +++--------- >> 4 files changed, 27 insertions(+), 31 deletions(-) >> >> diff --git a/src/lib/core/random.c b/src/lib/core/random.c >> index f4fa75b1c1..2229c5fbbc 100644 >> --- a/src/lib/core/random.c >> +++ b/src/lib/core/random.c >> @@ -97,3 +97,11 @@ rand: >> while (generated < size) >> buf[generated++] = rand(); >> } >> + >> +int64_t >> +random_in_range(int64_t min, int64_t max) >> +{ >> + assert(max >= min); >> + double drand = drand48(); >> + return (int64_t)(drand * max + (1 - drand) * min); >> +} > >2. The problem here is that you won't get an overflow, but you won't get a >completely random number either. This function is not able to return >anything > 48 bits without precision loss. So for example if I will give it >a range [INT64_MIN, INT64_MAX], it will never return lots of integers > 2^48. >Especially the ones > 2^53, since starting from this point double will loose >integer precision. * Ok, right, i see the problem here. Didn’t think it through enough. > >I expect a random function be able to return any value from the given range. >Not only certain ones. > >In other words, this solution is not much better than rand() %. It also is not >able to return the complete range. > >I was thinking about that a lot, and this is why I filed a ticket. It is not >a trivial task. > >The function should be able to return any value from 0 to 2^64-1 to cover the >entire [INT64_MIN, INT64_MAX], or [0, UINT64_MAX]. > >I was thinking we could use random_bytes() to get a value [0, UINT64_MAX], and >then scale it down to the given range. For example, if the range is [A, B], then >its length L = B - A. And the random value is R. > >Then a scaled down value would be > >    S = R / UINT64_MAX * L > >The problem here is the precision loss again. When you will divide R / UINT64_MAX >to get a value from [0, 1] range, the result is likely to be garbage. Double >can't handle that precision. This is where I stopped and filed the ticket. * Well, yes, scaling doesn’t really seem to work as good as we want to. I looked through this page:  https://www.pcg-random.org/posts/bounded-rands.html on generation in range and found out a bounding method which i think is the most suitable for us. It works for int64_t too and can be used both for «complete random» and «pseudo-random»: ``` uint32_t bounded_rand(rng_t& rng, uint32_t range) { uint32_t mask = ~uint32_t(0); --range; mask >>= __builtin_clz(range|1); uint32_t x; do { x = rng() & mask; } while (x > range); return x; } ``` >Another issue - random_bytes() is easy to use, but it is not pseudo-random. It is >completely random. This is actually good usually, but not for the tests. In swim >unit tests knowing random seed is the only way to reproduce a fail. That makes me >think we need two random functions: for pseudo-random values and for real random. > >Pseudo-random could be generated by concatenation of several rand() calls using ><< and |. > >Another option - implement Mersenne twister algorithm of pseudo-random numbers of >uint64_t type. > >Also I would look into what std:: has in C++. Its random functionality is much >richer than rand(). * Right, C++ has mersenne_twister int64_t implementation and some solution to get int64_t in range as following:     ``` std::random_device rd; std::mt19937_64 generator(rd()); std::uniform_int_distribution range(min, max); return range(generator); ``` However, C++ implementation seems to be overcomplicated. I think it is better idea to implement something simple & fast.   For the mersenne_twister the good idea is to adopt this implementation, as far as i see it: http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt64.html   But what i think might be even better is to take something less classic while more relevant, according to this paper:  https://arxiv.org/pdf/1910.06437.pdf   I think the best option for us now is xoshiro256++: http://prng.di.unimi.it/ It seems to be much faster and doesn’t fail any known statistical test as far as i see. The implementation to adapt:  http://prng.di.unimi.it/xoshiro256plusplus.c   -- Ilya Kosarev