Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH vshard 1/1] router: bucket_id_strcrc32 and bucket_id_mpcrc32
@ 2020-02-25 23:52 Vladislav Shpilevoy
  2020-02-26  7:15 ` Oleg Babin
  2020-02-26  8:20 ` Konstantin Osipov
  0 siblings, 2 replies; 12+ messages in thread
From: Vladislav Shpilevoy @ 2020-02-25 23:52 UTC (permalink / raw)
  To: tarantool-patches, olegrok

Closes #207

@TarantoolBot document
Title: vshard.router.bucket_id_strcrc32() and .bucket_id_mpcrc32()
vshard.router.bucket_id() is deprecated, each its usage logs a
warning. It still works, but will be deleted in future.

Behaviour of the old bucket_id() function is now available as
vshard.router.bucket_id_strcrc32(). It works exactly like the old
function, but does not log a warning.

The reason why there is a new function bucket_id_mpcrc32() is that
the old bucket_id() and the new bucket_id_strcrc32() are not
consistent for cdata numbers. In particular, they return 3
different values for normal Lua numbers like 123, for unsigned
long long cdata (like 123ULL, or ffi.cast('unsigned long long',
123)), and for signed long long cdata (like 123LL, or
ffi.cast('long long', 123)). Note, this is important!

    vshard.router.bucket_id(123)
    vshard.router.bucket_id(123LL)
    vshard.router.bucket_id(123ULL)

    Return 3 different values!!!

For float and double cdata (ffi.cast('float', number),
ffi.cast('double', number)) these functions return different
values even for the same numbers of the same floating point type.
This is because tostring() on a floating point cdata number
returns not the number, but a pointer at it. Different on each
call.

vshard.router.bucket_id_strcrc32() behaves exactly the same.

vshard.router.bucket_id_mpcrc32() is safer. It takes a CRC32 from
MessagePack encoded value. That is, bucket_id of integers does not
depend on their Lua type. However it still may return different
values for not equal floating point types. That is,
ffi.cast('float', number) may be reflected onto a bucket id not
equal to ffi.cast('double', number). This can't be fixed, because
a float value, even being casted to double, may have a garbage
tail in its fraction.

Floating point keys should not be used to calculate a bucket id,
usually.

A final note - bucket_id_mpcrc32() in case of a string key does
not encode it into MessagePack, but takes hash right from the
string. This does not affect consistency of the function, but
makes it as fast as bucket_id_strcrc32().
---
Branch: http://github.com/tarantool/vshard/tree/gerold103/gh-207-bucket_id-cdata-numbers
Issue: https://github.com/tarantool/vshard/issues/207

 test/router/router.result   |  27 -------
 test/router/router.test.lua |  11 ---
 test/unit/router.result     | 149 ++++++++++++++++++++++++++++++++++++
 test/unit/router.test.lua   |  49 ++++++++++++
 vshard/hash.lua             |  41 +++++++++-
 vshard/router/init.lua      |  23 +++++-
 6 files changed, 257 insertions(+), 43 deletions(-)
 create mode 100644 test/unit/router.result
 create mode 100644 test/unit/router.test.lua

diff --git a/test/router/router.result b/test/router/router.result
index 47d3278..31a351f 100644
--- a/test/router/router.result
+++ b/test/router/router.result
@@ -139,33 +139,6 @@ _ = test_run:cmd("setopt delimiter ''");
 ---
 ...
 --
--- bucket_id and bucket_count
---
-util.check_error(vshard.router.bucket_id) -- invalid arguments
----
-- 'Usage: vshard.router.bucket_id(key)'
-...
-vshard.router.bucket_id(1)
----
-- 477
-...
-vshard.router.bucket_id(2)
----
-- 401
-...
-vshard.router.bucket_id({2})
----
-- 401
-...
-vshard.router.bucket_id('2')
----
-- 401
-...
-vshard.router.bucket_count()
----
-- 3000
-...
---
 -- Initial distribution
 --
 util.check_error(vshard.router.call, 1, 'read', 'echo', {123})
diff --git a/test/router/router.test.lua b/test/router/router.test.lua
index d659515..571759a 100644
--- a/test/router/router.test.lua
+++ b/test/router/router.test.lua
@@ -59,17 +59,6 @@ for _, new_rs in pairs(new_replicasets) do
 end;
 _ = test_run:cmd("setopt delimiter ''");
 
---
--- bucket_id and bucket_count
---
-
-util.check_error(vshard.router.bucket_id) -- invalid arguments
-vshard.router.bucket_id(1)
-vshard.router.bucket_id(2)
-vshard.router.bucket_id({2})
-vshard.router.bucket_id('2')
-vshard.router.bucket_count()
-
 --
 -- Initial distribution
 --
diff --git a/test/unit/router.result b/test/unit/router.result
new file mode 100644
index 0000000..36366ac
--- /dev/null
+++ b/test/unit/router.result
@@ -0,0 +1,149 @@
+-- test-run result file version 2
+ffi = require('ffi')
+ | ---
+ | ...
+util = require('util')
+ | ---
+ | ...
+vshard = require('vshard')
+ | ---
+ | ...
+vshard.router.cfg({sharding = {}})
+ | ---
+ | ...
+
+--
+-- gh-207: vshard.router.bucket_id() was not consistent when
+-- values were cdata numbers.
+--
+
+-- Decimal is not available in 1.10, but vshard and its tests
+-- should work on 1.10. Decimal test is optional.
+has_decimal, decimal = pcall(require, 'decimal')
+ | ---
+ | ...
+
+function check_values(bid, values)                                              \
+    local result = {}                                                           \
+    for _, v in pairs(values) do                                                \
+        local v1 = bid(v)                                                       \
+        local v2 = bid(v)                                                       \
+        local t = type(v)                                                       \
+        if t == 'cdata' then                                                    \
+            t = ffi.typeof(v)                                                   \
+        end                                                                     \
+        if v1 ~= v2 then                                                        \
+            table.insert(result, {'not stable', {t, v, v1, v2}})                \
+        else                                                                    \
+            table.insert(result, {t, v, v1})                                    \
+        end                                                                     \
+    end                                                                         \
+    return result                                                               \
+end
+ | ---
+ | ...
+
+util.check_error(vshard.router.bucket_id)
+ | ---
+ | - 'Usage: vshard.router.bucket_id(key)'
+ | ...
+
+values = {                                                                      \
+    1, 1LL, 1ULL, ffi.cast('uint64_t', 1), ffi.cast('int64_t', 1),  '1',        \
+    has_decimal and decimal.new(1) or 1                                         \
+}
+ | ---
+ | ...
+
+check_values(vshard.router.bucket_id, values)
+ | ---
+ | - - - number
+ |     - 1
+ |     - 477
+ |   - - ctype<int64_t>
+ |     - 1
+ |     - 730
+ |   - - ctype<uint64_t>
+ |     - 1
+ |     - 982
+ |   - - ctype<uint64_t>
+ |     - 1
+ |     - 982
+ |   - - ctype<int64_t>
+ |     - 1
+ |     - 730
+ |   - - string
+ |     - '1'
+ |     - 477
+ |   - - ctype<struct 106>
+ |     - 1
+ |     - 477
+ | ...
+check_values(vshard.router.bucket_id_strcrc32, values)
+ | ---
+ | - - - number
+ |     - 1
+ |     - 477
+ |   - - ctype<int64_t>
+ |     - 1
+ |     - 730
+ |   - - ctype<uint64_t>
+ |     - 1
+ |     - 982
+ |   - - ctype<uint64_t>
+ |     - 1
+ |     - 982
+ |   - - ctype<int64_t>
+ |     - 1
+ |     - 730
+ |   - - string
+ |     - '1'
+ |     - 477
+ |   - - ctype<struct 106>
+ |     - 1
+ |     - 477
+ | ...
+
+-- Floating point cdata is not tested in strcrc32, because
+-- tostring() is not stable on them, and returns a pointer string.
+table.insert(values, ffi.cast('double', 1))
+ | ---
+ | ...
+table.insert(values, ffi.cast('float', 1))
+ | ---
+ | ...
+check_values(vshard.router.bucket_id_mpcrc32, values)
+ | ---
+ | - - - number
+ |     - 1
+ |     - 1614
+ |   - - ctype<int64_t>
+ |     - 1
+ |     - 1614
+ |   - - ctype<uint64_t>
+ |     - 1
+ |     - 1614
+ |   - - ctype<uint64_t>
+ |     - 1
+ |     - 1614
+ |   - - ctype<int64_t>
+ |     - 1
+ |     - 1614
+ |   - - string
+ |     - '1'
+ |     - 477
+ |   - - ctype<struct 106>
+ |     - 1
+ |     - 1696
+ |   - - ctype<float>
+ |     - 1
+ |     - 193
+ |   - - ctype<double>
+ |     - 1
+ |     - 1411
+ | ...
+
+vshard.router.bucket_count()
+ | ---
+ | - 3000
+ | ...
diff --git a/test/unit/router.test.lua b/test/unit/router.test.lua
new file mode 100644
index 0000000..06c92f1
--- /dev/null
+++ b/test/unit/router.test.lua
@@ -0,0 +1,49 @@
+ffi = require('ffi')
+util = require('util')
+vshard = require('vshard')
+vshard.router.cfg({sharding = {}})
+
+--
+-- gh-207: vshard.router.bucket_id() was not consistent when
+-- values were cdata numbers.
+--
+
+-- Decimal is not available in 1.10, but vshard and its tests
+-- should work on 1.10. Decimal test is optional.
+has_decimal, decimal = pcall(require, 'decimal')
+
+function check_values(bid, values)                                              \
+    local result = {}                                                           \
+    for _, v in pairs(values) do                                                \
+        local v1 = bid(v)                                                       \
+        local v2 = bid(v)                                                       \
+        local t = type(v)                                                       \
+        if t == 'cdata' then                                                    \
+            t = ffi.typeof(v)                                                   \
+        end                                                                     \
+        if v1 ~= v2 then                                                        \
+            table.insert(result, {'not stable', {t, v, v1, v2}})                \
+        else                                                                    \
+            table.insert(result, {t, v, v1})                                    \
+        end                                                                     \
+    end                                                                         \
+    return result                                                               \
+end
+
+util.check_error(vshard.router.bucket_id)
+
+values = {                                                                      \
+    1, 1LL, 1ULL, ffi.cast('uint64_t', 1), ffi.cast('int64_t', 1),  '1',        \
+    has_decimal and decimal.new(1) or 1                                         \
+}
+
+check_values(vshard.router.bucket_id, values)
+check_values(vshard.router.bucket_id_strcrc32, values)
+
+-- Floating point cdata is not tested in strcrc32, because
+-- tostring() is not stable on them, and returns a pointer string.
+table.insert(values, ffi.cast('double', 1))
+table.insert(values, ffi.cast('float', 1))
+check_values(vshard.router.bucket_id_mpcrc32, values)
+
+vshard.router.bucket_count()
diff --git a/vshard/hash.lua b/vshard/hash.lua
index b972eae..477b312 100644
--- a/vshard/hash.lua
+++ b/vshard/hash.lua
@@ -1,10 +1,13 @@
 -- hash.lua
 local ldigest = require('digest')
+local mpencode = require('msgpackffi').encode
 
 --
--- Simple hash function based on crc32
+-- Fast and simple hash. However it works incorrectly with
+-- floating point cdata values. Also hash of an integer value
+-- depends on its type: Lua number, cdata int64, cdata uint64.
 --
-local function key_hash(shard_key)
+local function strcrc32(shard_key)
     if type(shard_key) ~= 'table' then
         return ldigest.crc32(tostring(shard_key))
     else
@@ -16,6 +19,38 @@ local function key_hash(shard_key)
     end
 end
 
+local function mpcrc32_one(value)
+    if type(value) ~= 'string' then
+        return mpencode(value)
+    else
+        -- Despite the function called 'mp', strings are not
+        -- encoded. This is because it does not make much sense to
+        -- copy the whole string onto a temporary buffer just to
+        -- add a small MessagePack header. Such 'hack' makes
+        -- hashing of strings several magnitudes of order faster.
+        return value
+    end
+end
+
+--
+-- Stable hash providing the correct values for integers not
+-- depending on their size. However may return different hashes
+-- for the same floating point value if it is cdata float or cdata
+-- double.
+--
+local function mpcrc32(shard_key)
+    if type(shard_key) ~= 'table' then
+        return ldigest.crc32(mpcrc32_one(shard_key))
+    else
+        local crc32 = ldigest.crc32.new()
+        for _, v in ipairs(shard_key) do
+            crc32:update(mpcrc32_one(v))
+        end
+        return crc32:result()
+    end
+end
+
 return {
-    key_hash = key_hash;
+    strcrc32 = strcrc32,
+    mpcrc32 = mpcrc32,
 }
diff --git a/vshard/router/init.lua b/vshard/router/init.lua
index 91ad7e6..370398b 100644
--- a/vshard/router/init.lua
+++ b/vshard/router/init.lua
@@ -1067,7 +1067,24 @@ local function router_bucket_id(router, key)
     if key == nil then
         error("Usage: vshard.router.bucket_id(key)")
     end
-    return lhash.key_hash(key) % router.total_bucket_count + 1
+    log.warn('vshard.router.bucket_id() is deprecated, use '..
+             'vshard.router.bucket_id_strcrc32() or '..
+             'vshard.router.bucket_id_mpcrc32()')
+    return lhash.strcrc32(key) % router.total_bucket_count + 1
+end
+
+local function router_bucket_id_strcrc32(router, key)
+    if key == nil then
+        error("Usage: vshard.router.bucket_id_strcrc32(key)")
+    end
+    return lhash.strcrc32(key) % router.total_bucket_count + 1
+end
+
+local function router_bucket_id_mpcrc32(router, key)
+    if key == nil then
+        error("Usage: vshard.router.bucket_id_mpcrc32(key)")
+    end
+    return lhash.mpcrc32(key) % router.total_bucket_count + 1
 end
 
 local function router_bucket_count(router)
@@ -1124,7 +1141,9 @@ local router_mt = {
         callbre = router_callbre;
         route = router_route;
         routeall = router_routeall;
-        bucket_id = router_bucket_id;
+        bucket_id = router_bucket_id,
+        bucket_id_strcrc32 = router_bucket_id_strcrc32,
+        bucket_id_mpcrc32 = router_bucket_id_mpcrc32,
         bucket_count = router_bucket_count;
         sync = router_sync;
         bootstrap = cluster_bootstrap;
-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 1/1] router: bucket_id_strcrc32 and bucket_id_mpcrc32
  2020-02-25 23:52 [Tarantool-patches] [PATCH vshard 1/1] router: bucket_id_strcrc32 and bucket_id_mpcrc32 Vladislav Shpilevoy
@ 2020-02-26  7:15 ` Oleg Babin
  2020-02-27  0:22   ` Vladislav Shpilevoy
  2020-02-29 17:09   ` Vladislav Shpilevoy
  2020-02-26  8:20 ` Konstantin Osipov
  1 sibling, 2 replies; 12+ messages in thread
From: Oleg Babin @ 2020-02-26  7:15 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches

Hi! Thanks for your patch!

On 26/02/2020 02:52, Vladislav Shpilevoy wrote:
> Closes #207
> 
> @TarantoolBot document
> Title: vshard.router.bucket_id_strcrc32() and .bucket_id_mpcrc32()
> vshard.router.bucket_id() is deprecated, each its usage logs a
> warning. It still works, but will be deleted in future.

I think it's quite agressive to print message in logs for each time. 
Moreover it could lead to unexpected behaviour for some people.
IMO the most proper way print such message only once. Consider using 
something like that - 
https://github.com/tarantool/errors/blob/master/errors/deprecate.lua#L21


> Behaviour of the old bucket_id() function is now available as
> vshard.router.bucket_id_strcrc32(). It works exactly like the old
> function, but does not log a warning.
> 
> The reason why there is a new function bucket_id_mpcrc32() is that
> the old bucket_id() and the new bucket_id_strcrc32() are not
> consistent for cdata numbers. In particular, they return 3
> different values for normal Lua numbers like 123, for unsigned
> long long cdata (like 123ULL, or ffi.cast('unsigned long long',
> 123)), and for signed long long cdata (like 123LL, or
> ffi.cast('long long', 123)). Note, this is important!
> 
>      vshard.router.bucket_id(123)
>      vshard.router.bucket_id(123LL)
>      vshard.router.bucket_id(123ULL)
> 
>      Return 3 different values!!!
> 
> For float and double cdata (ffi.cast('float', number),
> ffi.cast('double', number)) these functions return different
> values even for the same numbers of the same floating point type.
> This is because tostring() on a floating point cdata number
> returns not the number, but a pointer at it. Different on each
> call.
> 
> vshard.router.bucket_id_strcrc32() behaves exactly the same.
> 
> vshard.router.bucket_id_mpcrc32() is safer. 

Looks like it's not enough safe for new "double" type:

| tarantool> box.space.test:create_index('double', {type='tree', parts = 
{1, 'double'}})
| tarantool> box.space.test:insert({ffi.cast('double', 1)}) -- cdata
| ---
| - [1]
| ...
| tarantool> type(box.space.test:select()[1][1])
| ---
| - number
| ...

And results of msgpack.encode are different.

I think you can return to your idea 
(https://github.com/tarantool/vshard/issues/207#issuecomment-589976204) 
with "tonumber":
|tarantool> tonumber(ffi.cast('double', 100))
| ---
| - 100
| ...

Such approach will lead to precision lost for huge unsigned/signed 
integers. But it will be predictable and determined for all cases.

---
Oleg Babin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 1/1] router: bucket_id_strcrc32 and bucket_id_mpcrc32
  2020-02-25 23:52 [Tarantool-patches] [PATCH vshard 1/1] router: bucket_id_strcrc32 and bucket_id_mpcrc32 Vladislav Shpilevoy
  2020-02-26  7:15 ` Oleg Babin
@ 2020-02-26  8:20 ` Konstantin Osipov
  2020-02-27  0:25   ` Vladislav Shpilevoy
  1 sibling, 1 reply; 12+ messages in thread
From: Konstantin Osipov @ 2020-02-26  8:20 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

* Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/02/26 09:31]:
> Closes #207

While you're at it, is this hardware crc32?

Did you consider implementing it in C as a server built-in to
avoid using messagepack as an interim representation?


-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 1/1] router: bucket_id_strcrc32 and bucket_id_mpcrc32
  2020-02-26  7:15 ` Oleg Babin
@ 2020-02-27  0:22   ` Vladislav Shpilevoy
  2020-02-27 10:12     ` Oleg Babin
  2020-02-29 17:09   ` Vladislav Shpilevoy
  1 sibling, 1 reply; 12+ messages in thread
From: Vladislav Shpilevoy @ 2020-02-27  0:22 UTC (permalink / raw)
  To: Oleg Babin, tarantool-patches

Thanks for the review!

> On 26/02/2020 02:52, Vladislav Shpilevoy wrote:
>> Closes #207
>>
>> @TarantoolBot document
>> Title: vshard.router.bucket_id_strcrc32() and .bucket_id_mpcrc32()
>> vshard.router.bucket_id() is deprecated, each its usage logs a
>> warning. It still works, but will be deleted in future.
> 
> I think it's quite agressive to print message in logs for each time. Moreover it could lead to unexpected behaviour for some people.
> IMO the most proper way print such message only once. Consider using something like that - https://github.com/tarantool/errors/blob/master/errors/deprecate.lua#L21

Yeah, ok, will print once.

>> Behaviour of the old bucket_id() function is now available as
>> vshard.router.bucket_id_strcrc32(). It works exactly like the old
>> function, but does not log a warning.
>>
>> The reason why there is a new function bucket_id_mpcrc32() is that
>> the old bucket_id() and the new bucket_id_strcrc32() are not
>> consistent for cdata numbers. In particular, they return 3
>> different values for normal Lua numbers like 123, for unsigned
>> long long cdata (like 123ULL, or ffi.cast('unsigned long long',
>> 123)), and for signed long long cdata (like 123LL, or
>> ffi.cast('long long', 123)). Note, this is important!
>>
>>      vshard.router.bucket_id(123)
>>      vshard.router.bucket_id(123LL)
>>      vshard.router.bucket_id(123ULL)
>>
>>      Return 3 different values!!!
>>
>> For float and double cdata (ffi.cast('float', number),
>> ffi.cast('double', number)) these functions return different
>> values even for the same numbers of the same floating point type.
>> This is because tostring() on a floating point cdata number
>> returns not the number, but a pointer at it. Different on each
>> call.
>>
>> vshard.router.bucket_id_strcrc32() behaves exactly the same.
>>
>> vshard.router.bucket_id_mpcrc32() is safer. 
> 
> Looks like it's not enough safe for new "double" type:
> 
> | tarantool> box.space.test:create_index('double', {type='tree', parts = {1, 'double'}})
> | tarantool> box.space.test:insert({ffi.cast('double', 1)}) -- cdata
> | ---
> | - [1]
> | ...
> | tarantool> type(box.space.test:select()[1][1])
> | ---
> | - number
> | ...
> 
> And results of msgpack.encode are different.

Yep, I knew that. Precision loss is why I discarded my first draft
patch, using tonumber() for all cdata numbers. The point is that
if you insert ffi.cast(double), then you should pass ffi.cast(double)
to bucket_id function as well. You calculate hash from what is
inserted. Not from what is returned. From how I imagine this.

> 
> I think you can return to your idea (https://github.com/tarantool/vshard/issues/207#issuecomment-589976204) with "tonumber":
> |tarantool> tonumber(ffi.cast('double', 100))
> | ---
> | - 100
> | ...
> 
> Such approach will lead to precision lost for huge unsigned/signed integers. But it will be predictable and determined for all cases.

This also makes MessagePack encoding useless, because if we
have only Lua numbers and strings, then crc32(tostring()) will
work fine too.

I can add a third function bucket_id_nstrcrc32. Stands for
Number String Crc32 - n str crc32. This will call tonumber()
for all cdata numbers before calling crc32(tostring(...)) on
them.

Sounds good?

> 
> ---
> Oleg Babin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 1/1] router: bucket_id_strcrc32 and bucket_id_mpcrc32
  2020-02-26  8:20 ` Konstantin Osipov
@ 2020-02-27  0:25   ` Vladislav Shpilevoy
  2020-02-27  6:53     ` Konstantin Osipov
  0 siblings, 1 reply; 12+ messages in thread
From: Vladislav Shpilevoy @ 2020-02-27  0:25 UTC (permalink / raw)
  To: Konstantin Osipov, tarantool-patches, olegrok



On 26/02/2020 09:20, Konstantin Osipov wrote:
> * Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/02/26 09:31]:
>> Closes #207
> 
> While you're at it, is this hardware crc32?

Yes, this is hardware, when possible. And it is in C. See crc32.c
files (there are two of them).

> 
> Did you consider implementing it in C as a server built-in to
> avoid using messagepack as an interim representation?

MessagePack is chosen not because of speed (although it is the
same as with tostring(), according to my benches, and even a little
faster on numbers), but because of bad representation of Lua cdata
numbers in strings.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 1/1] router: bucket_id_strcrc32 and bucket_id_mpcrc32
  2020-02-27  0:25   ` Vladislav Shpilevoy
@ 2020-02-27  6:53     ` Konstantin Osipov
  2020-02-29 17:10       ` Vladislav Shpilevoy
  0 siblings, 1 reply; 12+ messages in thread
From: Konstantin Osipov @ 2020-02-27  6:53 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

* Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/02/27 09:42]:
> > * Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/02/26 09:31]:
> >> Closes #207
> > 
> > While you're at it, is this hardware crc32?
> 
> Yes, this is hardware, when possible. And it is in C. See crc32.c
> files (there are two of them).

 I mean the entire method could be in C, including building a
 msgpack string from cdata.


-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 1/1] router: bucket_id_strcrc32 and bucket_id_mpcrc32
  2020-02-27  0:22   ` Vladislav Shpilevoy
@ 2020-02-27 10:12     ` Oleg Babin
  0 siblings, 0 replies; 12+ messages in thread
From: Oleg Babin @ 2020-02-27 10:12 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches

On 27/02/2020 03:22, Vladislav Shpilevoy wrote:
> Yep, I knew that. Precision loss is why I discarded my first draft
> patch, using tonumber() for all cdata numbers. The point is that
> if you insert ffi.cast(double), then you should pass ffi.cast(double)
> to bucket_id function as well. You calculate hash from what is
> inserted. Not from what is returned. From how I imagine this.

My main consern that user can modify returned value and then put it 
again. But ok, I see that we can't cover all possible cases - I hope 
this issue or some note in documentation makes developers to think about 
such problem.

> 
> This also makes MessagePack encoding useless, because if we
> have only Lua numbers and strings, then crc32(tostring()) will
> work fine too.
> 
> I can add a third function bucket_id_nstrcrc32. Stands for
> Number String Crc32 - n str crc32. This will call tonumber()
> for all cdata numbers before calling crc32(tostring(...)) on
> them.
> 
> Sounds good?
> 

Let's stay as is and should not overcomplcate API.

--
Oleg Babin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 1/1] router: bucket_id_strcrc32 and bucket_id_mpcrc32
  2020-02-26  7:15 ` Oleg Babin
  2020-02-27  0:22   ` Vladislav Shpilevoy
@ 2020-02-29 17:09   ` Vladislav Shpilevoy
  2020-03-02  8:10     ` Oleg Babin
  1 sibling, 1 reply; 12+ messages in thread
From: Vladislav Shpilevoy @ 2020-02-29 17:09 UTC (permalink / raw)
  To: Oleg Babin, tarantool-patches

Thanks for the comments!

> On 26/02/2020 02:52, Vladislav Shpilevoy wrote:
>> Closes #207
>>
>> @TarantoolBot document
>> Title: vshard.router.bucket_id_strcrc32() and .bucket_id_mpcrc32()
>> vshard.router.bucket_id() is deprecated, each its usage logs a
>> warning. It still works, but will be deleted in future.
> 
> I think it's quite agressive to print message in logs for each time. Moreover it could lead to unexpected behaviour for some people.
> IMO the most proper way print such message only once. Consider using something like that - https://github.com/tarantool/errors/blob/master/errors/deprecate.lua#L21

The aforementioned module won't print a warning only once. It will
print all the errors again once a code reload happens, because
'local errors_issued = {}' will be called again on reload, and
will clear 'errors_issued' table. But perhaps it was done
intentionally.

I made it the same way. Intentionally. In that case it will help
user to replace all bucket_id() calls during reloads, and after each
reload he will be able to validate whether the message is printed
again anywhere.

====================
+local router_bucket_id_deprecated_warn = true
 local function router_bucket_id(router, key)
     if key == nil then
         error("Usage: vshard.router.bucket_id(key)")
     end
-    log.warn('vshard.router.bucket_id() is deprecated, use '..
-             'vshard.router.bucket_id_strcrc32() or '..
-             'vshard.router.bucket_id_mpcrc32()')
+    if router_bucket_id_deprecated_warn then
+        router_bucket_id_deprecated_warn = false
+        log.warn('vshard.router.bucket_id() is deprecated, use '..
+                 'vshard.router.bucket_id_strcrc32() or '..
+                 'vshard.router.bucket_id_mpcrc32()')
+    end
     return lhash.strcrc32(key) % router.total_bucket_count + 1
 end
====================

>> Behaviour of the old bucket_id() function is now available as
>> vshard.router.bucket_id_strcrc32(). It works exactly like the old
>> function, but does not log a warning.
>>
>> The reason why there is a new function bucket_id_mpcrc32() is that
>> the old bucket_id() and the new bucket_id_strcrc32() are not
>> consistent for cdata numbers. In particular, they return 3
>> different values for normal Lua numbers like 123, for unsigned
>> long long cdata (like 123ULL, or ffi.cast('unsigned long long',
>> 123)), and for signed long long cdata (like 123LL, or
>> ffi.cast('long long', 123)). Note, this is important!
>>
>>      vshard.router.bucket_id(123)
>>      vshard.router.bucket_id(123LL)
>>      vshard.router.bucket_id(123ULL)
>>
>>      Return 3 different values!!!
>>
>> For float and double cdata (ffi.cast('float', number),
>> ffi.cast('double', number)) these functions return different
>> values even for the same numbers of the same floating point type.
>> This is because tostring() on a floating point cdata number
>> returns not the number, but a pointer at it. Different on each
>> call.
>>
>> vshard.router.bucket_id_strcrc32() behaves exactly the same.
>>
>> vshard.router.bucket_id_mpcrc32() is safer. 
> 
> Looks like it's not enough safe for new "double" type:
> 
> | tarantool> box.space.test:create_index('double', {type='tree', parts = {1, 'double'}})
> | tarantool> box.space.test:insert({ffi.cast('double', 1)}) -- cdata
> | ---
> | - [1]
> | ...
> | tarantool> type(box.space.test:select()[1][1])
> | ---
> | - number
> | ...
> 
> And results of msgpack.encode are different.
> 
> I think you can return to your idea (https://github.com/tarantool/vshard/issues/207#issuecomment-589976204) with "tonumber":
> |tarantool> tonumber(ffi.cast('double', 100))
> | ---
> | - 100
> | ...
> 
> Such approach will lead to precision lost for huge unsigned/signed integers. But it will be predictable and determined for all cases.

I added that example to docbot request, slightly modified.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 1/1] router: bucket_id_strcrc32 and bucket_id_mpcrc32
  2020-02-27  6:53     ` Konstantin Osipov
@ 2020-02-29 17:10       ` Vladislav Shpilevoy
  2020-03-01 16:52         ` Konstantin Osipov
  0 siblings, 1 reply; 12+ messages in thread
From: Vladislav Shpilevoy @ 2020-02-29 17:10 UTC (permalink / raw)
  To: Konstantin Osipov, tarantool-patches, olegrok

On 27/02/2020 07:53, Konstantin Osipov wrote:
> * Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/02/27 09:42]:
>>> * Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/02/26 09:31]:
>>>> Closes #207
>>>
>>> While you're at it, is this hardware crc32?
>>
>> Yes, this is hardware, when possible. And it is in C. See crc32.c
>> files (there are two of them).
> 
>  I mean the entire method could be in C, including building a
>  msgpack string from cdata.

But how building MessagePack is related to crc32? We will need to
do that in C anyway. Now I am doing that via msgpackffi, which is
fast enough.

So in both cases we have MessagePack building, 1 Lua C call, and
optimal crc32 calculation in C.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 1/1] router: bucket_id_strcrc32 and bucket_id_mpcrc32
  2020-02-29 17:10       ` Vladislav Shpilevoy
@ 2020-03-01 16:52         ` Konstantin Osipov
  0 siblings, 0 replies; 12+ messages in thread
From: Konstantin Osipov @ 2020-03-01 16:52 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

* Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/02/29 20:14]:
> On 27/02/2020 07:53, Konstantin Osipov wrote:
> > * Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/02/27 09:42]:
> >>> * Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/02/26 09:31]:
> >>>> Closes #207
> >>>
> >>> While you're at it, is this hardware crc32?
> >>
> >> Yes, this is hardware, when possible. And it is in C. See crc32.c
> >> files (there are two of them).
> > 
> >  I mean the entire method could be in C, including building a
> >  msgpack string from cdata.
> 
> But how building MessagePack is related to crc32? We will need to
> do that in C anyway. Now I am doing that via msgpackffi, which is
> fast enough.

Calling msgpackffi from Lua is slow unless the method is JITed. 

Most of the overhead is coming from invoking functions, not from
what they do.

If you had a single Lua-C function which would take all arguments
from Lua stack and return crc32 it may be faster.


-- 
Konstantin Osipov, Moscow, Russia
https://scylladb.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 1/1] router: bucket_id_strcrc32 and bucket_id_mpcrc32
  2020-02-29 17:09   ` Vladislav Shpilevoy
@ 2020-03-02  8:10     ` Oleg Babin
  2020-03-02 21:06       ` Vladislav Shpilevoy
  0 siblings, 1 reply; 12+ messages in thread
From: Oleg Babin @ 2020-03-02  8:10 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches

Hi! Your tests fail on Tarantool 1.10:

[001] Test failed! Result content mismatch:
[001] --- unit/router.result	Mon Mar  2 10:54:48 2020
[001] +++ unit/router.reject	Mon Mar  2 11:02:21 2020
[001] @@ -75,7 +75,7 @@
[001]   |   - - string
[001]   |     - '1'
[001]   |     - 477
[001] - |   - - ctype<struct 106>
[001] + |   - - number
[001]   |     - 1
[001]   |     - 477
[001]   | ...


Without taking this into account - patch LGTM.


On 29/02/2020 20:09, Vladislav Shpilevoy wrote:
> The aforementioned module won't print a warning only once. It will
> print all the errors again once a code reload happens, because
> 'local errors_issued = {}' will be called again on reload, and
> will clear 'errors_issued' table. But perhaps it was done
> intentionally.
> 
> I made it the same way. Intentionally. In that case it will help
> user to replace all bucket_id() calls during reloads, and after each
> reload he will be able to validate whether the message is printed
> again anywhere.

I see nothing bad to print deprecation message after code reload again. 
It doesn't lead to flood into logs as it's a manual action.

---
Oleg Babin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 1/1] router: bucket_id_strcrc32 and bucket_id_mpcrc32
  2020-03-02  8:10     ` Oleg Babin
@ 2020-03-02 21:06       ` Vladislav Shpilevoy
  0 siblings, 0 replies; 12+ messages in thread
From: Vladislav Shpilevoy @ 2020-03-02 21:06 UTC (permalink / raw)
  To: Oleg Babin, tarantool-patches

Thanks for the review!

On 02/03/2020 09:10, Oleg Babin wrote:
> Hi! Your tests fail on Tarantool 1.10:
> 
> [001] Test failed! Result content mismatch:
> [001] --- unit/router.result    Mon Mar  2 10:54:48 2020
> [001] +++ unit/router.reject    Mon Mar  2 11:02:21 2020
> [001] @@ -75,7 +75,7 @@
> [001]   |   - - string
> [001]   |     - '1'
> [001]   |     - 477
> [001] - |   - - ctype<struct 106>
> [001] + |   - - number
> [001]   |     - 1
> [001]   |     - 477
> [001]   | ...
> 
> 
> Without taking this into account - patch LGTM.

I fixed that and pushed to master.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-03-02 21:06 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-25 23:52 [Tarantool-patches] [PATCH vshard 1/1] router: bucket_id_strcrc32 and bucket_id_mpcrc32 Vladislav Shpilevoy
2020-02-26  7:15 ` Oleg Babin
2020-02-27  0:22   ` Vladislav Shpilevoy
2020-02-27 10:12     ` Oleg Babin
2020-02-29 17:09   ` Vladislav Shpilevoy
2020-03-02  8:10     ` Oleg Babin
2020-03-02 21:06       ` Vladislav Shpilevoy
2020-02-26  8:20 ` Konstantin Osipov
2020-02-27  0:25   ` Vladislav Shpilevoy
2020-02-27  6:53     ` Konstantin Osipov
2020-02-29 17:10       ` Vladislav Shpilevoy
2020-03-01 16:52         ` Konstantin Osipov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox