Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH vshard 1/1] rebalancer: give more info at bucket_recv() fail
@ 2021-05-25 20:42 Vladislav Shpilevoy via Tarantool-patches
  2021-05-26  9:03 ` Oleg Babin via Tarantool-patches
  2021-05-27 19:20 ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 2 replies; 5+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-05-25 20:42 UTC (permalink / raw)
  To: tarantool-patches, yaroslav.dynnikov, olegrok

vshard.storage.bucket_recv() used to raise the natural
space:insert(...) error without any additional info when it
failed. For instance, due to incorrect format, or a duplicate key.

When such an error happens, it is very useful to know what was the
problematic space and what was the failed tuple. This patch
enriches the space insertion error with that information.

The new detailed error object and its message should help to fix
the rebalancing issues, which quite often are about schema
mismatch in different replicasets. Especially hard to debug when
number of spaces is tens of even hundreds.

The old way was fine for errors like "duplicate key" because on
the newest version of Tarantool it contains the space name, the
old and the new tuples. But errors like tuple format mismatch
still are not very informative. VShard now tries to enrich all the
possible errors.

Closes #275
---
Branch: http://github.com/tarantool/vshard/tree/gerold103/gh-275-bucket_recv-detailed-error
Issue: https://github.com/tarantool/vshard/issues/275

 test/storage/storage.result   | 47 +++++++++++++++++++++++++++++++++++
 test/storage/storage.test.lua | 17 +++++++++++++
 vshard/error.lua              |  5 ++++
 vshard/storage/init.lua       |  8 +++++-
 4 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/test/storage/storage.result b/test/storage/storage.result
index 570d9c6..5372059 100644
--- a/test/storage/storage.result
+++ b/test/storage/storage.result
@@ -521,6 +521,53 @@ while box.space._bucket:get{4} do vshard.storage.recovery_wakeup() fiber.sleep(0
 ---
 ...
 --
+-- gh-275: detailed info when couldn't insert into a space.
+--
+res, err = vshard.storage.bucket_recv(                                          \
+    4, util.replicasets[2], {{box.space.test.id, {{9, 4}, {10, 4}, {1, 4}}}})
+---
+...
+assert(not res)
+---
+- true
+...
+assert(err.space == 'test')
+---
+- true
+...
+assert(err.bucket_id == 4)
+---
+- true
+...
+assert(tostring(err.tuple) == '[1, 4]')
+---
+- true
+...
+assert(err.reason:match('Duplicate key exists') ~= nil)
+---
+- true
+...
+err = err.message
+---
+...
+assert(err:match('bucket 4 data in space "test" at tuple %[1, 4%]') ~= nil)
+---
+- true
+...
+assert(err:match('Duplicate key exists') ~= nil)
+---
+- true
+...
+while box.space._bucket:get{4} do                                               \
+    vshard.storage.recovery_wakeup() fiber.sleep(0.01)                          \
+end
+---
+...
+assert(box.space.test:get{9} == nil and box.space.test:get{10} == nil)
+---
+- true
+...
+--
 -- Bucket transfer
 --
 -- Transfer to unknown replicaset.
diff --git a/test/storage/storage.test.lua b/test/storage/storage.test.lua
index 494e2e8..d1f3f50 100644
--- a/test/storage/storage.test.lua
+++ b/test/storage/storage.test.lua
@@ -125,6 +125,23 @@ vshard.storage.bucket_recv(100, 'from_uuid', {{1000, {{1}}}})
 res, err = vshard.storage.bucket_recv(4, util.replicasets[2], {{1000, {{1}}}})
 util.portable_error(err)
 while box.space._bucket:get{4} do vshard.storage.recovery_wakeup() fiber.sleep(0.01) end
+--
+-- gh-275: detailed info when couldn't insert into a space.
+--
+res, err = vshard.storage.bucket_recv(                                          \
+    4, util.replicasets[2], {{box.space.test.id, {{9, 4}, {10, 4}, {1, 4}}}})
+assert(not res)
+assert(err.space == 'test')
+assert(err.bucket_id == 4)
+assert(tostring(err.tuple) == '[1, 4]')
+assert(err.reason:match('Duplicate key exists') ~= nil)
+err = err.message
+assert(err:match('bucket 4 data in space "test" at tuple %[1, 4%]') ~= nil)
+assert(err:match('Duplicate key exists') ~= nil)
+while box.space._bucket:get{4} do                                               \
+    vshard.storage.recovery_wakeup() fiber.sleep(0.01)                          \
+end
+assert(box.space.test:get{9} == nil and box.space.test:get{10} == nil)
 
 --
 -- Bucket transfer
diff --git a/vshard/error.lua b/vshard/error.lua
index b02bfe9..bcbcd71 100644
--- a/vshard/error.lua
+++ b/vshard/error.lua
@@ -149,6 +149,11 @@ local error_message_template = {
         msg = 'Can not delete a storage ref: %s',
         args = {'reason'},
     },
+    [30] = {
+        name = 'BUCKET_RECV_DATA_ERROR',
+        msg = 'Can not receive the bucket %s data in space "%s" at tuple %s: %s',
+        args = {'bucket_id', 'space', 'tuple', 'reason'},
+    }
 }
 
 --
diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
index 63e0398..7045d91 100644
--- a/vshard/storage/init.lua
+++ b/vshard/storage/init.lua
@@ -1254,7 +1254,13 @@ local function bucket_recv_xc(bucket_id, from, data, opts)
         end
         box.begin()
         for _, tuple in ipairs(space_data) do
-            space:insert(tuple)
+            local ok, err = pcall(space.insert, space, tuple)
+            if not ok then
+                box.rollback()
+                return nil, lerror.vshard(lerror.code.BUCKET_RECV_DATA_ERROR,
+                                          bucket_id, space.name,
+                                          box.tuple.new(tuple), err)
+            end
             limit = limit - 1
             if limit == 0 then
                 box.commit()
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 1/1] rebalancer: give more info at bucket_recv() fail
  2021-05-25 20:42 [Tarantool-patches] [PATCH vshard 1/1] rebalancer: give more info at bucket_recv() fail Vladislav Shpilevoy via Tarantool-patches
@ 2021-05-26  9:03 ` Oleg Babin via Tarantool-patches
  2021-05-26 18:44   ` Vladislav Shpilevoy via Tarantool-patches
  2021-05-27 19:20 ` Vladislav Shpilevoy via Tarantool-patches
  1 sibling, 1 reply; 5+ messages in thread
From: Oleg Babin via Tarantool-patches @ 2021-05-26  9:03 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, yaroslav.dynnikov

Hi! Thanks for your patch. Two minor comments below.

On 25.05.2021 23:42, Vladislav Shpilevoy wrote:
> +--
>   -- Bucket transfer
>   --
>   -- Transfer to unknown replicaset.
> diff --git a/test/storage/storage.test.lua b/test/storage/storage.test.lua
> index 494e2e8..d1f3f50 100644
> --- a/test/storage/storage.test.lua
> +++ b/test/storage/storage.test.lua
> @@ -125,6 +125,23 @@ vshard.storage.bucket_recv(100, 'from_uuid', {{1000, {{1}}}})
>   res, err = vshard.storage.bucket_recv(4, util.replicasets[2], {{1000, {{1}}}})
>   util.portable_error(err)
>   while box.space._bucket:get{4} do vshard.storage.recovery_wakeup() fiber.sleep(0.01) end
> +--
> +-- gh-275: detailed info when couldn't insert into a space.
> +--
> +res, err = vshard.storage.bucket_recv(                                          \
> +    4, util.replicasets[2], {{box.space.test.id, {{9, 4}, {10, 4}, {1, 4}}}})
> +assert(not res)
> +assert(err.space == 'test')
> +assert(err.bucket_id == 4)
> +assert(tostring(err.tuple) == '[1, 4]')
> +assert(err.reason:match('Duplicate key exists') ~= nil)
> +err = err.message
> +assert(err:match('bucket 4 data in space "test" at tuple %[1, 4%]') ~= nil)
> +assert(err:match('Duplicate key exists') ~= nil)
> +while box.space._bucket:get{4} do                                               \
> +    vshard.storage.recovery_wakeup() fiber.sleep(0.01)                          \
> +end
> +assert(box.space.test:get{9} == nil and box.space.test:get{10} == nil)
>   
>   --
>   -- Bucket transfer
> diff --git a/vshard/error.lua b/vshard/error.lua
> index b02bfe9..bcbcd71 100644
> --- a/vshard/error.lua
> +++ b/vshard/error.lua
> @@ -149,6 +149,11 @@ local error_message_template = {
>           msg = 'Can not delete a storage ref: %s',
>           args = {'reason'},
>       },
> +    [30] = {
> +        name = 'BUCKET_RECV_DATA_ERROR',
> +        msg = 'Can not receive the bucket %s data in space "%s" at tuple %s: %s',
> +        args = {'bucket_id', 'space', 'tuple', 'reason'},
> +    }
>   }
>   
>   --
> diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
> index 63e0398..7045d91 100644
> --- a/vshard/storage/init.lua
> +++ b/vshard/storage/init.lua
> @@ -1254,7 +1254,13 @@ local function bucket_recv_xc(bucket_id, from, data, opts)
>           end
>           box.begin()
>           for _, tuple in ipairs(space_data) do
> -            space:insert(tuple)
> +            local ok, err = pcall(space.insert, space, tuple)
> +            if not ok then
> +                box.rollback()

Am I right that before a patch nobody rolled back transaction is case of 
error?

How did it work?

> +                return nil, lerror.vshard(lerror.code.BUCKET_RECV_DATA_ERROR,
> +                                          bucket_id, space.name,
> +                                          box.tuple.new(tuple), err)
> +            end

Do you really need `box.tuple.new` here. Why just `tuple` is not enough?

AFAIU box.tuple.new doesn't just increment tuple ref-counter and 
construct new tuple.

Rebalancing is quite CPU-intensive operation so I'm not sure that such 
behaviour doesn't

make error case worse.


>               limit = limit - 1
>               if limit == 0 then
>                   box.commit()

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 1/1] rebalancer: give more info at bucket_recv() fail
  2021-05-26  9:03 ` Oleg Babin via Tarantool-patches
@ 2021-05-26 18:44   ` Vladislav Shpilevoy via Tarantool-patches
  2021-05-27  8:24     ` Oleg Babin via Tarantool-patches
  0 siblings, 1 reply; 5+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-05-26 18:44 UTC (permalink / raw)
  To: Oleg Babin, tarantool-patches, yaroslav.dynnikov

Hi! Thanks for the review!

>> diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
>> index 63e0398..7045d91 100644
>> --- a/vshard/storage/init.lua
>> +++ b/vshard/storage/init.lua
>> @@ -1254,7 +1254,13 @@ local function bucket_recv_xc(bucket_id, from, data, opts)
>>           end
>>           box.begin()
>>           for _, tuple in ipairs(space_data) do
>> -            space:insert(tuple)
>> +            local ok, err = pcall(space.insert, space, tuple)
>> +            if not ok then
>> +                box.rollback()
> 
> Am I right that before a patch nobody rolled back transaction is case of error?
> 
> How did it work?

bucket_recv_xc() is called only from bucket_recv() via pcall.
Bucket_recv() does the rollback. I have the _xc() version so
as not to wrap into pcalls everything, and as a protection
against potential OOM. For instance, when I create a table
in there like `{bucket_id, recvg, from}` - it might fail too,
AFAIU.

>> +                return nil, lerror.vshard(lerror.code.BUCKET_RECV_DATA_ERROR,
>> +                                          bucket_id, space.name,
>> +                                          box.tuple.new(tuple), err)
>> +            end
> 
> Do you really need `box.tuple.new` here. Why just `tuple` is not enough?

Because `tuple` is a Lua table. When formatted into %s in the error
message, it turns into 'table 0x......' instead of showing the
content, while tuple objects have a nice serializer.

> AFAIU box.tuple.new doesn't just increment tuple ref-counter and construct new tuple.

It does exactly this.

> Rebalancing is quite CPU-intensive operation so I'm not sure that such behaviour doesn't
> 
> make error case worse.

I thought about it, but decided that it is not worth optimizing the
error case. It is better to provide a good error message. I also
thought about using json.encode() to avoid 'table 0x.....' problem,
but decided I don't want to introduce a dependency on the entire json
module just for this.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 1/1] rebalancer: give more info at bucket_recv() fail
  2021-05-26 18:44   ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-05-27  8:24     ` Oleg Babin via Tarantool-patches
  0 siblings, 0 replies; 5+ messages in thread
From: Oleg Babin via Tarantool-patches @ 2021-05-27  8:24 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, yaroslav.dynnikov

Thanks for your answers! LGTM.


On 26.05.2021 21:44, Vladislav Shpilevoy wrote:
> Hi! Thanks for the review!
>
>>> diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
>>> index 63e0398..7045d91 100644
>>> --- a/vshard/storage/init.lua
>>> +++ b/vshard/storage/init.lua
>>> @@ -1254,7 +1254,13 @@ local function bucket_recv_xc(bucket_id, from, data, opts)
>>>            end
>>>            box.begin()
>>>            for _, tuple in ipairs(space_data) do
>>> -            space:insert(tuple)
>>> +            local ok, err = pcall(space.insert, space, tuple)
>>> +            if not ok then
>>> +                box.rollback()
>> Am I right that before a patch nobody rolled back transaction is case of error?
>>
>> How did it work?
> bucket_recv_xc() is called only from bucket_recv() via pcall.
> Bucket_recv() does the rollback. I have the _xc() version so
> as not to wrap into pcalls everything, and as a protection
> against potential OOM. For instance, when I create a table
> in there like `{bucket_id, recvg, from}` - it might fail too,
> AFAIU.
Thanks for explanation.


>>> +                return nil, lerror.vshard(lerror.code.BUCKET_RECV_DATA_ERROR,
>>> +                                          bucket_id, space.name,
>>> +                                          box.tuple.new(tuple), err)
>>> +            end
>> Do you really need `box.tuple.new` here. Why just `tuple` is not enough?
> Because `tuple` is a Lua table. When formatted into %s in the error
> message, it turns into 'table 0x......' instead of showing the
> content, while tuple objects have a nice serializer.
>
Yes, I missed it. But if it's lua table (not box.tuple) it's OK.


>> AFAIU box.tuple.new doesn't just increment tuple ref-counter and construct new tuple.
> It does exactly this.
>
>> Rebalancing is quite CPU-intensive operation so I'm not sure that such behaviour doesn't
>>
>> make error case worse.
> I thought about it, but decided that it is not worth optimizing the
> error case. It is better to provide a good error message. I also
> thought about using json.encode() to avoid 'table 0x.....' problem,
> but decided I don't want to introduce a dependency on the entire json
> module just for this.
Ok, agree.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Tarantool-patches] [PATCH vshard 1/1] rebalancer: give more info at bucket_recv() fail
  2021-05-25 20:42 [Tarantool-patches] [PATCH vshard 1/1] rebalancer: give more info at bucket_recv() fail Vladislav Shpilevoy via Tarantool-patches
  2021-05-26  9:03 ` Oleg Babin via Tarantool-patches
@ 2021-05-27 19:20 ` Vladislav Shpilevoy via Tarantool-patches
  1 sibling, 0 replies; 5+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-05-27 19:20 UTC (permalink / raw)
  To: tarantool-patches, yaroslav.dynnikov, olegrok

Pushed to master.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-05-27 19:20 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-25 20:42 [Tarantool-patches] [PATCH vshard 1/1] rebalancer: give more info at bucket_recv() fail Vladislav Shpilevoy via Tarantool-patches
2021-05-26  9:03 ` Oleg Babin via Tarantool-patches
2021-05-26 18:44   ` Vladislav Shpilevoy via Tarantool-patches
2021-05-27  8:24     ` Oleg Babin via Tarantool-patches
2021-05-27 19:20 ` Vladislav Shpilevoy via Tarantool-patches

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox