[Tarantool-patches] [PATCH vshard 0/5] Router backoff, storage disable

2021-12-17T00:25:36Z

The patchset introduces vshard.storage auto and manual enable/disable. Disabled
storage blocks most of public API calls. The state is handled by the router in
a special way so it transparently retries requests failed due to the storage
being disabled.

Branch: http://github.com/tarantool/vshard/tree/gerold103/gh-298-replica-backoff-part-2
Issue: https://github.com/tarantool/vshard/issues/298

Vladislav Shpilevoy (5):
  router: backoff on some box errors
  storage: auto enable/disable
  storage: manual enable/disable
  error: introduce from_string
  router: backoff on storage being disabled

 example/localcfg.lua          |   1 -
 test/router/router2.result    | 325 ++++++++++++++++++++++++++++++++++
 test/router/router2.test.lua  | 132 ++++++++++++++
 test/storage/storage.result   | 128 +++++++++++++
 test/storage/storage.test.lua |  55 ++++++
 test/unit/error.result        |  18 ++
 test/unit/error.test.lua      |   6 +
 test/unit/garbage.result      |  17 +-
 test/unit/garbage.test.lua    |  15 +-
 test/unit/rebalancer.result   |   2 +-
 test/unit/rebalancer.test.lua |   2 +-
 vshard/consts.lua             |   1 +
 vshard/error.lua              |  32 +++-
 vshard/replicaset.lua         | 119 +++++++++++--
 vshard/router/init.lua        |   3 +-
 vshard/storage/init.lua       | 207 +++++++++++++++++-----
 16 files changed, 987 insertions(+), 76 deletions(-)

-- 
2.24.3 (Apple Git-128)

[Tarantool-patches] [PATCH vshard 1/5] router: backoff on some box errors

2021-12-17T00:26:08Z

Storage configuration takes time. Firstly, box.cfg{} which can be
called before vshard.storage.cfg(). Secondly, vshard.storage.cfg()
is not immediate as well.

During that time accessing the storage is not safe. Attempts to
call vshard.storage functions can return weird errors, or the
functions can even be not available yet. They need to be created
in _func and get access rights in _priv before becoming public.

Routers used to forward errors like 'access denied error' and
'no such function' to users as is, treating them as critical.

Not only it was confusing for users, but also could make an entire
replicaset not available for requests - the connection to it is
alive, so router would send all requests into it and they all
would fail. Even if the replicaset has another instance which is
perfectly functional.

This patch handles such specific errors inside of the router. The
faulty replicas are put into a 'backoff' state. They remain in it
for some fixed time (5 seconds for now), new requests won't be
sent to them until the time passes. Router will use other
instances.

Backoff is activated only for vshard.* functions. If the errors
are about some user's function, it is considered a regular error.
Because the router can't tell whether any side effects were done
on the remote instance before the error happened. Hence can't
retry to another node.

For example, if access was denied to 'vshard.storage.call', then
it is backoff. If inside of vshard.storage.call the access was
denied to 'user_test_func', then it is not backoff.

It all works for read-only requests exclusively of course. Because
for read-write requests the instance is just one - master. Router
does not have other options so backoff here wouldn't help.

Part of #298
---
 test/router/router2.result   | 237 +++++++++++++++++++++++++++++++++++
 test/router/router2.test.lua |  97 ++++++++++++++
 vshard/consts.lua            |   1 +
 vshard/error.lua             |   8 +-
 vshard/replicaset.lua        | 100 +++++++++++++--
 vshard/router/init.lua       |   3 +-
 6 files changed, 432 insertions(+), 14 deletions(-)

diff --git a/test/router/router2.result b/test/router/router2.result
index 492421b..a501dbf 100644
--- a/test/router/router2.result
+++ b/test/router/router2.result
@@ -311,6 +311,243 @@ vshard.router.info().bucket
  |   available_rw: 3000
  | ...
 
+--
+-- gh-298: backoff replicas when the storage raises errors meaning its
+-- configuration is not finished or even didn't start. Immediately failover to
+-- other instances then.
+--
+test_run:switch('storage_2_b')
+ | ---
+ | - true
+ | ...
+-- Turn off replication so as _func manipulations on the master wouldn't reach
+-- the replica.
+old_replication = box.cfg.replication
+ | ---
+ | ...
+box.cfg{replication = {}}
+ | ---
+ | ...
+
+test_run:switch('storage_2_a')
+ | ---
+ | - true
+ | ...
+box.schema.user.revoke('storage', 'execute', 'function', 'vshard.storage.call')
+ | ---
+ | ...
+
+test_run:switch('router_1')
+ | ---
+ | - true
+ | ...
+vshard.consts.REPLICA_BACKOFF_INTERVAL = 0.1
+ | ---
+ | ...
+
+-- Indeed fails when called directly via netbox.
+conn = vshard.router.route(1).master.conn
+ | ---
+ | ...
+ok, err = pcall(conn.call, conn, 'vshard.storage.call',                         \
+                {1, 'read', 'echo', {1}})
+ | ---
+ | ...
+assert(not ok and err.code == box.error.ACCESS_DENIED)
+ | ---
+ | - true
+ | ...
+
+-- Works when called via vshard - it goes to another replica transparently.
+long_timeout = {timeout = 1000000}
+ | ---
+ | ...
+res = vshard.router.callro(1, 'echo', {100}, long_timeout)
+ | ---
+ | ...
+assert(res == 100)
+ | ---
+ | - true
+ | ...
+
+--
+-- When all replicas are in backoff due to lack of access, raise an error.
+--
+test_run:switch('storage_2_b')
+ | ---
+ | - true
+ | ...
+assert(echo_count == 1)
+ | ---
+ | - true
+ | ...
+echo_count = 0
+ | ---
+ | ...
+-- Restore the replication so the replica gets the _func change from master.
+box.cfg{replication = old_replication}
+ | ---
+ | ...
+test_run:wait_vclock('storage_2_b', test_run:get_vclock('storage_1_a'))
+ | ---
+ | ...
+
+test_run:switch('router_1')
+ | ---
+ | - true
+ | ...
+ok, err = vshard.router.callro(1, 'echo', {100}, long_timeout)
+ | ---
+ | ...
+assert(not ok and err.code == vshard.error.code.REPLICASET_IN_BACKOFF)
+ | ---
+ | - true
+ | ...
+assert(err.error.code == box.error.ACCESS_DENIED)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('storage_2_a')
+ | ---
+ | - true
+ | ...
+assert(echo_count == 0)
+ | ---
+ | - true
+ | ...
+box.schema.user.grant('storage', 'execute', 'function', 'vshard.storage.call')
+ | ---
+ | ...
+test_run:wait_vclock('storage_2_b', test_run:get_vclock('storage_1_a'))
+ | ---
+ | ...
+
+--
+-- No vshard function = backoff.
+--
+test_run:switch('router_1')
+ | ---
+ | - true
+ | ...
+-- Drop all backoffs to check all works fine now.
+fiber.sleep(vshard.consts.REPLICA_BACKOFF_INTERVAL)
+ | ---
+ | ...
+res = vshard.router.callrw(1, 'echo', {100}, long_timeout)
+ | ---
+ | ...
+assert(res == 100)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('storage_2_a')
+ | ---
+ | - true
+ | ...
+assert(echo_count == 1)
+ | ---
+ | - true
+ | ...
+echo_count = 0
+ | ---
+ | ...
+old_storage_call = vshard.storage.call
+ | ---
+ | ...
+vshard.storage.call = nil
+ | ---
+ | ...
+
+-- Indeed fails when called directly via netbox.
+test_run:switch('router_1')
+ | ---
+ | - true
+ | ...
+conn = vshard.router.route(1).master.conn
+ | ---
+ | ...
+ok, err = pcall(conn.call, conn, 'vshard.storage.call',                         \
+                {1, 'read', 'echo', {1}})
+ | ---
+ | ...
+assert(not ok and err.code == box.error.NO_SUCH_PROC)
+ | ---
+ | - true
+ | ...
+
+-- Works when called via vshard - it goes to another replica.
+res = vshard.router.callro(1, 'echo', {100}, long_timeout)
+ | ---
+ | ...
+assert(res == 100)
+ | ---
+ | - true
+ | ...
+
+--
+-- When all replicas are in backoff due to not having the function, raise
+-- an error.
+--
+test_run:switch('storage_2_b')
+ | ---
+ | - true
+ | ...
+assert(echo_count == 1)
+ | ---
+ | - true
+ | ...
+echo_count = 0
+ | ---
+ | ...
+old_storage_call = vshard.storage.call
+ | ---
+ | ...
+vshard.storage.call = nil
+ | ---
+ | ...
+
+test_run:switch('router_1')
+ | ---
+ | - true
+ | ...
+ok, err = vshard.router.callro(1, 'echo', {100}, long_timeout)
+ | ---
+ | ...
+assert(not ok and err.code == vshard.error.code.REPLICASET_IN_BACKOFF)
+ | ---
+ | - true
+ | ...
+assert(err.error.code == box.error.NO_SUCH_PROC)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('storage_2_b')
+ | ---
+ | - true
+ | ...
+assert(echo_count == 0)
+ | ---
+ | - true
+ | ...
+vshard.storage.call = old_storage_call
+ | ---
+ | ...
+
+test_run:switch('storage_2_a')
+ | ---
+ | - true
+ | ...
+assert(echo_count == 0)
+ | ---
+ | - true
+ | ...
+vshard.storage.call = old_storage_call
+ | ---
+ | ...
+
 _ = test_run:switch("default")
  | ---
  | ...
diff --git a/test/router/router2.test.lua b/test/router/router2.test.lua
index 72c99e6..fb0c3b2 100644
--- a/test/router/router2.test.lua
+++ b/test/router/router2.test.lua
@@ -119,6 +119,103 @@ end
 test_run:wait_cond(wait_all_rw)
 vshard.router.info().bucket
 
+--
+-- gh-298: backoff replicas when the storage raises errors meaning its
+-- configuration is not finished or even didn't start. Immediately failover to
+-- other instances then.
+--
+test_run:switch('storage_2_b')
+-- Turn off replication so as _func manipulations on the master wouldn't reach
+-- the replica.
+old_replication = box.cfg.replication
+box.cfg{replication = {}}
+
+test_run:switch('storage_2_a')
+box.schema.user.revoke('storage', 'execute', 'function', 'vshard.storage.call')
+
+test_run:switch('router_1')
+vshard.consts.REPLICA_BACKOFF_INTERVAL = 0.1
+
+-- Indeed fails when called directly via netbox.
+conn = vshard.router.route(1).master.conn
+ok, err = pcall(conn.call, conn, 'vshard.storage.call',                         \
+                {1, 'read', 'echo', {1}})
+assert(not ok and err.code == box.error.ACCESS_DENIED)
+
+-- Works when called via vshard - it goes to another replica transparently.
+long_timeout = {timeout = 1000000}
+res = vshard.router.callro(1, 'echo', {100}, long_timeout)
+assert(res == 100)
+
+--
+-- When all replicas are in backoff due to lack of access, raise an error.
+--
+test_run:switch('storage_2_b')
+assert(echo_count == 1)
+echo_count = 0
+-- Restore the replication so the replica gets the _func change from master.
+box.cfg{replication = old_replication}
+test_run:wait_vclock('storage_2_b', test_run:get_vclock('storage_1_a'))
+
+test_run:switch('router_1')
+ok, err = vshard.router.callro(1, 'echo', {100}, long_timeout)
+assert(not ok and err.code == vshard.error.code.REPLICASET_IN_BACKOFF)
+assert(err.error.code == box.error.ACCESS_DENIED)
+
+test_run:switch('storage_2_a')
+assert(echo_count == 0)
+box.schema.user.grant('storage', 'execute', 'function', 'vshard.storage.call')
+test_run:wait_vclock('storage_2_b', test_run:get_vclock('storage_1_a'))
+
+--
+-- No vshard function = backoff.
+--
+test_run:switch('router_1')
+-- Drop all backoffs to check all works fine now.
+fiber.sleep(vshard.consts.REPLICA_BACKOFF_INTERVAL)
+res = vshard.router.callrw(1, 'echo', {100}, long_timeout)
+assert(res == 100)
+
+test_run:switch('storage_2_a')
+assert(echo_count == 1)
+echo_count = 0
+old_storage_call = vshard.storage.call
+vshard.storage.call = nil
+
+-- Indeed fails when called directly via netbox.
+test_run:switch('router_1')
+conn = vshard.router.route(1).master.conn
+ok, err = pcall(conn.call, conn, 'vshard.storage.call',                         \
+                {1, 'read', 'echo', {1}})
+assert(not ok and err.code == box.error.NO_SUCH_PROC)
+
+-- Works when called via vshard - it goes to another replica.
+res = vshard.router.callro(1, 'echo', {100}, long_timeout)
+assert(res == 100)
+
+--
+-- When all replicas are in backoff due to not having the function, raise
+-- an error.
+--
+test_run:switch('storage_2_b')
+assert(echo_count == 1)
+echo_count = 0
+old_storage_call = vshard.storage.call
+vshard.storage.call = nil
+
+test_run:switch('router_1')
+ok, err = vshard.router.callro(1, 'echo', {100}, long_timeout)
+assert(not ok and err.code == vshard.error.code.REPLICASET_IN_BACKOFF)
+assert(err.error.code == box.error.NO_SUCH_PROC)
+
+test_run:switch('storage_2_b')
+assert(echo_count == 0)
+vshard.storage.call = old_storage_call
+
+test_run:switch('storage_2_a')
+assert(echo_count == 0)
+vshard.storage.call = old_storage_call
+
 _ = test_run:switch("default")
 _ = test_run:cmd("stop server router_1")
 _ = test_run:cmd("cleanup server router_1")
diff --git a/vshard/consts.lua b/vshard/consts.lua
index 66a09ae..64cb0ae 100644
--- a/vshard/consts.lua
+++ b/vshard/consts.lua
@@ -40,6 +40,7 @@ return {
     RECONNECT_TIMEOUT = 0.5;
     GC_BACKOFF_INTERVAL = 5,
     RECOVERY_BACKOFF_INTERVAL = 5,
+    REPLICA_BACKOFF_INTERVAL = 5,
     COLLECT_LUA_GARBAGE_INTERVAL = 100;
     DEFAULT_BUCKET_SEND_TIMEOUT = 10,
     DEFAULT_BUCKET_RECV_TIMEOUT = 10,
diff --git a/vshard/error.lua b/vshard/error.lua
index fa7bdaa..0ce208a 100644
--- a/vshard/error.lua
+++ b/vshard/error.lua
@@ -158,7 +158,13 @@ local error_message_template = {
         name = 'MULTIPLE_MASTERS_FOUND',
         msg = 'Found more than one master in replicaset %s on nodes %s and %s',
         args = {'replicaset_uuid', 'master1', 'master2'},
-    }
+    },
+    [32] = {
+        name = 'REPLICASET_IN_BACKOFF',
+        msg = 'Replicaset %s is in backoff, can\'t take requests right now. '..
+              'Last error was %s',
+        args = {'replicaset_uuid', 'error'}
+    },
 }
 
 --
diff --git a/vshard/replicaset.lua b/vshard/replicaset.lua
index a47a7b9..573a555 100644
--- a/vshard/replicaset.lua
+++ b/vshard/replicaset.lua
@@ -13,6 +13,8 @@
 --             weight = number,
 --             down_ts = ,
+--             backoff_ts = ,
+--             backoff_err = ,
 --             net_timeout = @@ -142,6 +144,25 @@ local function netbox_wait_connected(conn, timeout)
     return timeout
 end
 
+--
+-- Check if the replica is not in backoff. It also serves as an update - if the
+-- replica still has an old backoff timestamp, it is cleared. This way of
+-- backoff update does not require any fibers to perform background updates.
+-- Hence works not only on the router.
+--
+local function replica_check_backoff(replica, now)
+    if not replica.backoff_ts then
+        return true
+    end
+    if replica.backoff_ts + consts.REPLICA_BACKOFF_INTERVAL > now then
+        return false
+    end
+    log.warn('Replica %s returns from backoff', replica.uuid)
+    replica.backoff_ts = nil
+    replica.backoff_err = nil
+    return true
+end
+
 --
 -- Connect to a specified replica and remember a new connection
 -- in the replica object. Note, that the function does not wait
@@ -421,6 +442,39 @@ local function can_retry_after_error(e)
     return e.type == 'ClientError' and e.code == box.error.TIMEOUT
 end
 
+--
+-- True if after the given error on call of the given function the connection
+-- must go into backoff.
+--
+local function can_backoff_after_error(e, func)
+    if not e then
+        return false
+    end
+    if type(e) ~= 'table' and
+       (type(e) ~= 'cdata' or not ffi.istype('struct error', e)) then
+        return false
+    end
+    -- So far it is enabled only for vshard's own functions. Including
+    -- vshard.storage.call(). Otherwise it is not possible to retry safely -
+    -- user's function could have side effects before raising that error.
+    -- For instance, 'access denied' could be raised by user's function
+    -- internally after it already changed some data on the storage.
+    if not func:startswith('vshard.') then
+        return false
+    end
+    -- ClientError is sent for all errors by old Tarantool versions which didn't
+    -- keep error type. New versions preserve the original error type.
+    if e.type == 'ClientError' or e.type == 'AccessDeniedError' then
+        if e.code == box.error.ACCESS_DENIED then
+            return e.message:startswith("Execute access to function 'vshard.")
+        end
+        if e.code == box.error.NO_SUCH_PROC then
+            return e.message:startswith("Procedure 'vshard.")
+        end
+    end
+    return false
+end
+
 --
 -- Pick a next replica according to round-robin load balancing
 -- policy.
@@ -443,7 +497,7 @@ end
 -- until failover fiber will repair the nearest connection.
 --
 local function replicaset_template_multicallro(prefer_replica, balance)
-    local function pick_next_replica(replicaset)
+    local function pick_next_replica(replicaset, now)
         local r
         local master = replicaset.master
         if balance then
@@ -451,7 +505,8 @@ local function replicaset_template_multicallro(prefer_replica, balance)
             while i > 0 do
                 r = replicaset_balance_replica(replicaset)
                 i = i - 1
-                if r:is_connected() and (not prefer_replica or r ~= master) then
+                if r:is_connected() and (not prefer_replica or r ~= master) and
+                   replica_check_backoff(r, now) then
                     return r
                 end
             end
@@ -459,7 +514,8 @@ local function replicaset_template_multicallro(prefer_replica, balance)
             local start_r = replicaset.replica
             r = start_r
             while r do
-                if r:is_connected() and (not prefer_replica or r ~= master) then
+                if r:is_connected() and (not prefer_replica or r ~= master) and
+                   replica_check_backoff(r, now) then
                     return r
                 end
                 r = r.next_by_priority
@@ -471,7 +527,8 @@ local function replicaset_template_multicallro(prefer_replica, balance)
                     -- Reached already checked part.
                     break
                 end
-                if r:is_connected() and (not prefer_replica or r ~= master) then
+                if r:is_connected() and (not prefer_replica or r ~= master) and
+                   replica_check_backoff(r, now) then
                     return r
                 end
             end
@@ -488,26 +545,43 @@ local function replicaset_template_multicallro(prefer_replica, balance)
         if timeout <= 0 then
             return nil, lerror.timeout()
         end
-        local end_time = fiber_clock() + timeout
+        local now = fiber_clock()
+        local end_time = now + timeout
         while not net_status and timeout > 0 do
-            replica = pick_next_replica(replicaset)
+            replica = pick_next_replica(replicaset, now)
             if not replica then
                 replica, timeout = replicaset_wait_master(replicaset, timeout)
                 if not replica then
                     return nil, timeout
                 end
                 replicaset_connect_to_replica(replicaset, replica)
+                if replica.backoff_ts then
+                    return nil, lerror.vshard(
+                        lerror.code.REPLICASET_IN_BACKOFF, replicaset.uuid,
+                        replica.backoff_err)
+                end
             end
             opts.timeout = timeout
             net_status, storage_status, retval, err =
                 replica_call(replica, func, args, opts)
-            timeout = end_time - fiber_clock()
+            now = fiber_clock()
+            timeout = end_time - now
             if not net_status and not storage_status and
                not can_retry_after_error(retval) then
-                -- There is no sense to retry LuaJit errors, such as
-                -- assetions, not defined variables etc.
-                net_status = true
-                break
+                if can_backoff_after_error(retval, func) then
+                    if not replica.backoff_ts then
+                        log.warn('Replica %s goes into backoff for %s sec '..
+                                 'after error %s', replica.uuid,
+                                 consts.REPLICA_BACKOFF_INTERVAL, retval)
+                        replica.backoff_ts = now
+                        replica.backoff_err = retval
+                    end
+                else
+                    -- There is no sense to retry LuaJit errors, such as
+                    -- assertions, undefined variables etc.
+                    net_status = true
+                    break
+                end
             end
         end
         if not net_status then
@@ -549,6 +623,8 @@ local function rebind_replicasets(replicasets, old_replicasets)
                 local conn = old_replica.conn
                 replica.conn = conn
                 replica.down_ts = old_replica.down_ts
+                replica.backoff_ts = old_replica.backoff_ts
+                replica.backoff_err = old_replica.backoff_err
                 replica.net_timeout = old_replica.net_timeout
                 replica.net_sequential_ok = old_replica.net_sequential_ok
                 replica.net_sequential_fail = old_replica.net_sequential_fail
@@ -986,7 +1062,7 @@ local function buildall(sharding_cfg)
                 uri = replica.uri, name = replica.name, uuid = replica_uuid,
                 zone = replica.zone, net_timeout = consts.CALL_TIMEOUT_MIN,
                 net_sequential_ok = 0, net_sequential_fail = 0,
-                down_ts = curr_ts,
+                down_ts = curr_ts, backoff_ts = nil, backoff_err = nil,
             }, replica_mt)
             new_replicaset.replicas[replica_uuid] = new_replica
             if replica.master then
diff --git a/vshard/router/init.lua b/vshard/router/init.lua
index 4748398..116f905 100644
--- a/vshard/router/init.lua
+++ b/vshard/router/init.lua
@@ -172,7 +172,8 @@ local function bucket_discovery(router, bucket_id)
             replicaset:callrw('vshard.storage.bucket_stat', {bucket_id})
         if err == nil then
             return bucket_set(router, bucket_id, replicaset.uuid)
-        elseif err.code ~= lerror.code.WRONG_BUCKET then
+        elseif err.code ~= lerror.code.WRONG_BUCKET and
+               err.code ~= lerror.code.REPLICASET_IN_BACKOFF then
             last_err = err
             unreachable_uuid = uuid
         end
-- 
2.24.3 (Apple Git-128)

[Tarantool-patches] [PATCH vshard 2/5] storage: auto enable/disable

2021-12-17T00:26:37Z

While vshard.storage.cfg() is not done, accessing vshard functions
is not safe. It will fail with low level errors like
'access denied' or 'no such function'.

However there can be even worse cases. The user can have universe
access rights. And vshard can be already in global namespace after
require(). So vshard.storage functions are already visible.

The previous patch fixed only the case when function access was
restricted properly. And still fixed it just partially.

New problems are:

- box.cfg{} is already called, but the instance is still
  'loading'. Then data is not fully recovered yet. Accessing is
  not safe from the data consistency perspective.

- vshard.storage.cfg() is not started, or is not finished yet. In
  the end it might be doing something on what the public functions
  depend.

This patch addresses these issues. Now all non-trivial
vshard.storage functions are disabled until vshard.storage.cfg()
is finished and the instance is fully recovered.

They raise an error with a special code. Returning it via
'nil, err' pair wouldn't work. Because firstly, some functions
return a boolean value and are not documented as ever failing.
People would miss this new error.

Second reason - vshard.storage.call() needs to signal the remote
caller that the storage is disabled and it was found before the
user's function was called. If it would be done via 'nil, err',
then the user's function could emulate the storage being disabled.
Or even worse, it could make some changes and then get that error
accidentally by going to another storage remotely which would be
disabled. Hence it is not allowed. Too easy to break something.

It was an option to change vshard.storage.call() signature to
return 'true, retvals...' when user's function was called and
'false, err' when it wasn't, but that would break backward
compatibility. Supporting it only for new routers does not seem
possible.

The patch also drops 'memtx_memory' setting from the config
because an attempt to apply it after calling box.cfg() (for
example, via boot_like_vshard()) raises an error - default memory
is bigger than this setting. It messed the new tests.

Part of #298
Closes #123
---
 example/localcfg.lua          |   1 -
 test/storage/storage.result   | 101 +++++++++++++++++++
 test/storage/storage.test.lua |  43 ++++++++
 test/unit/garbage.result      |  17 ++--
 test/unit/garbage.test.lua    |  15 +--
 test/unit/rebalancer.result   |   2 +-
 test/unit/rebalancer.test.lua |   2 +-
 vshard/error.lua              |   5 +
 vshard/storage/init.lua       | 179 ++++++++++++++++++++++++++--------
 9 files changed, 305 insertions(+), 60 deletions(-)

diff --git a/example/localcfg.lua b/example/localcfg.lua
index 71f9d6f..9235d45 100644
--- a/example/localcfg.lua
+++ b/example/localcfg.lua
@@ -1,5 +1,4 @@
 return {
-    memtx_memory = 100 * 1024 * 1024,
     sharding = {
         ['cbf06940-0790-498b-948d-042b62cf3d29'] = { -- replicaset #1
             replicas = {
diff --git a/test/storage/storage.result b/test/storage/storage.result
index af48a13..e83b34f 100644
--- a/test/storage/storage.result
+++ b/test/storage/storage.result
@@ -951,6 +951,107 @@ vshard.storage._call('info')
 ---
 - is_master: false
 ...
+--
+-- gh-123, gh-298: storage auto-enable/disable depending on instance state.
+--
+_ = test_run:cmd('stop server storage_1_a')
+---
+...
+_ = test_run:cmd('start server storage_1_a with wait=False, '..                 \
+                 'args="boot_before_cfg"')
+---
+...
+_ = test_run:switch('storage_1_a')
+---
+...
+-- Leaving box.cfg() not called won't work because at 1.10 test-run somewhy
+-- raises an error when try to start an instance without box.cfg(). It can only
+-- be emulated.
+old_info = box.info
+---
+...
+box.info = setmetatable({}, {__index = function() error('not configured') end})
+---
+...
+assert(not pcall(function() return box.info.status end))
+---
+- true
+...
+ok, err = pcall(vshard.storage.call, 1, 'read', 'echo', {100})
+---
+...
+assert(not ok and err.code == vshard.error.code.STORAGE_IS_DISABLED)
+---
+- true
+...
+assert(err.message:match('box seem not to be configured') ~= nil)
+---
+- true
+...
+box.info = old_info
+---
+...
+-- Disabled until box is loaded.
+vshard.storage.internal.errinj.ERRINJ_CFG_DELAY = true
+---
+...
+f = fiber.create(vshard.storage.cfg, cfg, instance_uuid)
+---
+...
+f:set_joinable(true)
+---
+...
+old_info = box.info
+---
+...
+box.info = {status = 'loading'}
+---
+...
+ok, err = pcall(vshard.storage.call, 1, 'read', 'echo', {100})
+---
+...
+assert(not ok and err.code == vshard.error.code.STORAGE_IS_DISABLED)
+---
+- true
+...
+assert(err.message:match('instance status is "loading"') ~= nil)
+---
+- true
+...
+box.info = old_info
+---
+...
+-- Disabled until storage is configured.
+test_run:wait_cond(function() return box.info.status == 'running' end)
+---
+- true
+...
+ok, err = pcall(vshard.storage.call, 1, 'read', 'echo', {100})
+---
+...
+assert(not ok and err.code == vshard.error.code.STORAGE_IS_DISABLED)
+---
+- true
+...
+assert(err.message:match('storage is not configured') ~= nil)
+---
+- true
+...
+vshard.storage.internal.errinj.ERRINJ_CFG_DELAY = false
+---
+...
+f:join()
+---
+- true
+...
+-- Enabled when all criteria are finally satisfied.
+ok, res = vshard.storage.call(1, 'read', 'echo', {100})
+---
+...
+assert(ok and res == 100)
+---
+- true
+...
 _ = test_run:switch("default")
 ---
 ...
diff --git a/test/storage/storage.test.lua b/test/storage/storage.test.lua
index 99ef2c4..ff39f2f 100644
--- a/test/storage/storage.test.lua
+++ b/test/storage/storage.test.lua
@@ -299,6 +299,49 @@ vshard.storage._call('info')
 _ = test_run:switch('storage_1_b')
 vshard.storage._call('info')
 
+--
+-- gh-123, gh-298: storage auto-enable/disable depending on instance state.
+--
+_ = test_run:cmd('stop server storage_1_a')
+_ = test_run:cmd('start server storage_1_a with wait=False, '..                 \
+                 'args="boot_before_cfg"')
+_ = test_run:switch('storage_1_a')
+-- Leaving box.cfg() not called won't work because at 1.10 test-run somewhy
+-- raises an error when try to start an instance without box.cfg(). It can only
+-- be emulated.
+old_info = box.info
+box.info = setmetatable({}, {__index = function() error('not configured') end})
+assert(not pcall(function() return box.info.status end))
+
+ok, err = pcall(vshard.storage.call, 1, 'read', 'echo', {100})
+assert(not ok and err.code == vshard.error.code.STORAGE_IS_DISABLED)
+assert(err.message:match('box seem not to be configured') ~= nil)
+box.info = old_info
+
+-- Disabled until box is loaded.
+vshard.storage.internal.errinj.ERRINJ_CFG_DELAY = true
+f = fiber.create(vshard.storage.cfg, cfg, instance_uuid)
+f:set_joinable(true)
+
+old_info = box.info
+box.info = {status = 'loading'}
+ok, err = pcall(vshard.storage.call, 1, 'read', 'echo', {100})
+assert(not ok and err.code == vshard.error.code.STORAGE_IS_DISABLED)
+assert(err.message:match('instance status is "loading"') ~= nil)
+box.info = old_info
+
+-- Disabled until storage is configured.
+test_run:wait_cond(function() return box.info.status == 'running' end)
+ok, err = pcall(vshard.storage.call, 1, 'read', 'echo', {100})
+assert(not ok and err.code == vshard.error.code.STORAGE_IS_DISABLED)
+assert(err.message:match('storage is not configured') ~= nil)
+vshard.storage.internal.errinj.ERRINJ_CFG_DELAY = false
+f:join()
+
+-- Enabled when all criteria are finally satisfied.
+ok, res = vshard.storage.call(1, 'read', 'echo', {100})
+assert(ok and res == 100)
+
 _ = test_run:switch("default")
 test_run:drop_cluster(REPLICASET_2)
 test_run:drop_cluster(REPLICASET_1)
diff --git a/test/unit/garbage.result b/test/unit/garbage.result
index a530496..c4b0fc3 100644
--- a/test/unit/garbage.result
+++ b/test/unit/garbage.result
@@ -16,7 +16,7 @@ test_run:cmd("setopt delimiter ';'")
 ...
 function show_sharded_spaces()
     local result = {}
-    for k, space in pairs(vshard.storage.sharded_spaces()) do
+    for k, space in pairs(vshard.storage._sharded_spaces()) do
         table.insert(result, space.name)
     end
     table.sort(result)
@@ -489,7 +489,10 @@ f:cancel()
 util = require('util')
 ---
 ...
-util.check_error(vshard.storage.bucket_delete_garbage)
+delete_garbage = vshard.storage._bucket_delete_garbage
+---
+...
+util.check_error(delete_garbage)
 ---
 - 'Usage: bucket_delete_garbage(bucket_id, opts)'
 ...
@@ -506,7 +509,7 @@ s:replace{6, 4}
 ---
 - [6, 4]
 ...
-vshard.storage.bucket_delete_garbage(4)
+delete_garbage(4)
 ---
 ...
 s:select{}
@@ -526,7 +529,7 @@ s:replace{6, 4}
 ---
 - [6, 4]
 ...
-vshard.storage.bucket_delete_garbage(4)
+delete_garbage(4)
 ---
 ...
 s:select{}
@@ -547,16 +550,16 @@ s:replace{6, 4}
 ---
 - [6, 4]
 ...
-util.check_error(vshard.storage.bucket_delete_garbage, 4)
+util.check_error(delete_garbage, 4)
 ---
 - Can not delete not garbage bucket. Use "{force=true}" to ignore this attention
 ...
-util.check_error(vshard.storage.bucket_delete_garbage, 4, 10000)
+util.check_error(delete_garbage, 4, 10000)
 ---
 - 'Usage: bucket_delete_garbage(bucket_id, opts)'
 ...
 -- 'Force' option ignores this error.
-vshard.storage.bucket_delete_garbage(4, {force = true})
+delete_garbage(4, {force = true})
 ---
 ...
 s:select{}
diff --git a/test/unit/garbage.test.lua b/test/unit/garbage.test.lua
index 250afb0..1416f58 100644
--- a/test/unit/garbage.test.lua
+++ b/test/unit/garbage.test.lua
@@ -6,7 +6,7 @@ engine = test_run:get_cfg('engine')
 test_run:cmd("setopt delimiter ';'")
 function show_sharded_spaces()
     local result = {}
-    for k, space in pairs(vshard.storage.sharded_spaces()) do
+    for k, space in pairs(vshard.storage._sharded_spaces()) do
         table.insert(result, space.name)
     end
     table.sort(result)
@@ -197,30 +197,31 @@ f:cancel()
 --
 util = require('util')
 
-util.check_error(vshard.storage.bucket_delete_garbage)
+delete_garbage = vshard.storage._bucket_delete_garbage
+util.check_error(delete_garbage)
 
 -- Delete an existing garbage bucket.
 _bucket:replace{4, vshard.consts.BUCKET.SENT}
 s:replace{5, 4}
 s:replace{6, 4}
-vshard.storage.bucket_delete_garbage(4)
+delete_garbage(4)
 s:select{}
 
 -- Delete a not existing garbage bucket.
 _ = _bucket:delete{4}
 s:replace{5, 4}
 s:replace{6, 4}
-vshard.storage.bucket_delete_garbage(4)
+delete_garbage(4)
 s:select{}
 
 -- Fail to delete a not garbage bucket.
 _bucket:replace{4, vshard.consts.BUCKET.ACTIVE}
 s:replace{5, 4}
 s:replace{6, 4}
-util.check_error(vshard.storage.bucket_delete_garbage, 4)
-util.check_error(vshard.storage.bucket_delete_garbage, 4, 10000)
+util.check_error(delete_garbage, 4)
+util.check_error(delete_garbage, 4, 10000)
 -- 'Force' option ignores this error.
-vshard.storage.bucket_delete_garbage(4, {force = true})
+delete_garbage(4, {force = true})
 s:select{}
 
 --
diff --git a/test/unit/rebalancer.result b/test/unit/rebalancer.result
index 19aa480..a931d63 100644
--- a/test/unit/rebalancer.result
+++ b/test/unit/rebalancer.result
@@ -287,7 +287,7 @@ build_routes(replicasets)
 --
 -- Test rebalancer local state.
 --
-get_state = vshard.storage.rebalancer_request_state
+get_state = vshard.storage._rebalancer_request_state
 ---
 ...
 _bucket = box.schema.create_space('_bucket')
diff --git a/test/unit/rebalancer.test.lua b/test/unit/rebalancer.test.lua
index 8087d42..1cdbbcb 100644
--- a/test/unit/rebalancer.test.lua
+++ b/test/unit/rebalancer.test.lua
@@ -75,7 +75,7 @@ build_routes(replicasets)
 --
 -- Test rebalancer local state.
 --
-get_state = vshard.storage.rebalancer_request_state
+get_state = vshard.storage._rebalancer_request_state
 _bucket = box.schema.create_space('_bucket')
 pk = _bucket:create_index('pk')
 status = _bucket:create_index('status', {parts = {{2, 'string'}}, unique = false})
diff --git a/vshard/error.lua b/vshard/error.lua
index 0ce208a..2b97eae 100644
--- a/vshard/error.lua
+++ b/vshard/error.lua
@@ -165,6 +165,11 @@ local error_message_template = {
               'Last error was %s',
         args = {'replicaset_uuid', 'error'}
     },
+    [33] = {
+        name = 'STORAGE_IS_DISABLED',
+        msg = 'Storage is disabled: %s',
+        args = {'reason'}
+    },
 }
 
 --
diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
index 3158f3c..d3c4e2a 100644
--- a/vshard/storage/init.lua
+++ b/vshard/storage/init.lua
@@ -134,6 +134,16 @@ if not M then
         -- help routers find a new location for sent and deleted buckets without
         -- whole cluster scan.
         route_map = {},
+        -- Flag whether vshard.storage.cfg() is finished.
+        is_configured = false,
+        -- Flag whether box.info.status is acceptable. For instance, 'loading'
+        -- is not.
+        is_loaded = false,
+        -- Reference to the function-proxy to most of the public functions. It
+        -- allows to avoid 'if's in each function by adding expensive
+        -- conditional checks in one rarely used version of the wrapper and no
+        -- checks into the other almost always used wrapper.
+        api_call_cache = nil,
 
         ------------------- Garbage collection -------------------
         -- Fiber to remove garbage buckets data.
@@ -670,19 +680,19 @@ local function this_is_master()
            M.this_replica == M.this_replicaset.master
 end
 
-local function on_master_disable(...)
-    M._on_master_disable(...)
+local function on_master_disable(new_func, old_func)
+    M._on_master_disable(new_func, old_func)
     -- If a trigger is set after storage.cfg(), then notify an
     -- user, that the current instance is not master.
-    if select('#', ...) == 1 and not this_is_master() then
+    if old_func == nil and not this_is_master() then
         M._on_master_disable:run()
     end
 end
 
-local function on_master_enable(...)
-    M._on_master_enable(...)
+local function on_master_enable(new_func, old_func)
+    M._on_master_enable(new_func, old_func)
     -- Same as above, but notify, that the instance is master.
-    if select('#', ...) == 1 and this_is_master() then
+    if old_func == nil and this_is_master() then
         M._on_master_enable:run()
     end
 end
@@ -1363,6 +1373,13 @@ local function find_sharded_spaces()
     return spaces
 end
 
+--
+-- Public wrapper for sharded spaces list getter.
+--
+local function storage_sharded_spaces()
+    return table.deepcopy(find_sharded_spaces())
+end
+
 if M.errinj.ERRINJ_RELOAD then
     error('Error injection: reload')
 end
@@ -2764,6 +2781,7 @@ local function storage_cfg(cfg, this_replica_uuid, is_reload)
         M.rebalancer_fiber = nil
     end
     lua_gc.set_state(M.collect_lua_garbage, consts.COLLECT_LUA_GARBAGE_INTERVAL)
+    M.is_configured = true
     -- Destroy connections, not used in a new configuration.
     collectgarbage()
 end
@@ -2915,6 +2933,58 @@ local function storage_info()
     return state
 end
 
+--------------------------------------------------------------------------------
+-- Public API protection
+--------------------------------------------------------------------------------
+
+--
+-- Arguments are listed explicitly instead of '...' because the latter does not
+-- jit.
+--
+local function storage_api_call_safe(func, arg1, arg2, arg3, arg4)
+    return func(arg1, arg2, arg3, arg4)
+end
+
+--
+-- Unsafe proxy is loaded with protections. But it is used rarely and only in
+-- the beginning of instance's lifetime.
+--
+local function storage_api_call_unsafe(func, arg1, arg2, arg3, arg4)
+    -- box.info is quite expensive. Avoid calling it again when the instance
+    -- is finally loaded.
+    if not M.is_loaded then
+        -- box.info raises an error until box.cfg() is started.
+        local ok, status = pcall(function()
+            return box.info.status
+        end)
+        if not ok then
+            local msg = 'box seem not to be configured'
+            return error(lerror.vshard(lerror.code.STORAGE_IS_DISABLED, msg))
+        end
+        -- 'Orphan' is allowed because even if a replica is an orphan, it still
+        -- could be up to date. Just not all other replicas are connected.
+        if status ~= 'running' and status ~= 'orphan' then
+            local msg = ('instance status is "%s"'):format(status)
+            return error(lerror.vshard(lerror.code.STORAGE_IS_DISABLED, msg))
+        end
+        M.is_loaded = true
+    end
+    if not M.is_configured then
+        local msg = 'storage is not configured'
+        return error(lerror.vshard(lerror.code.STORAGE_IS_DISABLED, msg))
+    end
+    M.api_call_cache = storage_api_call_safe
+    return func(arg1, arg2, arg3, arg4)
+end
+
+M.api_call_cache = storage_api_call_unsafe
+
+local function storage_make_api(func)
+    return function(arg1, arg2, arg3, arg4)
+        return M.api_call_cache(func, arg1, arg2, arg3, arg4)
+    end
+end
+
 --------------------------------------------------------------------------------
 -- Module definition
 --------------------------------------------------------------------------------
@@ -3021,44 +3091,67 @@ M.bucket_are_all_rw = bucket_are_all_rw_public
 M.bucket_generation_wait = bucket_generation_wait
 lregistry.storage = M
 
+--
+-- Not all methods are public here. Private methods should not be exposed if
+-- possible. At least not without notable difference in naming.
+--
 return {
-    sync = sync,
-    bucket_force_create = bucket_force_create,
-    bucket_force_drop = bucket_force_drop,
-    bucket_collect = bucket_collect,
-    bucket_recv = bucket_recv,
-    bucket_send = bucket_send,
-    bucket_stat = bucket_stat,
-    bucket_pin = bucket_pin,
-    bucket_unpin = bucket_unpin,
-    bucket_ref = bucket_ref,
-    bucket_unref = bucket_unref,
-    bucket_refro = bucket_refro,
-    bucket_refrw = bucket_refrw,
-    bucket_unrefro = bucket_unrefro,
-    bucket_unrefrw = bucket_unrefrw,
-    bucket_delete_garbage = bucket_delete_garbage,
-    garbage_collector_wakeup = garbage_collector_wakeup,
-    rebalancer_wakeup = rebalancer_wakeup,
-    rebalancer_apply_routes = rebalancer_apply_routes,
-    rebalancer_disable = rebalancer_disable,
-    rebalancer_enable = rebalancer_enable,
-    is_locked = is_this_replicaset_locked,
-    rebalancing_is_in_progress = rebalancing_is_in_progress,
-    recovery_wakeup = recovery_wakeup,
-    call = storage_call,
-    _call = service_call,
+    --
+    -- Bucket methods.
+    --
+    bucket_force_create = storage_make_api(bucket_force_create),
+    bucket_force_drop = storage_make_api(bucket_force_drop),
+    bucket_collect = storage_make_api(bucket_collect),
+    bucket_recv = storage_make_api(bucket_recv),
+    bucket_send = storage_make_api(bucket_send),
+    bucket_stat = storage_make_api(bucket_stat),
+    bucket_pin = storage_make_api(bucket_pin),
+    bucket_unpin = storage_make_api(bucket_unpin),
+    bucket_ref = storage_make_api(bucket_ref),
+    bucket_unref = storage_make_api(bucket_unref),
+    bucket_refro = storage_make_api(bucket_refro),
+    bucket_refrw = storage_make_api(bucket_refrw),
+    bucket_unrefro = storage_make_api(bucket_unrefro),
+    bucket_unrefrw = storage_make_api(bucket_unrefrw),
+    bucket_delete_garbage = storage_make_api(bucket_delete_garbage),
+    _bucket_delete_garbage = bucket_delete_garbage,
+    buckets_info = storage_make_api(storage_buckets_info),
+    buckets_count = storage_make_api(bucket_count_public),
+    buckets_discovery = storage_make_api(buckets_discovery),
+    --
+    -- Garbage collector.
+    --
+    garbage_collector_wakeup = storage_make_api(garbage_collector_wakeup),
+    --
+    -- Rebalancer.
+    --
+    rebalancer_wakeup = storage_make_api(rebalancer_wakeup),
+    rebalancer_apply_routes = storage_make_api(rebalancer_apply_routes),
+    rebalancer_disable = storage_make_api(rebalancer_disable),
+    rebalancer_enable = storage_make_api(rebalancer_enable),
+    rebalancing_is_in_progress = storage_make_api(rebalancing_is_in_progress),
+    rebalancer_request_state = storage_make_api(rebalancer_request_state),
+    _rebalancer_request_state = rebalancer_request_state,
+    --
+    -- Recovery.
+    --
+    recovery_wakeup = storage_make_api(recovery_wakeup),
+    --
+    -- Instance info.
+    --
+    is_locked = storage_make_api(is_this_replicaset_locked),
+    info = storage_make_api(storage_info),
+    sharded_spaces = storage_make_api(storage_sharded_spaces),
+    _sharded_spaces = storage_sharded_spaces,
+    module_version = function() return M.module_version end,
+    --
+    -- Miscellaneous.
+    --
+    call = storage_make_api(storage_call),
+    _call = storage_make_api(service_call),
+    sync = storage_make_api(sync),
     cfg = function(cfg, uuid) return storage_cfg(cfg, uuid, false) end,
-    info = storage_info,
-    buckets_info = storage_buckets_info,
-    buckets_count = bucket_count_public,
-    buckets_discovery = buckets_discovery,
-    rebalancer_request_state = rebalancer_request_state,
+    on_master_enable = storage_make_api(on_master_enable),
+    on_master_disable = storage_make_api(on_master_disable),
     internal = M,
-    on_master_enable = on_master_enable,
-    on_master_disable = on_master_disable,
-    sharded_spaces = function()
-        return table.deepcopy(find_sharded_spaces())
-    end,
-    module_version = function() return M.module_version end,
 }
-- 
2.24.3 (Apple Git-128)

[Tarantool-patches] [PATCH vshard 3/5] storage: manual enable/disable

2021-12-17T00:27:15Z

The patch introduces functions vshard.storage.enable()/disable().

They allow to control manually whether the instance can accept
requests.

It solves the following problems which were not covered by
previous patches:

- Even if box.cfg() is done, status is 'running', and
  vshard.storage.cfg() is finished, still user's application can
  be not ready to accept requests. For instance, it needs to
  create more functions and users on top of vshard. Then it wants
  to disable public requests until all preliminary work is done.

- After all is enabled, fine, and dandy, still the instance might
  want to disable self in case of an emergency. Such as its config
  got broken or too outdated, desynced with a centric storage.

vshard.storage.enable()/disable() can be called any time, before,
during, and after vshard.storage.cfg() to solve these issues.

Part of #298
---
 test/storage/storage.result   | 27 +++++++++++++++++++++++++++
 test/storage/storage.test.lua | 12 ++++++++++++
 vshard/storage/init.lua       | 28 ++++++++++++++++++++++++++++
 3 files changed, 67 insertions(+)

diff --git a/test/storage/storage.result b/test/storage/storage.result
index e83b34f..e1b0d5d 100644
--- a/test/storage/storage.result
+++ b/test/storage/storage.result
@@ -1052,6 +1052,33 @@ assert(ok and res == 100)
 ---
 - true
 ...
+--
+-- Manual enable/disable.
+--
+vshard.storage.disable()
+---
+...
+ok, err = pcall(vshard.storage.call, 1, 'read', 'echo', {100})
+---
+...
+assert(not ok and err.code == vshard.error.code.STORAGE_IS_DISABLED)
+---
+- true
+...
+assert(err.message:match('storage is disabled explicitly') ~= nil)
+---
+- true
+...
+vshard.storage.enable()
+---
+...
+ok, res = vshard.storage.call(1, 'read', 'echo', {100})
+---
+...
+assert(ok and res == 100)
+---
+- true
+...
 _ = test_run:switch("default")
 ---
 ...
diff --git a/test/storage/storage.test.lua b/test/storage/storage.test.lua
index ff39f2f..e1127d3 100644
--- a/test/storage/storage.test.lua
+++ b/test/storage/storage.test.lua
@@ -342,6 +342,18 @@ f:join()
 ok, res = vshard.storage.call(1, 'read', 'echo', {100})
 assert(ok and res == 100)
 
+--
+-- Manual enable/disable.
+--
+vshard.storage.disable()
+ok, err = pcall(vshard.storage.call, 1, 'read', 'echo', {100})
+assert(not ok and err.code == vshard.error.code.STORAGE_IS_DISABLED)
+assert(err.message:match('storage is disabled explicitly') ~= nil)
+
+vshard.storage.enable()
+ok, res = vshard.storage.call(1, 'read', 'echo', {100})
+assert(ok and res == 100)
+
 _ = test_run:switch("default")
 test_run:drop_cluster(REPLICASET_2)
 test_run:drop_cluster(REPLICASET_1)
diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
index d3c4e2a..94e0c16 100644
--- a/vshard/storage/init.lua
+++ b/vshard/storage/init.lua
@@ -139,6 +139,9 @@ if not M then
         -- Flag whether box.info.status is acceptable. For instance, 'loading'
         -- is not.
         is_loaded = false,
+        -- Flag whether the instance is enabled manually. It is true by default
+        -- for backward compatibility with old vshard.
+        is_enabled = true,
         -- Reference to the function-proxy to most of the public functions. It
         -- allows to avoid 'if's in each function by adding expensive
         -- conditional checks in one rarely used version of the wrapper and no
@@ -201,6 +204,11 @@ if not M then
     }
 else
     bucket_ref_new = ffi.typeof("struct bucket_ref")
+
+    -- It is not set when reloaded from an old vshard version.
+    if M.is_enabled == nil then
+        M.is_enabled = true
+    end
 end
 
 --
@@ -2973,6 +2981,10 @@ local function storage_api_call_unsafe(func, arg1, arg2, arg3, arg4)
         local msg = 'storage is not configured'
         return error(lerror.vshard(lerror.code.STORAGE_IS_DISABLED, msg))
     end
+    if not M.is_enabled then
+        local msg = 'storage is disabled explicitly'
+        return error(lerror.vshard(lerror.code.STORAGE_IS_DISABLED, msg))
+    end
     M.api_call_cache = storage_api_call_safe
     return func(arg1, arg2, arg3, arg4)
 end
@@ -2985,6 +2997,20 @@ local function storage_make_api(func)
     end
 end
 
+local function storage_enable()
+    M.is_enabled = true
+end
+
+--
+-- Disable can be used in case the storage entered a critical state in which
+-- requests are not allowed. For instance, its config got broken or too old
+-- compared to a centric config somewhere.
+--
+local function storage_disable()
+    M.is_enabled = false
+    M.api_call_cache = storage_api_call_unsafe
+end
+
 --------------------------------------------------------------------------------
 -- Module definition
 --------------------------------------------------------------------------------
@@ -3153,5 +3179,7 @@ return {
     cfg = function(cfg, uuid) return storage_cfg(cfg, uuid, false) end,
     on_master_enable = storage_make_api(on_master_enable),
     on_master_disable = storage_make_api(on_master_disable),
+    enable = storage_enable,
+    disable = storage_disable,
     internal = M,
 }
-- 
2.24.3 (Apple Git-128)

[Tarantool-patches] [PATCH vshard 4/5] error: introduce from_string

2021-12-17T00:27:41Z

vshard.storage.call() and most of the other vshard.storage.*
functions now raise an exception STORAGE_IS_DISABLED when the
storage is disabled.

The router wants to catch it to handle in a special way. But
unfortunately,

- error(obj) in a Lua function is wrapped into LuajitError. 'obj'
  is saved into 'message' using its __tostring meta-method.

- It is not possible to create your own error type in a sane way.

These 2 facts mean that the router needs to be able to extract
the original error from LuajitError's message. In vshard errors
are serialized into json, so a valid vshard error, such as
STORAGE_IS_DISABLED, can be extracted from LuajitError's message
if it wasn't truncated due to being too long. For this particular
error it won't happen.

The patch introduces new method vshard.error.from_string() to
perform this extraction for its further usage in router.

Part of #298
---
 test/unit/error.result   | 18 ++++++++++++++++++
 test/unit/error.test.lua |  6 ++++++
 vshard/error.lua         | 19 +++++++++++++++++++
 3 files changed, 43 insertions(+)

diff --git a/test/unit/error.result b/test/unit/error.result
index bb4e0cc..258b14f 100644
--- a/test/unit/error.result
+++ b/test/unit/error.result
@@ -48,6 +48,24 @@ test_run:grep_log('default', '"reason":"reason","code":11,"type":"ShardingError"
 ---
 - '"reason":"reason","code":11,"type":"ShardingError"'
 ...
+e = lerror.vshard(lerror.code.STORAGE_IS_DISABLED, 'any reason')
+---
+...
+e = lerror.from_string(tostring(e))
+---
+...
+assert(e.code == lerror.code.STORAGE_IS_DISABLED)
+---
+- true
+...
+assert(e.type == 'ShardingError')
+---
+- true
+...
+assert(e.message == 'Storage is disabled: any reason')
+---
+- true
+...
 --
 -- Part of gh-100: check `error.vshard`.
 --
diff --git a/test/unit/error.test.lua b/test/unit/error.test.lua
index 0a51d33..5e669b6 100644
--- a/test/unit/error.test.lua
+++ b/test/unit/error.test.lua
@@ -19,6 +19,12 @@ log = require('log')
 log.info('Log error: %s', vshard_error)
 test_run:grep_log('default', '"reason":"reason","code":11,"type":"ShardingError"')
 
+e = lerror.vshard(lerror.code.STORAGE_IS_DISABLED, 'any reason')
+e = lerror.from_string(tostring(e))
+assert(e.code == lerror.code.STORAGE_IS_DISABLED)
+assert(e.type == 'ShardingError')
+assert(e.message == 'Storage is disabled: any reason')
+
 --
 -- Part of gh-100: check `error.vshard`.
 --
diff --git a/vshard/error.lua b/vshard/error.lua
index 2b97eae..96c4bdd 100644
--- a/vshard/error.lua
+++ b/vshard/error.lua
@@ -244,6 +244,24 @@ local function make_error(e)
     end
 end
 
+--
+-- Restore an error object from its string serialization.
+--
+local function from_string(str)
+    -- Error objects in VShard are stringified into json. Hence can restore also
+    -- as json. The only problem is that the json might be truncated if it was
+    -- stored in an error message of a real error object. It is not very
+    -- reliable.
+    local ok, res = pcall(json.decode, str)
+    if not ok then
+        return nil
+    end
+    if type(res) ~= 'table' or type(res.type) ~= 'string' or
+       type(res.code) ~= 'number' or type(res.message) ~= 'string' then
+        return nil
+    end
+    return make_error(res)
+end
 
 local function make_alert(code, ...)
     local format = error_message_template[code]
@@ -266,6 +284,7 @@ return {
     box = box_error,
     vshard = vshard_error,
     make = make_error,
+    from_string = from_string,
     alert = make_alert,
     timeout = make_timeout,
 }
-- 
2.24.3 (Apple Git-128)

[Tarantool-patches] [PATCH vshard 5/5] router: backoff on storage being disabled

2021-12-17T00:28:11Z

If a storage reports it is disabled, then it probably will take
some time before it can accept new requests.

This patch makes STORAGE_IS_DISABLED error cause the connection's
backoff. In line with 'access denied' and 'no such function'
errors. Because the reason for all 3 is the same - the storage is
not ready to accept requests yet.

Such requests are transparently retried now.

Closes #298

@TarantoolBot document
Title: vshard.storage.enable/disable()
`vshard.storage.disable()` makes most of the `vshard.storage`
functions throw an error. As Lua exception, not via `nil, err`
pattern.

`vshard.storage.enable()` reverts the disable.

By default the storage is enabled.

Additionally, the storage is forcefully disabled automatically
until `vshard.storage.cfg()` is finished and the instance finished
recovery (its `box.info.status` is `'running'`, for example).

Auto-disable protects from usage of vshard functions before the
storage's global state is fully created.

Manual `vshard.storage.disable()` helps to achieve the same for
user's application. For instance, a user might want to do some
preparatory work after `vshard.storage.cfg` before the application
is ready for requests. Then the flow would be:
```Lua
vshard.storage.disable()
vshard.storage.cfg(...)
-- Do your preparatory work here ...
vshard.storage.enable()
```

The routers handle the errors signaling about the storage being disabled in a
special way. They put connections to such instances into a backoff state for
some time and will try to use other replicas. For example, assume a replicaset
has replicas 'replica_1' and 'replica_2'. Assume 'replica_1' is disabled due to
any reason. If a router will try to talk to 'replica_1', it will get a special
error and will transparently retry to 'replica_2'.

When 'replica_1' is enabled again, the router will notice it too and will send
requests to it again.

It all works exclusively for read-only requests. Read-write requests can only be
sent to a master, which is one per replicaset. They are not retried.
---
 test/router/router2.result   | 88 ++++++++++++++++++++++++++++++++++++
 test/router/router2.test.lua | 35 ++++++++++++++
 vshard/replicaset.lua        | 19 +++++++-
 3 files changed, 140 insertions(+), 2 deletions(-)

diff --git a/test/router/router2.result b/test/router/router2.result
index a501dbf..ebf0b3f 100644
--- a/test/router/router2.result
+++ b/test/router/router2.result
@@ -548,6 +548,94 @@ vshard.storage.call = old_storage_call
  | ---
  | ...
 
+--
+-- Storage is disabled = backoff.
+--
+vshard.storage.disable()
+ | ---
+ | ...
+
+test_run:switch('router_1')
+ | ---
+ | - true
+ | ...
+-- Drop old backoffs.
+fiber.sleep(vshard.consts.REPLICA_BACKOFF_INTERVAL)
+ | ---
+ | ...
+-- Success, but internally the request was retried.
+res, err = vshard.router.callro(1, 'echo', {100}, long_timeout)
+ | ---
+ | ...
+assert(res == 100)
+ | ---
+ | - true
+ | ...
+-- The best replica entered backoff state.
+util = require('util')
+ | ---
+ | ...
+storage_2 = vshard.router.static.replicasets[replicasets[2]]
+ | ---
+ | ...
+storage_2_a = storage_2.replicas[util.name_to_uuid.storage_2_a]
+ | ---
+ | ...
+assert(storage_2_a.backoff_ts ~= nil)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('storage_2_b')
+ | ---
+ | - true
+ | ...
+assert(echo_count == 1)
+ | ---
+ | - true
+ | ...
+echo_count = 0
+ | ---
+ | ...
+
+test_run:switch('storage_2_a')
+ | ---
+ | - true
+ | ...
+assert(echo_count == 0)
+ | ---
+ | - true
+ | ...
+vshard.storage.enable()
+ | ---
+ | ...
+
+test_run:switch('router_1')
+ | ---
+ | - true
+ | ...
+-- Drop the backoff.
+fiber.sleep(vshard.consts.REPLICA_BACKOFF_INTERVAL)
+ | ---
+ | ...
+-- Now goes to the best replica - it is enabled again.
+res, err = vshard.router.callro(1, 'echo', {100}, long_timeout)
+ | ---
+ | ...
+assert(res == 100)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('storage_2_a')
+ | ---
+ | - true
+ | ...
+assert(echo_count == 1)
+ | ---
+ | - true
+ | ...
+
 _ = test_run:switch("default")
  | ---
  | ...
diff --git a/test/router/router2.test.lua b/test/router/router2.test.lua
index fb0c3b2..1c21876 100644
--- a/test/router/router2.test.lua
+++ b/test/router/router2.test.lua
@@ -216,6 +216,41 @@ test_run:switch('storage_2_a')
 assert(echo_count == 0)
 vshard.storage.call = old_storage_call
 
+--
+-- Storage is disabled = backoff.
+--
+vshard.storage.disable()
+
+test_run:switch('router_1')
+-- Drop old backoffs.
+fiber.sleep(vshard.consts.REPLICA_BACKOFF_INTERVAL)
+-- Success, but internally the request was retried.
+res, err = vshard.router.callro(1, 'echo', {100}, long_timeout)
+assert(res == 100)
+-- The best replica entered backoff state.
+util = require('util')
+storage_2 = vshard.router.static.replicasets[replicasets[2]]
+storage_2_a = storage_2.replicas[util.name_to_uuid.storage_2_a]
+assert(storage_2_a.backoff_ts ~= nil)
+
+test_run:switch('storage_2_b')
+assert(echo_count == 1)
+echo_count = 0
+
+test_run:switch('storage_2_a')
+assert(echo_count == 0)
+vshard.storage.enable()
+
+test_run:switch('router_1')
+-- Drop the backoff.
+fiber.sleep(vshard.consts.REPLICA_BACKOFF_INTERVAL)
+-- Now goes to the best replica - it is enabled again.
+res, err = vshard.router.callro(1, 'echo', {100}, long_timeout)
+assert(res == 100)
+
+test_run:switch('storage_2_a')
+assert(echo_count == 1)
+
 _ = test_run:switch("default")
 _ = test_run:cmd("stop server router_1")
 _ = test_run:cmd("cleanup server router_1")
diff --git a/vshard/replicaset.lua b/vshard/replicaset.lua
index 573a555..623d24d 100644
--- a/vshard/replicaset.lua
+++ b/vshard/replicaset.lua
@@ -347,9 +347,21 @@ local function replica_call(replica, func, args, opts)
         if opts.timeout >= replica.net_timeout then
             replica_on_failed_request(replica)
         end
+        local err = storage_status
+        -- VShard functions can throw exceptions using error() function. When
+        -- it reaches the network layer, it is wrapped into LuajitError. Try to
+        -- extract the original error if this is the case. Not always is
+        -- possible - the string representation could be truncated.
+        --
+        -- In old Tarantool versions LuajitError turned into ClientError on the
+        -- client. Check both types.
+        if func:startswith('vshard.') and (err.type == 'LuajitError' or
+           err.type == 'ClientError') then
+            err = lerror.from_string(err.message) or err
+        end
         log.error("Exception during calling '%s' on '%s': %s", func, replica,
-                  storage_status)
-        return false, nil, lerror.make(storage_status)
+                  err)
+        return false, nil, lerror.make(err)
     else
         replica_on_success_request(replica)
     end
@@ -472,6 +484,9 @@ local function can_backoff_after_error(e, func)
             return e.message:startswith("Procedure 'vshard.")
         end
     end
+    if e.type == 'ShardingError' then
+        return e.code == vshard.error.code.STORAGE_IS_DISABLED
+    end
     return false
 end
 
-- 
2.24.3 (Apple Git-128)

Re: [Tarantool-patches] [PATCH vshard 1/5] router: backoff on some box errors

2021-12-17T11:09:30Z

Thanks for your patch! LGTM. But I think tests should be extended a bit.

I see that tests cover AccessDenied error but I don't see that there is 
simple error()/assert() error check.

On 17.12.2021 03:25, Vladislav Shpilevoy wrote:
> Storage configuration takes time. Firstly, box.cfg{} which can be
> called before vshard.storage.cfg(). Secondly, vshard.storage.cfg()
> is not immediate as well.
>
> During that time accessing the storage is not safe. Attempts to
> call vshard.storage functions can return weird errors, or the
> functions can even be not available yet. They need to be created
> in _func and get access rights in _priv before becoming public.
>
> Routers used to forward errors like 'access denied error' and
> 'no such function' to users as is, treating them as critical.
>
> Not only it was confusing for users, but also could make an entire
> replicaset not available for requests - the connection to it is
> alive, so router would send all requests into it and they all
> would fail. Even if the replicaset has another instance which is
> perfectly functional.
>
> This patch handles such specific errors inside of the router. The
> faulty replicas are put into a 'backoff' state. They remain in it
> for some fixed time (5 seconds for now), new requests won't be
> sent to them until the time passes. Router will use other
> instances.
>
> Backoff is activated only for vshard.* functions. If the errors
> are about some user's function, it is considered a regular error.
> Because the router can't tell whether any side effects were done
> on the remote instance before the error happened. Hence can't
> retry to another node.
>
> For example, if access was denied to 'vshard.storage.call', then
> it is backoff. If inside of vshard.storage.call the access was
> denied to 'user_test_func', then it is not backoff.
>
> It all works for read-only requests exclusively of course. Because
> for read-write requests the instance is just one - master. Router
> does not have other options so backoff here wouldn't help.
>
> Part of #298
> ---
>   test/router/router2.result   | 237 +++++++++++++++++++++++++++++++++++
>   test/router/router2.test.lua |  97 ++++++++++++++
>   vshard/consts.lua            |   1 +
>   vshard/error.lua             |   8 +-
>   vshard/replicaset.lua        | 100 +++++++++++++--
>   vshard/router/init.lua       |   3 +-
>   6 files changed, 432 insertions(+), 14 deletions(-)

Re: [Tarantool-patches] [PATCH vshard 2/5] storage: auto enable/disable

2021-12-17T11:10:01Z

Thanks for your patch. See my two nits below.

On 17.12.2021 03:25, Vladislav Shpilevoy wrote:
> +--------------------------------------------------------------------------------
> +-- Public API protection
> +--------------------------------------------------------------------------------
> +
> +--
> +-- Arguments are listed explicitly instead of '...' because the latter does not
> +-- jit.
> +--
> +local function storage_api_call_safe(func, arg1, arg2, arg3, arg4)
> +    return func(arg1, arg2, arg3, arg4)
> +end
> +
> +--
> +-- Unsafe proxy is loaded with protections. But it is used rarely and only in
> +-- the beginning of instance's lifetime.
> +--
> +local function storage_api_call_unsafe(func, arg1, arg2, arg3, arg4)
> +    -- box.info is quite expensive. Avoid calling it again when the instance
> +    -- is finally loaded.
> +    if not M.is_loaded then
> +        -- box.info raises an error until box.cfg() is started.
> +        local ok, status = pcall(function()
> +            return box.info.status
> +        end)

nit: It could be changed to type(box.cfg) == 'function'. I'd call it 
"common" pattern to check that box is not yet configured.

> +        if not ok then
> +            local msg = 'box seem not to be configured'

nit: seem -> seems?


> +            return error(lerror.vshard(lerror.code.STORAGE_IS_DISABLED, msg))
> +        end
> +        -- 'Orphan' is allowed because even if a replica is an orphan, it still
> +        -- could be up to date. Just not all other replicas are connected.
> +        if status ~= 'running' and status ~= 'orphan' then
> +            local msg = ('instance status is "%s"'):format(status)
> +            return error(lerror.vshard(lerror.code.STORAGE_IS_DISABLED, msg))
> +        end
> +        M.is_loaded = true
> +    end
> +    if not M.is_configured then
> +        local msg = 'storage is not configured'
> +        return error(lerror.vshard(lerror.code.STORAGE_IS_DISABLED, msg))
> +    end
> +    M.api_call_cache = storage_api_call_safe
> +    return func(arg1, arg2, arg3, arg4)
> +end
> +
> +M.api_call_cache = storage_api_call_unsafe
> +
> +local function storage_make_api(func)
> +    return function(arg1, arg2, arg3, arg4)
> +        return M.api_call_cache(func, arg1, arg2, arg3, arg4)
> +    end
> +end
> +
>   --------------------------------------------------------------------------------
>   -- Module definition
>   --------------------------------------------------------------------------------
> @@ -3021,44 +3091,67 @@ M.bucket_are_all_rw = bucket_are_all_rw_public
>   M.bucket_generation_wait = bucket_generation_wait
>   lregistry.storage = M
>   
> +--
> +-- Not all methods are public here. Private methods should not be exposed if
> +-- possible. At least not without notable difference in naming.
> +--
>   return {
> -    sync = sync,
> -    bucket_force_create = bucket_force_create,
> -    bucket_force_drop = bucket_force_drop,
> -    bucket_collect = bucket_collect,
> -    bucket_recv = bucket_recv,
> -    bucket_send = bucket_send,
> -    bucket_stat = bucket_stat,
> -    bucket_pin = bucket_pin,
> -    bucket_unpin = bucket_unpin,
> -    bucket_ref = bucket_ref,
> -    bucket_unref = bucket_unref,
> -    bucket_refro = bucket_refro,
> -    bucket_refrw = bucket_refrw,
> -    bucket_unrefro = bucket_unrefro,
> -    bucket_unrefrw = bucket_unrefrw,
> -    bucket_delete_garbage = bucket_delete_garbage,
> -    garbage_collector_wakeup = garbage_collector_wakeup,
> -    rebalancer_wakeup = rebalancer_wakeup,
> -    rebalancer_apply_routes = rebalancer_apply_routes,
> -    rebalancer_disable = rebalancer_disable,
> -    rebalancer_enable = rebalancer_enable,
> -    is_locked = is_this_replicaset_locked,
> -    rebalancing_is_in_progress = rebalancing_is_in_progress,
> -    recovery_wakeup = recovery_wakeup,
> -    call = storage_call,
> -    _call = service_call,
> +    --
> +    -- Bucket methods.
> +    --
> +    bucket_force_create = storage_make_api(bucket_force_create),
> +    bucket_force_drop = storage_make_api(bucket_force_drop),
> +    bucket_collect = storage_make_api(bucket_collect),
> +    bucket_recv = storage_make_api(bucket_recv),
> +    bucket_send = storage_make_api(bucket_send),
> +    bucket_stat = storage_make_api(bucket_stat),
> +    bucket_pin = storage_make_api(bucket_pin),
> +    bucket_unpin = storage_make_api(bucket_unpin),
> +    bucket_ref = storage_make_api(bucket_ref),
> +    bucket_unref = storage_make_api(bucket_unref),
> +    bucket_refro = storage_make_api(bucket_refro),
> +    bucket_refrw = storage_make_api(bucket_refrw),
> +    bucket_unrefro = storage_make_api(bucket_unrefro),
> +    bucket_unrefrw = storage_make_api(bucket_unrefrw),
> +    bucket_delete_garbage = storage_make_api(bucket_delete_garbage),
> +    _bucket_delete_garbage = bucket_delete_garbage,
> +    buckets_info = storage_make_api(storage_buckets_info),
> +    buckets_count = storage_make_api(bucket_count_public),
> +    buckets_discovery = storage_make_api(buckets_discovery),
> +    --
> +    -- Garbage collector.
> +    --
> +    garbage_collector_wakeup = storage_make_api(garbage_collector_wakeup),
> +    --
> +    -- Rebalancer.
> +    --
> +    rebalancer_wakeup = storage_make_api(rebalancer_wakeup),
> +    rebalancer_apply_routes = storage_make_api(rebalancer_apply_routes),
> +    rebalancer_disable = storage_make_api(rebalancer_disable),
> +    rebalancer_enable = storage_make_api(rebalancer_enable),
> +    rebalancing_is_in_progress = storage_make_api(rebalancing_is_in_progress),
> +    rebalancer_request_state = storage_make_api(rebalancer_request_state),
> +    _rebalancer_request_state = rebalancer_request_state,
> +    --
> +    -- Recovery.
> +    --
> +    recovery_wakeup = storage_make_api(recovery_wakeup),
> +    --
> +    -- Instance info.
> +    --
> +    is_locked = storage_make_api(is_this_replicaset_locked),
> +    info = storage_make_api(storage_info),
> +    sharded_spaces = storage_make_api(storage_sharded_spaces),
> +    _sharded_spaces = storage_sharded_spaces,
> +    module_version = function() return M.module_version end,
> +    --
> +    -- Miscellaneous.
> +    --
> +    call = storage_make_api(storage_call),
> +    _call = storage_make_api(service_call),
> +    sync = storage_make_api(sync),
>       cfg = function(cfg, uuid) return storage_cfg(cfg, uuid, false) end,
> -    info = storage_info,
> -    buckets_info = storage_buckets_info,
> -    buckets_count = bucket_count_public,
> -    buckets_discovery = buckets_discovery,
> -    rebalancer_request_state = rebalancer_request_state,
> +    on_master_enable = storage_make_api(on_master_enable),
> +    on_master_disable = storage_make_api(on_master_disable),
>       internal = M,
> -    on_master_enable = on_master_enable,
> -    on_master_disable = on_master_disable,
> -    sharded_spaces = function()
> -        return table.deepcopy(find_sharded_spaces())
> -    end,
> -    module_version = function() return M.module_version end,
>   }

Re: [Tarantool-patches] [PATCH vshard 3/5] storage: manual enable/disable

2021-12-17T11:10:31Z

Thanks for your patch. LGTM.

On 17.12.2021 03:25, Vladislav Shpilevoy wrote:
> The patch introduces functions vshard.storage.enable()/disable().
>

Re: [Tarantool-patches] [PATCH vshard 4/5] error: introduce from_string

2021-12-17T11:11:02Z

Thanks for your patch. LGTM.

However we should probably use more neutral name that doesn't touch 
implementation details.

Maybe in future we will use new native custom errors that are available 
in new Tarantool versions.

So probably it's better to call this method e.g. deserialize. Feel free 
to ignore.

On 17.12.2021 03:25, Vladislav Shpilevoy wrote:
> vshard.storage.call() and most of the other vshard.storage.*
> functions now raise an exception STORAGE_IS_DISABLED when the
> storage is disabled.

Re: [Tarantool-patches] [PATCH vshard 5/5] router: backoff on storage being disabled

2021-12-17T11:11:31Z

Thanks for your patch. LGTM.

On 17.12.2021 03:25, Vladislav Shpilevoy wrote:
> If a storage reports it is disabled, then it probably will take
> some time before it can accept new requests.
>
> This patch makes STORAGE_IS_DISABLED error cause the connection's
> backoff. In line with 'access denied' and 'no such function'
> errors. Because the reason for all 3 is the same - the storage is
> not ready to accept requests yet.
>
> Such requests are transparently retried now.
>
> Closes #298

Re: [Tarantool-patches] [PATCH vshard 1/5] router: backoff on some box errors

2021-12-17T23:10:22Z

Hi! Thanks for the review!

On 17.12.2021 12:09, Oleg Babin wrote:
> Thanks for your patch! LGTM. But I think tests should be extended a bit.
> 
> I see that tests cover AccessDenied error but I don't see that there is simple error()/assert() error check.

I hope you meant this:

====================
diff --git a/test/router/router2.result b/test/router/router2.result
index a501dbf..85d077a 100644
--- a/test/router/router2.result
+++ b/test/router/router2.result
@@ -548,6 +548,28 @@ vshard.storage.call = old_storage_call
  | ---
  | ...
 
+--
+-- Fails without backoff for other errors.
+--
+test_run:switch('router_1')
+ | ---
+ | - true
+ | ...
+fiber.sleep(vshard.consts.REPLICA_BACKOFF_INTERVAL)
+ | ---
+ | ...
+rs = vshard.router.route(1)
+ | ---
+ | ...
+ok, err = rs:callro('vshard.storage.call', {1, 'badmode', 'echo', {100}},       \
+                    long_timeout)
+ | ---
+ | ...
+assert(not ok and err.message:match('Unknown mode') ~= nil)
+ | ---
+ | - true
+ | ...
+
 _ = test_run:switch("default")
  | ---
  | ...
diff --git a/test/router/router2.test.lua b/test/router/router2.test.lua
index fb0c3b2..a2171fa 100644
--- a/test/router/router2.test.lua
+++ b/test/router/router2.test.lua
@@ -216,6 +216,16 @@ test_run:switch('storage_2_a')
 assert(echo_count == 0)
 vshard.storage.call = old_storage_call
 
+--
+-- Fails without backoff for other errors.
+--
+test_run:switch('router_1')
+fiber.sleep(vshard.consts.REPLICA_BACKOFF_INTERVAL)
+rs = vshard.router.route(1)
+ok, err = rs:callro('vshard.storage.call', {1, 'badmode', 'echo', {100}},       \
+                    long_timeout)
+assert(not ok and err.message:match('Unknown mode') ~= nil)
+
 _ = test_run:switch("default")
 _ = test_run:cmd("stop server router_1")
 _ = test_run:cmd("cleanup server router_1")

Re: [Tarantool-patches] [PATCH vshard 2/5] storage: auto enable/disable

2021-12-17T23:10:51Z

Thanks for the review!

On 17.12.2021 12:09, Oleg Babin via Tarantool-patches wrote:
> Thanks for your patch. See my two nits below.
> 
> On 17.12.2021 03:25, Vladislav Shpilevoy wrote:
>> +--------------------------------------------------------------------------------
>> +-- Public API protection
>> +--------------------------------------------------------------------------------
>> +
>> +--
>> +-- Arguments are listed explicitly instead of '...' because the latter does not
>> +-- jit.
>> +--
>> +local function storage_api_call_safe(func, arg1, arg2, arg3, arg4)
>> +    return func(arg1, arg2, arg3, arg4)
>> +end
>> +
>> +--
>> +-- Unsafe proxy is loaded with protections. But it is used rarely and only in
>> +-- the beginning of instance's lifetime.
>> +--
>> +local function storage_api_call_unsafe(func, arg1, arg2, arg3, arg4)
>> +    -- box.info is quite expensive. Avoid calling it again when the instance
>> +    -- is finally loaded.
>> +    if not M.is_loaded then
>> +        -- box.info raises an error until box.cfg() is started.
>> +        local ok, status = pcall(function()
>> +            return box.info.status
>> +        end)
> 
> nit: It could be changed to type(box.cfg) == 'function'. I'd call it "common" pattern to check that box is not yet configured.
> 
>> +        if not ok then
>> +            local msg = 'box seem not to be configured'
> 
> nit: seem -> seems?

Thanks, all fixed:

====================
diff --git a/test/storage/storage.result b/test/storage/storage.result
index e83b34f..790ba11 100644
--- a/test/storage/storage.result
+++ b/test/storage/storage.result
@@ -967,15 +967,15 @@ _ = test_run:switch('storage_1_a')
 -- Leaving box.cfg() not called won't work because at 1.10 test-run somewhy
 -- raises an error when try to start an instance without box.cfg(). It can only
 -- be emulated.
-old_info = box.info
+old_cfg = box.cfg
 ---
 ...
-box.info = setmetatable({}, {__index = function() error('not configured') end})
+assert(type(old_cfg) == 'table')
 ---
+- true
 ...
-assert(not pcall(function() return box.info.status end))
+box.cfg = function(...) return old_cfg(...) end
 ---
-- true
 ...
 ok, err = pcall(vshard.storage.call, 1, 'read', 'echo', {100})
 ---
@@ -984,11 +984,11 @@ assert(not ok and err.code == vshard.error.code.STORAGE_IS_DISABLED)
 ---
 - true
 ...
-assert(err.message:match('box seem not to be configured') ~= nil)
+assert(err.message:match('box seems not to be configured') ~= nil)
 ---
 - true
 ...
-box.info = old_info
+box.cfg = old_cfg
 ---
 ...
 -- Disabled until box is loaded.
diff --git a/test/storage/storage.test.lua b/test/storage/storage.test.lua
index ff39f2f..8695636 100644
--- a/test/storage/storage.test.lua
+++ b/test/storage/storage.test.lua
@@ -309,14 +309,14 @@ _ = test_run:switch('storage_1_a')
 -- Leaving box.cfg() not called won't work because at 1.10 test-run somewhy
 -- raises an error when try to start an instance without box.cfg(). It can only
 -- be emulated.
-old_info = box.info
-box.info = setmetatable({}, {__index = function() error('not configured') end})
-assert(not pcall(function() return box.info.status end))
+old_cfg = box.cfg
+assert(type(old_cfg) == 'table')
+box.cfg = function(...) return old_cfg(...) end
 
 ok, err = pcall(vshard.storage.call, 1, 'read', 'echo', {100})
 assert(not ok and err.code == vshard.error.code.STORAGE_IS_DISABLED)
-assert(err.message:match('box seem not to be configured') ~= nil)
-box.info = old_info
+assert(err.message:match('box seems not to be configured') ~= nil)
+box.cfg = old_cfg
 
 -- Disabled until box is loaded.
 vshard.storage.internal.errinj.ERRINJ_CFG_DELAY = true
diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
index d3c4e2a..77da663 100644
--- a/vshard/storage/init.lua
+++ b/vshard/storage/init.lua
@@ -2953,14 +2953,11 @@ local function storage_api_call_unsafe(func, arg1, arg2, arg3, arg4)
     -- box.info is quite expensive. Avoid calling it again when the instance
     -- is finally loaded.
     if not M.is_loaded then
-        -- box.info raises an error until box.cfg() is started.
-        local ok, status = pcall(function()
-            return box.info.status
-        end)
-        if not ok then
-            local msg = 'box seem not to be configured'
+        if type(box.cfg) == 'function' then
+            local msg = 'box seems not to be configured'
             return error(lerror.vshard(lerror.code.STORAGE_IS_DISABLED, msg))
         end
+        local status = box.info.status
         -- 'Orphan' is allowed because even if a replica is an orphan, it still
         -- could be up to date. Just not all other replicas are connected.
         if status ~= 'running' and status ~= 'orphan' then

Re: [Tarantool-patches] [PATCH vshard 4/5] error: introduce from_string

2021-12-17T23:11:24Z

Thanks for the review!

On 17.12.2021 12:09, Oleg Babin wrote:
> Thanks for your patch. LGTM.
> 
> However we should probably use more neutral name that doesn't touch implementation details.
> 
> Maybe in future we will use new native custom errors that are available in new Tarantool versions.
> 
> So probably it's better to call this method e.g. deserialize. Feel free to ignore.

I used from_string() to be consistent with tostring(). Deserialize() would
assume there is serialize(), but there isn't. Also it won't work if the error is
packed into MessagePack. It works for strings only. Hence mentioning the exact
type 'string' seems valid.

As for native errors - I agree. The existence of from_string() is a huge
crutch. I was lucky that the errors were encoded as json in vshard from
the beginning. So I have at least sometimes working way of getting the original
error from an exception.

We have 'custom' error types in latest Tarantools for about a year now,
but it is still hardly usable, because you can't attach any payload to it. Which
is often very important even if it has just 1-2 fields. I wish the product
managers would see usefulness of good errors API someday.

Re: [Tarantool-patches] [PATCH vshard 1/5] router: backoff on some box errors

2021-12-18T13:58:00Z

Thanks for your fixes. Yes, it is that I meant. LGTM.

On 18.12.2021 02:10, Vladislav Shpilevoy wrote:
> Hi! Thanks for the review!
>
> On 17.12.2021 12:09, Oleg Babin wrote:
>> Thanks for your patch! LGTM. But I think tests should be extended a bit.
>>
>> I see that tests cover AccessDenied error but I don't see that there is simple error()/assert() error check.
> I hope you meant this:
>
> ====================
> diff --git a/test/router/router2.result b/test/router/router2.result
> index a501dbf..85d077a 100644
> --- a/test/router/router2.result
> +++ b/test/router/router2.result
> @@ -548,6 +548,28 @@ vshard.storage.call = old_storage_call
>    | ---
>    | ...
>   
> +--
> +-- Fails without backoff for other errors.
> +--
> +test_run:switch('router_1')
> + | ---
> + | - true
> + | ...
> +fiber.sleep(vshard.consts.REPLICA_BACKOFF_INTERVAL)
> + | ---
> + | ...
> +rs = vshard.router.route(1)
> + | ---
> + | ...
> +ok, err = rs:callro('vshard.storage.call', {1, 'badmode', 'echo', {100}},       \
> +                    long_timeout)
> + | ---
> + | ...
> +assert(not ok and err.message:match('Unknown mode') ~= nil)
> + | ---
> + | - true
> + | ...
> +
>   _ = test_run:switch("default")
>    | ---
>    | ...
> diff --git a/test/router/router2.test.lua b/test/router/router2.test.lua
> index fb0c3b2..a2171fa 100644
> --- a/test/router/router2.test.lua
> +++ b/test/router/router2.test.lua
> @@ -216,6 +216,16 @@ test_run:switch('storage_2_a')
>   assert(echo_count == 0)
>   vshard.storage.call = old_storage_call
>   
> +--
> +-- Fails without backoff for other errors.
> +--
> +test_run:switch('router_1')
> +fiber.sleep(vshard.consts.REPLICA_BACKOFF_INTERVAL)
> +rs = vshard.router.route(1)
> +ok, err = rs:callro('vshard.storage.call', {1, 'badmode', 'echo', {100}},       \
> +                    long_timeout)
> +assert(not ok and err.message:match('Unknown mode') ~= nil)
> +
>   _ = test_run:switch("default")
>   _ = test_run:cmd("stop server router_1")
>   _ = test_run:cmd("cleanup server router_1")

Re: [Tarantool-patches] [PATCH vshard 2/5] storage: auto enable/disable

2021-12-18T13:58:35Z

Thanks for your fixes. LGTM.

On 18.12.2021 02:10, Vladislav Shpilevoy wrote:
> Thanks for the review!
>
> On 17.12.2021 12:09, Oleg Babin via Tarantool-patches wrote:
>> Thanks for your patch. See my two nits below.
>>
>> On 17.12.2021 03:25, Vladislav Shpilevoy wrote:
>>> +--------------------------------------------------------------------------------
>>> +-- Public API protection
>>> +--------------------------------------------------------------------------------
>>> +
>>> +--
>>> +-- Arguments are listed explicitly instead of '...' because the latter does not
>>> +-- jit.
>>> +--
>>> +local function storage_api_call_safe(func, arg1, arg2, arg3, arg4)
>>> +    return func(arg1, arg2, arg3, arg4)
>>> +end
>>> +
>>> +--
>>> +-- Unsafe proxy is loaded with protections. But it is used rarely and only in
>>> +-- the beginning of instance's lifetime.
>>> +--
>>> +local function storage_api_call_unsafe(func, arg1, arg2, arg3, arg4)
>>> +    -- box.info is quite expensive. Avoid calling it again when the instance
>>> +    -- is finally loaded.
>>> +    if not M.is_loaded then
>>> +        -- box.info raises an error until box.cfg() is started.
>>> +        local ok, status = pcall(function()
>>> +            return box.info.status
>>> +        end)
>> nit: It could be changed to type(box.cfg) == 'function'. I'd call it "common" pattern to check that box is not yet configured.
>>
>>> +        if not ok then
>>> +            local msg = 'box seem not to be configured'
>> nit: seem -> seems?
> Thanks, all fixed:
>
> ====================
> diff --git a/test/storage/storage.result b/test/storage/storage.result
> index e83b34f..790ba11 100644
> --- a/test/storage/storage.result
> +++ b/test/storage/storage.result
> @@ -967,15 +967,15 @@ _ = test_run:switch('storage_1_a')
>   -- Leaving box.cfg() not called won't work because at 1.10 test-run somewhy
>   -- raises an error when try to start an instance without box.cfg(). It can only
>   -- be emulated.
> -old_info = box.info
> +old_cfg = box.cfg
>   ---
>   ...
> -box.info = setmetatable({}, {__index = function() error('not configured') end})
> +assert(type(old_cfg) == 'table')
>   ---
> +- true
>   ...
> -assert(not pcall(function() return box.info.status end))
> +box.cfg = function(...) return old_cfg(...) end
>   ---
> -- true
>   ...
>   ok, err = pcall(vshard.storage.call, 1, 'read', 'echo', {100})
>   ---
> @@ -984,11 +984,11 @@ assert(not ok and err.code == vshard.error.code.STORAGE_IS_DISABLED)
>   ---
>   - true
>   ...
> -assert(err.message:match('box seem not to be configured') ~= nil)
> +assert(err.message:match('box seems not to be configured') ~= nil)
>   ---
>   - true
>   ...
> -box.info = old_info
> +box.cfg = old_cfg
>   ---
>   ...
>   -- Disabled until box is loaded.
> diff --git a/test/storage/storage.test.lua b/test/storage/storage.test.lua
> index ff39f2f..8695636 100644
> --- a/test/storage/storage.test.lua
> +++ b/test/storage/storage.test.lua
> @@ -309,14 +309,14 @@ _ = test_run:switch('storage_1_a')
>   -- Leaving box.cfg() not called won't work because at 1.10 test-run somewhy
>   -- raises an error when try to start an instance without box.cfg(). It can only
>   -- be emulated.
> -old_info = box.info
> -box.info = setmetatable({}, {__index = function() error('not configured') end})
> -assert(not pcall(function() return box.info.status end))
> +old_cfg = box.cfg
> +assert(type(old_cfg) == 'table')
> +box.cfg = function(...) return old_cfg(...) end
>   
>   ok, err = pcall(vshard.storage.call, 1, 'read', 'echo', {100})
>   assert(not ok and err.code == vshard.error.code.STORAGE_IS_DISABLED)
> -assert(err.message:match('box seem not to be configured') ~= nil)
> -box.info = old_info
> +assert(err.message:match('box seems not to be configured') ~= nil)
> +box.cfg = old_cfg
>   
>   -- Disabled until box is loaded.
>   vshard.storage.internal.errinj.ERRINJ_CFG_DELAY = true
> diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
> index d3c4e2a..77da663 100644
> --- a/vshard/storage/init.lua
> +++ b/vshard/storage/init.lua
> @@ -2953,14 +2953,11 @@ local function storage_api_call_unsafe(func, arg1, arg2, arg3, arg4)
>       -- box.info is quite expensive. Avoid calling it again when the instance
>       -- is finally loaded.
>       if not M.is_loaded then
> -        -- box.info raises an error until box.cfg() is started.
> -        local ok, status = pcall(function()
> -            return box.info.status
> -        end)
> -        if not ok then
> -            local msg = 'box seem not to be configured'
> +        if type(box.cfg) == 'function' then
> +            local msg = 'box seems not to be configured'
>               return error(lerror.vshard(lerror.code.STORAGE_IS_DISABLED, msg))
>           end
> +        local status = box.info.status
>           -- 'Orphan' is allowed because even if a replica is an orphan, it still
>           -- could be up to date. Just not all other replicas are connected.
>           if status ~= 'running' and status ~= 'orphan' then

Re: [Tarantool-patches] [PATCH vshard 0/5] Router backoff, storage disable

2021-12-18T13:59:06Z

Thanks for your patchset. LGTM.

On 17.12.2021 03:25, Vladislav Shpilevoy wrote:
> The patchset introduces vshard.storage auto and manual enable/disable. Disabled
> storage blocks most of public API calls. The state is handled by the router in
> a special way so it transparently retries requests failed due to the storage
> being disabled.
>
> Branch:http://github.com/tarantool/vshard/tree/gerold103/gh-298-replica-backoff-part-2
> Issue:https://github.com/tarantool/vshard/issues/298
>
> Vladislav Shpilevoy (5):
>    router: backoff on some box errors
>    storage: auto enable/disable
>    storage: manual enable/disable
>    error: introduce from_string
>    router: backoff on storage being disabled
>
>   example/localcfg.lua          |   1 -
>   test/router/router2.result    | 325 ++++++++++++++++++++++++++++++++++
>   test/router/router2.test.lua  | 132 ++++++++++++++
>   test/storage/storage.result   | 128 +++++++++++++
>   test/storage/storage.test.lua |  55 ++++++
>   test/unit/error.result        |  18 ++
>   test/unit/error.test.lua      |   6 +
>   test/unit/garbage.result      |  17 +-
>   test/unit/garbage.test.lua    |  15 +-
>   test/unit/rebalancer.result   |   2 +-
>   test/unit/rebalancer.test.lua |   2 +-
>   vshard/consts.lua             |   1 +
>   vshard/error.lua              |  32 +++-
>   vshard/replicaset.lua         | 119 +++++++++++--
>   vshard/router/init.lua        |   3 +-
>   vshard/storage/init.lua       | 207 +++++++++++++++++-----
>   16 files changed, 987 insertions(+), 76 deletions(-)
>

Re: [Tarantool-patches] [PATCH vshard 0/5] Router backoff, storage disable

2021-12-20T23:52:04Z

Pushed to master.