The patchset introduces vshard.storage auto and manual enable/disable. Disabled storage blocks most of public API calls. The state is handled by the router in a special way so it transparently retries requests failed due to the storage being disabled. Branch: http://github.com/tarantool/vshard/tree/gerold103/gh-298-replica-backoff-part-2 Issue: https://github.com/tarantool/vshard/issues/298 Vladislav Shpilevoy (5): router: backoff on some box errors storage: auto enable/disable storage: manual enable/disable error: introduce from_string router: backoff on storage being disabled example/localcfg.lua | 1 - test/router/router2.result | 325 ++++++++++++++++++++++++++++++++++ test/router/router2.test.lua | 132 ++++++++++++++ test/storage/storage.result | 128 +++++++++++++ test/storage/storage.test.lua | 55 ++++++ test/unit/error.result | 18 ++ test/unit/error.test.lua | 6 + test/unit/garbage.result | 17 +- test/unit/garbage.test.lua | 15 +- test/unit/rebalancer.result | 2 +- test/unit/rebalancer.test.lua | 2 +- vshard/consts.lua | 1 + vshard/error.lua | 32 +++- vshard/replicaset.lua | 119 +++++++++++-- vshard/router/init.lua | 3 +- vshard/storage/init.lua | 207 +++++++++++++++++----- 16 files changed, 987 insertions(+), 76 deletions(-) -- 2.24.3 (Apple Git-128)
Storage configuration takes time. Firstly, box.cfg{} which can be called before vshard.storage.cfg(). Secondly, vshard.storage.cfg() is not immediate as well. During that time accessing the storage is not safe. Attempts to call vshard.storage functions can return weird errors, or the functions can even be not available yet. They need to be created in _func and get access rights in _priv before becoming public. Routers used to forward errors like 'access denied error' and 'no such function' to users as is, treating them as critical. Not only it was confusing for users, but also could make an entire replicaset not available for requests - the connection to it is alive, so router would send all requests into it and they all would fail. Even if the replicaset has another instance which is perfectly functional. This patch handles such specific errors inside of the router. The faulty replicas are put into a 'backoff' state. They remain in it for some fixed time (5 seconds for now), new requests won't be sent to them until the time passes. Router will use other instances. Backoff is activated only for vshard.* functions. If the errors are about some user's function, it is considered a regular error. Because the router can't tell whether any side effects were done on the remote instance before the error happened. Hence can't retry to another node. For example, if access was denied to 'vshard.storage.call', then it is backoff. If inside of vshard.storage.call the access was denied to 'user_test_func', then it is not backoff. It all works for read-only requests exclusively of course. Because for read-write requests the instance is just one - master. Router does not have other options so backoff here wouldn't help. Part of #298 --- test/router/router2.result | 237 +++++++++++++++++++++++++++++++++++ test/router/router2.test.lua | 97 ++++++++++++++ vshard/consts.lua | 1 + vshard/error.lua | 8 +- vshard/replicaset.lua | 100 +++++++++++++-- vshard/router/init.lua | 3 +- 6 files changed, 432 insertions(+), 14 deletions(-) diff --git a/test/router/router2.result b/test/router/router2.result index 492421b..a501dbf 100644 --- a/test/router/router2.result +++ b/test/router/router2.result @@ -311,6 +311,243 @@ vshard.router.info().bucket | available_rw: 3000 | ... +-- +-- gh-298: backoff replicas when the storage raises errors meaning its +-- configuration is not finished or even didn't start. Immediately failover to +-- other instances then. +-- +test_run:switch('storage_2_b') + | --- + | - true + | ... +-- Turn off replication so as _func manipulations on the master wouldn't reach +-- the replica. +old_replication = box.cfg.replication + | --- + | ... +box.cfg{replication = {}} + | --- + | ... + +test_run:switch('storage_2_a') + | --- + | - true + | ... +box.schema.user.revoke('storage', 'execute', 'function', 'vshard.storage.call') + | --- + | ... + +test_run:switch('router_1') + | --- + | - true + | ... +vshard.consts.REPLICA_BACKOFF_INTERVAL = 0.1 + | --- + | ... + +-- Indeed fails when called directly via netbox. +conn = vshard.router.route(1).master.conn + | --- + | ... +ok, err = pcall(conn.call, conn, 'vshard.storage.call', \ + {1, 'read', 'echo', {1}}) + | --- + | ... +assert(not ok and err.code == box.error.ACCESS_DENIED) + | --- + | - true + | ... + +-- Works when called via vshard - it goes to another replica transparently. +long_timeout = {timeout = 1000000} + | --- + | ... +res = vshard.router.callro(1, 'echo', {100}, long_timeout) + | --- + | ... +assert(res == 100) + | --- + | - true + | ... + +-- +-- When all replicas are in backoff due to lack of access, raise an error. +-- +test_run:switch('storage_2_b') + | --- + | - true + | ... +assert(echo_count == 1) + | --- + | - true + | ... +echo_count = 0 + | --- + | ... +-- Restore the replication so the replica gets the _func change from master. +box.cfg{replication = old_replication} + | --- + | ... +test_run:wait_vclock('storage_2_b', test_run:get_vclock('storage_1_a')) + | --- + | ... + +test_run:switch('router_1') + | --- + | - true + | ... +ok, err = vshard.router.callro(1, 'echo', {100}, long_timeout) + | --- + | ... +assert(not ok and err.code == vshard.error.code.REPLICASET_IN_BACKOFF) + | --- + | - true + | ... +assert(err.error.code == box.error.ACCESS_DENIED) + | --- + | - true + | ... + +test_run:switch('storage_2_a') + | --- + | - true + | ... +assert(echo_count == 0) + | --- + | - true + | ... +box.schema.user.grant('storage', 'execute', 'function', 'vshard.storage.call') + | --- + | ... +test_run:wait_vclock('storage_2_b', test_run:get_vclock('storage_1_a')) + | --- + | ... + +-- +-- No vshard function = backoff. +-- +test_run:switch('router_1') + | --- + | - true + | ... +-- Drop all backoffs to check all works fine now. +fiber.sleep(vshard.consts.REPLICA_BACKOFF_INTERVAL) + | --- + | ... +res = vshard.router.callrw(1, 'echo', {100}, long_timeout) + | --- + | ... +assert(res == 100) + | --- + | - true + | ... + +test_run:switch('storage_2_a') + | --- + | - true + | ... +assert(echo_count == 1) + | --- + | - true + | ... +echo_count = 0 + | --- + | ... +old_storage_call = vshard.storage.call + | --- + | ... +vshard.storage.call = nil + | --- + | ... + +-- Indeed fails when called directly via netbox. +test_run:switch('router_1') + | --- + | - true + | ... +conn = vshard.router.route(1).master.conn + | --- + | ... +ok, err = pcall(conn.call, conn, 'vshard.storage.call', \ + {1, 'read', 'echo', {1}}) + | --- + | ... +assert(not ok and err.code == box.error.NO_SUCH_PROC) + | --- + | - true + | ... + +-- Works when called via vshard - it goes to another replica. +res = vshard.router.callro(1, 'echo', {100}, long_timeout) + | --- + | ... +assert(res == 100) + | --- + | - true + | ... + +-- +-- When all replicas are in backoff due to not having the function, raise +-- an error. +-- +test_run:switch('storage_2_b') + | --- + | - true + | ... +assert(echo_count == 1) + | --- + | - true + | ... +echo_count = 0 + | --- + | ... +old_storage_call = vshard.storage.call + | --- + | ... +vshard.storage.call = nil + | --- + | ... + +test_run:switch('router_1') + | --- + | - true + | ... +ok, err = vshard.router.callro(1, 'echo', {100}, long_timeout) + | --- + | ... +assert(not ok and err.code == vshard.error.code.REPLICASET_IN_BACKOFF) + | --- + | - true + | ... +assert(err.error.code == box.error.NO_SUCH_PROC) + | --- + | - true + | ... + +test_run:switch('storage_2_b') + | --- + | - true + | ... +assert(echo_count == 0) + | --- + | - true + | ... +vshard.storage.call = old_storage_call + | --- + | ... + +test_run:switch('storage_2_a') + | --- + | - true + | ... +assert(echo_count == 0) + | --- + | - true + | ... +vshard.storage.call = old_storage_call + | --- + | ... + _ = test_run:switch("default") | --- | ... diff --git a/test/router/router2.test.lua b/test/router/router2.test.lua index 72c99e6..fb0c3b2 100644 --- a/test/router/router2.test.lua +++ b/test/router/router2.test.lua @@ -119,6 +119,103 @@ end test_run:wait_cond(wait_all_rw) vshard.router.info().bucket +-- +-- gh-298: backoff replicas when the storage raises errors meaning its +-- configuration is not finished or even didn't start. Immediately failover to +-- other instances then. +-- +test_run:switch('storage_2_b') +-- Turn off replication so as _func manipulations on the master wouldn't reach +-- the replica. +old_replication = box.cfg.replication +box.cfg{replication = {}} + +test_run:switch('storage_2_a') +box.schema.user.revoke('storage', 'execute', 'function', 'vshard.storage.call') + +test_run:switch('router_1') +vshard.consts.REPLICA_BACKOFF_INTERVAL = 0.1 + +-- Indeed fails when called directly via netbox. +conn = vshard.router.route(1).master.conn +ok, err = pcall(conn.call, conn, 'vshard.storage.call', \ + {1, 'read', 'echo', {1}}) +assert(not ok and err.code == box.error.ACCESS_DENIED) + +-- Works when called via vshard - it goes to another replica transparently. +long_timeout = {timeout = 1000000} +res = vshard.router.callro(1, 'echo', {100}, long_timeout) +assert(res == 100) + +-- +-- When all replicas are in backoff due to lack of access, raise an error. +-- +test_run:switch('storage_2_b') +assert(echo_count == 1) +echo_count = 0 +-- Restore the replication so the replica gets the _func change from master. +box.cfg{replication = old_replication} +test_run:wait_vclock('storage_2_b', test_run:get_vclock('storage_1_a')) + +test_run:switch('router_1') +ok, err = vshard.router.callro(1, 'echo', {100}, long_timeout) +assert(not ok and err.code == vshard.error.code.REPLICASET_IN_BACKOFF) +assert(err.error.code == box.error.ACCESS_DENIED) + +test_run:switch('storage_2_a') +assert(echo_count == 0) +box.schema.user.grant('storage', 'execute', 'function', 'vshard.storage.call') +test_run:wait_vclock('storage_2_b', test_run:get_vclock('storage_1_a')) + +-- +-- No vshard function = backoff. +-- +test_run:switch('router_1') +-- Drop all backoffs to check all works fine now. +fiber.sleep(vshard.consts.REPLICA_BACKOFF_INTERVAL) +res = vshard.router.callrw(1, 'echo', {100}, long_timeout) +assert(res == 100) + +test_run:switch('storage_2_a') +assert(echo_count == 1) +echo_count = 0 +old_storage_call = vshard.storage.call +vshard.storage.call = nil + +-- Indeed fails when called directly via netbox. +test_run:switch('router_1') +conn = vshard.router.route(1).master.conn +ok, err = pcall(conn.call, conn, 'vshard.storage.call', \ + {1, 'read', 'echo', {1}}) +assert(not ok and err.code == box.error.NO_SUCH_PROC) + +-- Works when called via vshard - it goes to another replica. +res = vshard.router.callro(1, 'echo', {100}, long_timeout) +assert(res == 100) + +-- +-- When all replicas are in backoff due to not having the function, raise +-- an error. +-- +test_run:switch('storage_2_b') +assert(echo_count == 1) +echo_count = 0 +old_storage_call = vshard.storage.call +vshard.storage.call = nil + +test_run:switch('router_1') +ok, err = vshard.router.callro(1, 'echo', {100}, long_timeout) +assert(not ok and err.code == vshard.error.code.REPLICASET_IN_BACKOFF) +assert(err.error.code == box.error.NO_SUCH_PROC) + +test_run:switch('storage_2_b') +assert(echo_count == 0) +vshard.storage.call = old_storage_call + +test_run:switch('storage_2_a') +assert(echo_count == 0) +vshard.storage.call = old_storage_call + _ = test_run:switch("default") _ = test_run:cmd("stop server router_1") _ = test_run:cmd("cleanup server router_1") diff --git a/vshard/consts.lua b/vshard/consts.lua index 66a09ae..64cb0ae 100644 --- a/vshard/consts.lua +++ b/vshard/consts.lua @@ -40,6 +40,7 @@ return { RECONNECT_TIMEOUT = 0.5; GC_BACKOFF_INTERVAL = 5, RECOVERY_BACKOFF_INTERVAL = 5, + REPLICA_BACKOFF_INTERVAL = 5, COLLECT_LUA_GARBAGE_INTERVAL = 100; DEFAULT_BUCKET_SEND_TIMEOUT = 10, DEFAULT_BUCKET_RECV_TIMEOUT = 10, diff --git a/vshard/error.lua b/vshard/error.lua index fa7bdaa..0ce208a 100644 --- a/vshard/error.lua +++ b/vshard/error.lua @@ -158,7 +158,13 @@ local error_message_template = { name = 'MULTIPLE_MASTERS_FOUND', msg = 'Found more than one master in replicaset %s on nodes %s and %s', args = {'replicaset_uuid', 'master1', 'master2'}, - } + }, + [32] = { + name = 'REPLICASET_IN_BACKOFF', + msg = 'Replicaset %s is in backoff, can\'t take requests right now. '.. + 'Last error was %s', + args = {'replicaset_uuid', 'error'} + }, } -- diff --git a/vshard/replicaset.lua b/vshard/replicaset.lua index a47a7b9..573a555 100644 --- a/vshard/replicaset.lua +++ b/vshard/replicaset.lua @@ -13,6 +13,8 @@ -- weight = number, -- down_ts = <timestamp of disconnect from the -- replica>, +-- backoff_ts = <timestamp when was sent into backoff state>, +-- backoff_err = <error object caused the backoff>, -- net_timeout = <current network timeout for calls, -- doubled on each network fail until -- max value, and reset to minimal @@ -142,6 +144,25 @@ local function netbox_wait_connected(conn, timeout) return timeout end +-- +-- Check if the replica is not in backoff. It also serves as an update - if the +-- replica still has an old backoff timestamp, it is cleared. This way of +-- backoff update does not require any fibers to perform background updates. +-- Hence works not only on the router. +-- +local function replica_check_backoff(replica, now) + if not replica.backoff_ts then + return true + end + if replica.backoff_ts + consts.REPLICA_BACKOFF_INTERVAL > now then + return false + end + log.warn('Replica %s returns from backoff', replica.uuid) + replica.backoff_ts = nil + replica.backoff_err = nil + return true +end + -- -- Connect to a specified replica and remember a new connection -- in the replica object. Note, that the function does not wait @@ -421,6 +442,39 @@ local function can_retry_after_error(e) return e.type == 'ClientError' and e.code == box.error.TIMEOUT end +-- +-- True if after the given error on call of the given function the connection +-- must go into backoff. +-- +local function can_backoff_after_error(e, func) + if not e then + return false + end + if type(e) ~= 'table' and + (type(e) ~= 'cdata' or not ffi.istype('struct error', e)) then + return false + end + -- So far it is enabled only for vshard's own functions. Including + -- vshard.storage.call(). Otherwise it is not possible to retry safely - + -- user's function could have side effects before raising that error. + -- For instance, 'access denied' could be raised by user's function + -- internally after it already changed some data on the storage. + if not func:startswith('vshard.') then + return false + end + -- ClientError is sent for all errors by old Tarantool versions which didn't + -- keep error type. New versions preserve the original error type. + if e.type == 'ClientError' or e.type == 'AccessDeniedError' then + if e.code == box.error.ACCESS_DENIED then + return e.message:startswith("Execute access to function 'vshard.") + end + if e.code == box.error.NO_SUCH_PROC then + return e.message:startswith("Procedure 'vshard.") + end + end + return false +end + -- -- Pick a next replica according to round-robin load balancing -- policy. @@ -443,7 +497,7 @@ end -- until failover fiber will repair the nearest connection. -- local function replicaset_template_multicallro(prefer_replica, balance) - local function pick_next_replica(replicaset) + local function pick_next_replica(replicaset, now) local r local master = replicaset.master if balance then @@ -451,7 +505,8 @@ local function replicaset_template_multicallro(prefer_replica, balance) while i > 0 do r = replicaset_balance_replica(replicaset) i = i - 1 - if r:is_connected() and (not prefer_replica or r ~= master) then + if r:is_connected() and (not prefer_replica or r ~= master) and + replica_check_backoff(r, now) then return r end end @@ -459,7 +514,8 @@ local function replicaset_template_multicallro(prefer_replica, balance) local start_r = replicaset.replica r = start_r while r do - if r:is_connected() and (not prefer_replica or r ~= master) then + if r:is_connected() and (not prefer_replica or r ~= master) and + replica_check_backoff(r, now) then return r end r = r.next_by_priority @@ -471,7 +527,8 @@ local function replicaset_template_multicallro(prefer_replica, balance) -- Reached already checked part. break end - if r:is_connected() and (not prefer_replica or r ~= master) then + if r:is_connected() and (not prefer_replica or r ~= master) and + replica_check_backoff(r, now) then return r end end @@ -488,26 +545,43 @@ local function replicaset_template_multicallro(prefer_replica, balance) if timeout <= 0 then return nil, lerror.timeout() end - local end_time = fiber_clock() + timeout + local now = fiber_clock() + local end_time = now + timeout while not net_status and timeout > 0 do - replica = pick_next_replica(replicaset) + replica = pick_next_replica(replicaset, now) if not replica then replica, timeout = replicaset_wait_master(replicaset, timeout) if not replica then return nil, timeout end replicaset_connect_to_replica(replicaset, replica) + if replica.backoff_ts then + return nil, lerror.vshard( + lerror.code.REPLICASET_IN_BACKOFF, replicaset.uuid, + replica.backoff_err) + end end opts.timeout = timeout net_status, storage_status, retval, err = replica_call(replica, func, args, opts) - timeout = end_time - fiber_clock() + now = fiber_clock() + timeout = end_time - now if not net_status and not storage_status and not can_retry_after_error(retval) then - -- There is no sense to retry LuaJit errors, such as - -- assetions, not defined variables etc. - net_status = true - break + if can_backoff_after_error(retval, func) then + if not replica.backoff_ts then + log.warn('Replica %s goes into backoff for %s sec '.. + 'after error %s', replica.uuid, + consts.REPLICA_BACKOFF_INTERVAL, retval) + replica.backoff_ts = now + replica.backoff_err = retval + end + else + -- There is no sense to retry LuaJit errors, such as + -- assertions, undefined variables etc. + net_status = true + break + end end end if not net_status then @@ -549,6 +623,8 @@ local function rebind_replicasets(replicasets, old_replicasets) local conn = old_replica.conn replica.conn = conn replica.down_ts = old_replica.down_ts + replica.backoff_ts = old_replica.backoff_ts + replica.backoff_err = old_replica.backoff_err replica.net_timeout = old_replica.net_timeout replica.net_sequential_ok = old_replica.net_sequential_ok replica.net_sequential_fail = old_replica.net_sequential_fail @@ -986,7 +1062,7 @@ local function buildall(sharding_cfg) uri = replica.uri, name = replica.name, uuid = replica_uuid, zone = replica.zone, net_timeout = consts.CALL_TIMEOUT_MIN, net_sequential_ok = 0, net_sequential_fail = 0, - down_ts = curr_ts, + down_ts = curr_ts, backoff_ts = nil, backoff_err = nil, }, replica_mt) new_replicaset.replicas[replica_uuid] = new_replica if replica.master then diff --git a/vshard/router/init.lua b/vshard/router/init.lua index 4748398..116f905 100644 --- a/vshard/router/init.lua +++ b/vshard/router/init.lua @@ -172,7 +172,8 @@ local function bucket_discovery(router, bucket_id) replicaset:callrw('vshard.storage.bucket_stat', {bucket_id}) if err == nil then return bucket_set(router, bucket_id, replicaset.uuid) - elseif err.code ~= lerror.code.WRONG_BUCKET then + elseif err.code ~= lerror.code.WRONG_BUCKET and + err.code ~= lerror.code.REPLICASET_IN_BACKOFF then last_err = err unreachable_uuid = uuid end -- 2.24.3 (Apple Git-128)
While vshard.storage.cfg() is not done, accessing vshard functions is not safe. It will fail with low level errors like 'access denied' or 'no such function'. However there can be even worse cases. The user can have universe access rights. And vshard can be already in global namespace after require(). So vshard.storage functions are already visible. The previous patch fixed only the case when function access was restricted properly. And still fixed it just partially. New problems are: - box.cfg{} is already called, but the instance is still 'loading'. Then data is not fully recovered yet. Accessing is not safe from the data consistency perspective. - vshard.storage.cfg() is not started, or is not finished yet. In the end it might be doing something on what the public functions depend. This patch addresses these issues. Now all non-trivial vshard.storage functions are disabled until vshard.storage.cfg() is finished and the instance is fully recovered. They raise an error with a special code. Returning it via 'nil, err' pair wouldn't work. Because firstly, some functions return a boolean value and are not documented as ever failing. People would miss this new error. Second reason - vshard.storage.call() needs to signal the remote caller that the storage is disabled and it was found before the user's function was called. If it would be done via 'nil, err', then the user's function could emulate the storage being disabled. Or even worse, it could make some changes and then get that error accidentally by going to another storage remotely which would be disabled. Hence it is not allowed. Too easy to break something. It was an option to change vshard.storage.call() signature to return 'true, retvals...' when user's function was called and 'false, err' when it wasn't, but that would break backward compatibility. Supporting it only for new routers does not seem possible. The patch also drops 'memtx_memory' setting from the config because an attempt to apply it after calling box.cfg() (for example, via boot_like_vshard()) raises an error - default memory is bigger than this setting. It messed the new tests. Part of #298 Closes #123 --- example/localcfg.lua | 1 - test/storage/storage.result | 101 +++++++++++++++++++ test/storage/storage.test.lua | 43 ++++++++ test/unit/garbage.result | 17 ++-- test/unit/garbage.test.lua | 15 +-- test/unit/rebalancer.result | 2 +- test/unit/rebalancer.test.lua | 2 +- vshard/error.lua | 5 + vshard/storage/init.lua | 179 ++++++++++++++++++++++++++-------- 9 files changed, 305 insertions(+), 60 deletions(-) diff --git a/example/localcfg.lua b/example/localcfg.lua index 71f9d6f..9235d45 100644 --- a/example/localcfg.lua +++ b/example/localcfg.lua @@ -1,5 +1,4 @@ return { - memtx_memory = 100 * 1024 * 1024, sharding = { ['cbf06940-0790-498b-948d-042b62cf3d29'] = { -- replicaset #1 replicas = { diff --git a/test/storage/storage.result b/test/storage/storage.result index af48a13..e83b34f 100644 --- a/test/storage/storage.result +++ b/test/storage/storage.result @@ -951,6 +951,107 @@ vshard.storage._call('info') --- - is_master: false ... +-- +-- gh-123, gh-298: storage auto-enable/disable depending on instance state. +-- +_ = test_run:cmd('stop server storage_1_a') +--- +... +_ = test_run:cmd('start server storage_1_a with wait=False, '.. \ + 'args="boot_before_cfg"') +--- +... +_ = test_run:switch('storage_1_a') +--- +... +-- Leaving box.cfg() not called won't work because at 1.10 test-run somewhy +-- raises an error when try to start an instance without box.cfg(). It can only +-- be emulated. +old_info = box.info +--- +... +box.info = setmetatable({}, {__index = function() error('not configured') end}) +--- +... +assert(not pcall(function() return box.info.status end)) +--- +- true +... +ok, err = pcall(vshard.storage.call, 1, 'read', 'echo', {100}) +--- +... +assert(not ok and err.code == vshard.error.code.STORAGE_IS_DISABLED) +--- +- true +... +assert(err.message:match('box seem not to be configured') ~= nil) +--- +- true +... +box.info = old_info +--- +... +-- Disabled until box is loaded. +vshard.storage.internal.errinj.ERRINJ_CFG_DELAY = true +--- +... +f = fiber.create(vshard.storage.cfg, cfg, instance_uuid) +--- +... +f:set_joinable(true) +--- +... +old_info = box.info +--- +... +box.info = {status = 'loading'} +--- +... +ok, err = pcall(vshard.storage.call, 1, 'read', 'echo', {100}) +--- +... +assert(not ok and err.code == vshard.error.code.STORAGE_IS_DISABLED) +--- +- true +... +assert(err.message:match('instance status is "loading"') ~= nil) +--- +- true +... +box.info = old_info +--- +... +-- Disabled until storage is configured. +test_run:wait_cond(function() return box.info.status == 'running' end) +--- +- true +... +ok, err = pcall(vshard.storage.call, 1, 'read', 'echo', {100}) +--- +... +assert(not ok and err.code == vshard.error.code.STORAGE_IS_DISABLED) +--- +- true +... +assert(err.message:match('storage is not configured') ~= nil) +--- +- true +... +vshard.storage.internal.errinj.ERRINJ_CFG_DELAY = false +--- +... +f:join() +--- +- true +... +-- Enabled when all criteria are finally satisfied. +ok, res = vshard.storage.call(1, 'read', 'echo', {100}) +--- +... +assert(ok and res == 100) +--- +- true +... _ = test_run:switch("default") --- ... diff --git a/test/storage/storage.test.lua b/test/storage/storage.test.lua index 99ef2c4..ff39f2f 100644 --- a/test/storage/storage.test.lua +++ b/test/storage/storage.test.lua @@ -299,6 +299,49 @@ vshard.storage._call('info') _ = test_run:switch('storage_1_b') vshard.storage._call('info') +-- +-- gh-123, gh-298: storage auto-enable/disable depending on instance state. +-- +_ = test_run:cmd('stop server storage_1_a') +_ = test_run:cmd('start server storage_1_a with wait=False, '.. \ + 'args="boot_before_cfg"') +_ = test_run:switch('storage_1_a') +-- Leaving box.cfg() not called won't work because at 1.10 test-run somewhy +-- raises an error when try to start an instance without box.cfg(). It can only +-- be emulated. +old_info = box.info +box.info = setmetatable({}, {__index = function() error('not configured') end}) +assert(not pcall(function() return box.info.status end)) + +ok, err = pcall(vshard.storage.call, 1, 'read', 'echo', {100}) +assert(not ok and err.code == vshard.error.code.STORAGE_IS_DISABLED) +assert(err.message:match('box seem not to be configured') ~= nil) +box.info = old_info + +-- Disabled until box is loaded. +vshard.storage.internal.errinj.ERRINJ_CFG_DELAY = true +f = fiber.create(vshard.storage.cfg, cfg, instance_uuid) +f:set_joinable(true) + +old_info = box.info +box.info = {status = 'loading'} +ok, err = pcall(vshard.storage.call, 1, 'read', 'echo', {100}) +assert(not ok and err.code == vshard.error.code.STORAGE_IS_DISABLED) +assert(err.message:match('instance status is "loading"') ~= nil) +box.info = old_info + +-- Disabled until storage is configured. +test_run:wait_cond(function() return box.info.status == 'running' end) +ok, err = pcall(vshard.storage.call, 1, 'read', 'echo', {100}) +assert(not ok and err.code == vshard.error.code.STORAGE_IS_DISABLED) +assert(err.message:match('storage is not configured') ~= nil) +vshard.storage.internal.errinj.ERRINJ_CFG_DELAY = false +f:join() + +-- Enabled when all criteria are finally satisfied. +ok, res = vshard.storage.call(1, 'read', 'echo', {100}) +assert(ok and res == 100) + _ = test_run:switch("default") test_run:drop_cluster(REPLICASET_2) test_run:drop_cluster(REPLICASET_1) diff --git a/test/unit/garbage.result b/test/unit/garbage.result index a530496..c4b0fc3 100644 --- a/test/unit/garbage.result +++ b/test/unit/garbage.result @@ -16,7 +16,7 @@ test_run:cmd("setopt delimiter ';'") ... function show_sharded_spaces() local result = {} - for k, space in pairs(vshard.storage.sharded_spaces()) do + for k, space in pairs(vshard.storage._sharded_spaces()) do table.insert(result, space.name) end table.sort(result) @@ -489,7 +489,10 @@ f:cancel() util = require('util') --- ... -util.check_error(vshard.storage.bucket_delete_garbage) +delete_garbage = vshard.storage._bucket_delete_garbage +--- +... +util.check_error(delete_garbage) --- - 'Usage: bucket_delete_garbage(bucket_id, opts)' ... @@ -506,7 +509,7 @@ s:replace{6, 4} --- - [6, 4] ... -vshard.storage.bucket_delete_garbage(4) +delete_garbage(4) --- ... s:select{} @@ -526,7 +529,7 @@ s:replace{6, 4} --- - [6, 4] ... -vshard.storage.bucket_delete_garbage(4) +delete_garbage(4) --- ... s:select{} @@ -547,16 +550,16 @@ s:replace{6, 4} --- - [6, 4] ... -util.check_error(vshard.storage.bucket_delete_garbage, 4) +util.check_error(delete_garbage, 4) --- - Can not delete not garbage bucket. Use "{force=true}" to ignore this attention ... -util.check_error(vshard.storage.bucket_delete_garbage, 4, 10000) +util.check_error(delete_garbage, 4, 10000) --- - 'Usage: bucket_delete_garbage(bucket_id, opts)' ... -- 'Force' option ignores this error. -vshard.storage.bucket_delete_garbage(4, {force = true}) +delete_garbage(4, {force = true}) --- ... s:select{} diff --git a/test/unit/garbage.test.lua b/test/unit/garbage.test.lua index 250afb0..1416f58 100644 --- a/test/unit/garbage.test.lua +++ b/test/unit/garbage.test.lua @@ -6,7 +6,7 @@ engine = test_run:get_cfg('engine') test_run:cmd("setopt delimiter ';'") function show_sharded_spaces() local result = {} - for k, space in pairs(vshard.storage.sharded_spaces()) do + for k, space in pairs(vshard.storage._sharded_spaces()) do table.insert(result, space.name) end table.sort(result) @@ -197,30 +197,31 @@ f:cancel() -- util = require('util') -util.check_error(vshard.storage.bucket_delete_garbage) +delete_garbage = vshard.storage._bucket_delete_garbage +util.check_error(delete_garbage) -- Delete an existing garbage bucket. _bucket:replace{4, vshard.consts.BUCKET.SENT} s:replace{5, 4} s:replace{6, 4} -vshard.storage.bucket_delete_garbage(4) +delete_garbage(4) s:select{} -- Delete a not existing garbage bucket. _ = _bucket:delete{4} s:replace{5, 4} s:replace{6, 4} -vshard.storage.bucket_delete_garbage(4) +delete_garbage(4) s:select{} -- Fail to delete a not garbage bucket. _bucket:replace{4, vshard.consts.BUCKET.ACTIVE} s:replace{5, 4} s:replace{6, 4} -util.check_error(vshard.storage.bucket_delete_garbage, 4) -util.check_error(vshard.storage.bucket_delete_garbage, 4, 10000) +util.check_error(delete_garbage, 4) +util.check_error(delete_garbage, 4, 10000) -- 'Force' option ignores this error. -vshard.storage.bucket_delete_garbage(4, {force = true}) +delete_garbage(4, {force = true}) s:select{} -- diff --git a/test/unit/rebalancer.result b/test/unit/rebalancer.result index 19aa480..a931d63 100644 --- a/test/unit/rebalancer.result +++ b/test/unit/rebalancer.result @@ -287,7 +287,7 @@ build_routes(replicasets) -- -- Test rebalancer local state. -- -get_state = vshard.storage.rebalancer_request_state +get_state = vshard.storage._rebalancer_request_state --- ... _bucket = box.schema.create_space('_bucket') diff --git a/test/unit/rebalancer.test.lua b/test/unit/rebalancer.test.lua index 8087d42..1cdbbcb 100644 --- a/test/unit/rebalancer.test.lua +++ b/test/unit/rebalancer.test.lua @@ -75,7 +75,7 @@ build_routes(replicasets) -- -- Test rebalancer local state. -- -get_state = vshard.storage.rebalancer_request_state +get_state = vshard.storage._rebalancer_request_state _bucket = box.schema.create_space('_bucket') pk = _bucket:create_index('pk') status = _bucket:create_index('status', {parts = {{2, 'string'}}, unique = false}) diff --git a/vshard/error.lua b/vshard/error.lua index 0ce208a..2b97eae 100644 --- a/vshard/error.lua +++ b/vshard/error.lua @@ -165,6 +165,11 @@ local error_message_template = { 'Last error was %s', args = {'replicaset_uuid', 'error'} }, + [33] = { + name = 'STORAGE_IS_DISABLED', + msg = 'Storage is disabled: %s', + args = {'reason'} + }, } -- diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua index 3158f3c..d3c4e2a 100644 --- a/vshard/storage/init.lua +++ b/vshard/storage/init.lua @@ -134,6 +134,16 @@ if not M then -- help routers find a new location for sent and deleted buckets without -- whole cluster scan. route_map = {}, + -- Flag whether vshard.storage.cfg() is finished. + is_configured = false, + -- Flag whether box.info.status is acceptable. For instance, 'loading' + -- is not. + is_loaded = false, + -- Reference to the function-proxy to most of the public functions. It + -- allows to avoid 'if's in each function by adding expensive + -- conditional checks in one rarely used version of the wrapper and no + -- checks into the other almost always used wrapper. + api_call_cache = nil, ------------------- Garbage collection ------------------- -- Fiber to remove garbage buckets data. @@ -670,19 +680,19 @@ local function this_is_master() M.this_replica == M.this_replicaset.master end -local function on_master_disable(...) - M._on_master_disable(...) +local function on_master_disable(new_func, old_func) + M._on_master_disable(new_func, old_func) -- If a trigger is set after storage.cfg(), then notify an -- user, that the current instance is not master. - if select('#', ...) == 1 and not this_is_master() then + if old_func == nil and not this_is_master() then M._on_master_disable:run() end end -local function on_master_enable(...) - M._on_master_enable(...) +local function on_master_enable(new_func, old_func) + M._on_master_enable(new_func, old_func) -- Same as above, but notify, that the instance is master. - if select('#', ...) == 1 and this_is_master() then + if old_func == nil and this_is_master() then M._on_master_enable:run() end end @@ -1363,6 +1373,13 @@ local function find_sharded_spaces() return spaces end +-- +-- Public wrapper for sharded spaces list getter. +-- +local function storage_sharded_spaces() + return table.deepcopy(find_sharded_spaces()) +end + if M.errinj.ERRINJ_RELOAD then error('Error injection: reload') end @@ -2764,6 +2781,7 @@ local function storage_cfg(cfg, this_replica_uuid, is_reload) M.rebalancer_fiber = nil end lua_gc.set_state(M.collect_lua_garbage, consts.COLLECT_LUA_GARBAGE_INTERVAL) + M.is_configured = true -- Destroy connections, not used in a new configuration. collectgarbage() end @@ -2915,6 +2933,58 @@ local function storage_info() return state end +-------------------------------------------------------------------------------- +-- Public API protection +-------------------------------------------------------------------------------- + +-- +-- Arguments are listed explicitly instead of '...' because the latter does not +-- jit. +-- +local function storage_api_call_safe(func, arg1, arg2, arg3, arg4) + return func(arg1, arg2, arg3, arg4) +end + +-- +-- Unsafe proxy is loaded with protections. But it is used rarely and only in +-- the beginning of instance's lifetime. +-- +local function storage_api_call_unsafe(func, arg1, arg2, arg3, arg4) + -- box.info is quite expensive. Avoid calling it again when the instance + -- is finally loaded. + if not M.is_loaded then + -- box.info raises an error until box.cfg() is started. + local ok, status = pcall(function() + return box.info.status + end) + if not ok then + local msg = 'box seem not to be configured' + return error(lerror.vshard(lerror.code.STORAGE_IS_DISABLED, msg)) + end + -- 'Orphan' is allowed because even if a replica is an orphan, it still + -- could be up to date. Just not all other replicas are connected. + if status ~= 'running' and status ~= 'orphan' then + local msg = ('instance status is "%s"'):format(status) + return error(lerror.vshard(lerror.code.STORAGE_IS_DISABLED, msg)) + end + M.is_loaded = true + end + if not M.is_configured then + local msg = 'storage is not configured' + return error(lerror.vshard(lerror.code.STORAGE_IS_DISABLED, msg)) + end + M.api_call_cache = storage_api_call_safe + return func(arg1, arg2, arg3, arg4) +end + +M.api_call_cache = storage_api_call_unsafe + +local function storage_make_api(func) + return function(arg1, arg2, arg3, arg4) + return M.api_call_cache(func, arg1, arg2, arg3, arg4) + end +end + -------------------------------------------------------------------------------- -- Module definition -------------------------------------------------------------------------------- @@ -3021,44 +3091,67 @@ M.bucket_are_all_rw = bucket_are_all_rw_public M.bucket_generation_wait = bucket_generation_wait lregistry.storage = M +-- +-- Not all methods are public here. Private methods should not be exposed if +-- possible. At least not without notable difference in naming. +-- return { - sync = sync, - bucket_force_create = bucket_force_create, - bucket_force_drop = bucket_force_drop, - bucket_collect = bucket_collect, - bucket_recv = bucket_recv, - bucket_send = bucket_send, - bucket_stat = bucket_stat, - bucket_pin = bucket_pin, - bucket_unpin = bucket_unpin, - bucket_ref = bucket_ref, - bucket_unref = bucket_unref, - bucket_refro = bucket_refro, - bucket_refrw = bucket_refrw, - bucket_unrefro = bucket_unrefro, - bucket_unrefrw = bucket_unrefrw, - bucket_delete_garbage = bucket_delete_garbage, - garbage_collector_wakeup = garbage_collector_wakeup, - rebalancer_wakeup = rebalancer_wakeup, - rebalancer_apply_routes = rebalancer_apply_routes, - rebalancer_disable = rebalancer_disable, - rebalancer_enable = rebalancer_enable, - is_locked = is_this_replicaset_locked, - rebalancing_is_in_progress = rebalancing_is_in_progress, - recovery_wakeup = recovery_wakeup, - call = storage_call, - _call = service_call, + -- + -- Bucket methods. + -- + bucket_force_create = storage_make_api(bucket_force_create), + bucket_force_drop = storage_make_api(bucket_force_drop), + bucket_collect = storage_make_api(bucket_collect), + bucket_recv = storage_make_api(bucket_recv), + bucket_send = storage_make_api(bucket_send), + bucket_stat = storage_make_api(bucket_stat), + bucket_pin = storage_make_api(bucket_pin), + bucket_unpin = storage_make_api(bucket_unpin), + bucket_ref = storage_make_api(bucket_ref), + bucket_unref = storage_make_api(bucket_unref), + bucket_refro = storage_make_api(bucket_refro), + bucket_refrw = storage_make_api(bucket_refrw), + bucket_unrefro = storage_make_api(bucket_unrefro), + bucket_unrefrw = storage_make_api(bucket_unrefrw), + bucket_delete_garbage = storage_make_api(bucket_delete_garbage), + _bucket_delete_garbage = bucket_delete_garbage, + buckets_info = storage_make_api(storage_buckets_info), + buckets_count = storage_make_api(bucket_count_public), + buckets_discovery = storage_make_api(buckets_discovery), + -- + -- Garbage collector. + -- + garbage_collector_wakeup = storage_make_api(garbage_collector_wakeup), + -- + -- Rebalancer. + -- + rebalancer_wakeup = storage_make_api(rebalancer_wakeup), + rebalancer_apply_routes = storage_make_api(rebalancer_apply_routes), + rebalancer_disable = storage_make_api(rebalancer_disable), + rebalancer_enable = storage_make_api(rebalancer_enable), + rebalancing_is_in_progress = storage_make_api(rebalancing_is_in_progress), + rebalancer_request_state = storage_make_api(rebalancer_request_state), + _rebalancer_request_state = rebalancer_request_state, + -- + -- Recovery. + -- + recovery_wakeup = storage_make_api(recovery_wakeup), + -- + -- Instance info. + -- + is_locked = storage_make_api(is_this_replicaset_locked), + info = storage_make_api(storage_info), + sharded_spaces = storage_make_api(storage_sharded_spaces), + _sharded_spaces = storage_sharded_spaces, + module_version = function() return M.module_version end, + -- + -- Miscellaneous. + -- + call = storage_make_api(storage_call), + _call = storage_make_api(service_call), + sync = storage_make_api(sync), cfg = function(cfg, uuid) return storage_cfg(cfg, uuid, false) end, - info = storage_info, - buckets_info = storage_buckets_info, - buckets_count = bucket_count_public, - buckets_discovery = buckets_discovery, - rebalancer_request_state = rebalancer_request_state, + on_master_enable = storage_make_api(on_master_enable), + on_master_disable = storage_make_api(on_master_disable), internal = M, - on_master_enable = on_master_enable, - on_master_disable = on_master_disable, - sharded_spaces = function() - return table.deepcopy(find_sharded_spaces()) - end, - module_version = function() return M.module_version end, } -- 2.24.3 (Apple Git-128)
The patch introduces functions vshard.storage.enable()/disable(). They allow to control manually whether the instance can accept requests. It solves the following problems which were not covered by previous patches: - Even if box.cfg() is done, status is 'running', and vshard.storage.cfg() is finished, still user's application can be not ready to accept requests. For instance, it needs to create more functions and users on top of vshard. Then it wants to disable public requests until all preliminary work is done. - After all is enabled, fine, and dandy, still the instance might want to disable self in case of an emergency. Such as its config got broken or too outdated, desynced with a centric storage. vshard.storage.enable()/disable() can be called any time, before, during, and after vshard.storage.cfg() to solve these issues. Part of #298 --- test/storage/storage.result | 27 +++++++++++++++++++++++++++ test/storage/storage.test.lua | 12 ++++++++++++ vshard/storage/init.lua | 28 ++++++++++++++++++++++++++++ 3 files changed, 67 insertions(+) diff --git a/test/storage/storage.result b/test/storage/storage.result index e83b34f..e1b0d5d 100644 --- a/test/storage/storage.result +++ b/test/storage/storage.result @@ -1052,6 +1052,33 @@ assert(ok and res == 100) --- - true ... +-- +-- Manual enable/disable. +-- +vshard.storage.disable() +--- +... +ok, err = pcall(vshard.storage.call, 1, 'read', 'echo', {100}) +--- +... +assert(not ok and err.code == vshard.error.code.STORAGE_IS_DISABLED) +--- +- true +... +assert(err.message:match('storage is disabled explicitly') ~= nil) +--- +- true +... +vshard.storage.enable() +--- +... +ok, res = vshard.storage.call(1, 'read', 'echo', {100}) +--- +... +assert(ok and res == 100) +--- +- true +... _ = test_run:switch("default") --- ... diff --git a/test/storage/storage.test.lua b/test/storage/storage.test.lua index ff39f2f..e1127d3 100644 --- a/test/storage/storage.test.lua +++ b/test/storage/storage.test.lua @@ -342,6 +342,18 @@ f:join() ok, res = vshard.storage.call(1, 'read', 'echo', {100}) assert(ok and res == 100) +-- +-- Manual enable/disable. +-- +vshard.storage.disable() +ok, err = pcall(vshard.storage.call, 1, 'read', 'echo', {100}) +assert(not ok and err.code == vshard.error.code.STORAGE_IS_DISABLED) +assert(err.message:match('storage is disabled explicitly') ~= nil) + +vshard.storage.enable() +ok, res = vshard.storage.call(1, 'read', 'echo', {100}) +assert(ok and res == 100) + _ = test_run:switch("default") test_run:drop_cluster(REPLICASET_2) test_run:drop_cluster(REPLICASET_1) diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua index d3c4e2a..94e0c16 100644 --- a/vshard/storage/init.lua +++ b/vshard/storage/init.lua @@ -139,6 +139,9 @@ if not M then -- Flag whether box.info.status is acceptable. For instance, 'loading' -- is not. is_loaded = false, + -- Flag whether the instance is enabled manually. It is true by default + -- for backward compatibility with old vshard. + is_enabled = true, -- Reference to the function-proxy to most of the public functions. It -- allows to avoid 'if's in each function by adding expensive -- conditional checks in one rarely used version of the wrapper and no @@ -201,6 +204,11 @@ if not M then } else bucket_ref_new = ffi.typeof("struct bucket_ref") + + -- It is not set when reloaded from an old vshard version. + if M.is_enabled == nil then + M.is_enabled = true + end end -- @@ -2973,6 +2981,10 @@ local function storage_api_call_unsafe(func, arg1, arg2, arg3, arg4) local msg = 'storage is not configured' return error(lerror.vshard(lerror.code.STORAGE_IS_DISABLED, msg)) end + if not M.is_enabled then + local msg = 'storage is disabled explicitly' + return error(lerror.vshard(lerror.code.STORAGE_IS_DISABLED, msg)) + end M.api_call_cache = storage_api_call_safe return func(arg1, arg2, arg3, arg4) end @@ -2985,6 +2997,20 @@ local function storage_make_api(func) end end +local function storage_enable() + M.is_enabled = true +end + +-- +-- Disable can be used in case the storage entered a critical state in which +-- requests are not allowed. For instance, its config got broken or too old +-- compared to a centric config somewhere. +-- +local function storage_disable() + M.is_enabled = false + M.api_call_cache = storage_api_call_unsafe +end + -------------------------------------------------------------------------------- -- Module definition -------------------------------------------------------------------------------- @@ -3153,5 +3179,7 @@ return { cfg = function(cfg, uuid) return storage_cfg(cfg, uuid, false) end, on_master_enable = storage_make_api(on_master_enable), on_master_disable = storage_make_api(on_master_disable), + enable = storage_enable, + disable = storage_disable, internal = M, } -- 2.24.3 (Apple Git-128)
vshard.storage.call() and most of the other vshard.storage.* functions now raise an exception STORAGE_IS_DISABLED when the storage is disabled. The router wants to catch it to handle in a special way. But unfortunately, - error(obj) in a Lua function is wrapped into LuajitError. 'obj' is saved into 'message' using its __tostring meta-method. - It is not possible to create your own error type in a sane way. These 2 facts mean that the router needs to be able to extract the original error from LuajitError's message. In vshard errors are serialized into json, so a valid vshard error, such as STORAGE_IS_DISABLED, can be extracted from LuajitError's message if it wasn't truncated due to being too long. For this particular error it won't happen. The patch introduces new method vshard.error.from_string() to perform this extraction for its further usage in router. Part of #298 --- test/unit/error.result | 18 ++++++++++++++++++ test/unit/error.test.lua | 6 ++++++ vshard/error.lua | 19 +++++++++++++++++++ 3 files changed, 43 insertions(+) diff --git a/test/unit/error.result b/test/unit/error.result index bb4e0cc..258b14f 100644 --- a/test/unit/error.result +++ b/test/unit/error.result @@ -48,6 +48,24 @@ test_run:grep_log('default', '"reason":"reason","code":11,"type":"ShardingError" --- - '"reason":"reason","code":11,"type":"ShardingError"' ... +e = lerror.vshard(lerror.code.STORAGE_IS_DISABLED, 'any reason') +--- +... +e = lerror.from_string(tostring(e)) +--- +... +assert(e.code == lerror.code.STORAGE_IS_DISABLED) +--- +- true +... +assert(e.type == 'ShardingError') +--- +- true +... +assert(e.message == 'Storage is disabled: any reason') +--- +- true +... -- -- Part of gh-100: check `error.vshard`. -- diff --git a/test/unit/error.test.lua b/test/unit/error.test.lua index 0a51d33..5e669b6 100644 --- a/test/unit/error.test.lua +++ b/test/unit/error.test.lua @@ -19,6 +19,12 @@ log = require('log') log.info('Log error: %s', vshard_error) test_run:grep_log('default', '"reason":"reason","code":11,"type":"ShardingError"') +e = lerror.vshard(lerror.code.STORAGE_IS_DISABLED, 'any reason') +e = lerror.from_string(tostring(e)) +assert(e.code == lerror.code.STORAGE_IS_DISABLED) +assert(e.type == 'ShardingError') +assert(e.message == 'Storage is disabled: any reason') + -- -- Part of gh-100: check `error.vshard`. -- diff --git a/vshard/error.lua b/vshard/error.lua index 2b97eae..96c4bdd 100644 --- a/vshard/error.lua +++ b/vshard/error.lua @@ -244,6 +244,24 @@ local function make_error(e) end end +-- +-- Restore an error object from its string serialization. +-- +local function from_string(str) + -- Error objects in VShard are stringified into json. Hence can restore also + -- as json. The only problem is that the json might be truncated if it was + -- stored in an error message of a real error object. It is not very + -- reliable. + local ok, res = pcall(json.decode, str) + if not ok then + return nil + end + if type(res) ~= 'table' or type(res.type) ~= 'string' or + type(res.code) ~= 'number' or type(res.message) ~= 'string' then + return nil + end + return make_error(res) +end local function make_alert(code, ...) local format = error_message_template[code] @@ -266,6 +284,7 @@ return { box = box_error, vshard = vshard_error, make = make_error, + from_string = from_string, alert = make_alert, timeout = make_timeout, } -- 2.24.3 (Apple Git-128)
If a storage reports it is disabled, then it probably will take some time before it can accept new requests. This patch makes STORAGE_IS_DISABLED error cause the connection's backoff. In line with 'access denied' and 'no such function' errors. Because the reason for all 3 is the same - the storage is not ready to accept requests yet. Such requests are transparently retried now. Closes #298 @TarantoolBot document Title: vshard.storage.enable/disable() `vshard.storage.disable()` makes most of the `vshard.storage` functions throw an error. As Lua exception, not via `nil, err` pattern. `vshard.storage.enable()` reverts the disable. By default the storage is enabled. Additionally, the storage is forcefully disabled automatically until `vshard.storage.cfg()` is finished and the instance finished recovery (its `box.info.status` is `'running'`, for example). Auto-disable protects from usage of vshard functions before the storage's global state is fully created. Manual `vshard.storage.disable()` helps to achieve the same for user's application. For instance, a user might want to do some preparatory work after `vshard.storage.cfg` before the application is ready for requests. Then the flow would be: ```Lua vshard.storage.disable() vshard.storage.cfg(...) -- Do your preparatory work here ... vshard.storage.enable() ``` The routers handle the errors signaling about the storage being disabled in a special way. They put connections to such instances into a backoff state for some time and will try to use other replicas. For example, assume a replicaset has replicas 'replica_1' and 'replica_2'. Assume 'replica_1' is disabled due to any reason. If a router will try to talk to 'replica_1', it will get a special error and will transparently retry to 'replica_2'. When 'replica_1' is enabled again, the router will notice it too and will send requests to it again. It all works exclusively for read-only requests. Read-write requests can only be sent to a master, which is one per replicaset. They are not retried. --- test/router/router2.result | 88 ++++++++++++++++++++++++++++++++++++ test/router/router2.test.lua | 35 ++++++++++++++ vshard/replicaset.lua | 19 +++++++- 3 files changed, 140 insertions(+), 2 deletions(-) diff --git a/test/router/router2.result b/test/router/router2.result index a501dbf..ebf0b3f 100644 --- a/test/router/router2.result +++ b/test/router/router2.result @@ -548,6 +548,94 @@ vshard.storage.call = old_storage_call | --- | ... +-- +-- Storage is disabled = backoff. +-- +vshard.storage.disable() + | --- + | ... + +test_run:switch('router_1') + | --- + | - true + | ... +-- Drop old backoffs. +fiber.sleep(vshard.consts.REPLICA_BACKOFF_INTERVAL) + | --- + | ... +-- Success, but internally the request was retried. +res, err = vshard.router.callro(1, 'echo', {100}, long_timeout) + | --- + | ... +assert(res == 100) + | --- + | - true + | ... +-- The best replica entered backoff state. +util = require('util') + | --- + | ... +storage_2 = vshard.router.static.replicasets[replicasets[2]] + | --- + | ... +storage_2_a = storage_2.replicas[util.name_to_uuid.storage_2_a] + | --- + | ... +assert(storage_2_a.backoff_ts ~= nil) + | --- + | - true + | ... + +test_run:switch('storage_2_b') + | --- + | - true + | ... +assert(echo_count == 1) + | --- + | - true + | ... +echo_count = 0 + | --- + | ... + +test_run:switch('storage_2_a') + | --- + | - true + | ... +assert(echo_count == 0) + | --- + | - true + | ... +vshard.storage.enable() + | --- + | ... + +test_run:switch('router_1') + | --- + | - true + | ... +-- Drop the backoff. +fiber.sleep(vshard.consts.REPLICA_BACKOFF_INTERVAL) + | --- + | ... +-- Now goes to the best replica - it is enabled again. +res, err = vshard.router.callro(1, 'echo', {100}, long_timeout) + | --- + | ... +assert(res == 100) + | --- + | - true + | ... + +test_run:switch('storage_2_a') + | --- + | - true + | ... +assert(echo_count == 1) + | --- + | - true + | ... + _ = test_run:switch("default") | --- | ... diff --git a/test/router/router2.test.lua b/test/router/router2.test.lua index fb0c3b2..1c21876 100644 --- a/test/router/router2.test.lua +++ b/test/router/router2.test.lua @@ -216,6 +216,41 @@ test_run:switch('storage_2_a') assert(echo_count == 0) vshard.storage.call = old_storage_call +-- +-- Storage is disabled = backoff. +-- +vshard.storage.disable() + +test_run:switch('router_1') +-- Drop old backoffs. +fiber.sleep(vshard.consts.REPLICA_BACKOFF_INTERVAL) +-- Success, but internally the request was retried. +res, err = vshard.router.callro(1, 'echo', {100}, long_timeout) +assert(res == 100) +-- The best replica entered backoff state. +util = require('util') +storage_2 = vshard.router.static.replicasets[replicasets[2]] +storage_2_a = storage_2.replicas[util.name_to_uuid.storage_2_a] +assert(storage_2_a.backoff_ts ~= nil) + +test_run:switch('storage_2_b') +assert(echo_count == 1) +echo_count = 0 + +test_run:switch('storage_2_a') +assert(echo_count == 0) +vshard.storage.enable() + +test_run:switch('router_1') +-- Drop the backoff. +fiber.sleep(vshard.consts.REPLICA_BACKOFF_INTERVAL) +-- Now goes to the best replica - it is enabled again. +res, err = vshard.router.callro(1, 'echo', {100}, long_timeout) +assert(res == 100) + +test_run:switch('storage_2_a') +assert(echo_count == 1) + _ = test_run:switch("default") _ = test_run:cmd("stop server router_1") _ = test_run:cmd("cleanup server router_1") diff --git a/vshard/replicaset.lua b/vshard/replicaset.lua index 573a555..623d24d 100644 --- a/vshard/replicaset.lua +++ b/vshard/replicaset.lua @@ -347,9 +347,21 @@ local function replica_call(replica, func, args, opts) if opts.timeout >= replica.net_timeout then replica_on_failed_request(replica) end + local err = storage_status + -- VShard functions can throw exceptions using error() function. When + -- it reaches the network layer, it is wrapped into LuajitError. Try to + -- extract the original error if this is the case. Not always is + -- possible - the string representation could be truncated. + -- + -- In old Tarantool versions LuajitError turned into ClientError on the + -- client. Check both types. + if func:startswith('vshard.') and (err.type == 'LuajitError' or + err.type == 'ClientError') then + err = lerror.from_string(err.message) or err + end log.error("Exception during calling '%s' on '%s': %s", func, replica, - storage_status) - return false, nil, lerror.make(storage_status) + err) + return false, nil, lerror.make(err) else replica_on_success_request(replica) end @@ -472,6 +484,9 @@ local function can_backoff_after_error(e, func) return e.message:startswith("Procedure 'vshard.") end end + if e.type == 'ShardingError' then + return e.code == vshard.error.code.STORAGE_IS_DISABLED + end return false end -- 2.24.3 (Apple Git-128)
Thanks for your patch! LGTM. But I think tests should be extended a bit.
I see that tests cover AccessDenied error but I don't see that there is
simple error()/assert() error check.
On 17.12.2021 03:25, Vladislav Shpilevoy wrote:
> Storage configuration takes time. Firstly, box.cfg{} which can be
> called before vshard.storage.cfg(). Secondly, vshard.storage.cfg()
> is not immediate as well.
>
> During that time accessing the storage is not safe. Attempts to
> call vshard.storage functions can return weird errors, or the
> functions can even be not available yet. They need to be created
> in _func and get access rights in _priv before becoming public.
>
> Routers used to forward errors like 'access denied error' and
> 'no such function' to users as is, treating them as critical.
>
> Not only it was confusing for users, but also could make an entire
> replicaset not available for requests - the connection to it is
> alive, so router would send all requests into it and they all
> would fail. Even if the replicaset has another instance which is
> perfectly functional.
>
> This patch handles such specific errors inside of the router. The
> faulty replicas are put into a 'backoff' state. They remain in it
> for some fixed time (5 seconds for now), new requests won't be
> sent to them until the time passes. Router will use other
> instances.
>
> Backoff is activated only for vshard.* functions. If the errors
> are about some user's function, it is considered a regular error.
> Because the router can't tell whether any side effects were done
> on the remote instance before the error happened. Hence can't
> retry to another node.
>
> For example, if access was denied to 'vshard.storage.call', then
> it is backoff. If inside of vshard.storage.call the access was
> denied to 'user_test_func', then it is not backoff.
>
> It all works for read-only requests exclusively of course. Because
> for read-write requests the instance is just one - master. Router
> does not have other options so backoff here wouldn't help.
>
> Part of #298
> ---
> test/router/router2.result | 237 +++++++++++++++++++++++++++++++++++
> test/router/router2.test.lua | 97 ++++++++++++++
> vshard/consts.lua | 1 +
> vshard/error.lua | 8 +-
> vshard/replicaset.lua | 100 +++++++++++++--
> vshard/router/init.lua | 3 +-
> 6 files changed, 432 insertions(+), 14 deletions(-)
Thanks for your patch. See my two nits below. On 17.12.2021 03:25, Vladislav Shpilevoy wrote: > +-------------------------------------------------------------------------------- > +-- Public API protection > +-------------------------------------------------------------------------------- > + > +-- > +-- Arguments are listed explicitly instead of '...' because the latter does not > +-- jit. > +-- > +local function storage_api_call_safe(func, arg1, arg2, arg3, arg4) > + return func(arg1, arg2, arg3, arg4) > +end > + > +-- > +-- Unsafe proxy is loaded with protections. But it is used rarely and only in > +-- the beginning of instance's lifetime. > +-- > +local function storage_api_call_unsafe(func, arg1, arg2, arg3, arg4) > + -- box.info is quite expensive. Avoid calling it again when the instance > + -- is finally loaded. > + if not M.is_loaded then > + -- box.info raises an error until box.cfg() is started. > + local ok, status = pcall(function() > + return box.info.status > + end) nit: It could be changed to type(box.cfg) == 'function'. I'd call it "common" pattern to check that box is not yet configured. > + if not ok then > + local msg = 'box seem not to be configured' nit: seem -> seems? > + return error(lerror.vshard(lerror.code.STORAGE_IS_DISABLED, msg)) > + end > + -- 'Orphan' is allowed because even if a replica is an orphan, it still > + -- could be up to date. Just not all other replicas are connected. > + if status ~= 'running' and status ~= 'orphan' then > + local msg = ('instance status is "%s"'):format(status) > + return error(lerror.vshard(lerror.code.STORAGE_IS_DISABLED, msg)) > + end > + M.is_loaded = true > + end > + if not M.is_configured then > + local msg = 'storage is not configured' > + return error(lerror.vshard(lerror.code.STORAGE_IS_DISABLED, msg)) > + end > + M.api_call_cache = storage_api_call_safe > + return func(arg1, arg2, arg3, arg4) > +end > + > +M.api_call_cache = storage_api_call_unsafe > + > +local function storage_make_api(func) > + return function(arg1, arg2, arg3, arg4) > + return M.api_call_cache(func, arg1, arg2, arg3, arg4) > + end > +end > + > -------------------------------------------------------------------------------- > -- Module definition > -------------------------------------------------------------------------------- > @@ -3021,44 +3091,67 @@ M.bucket_are_all_rw = bucket_are_all_rw_public > M.bucket_generation_wait = bucket_generation_wait > lregistry.storage = M > > +-- > +-- Not all methods are public here. Private methods should not be exposed if > +-- possible. At least not without notable difference in naming. > +-- > return { > - sync = sync, > - bucket_force_create = bucket_force_create, > - bucket_force_drop = bucket_force_drop, > - bucket_collect = bucket_collect, > - bucket_recv = bucket_recv, > - bucket_send = bucket_send, > - bucket_stat = bucket_stat, > - bucket_pin = bucket_pin, > - bucket_unpin = bucket_unpin, > - bucket_ref = bucket_ref, > - bucket_unref = bucket_unref, > - bucket_refro = bucket_refro, > - bucket_refrw = bucket_refrw, > - bucket_unrefro = bucket_unrefro, > - bucket_unrefrw = bucket_unrefrw, > - bucket_delete_garbage = bucket_delete_garbage, > - garbage_collector_wakeup = garbage_collector_wakeup, > - rebalancer_wakeup = rebalancer_wakeup, > - rebalancer_apply_routes = rebalancer_apply_routes, > - rebalancer_disable = rebalancer_disable, > - rebalancer_enable = rebalancer_enable, > - is_locked = is_this_replicaset_locked, > - rebalancing_is_in_progress = rebalancing_is_in_progress, > - recovery_wakeup = recovery_wakeup, > - call = storage_call, > - _call = service_call, > + -- > + -- Bucket methods. > + -- > + bucket_force_create = storage_make_api(bucket_force_create), > + bucket_force_drop = storage_make_api(bucket_force_drop), > + bucket_collect = storage_make_api(bucket_collect), > + bucket_recv = storage_make_api(bucket_recv), > + bucket_send = storage_make_api(bucket_send), > + bucket_stat = storage_make_api(bucket_stat), > + bucket_pin = storage_make_api(bucket_pin), > + bucket_unpin = storage_make_api(bucket_unpin), > + bucket_ref = storage_make_api(bucket_ref), > + bucket_unref = storage_make_api(bucket_unref), > + bucket_refro = storage_make_api(bucket_refro), > + bucket_refrw = storage_make_api(bucket_refrw), > + bucket_unrefro = storage_make_api(bucket_unrefro), > + bucket_unrefrw = storage_make_api(bucket_unrefrw), > + bucket_delete_garbage = storage_make_api(bucket_delete_garbage), > + _bucket_delete_garbage = bucket_delete_garbage, > + buckets_info = storage_make_api(storage_buckets_info), > + buckets_count = storage_make_api(bucket_count_public), > + buckets_discovery = storage_make_api(buckets_discovery), > + -- > + -- Garbage collector. > + -- > + garbage_collector_wakeup = storage_make_api(garbage_collector_wakeup), > + -- > + -- Rebalancer. > + -- > + rebalancer_wakeup = storage_make_api(rebalancer_wakeup), > + rebalancer_apply_routes = storage_make_api(rebalancer_apply_routes), > + rebalancer_disable = storage_make_api(rebalancer_disable), > + rebalancer_enable = storage_make_api(rebalancer_enable), > + rebalancing_is_in_progress = storage_make_api(rebalancing_is_in_progress), > + rebalancer_request_state = storage_make_api(rebalancer_request_state), > + _rebalancer_request_state = rebalancer_request_state, > + -- > + -- Recovery. > + -- > + recovery_wakeup = storage_make_api(recovery_wakeup), > + -- > + -- Instance info. > + -- > + is_locked = storage_make_api(is_this_replicaset_locked), > + info = storage_make_api(storage_info), > + sharded_spaces = storage_make_api(storage_sharded_spaces), > + _sharded_spaces = storage_sharded_spaces, > + module_version = function() return M.module_version end, > + -- > + -- Miscellaneous. > + -- > + call = storage_make_api(storage_call), > + _call = storage_make_api(service_call), > + sync = storage_make_api(sync), > cfg = function(cfg, uuid) return storage_cfg(cfg, uuid, false) end, > - info = storage_info, > - buckets_info = storage_buckets_info, > - buckets_count = bucket_count_public, > - buckets_discovery = buckets_discovery, > - rebalancer_request_state = rebalancer_request_state, > + on_master_enable = storage_make_api(on_master_enable), > + on_master_disable = storage_make_api(on_master_disable), > internal = M, > - on_master_enable = on_master_enable, > - on_master_disable = on_master_disable, > - sharded_spaces = function() > - return table.deepcopy(find_sharded_spaces()) > - end, > - module_version = function() return M.module_version end, > }
Thanks for your patch. LGTM.
On 17.12.2021 03:25, Vladislav Shpilevoy wrote:
> The patch introduces functions vshard.storage.enable()/disable().
>
Thanks for your patch. LGTM.
However we should probably use more neutral name that doesn't touch
implementation details.
Maybe in future we will use new native custom errors that are available
in new Tarantool versions.
So probably it's better to call this method e.g. deserialize. Feel free
to ignore.
On 17.12.2021 03:25, Vladislav Shpilevoy wrote:
> vshard.storage.call() and most of the other vshard.storage.*
> functions now raise an exception STORAGE_IS_DISABLED when the
> storage is disabled.
Thanks for your patch. LGTM.
On 17.12.2021 03:25, Vladislav Shpilevoy wrote:
> If a storage reports it is disabled, then it probably will take
> some time before it can accept new requests.
>
> This patch makes STORAGE_IS_DISABLED error cause the connection's
> backoff. In line with 'access denied' and 'no such function'
> errors. Because the reason for all 3 is the same - the storage is
> not ready to accept requests yet.
>
> Such requests are transparently retried now.
>
> Closes #298
Hi! Thanks for the review! On 17.12.2021 12:09, Oleg Babin wrote: > Thanks for your patch! LGTM. But I think tests should be extended a bit. > > I see that tests cover AccessDenied error but I don't see that there is simple error()/assert() error check. I hope you meant this: ==================== diff --git a/test/router/router2.result b/test/router/router2.result index a501dbf..85d077a 100644 --- a/test/router/router2.result +++ b/test/router/router2.result @@ -548,6 +548,28 @@ vshard.storage.call = old_storage_call | --- | ... +-- +-- Fails without backoff for other errors. +-- +test_run:switch('router_1') + | --- + | - true + | ... +fiber.sleep(vshard.consts.REPLICA_BACKOFF_INTERVAL) + | --- + | ... +rs = vshard.router.route(1) + | --- + | ... +ok, err = rs:callro('vshard.storage.call', {1, 'badmode', 'echo', {100}}, \ + long_timeout) + | --- + | ... +assert(not ok and err.message:match('Unknown mode') ~= nil) + | --- + | - true + | ... + _ = test_run:switch("default") | --- | ... diff --git a/test/router/router2.test.lua b/test/router/router2.test.lua index fb0c3b2..a2171fa 100644 --- a/test/router/router2.test.lua +++ b/test/router/router2.test.lua @@ -216,6 +216,16 @@ test_run:switch('storage_2_a') assert(echo_count == 0) vshard.storage.call = old_storage_call +-- +-- Fails without backoff for other errors. +-- +test_run:switch('router_1') +fiber.sleep(vshard.consts.REPLICA_BACKOFF_INTERVAL) +rs = vshard.router.route(1) +ok, err = rs:callro('vshard.storage.call', {1, 'badmode', 'echo', {100}}, \ + long_timeout) +assert(not ok and err.message:match('Unknown mode') ~= nil) + _ = test_run:switch("default") _ = test_run:cmd("stop server router_1") _ = test_run:cmd("cleanup server router_1")
Thanks for the review! On 17.12.2021 12:09, Oleg Babin via Tarantool-patches wrote: > Thanks for your patch. See my two nits below. > > On 17.12.2021 03:25, Vladislav Shpilevoy wrote: >> +-------------------------------------------------------------------------------- >> +-- Public API protection >> +-------------------------------------------------------------------------------- >> + >> +-- >> +-- Arguments are listed explicitly instead of '...' because the latter does not >> +-- jit. >> +-- >> +local function storage_api_call_safe(func, arg1, arg2, arg3, arg4) >> + return func(arg1, arg2, arg3, arg4) >> +end >> + >> +-- >> +-- Unsafe proxy is loaded with protections. But it is used rarely and only in >> +-- the beginning of instance's lifetime. >> +-- >> +local function storage_api_call_unsafe(func, arg1, arg2, arg3, arg4) >> + -- box.info is quite expensive. Avoid calling it again when the instance >> + -- is finally loaded. >> + if not M.is_loaded then >> + -- box.info raises an error until box.cfg() is started. >> + local ok, status = pcall(function() >> + return box.info.status >> + end) > > nit: It could be changed to type(box.cfg) == 'function'. I'd call it "common" pattern to check that box is not yet configured. > >> + if not ok then >> + local msg = 'box seem not to be configured' > > nit: seem -> seems? Thanks, all fixed: ==================== diff --git a/test/storage/storage.result b/test/storage/storage.result index e83b34f..790ba11 100644 --- a/test/storage/storage.result +++ b/test/storage/storage.result @@ -967,15 +967,15 @@ _ = test_run:switch('storage_1_a') -- Leaving box.cfg() not called won't work because at 1.10 test-run somewhy -- raises an error when try to start an instance without box.cfg(). It can only -- be emulated. -old_info = box.info +old_cfg = box.cfg --- ... -box.info = setmetatable({}, {__index = function() error('not configured') end}) +assert(type(old_cfg) == 'table') --- +- true ... -assert(not pcall(function() return box.info.status end)) +box.cfg = function(...) return old_cfg(...) end --- -- true ... ok, err = pcall(vshard.storage.call, 1, 'read', 'echo', {100}) --- @@ -984,11 +984,11 @@ assert(not ok and err.code == vshard.error.code.STORAGE_IS_DISABLED) --- - true ... -assert(err.message:match('box seem not to be configured') ~= nil) +assert(err.message:match('box seems not to be configured') ~= nil) --- - true ... -box.info = old_info +box.cfg = old_cfg --- ... -- Disabled until box is loaded. diff --git a/test/storage/storage.test.lua b/test/storage/storage.test.lua index ff39f2f..8695636 100644 --- a/test/storage/storage.test.lua +++ b/test/storage/storage.test.lua @@ -309,14 +309,14 @@ _ = test_run:switch('storage_1_a') -- Leaving box.cfg() not called won't work because at 1.10 test-run somewhy -- raises an error when try to start an instance without box.cfg(). It can only -- be emulated. -old_info = box.info -box.info = setmetatable({}, {__index = function() error('not configured') end}) -assert(not pcall(function() return box.info.status end)) +old_cfg = box.cfg +assert(type(old_cfg) == 'table') +box.cfg = function(...) return old_cfg(...) end ok, err = pcall(vshard.storage.call, 1, 'read', 'echo', {100}) assert(not ok and err.code == vshard.error.code.STORAGE_IS_DISABLED) -assert(err.message:match('box seem not to be configured') ~= nil) -box.info = old_info +assert(err.message:match('box seems not to be configured') ~= nil) +box.cfg = old_cfg -- Disabled until box is loaded. vshard.storage.internal.errinj.ERRINJ_CFG_DELAY = true diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua index d3c4e2a..77da663 100644 --- a/vshard/storage/init.lua +++ b/vshard/storage/init.lua @@ -2953,14 +2953,11 @@ local function storage_api_call_unsafe(func, arg1, arg2, arg3, arg4) -- box.info is quite expensive. Avoid calling it again when the instance -- is finally loaded. if not M.is_loaded then - -- box.info raises an error until box.cfg() is started. - local ok, status = pcall(function() - return box.info.status - end) - if not ok then - local msg = 'box seem not to be configured' + if type(box.cfg) == 'function' then + local msg = 'box seems not to be configured' return error(lerror.vshard(lerror.code.STORAGE_IS_DISABLED, msg)) end + local status = box.info.status -- 'Orphan' is allowed because even if a replica is an orphan, it still -- could be up to date. Just not all other replicas are connected. if status ~= 'running' and status ~= 'orphan' then
Thanks for the review!
On 17.12.2021 12:09, Oleg Babin wrote:
> Thanks for your patch. LGTM.
>
> However we should probably use more neutral name that doesn't touch implementation details.
>
> Maybe in future we will use new native custom errors that are available in new Tarantool versions.
>
> So probably it's better to call this method e.g. deserialize. Feel free to ignore.
I used from_string() to be consistent with tostring(). Deserialize() would
assume there is serialize(), but there isn't. Also it won't work if the error is
packed into MessagePack. It works for strings only. Hence mentioning the exact
type 'string' seems valid.
As for native errors - I agree. The existence of from_string() is a huge
crutch. I was lucky that the errors were encoded as json in vshard from
the beginning. So I have at least sometimes working way of getting the original
error from an exception.
We have 'custom' error types in latest Tarantools for about a year now,
but it is still hardly usable, because you can't attach any payload to it. Which
is often very important even if it has just 1-2 fields. I wish the product
managers would see usefulness of good errors API someday.
Thanks for your fixes. Yes, it is that I meant. LGTM.
On 18.12.2021 02:10, Vladislav Shpilevoy wrote:
> Hi! Thanks for the review!
>
> On 17.12.2021 12:09, Oleg Babin wrote:
>> Thanks for your patch! LGTM. But I think tests should be extended a bit.
>>
>> I see that tests cover AccessDenied error but I don't see that there is simple error()/assert() error check.
> I hope you meant this:
>
> ====================
> diff --git a/test/router/router2.result b/test/router/router2.result
> index a501dbf..85d077a 100644
> --- a/test/router/router2.result
> +++ b/test/router/router2.result
> @@ -548,6 +548,28 @@ vshard.storage.call = old_storage_call
> | ---
> | ...
>
> +--
> +-- Fails without backoff for other errors.
> +--
> +test_run:switch('router_1')
> + | ---
> + | - true
> + | ...
> +fiber.sleep(vshard.consts.REPLICA_BACKOFF_INTERVAL)
> + | ---
> + | ...
> +rs = vshard.router.route(1)
> + | ---
> + | ...
> +ok, err = rs:callro('vshard.storage.call', {1, 'badmode', 'echo', {100}}, \
> + long_timeout)
> + | ---
> + | ...
> +assert(not ok and err.message:match('Unknown mode') ~= nil)
> + | ---
> + | - true
> + | ...
> +
> _ = test_run:switch("default")
> | ---
> | ...
> diff --git a/test/router/router2.test.lua b/test/router/router2.test.lua
> index fb0c3b2..a2171fa 100644
> --- a/test/router/router2.test.lua
> +++ b/test/router/router2.test.lua
> @@ -216,6 +216,16 @@ test_run:switch('storage_2_a')
> assert(echo_count == 0)
> vshard.storage.call = old_storage_call
>
> +--
> +-- Fails without backoff for other errors.
> +--
> +test_run:switch('router_1')
> +fiber.sleep(vshard.consts.REPLICA_BACKOFF_INTERVAL)
> +rs = vshard.router.route(1)
> +ok, err = rs:callro('vshard.storage.call', {1, 'badmode', 'echo', {100}}, \
> + long_timeout)
> +assert(not ok and err.message:match('Unknown mode') ~= nil)
> +
> _ = test_run:switch("default")
> _ = test_run:cmd("stop server router_1")
> _ = test_run:cmd("cleanup server router_1")
Thanks for your fixes. LGTM.
On 18.12.2021 02:10, Vladislav Shpilevoy wrote:
> Thanks for the review!
>
> On 17.12.2021 12:09, Oleg Babin via Tarantool-patches wrote:
>> Thanks for your patch. See my two nits below.
>>
>> On 17.12.2021 03:25, Vladislav Shpilevoy wrote:
>>> +--------------------------------------------------------------------------------
>>> +-- Public API protection
>>> +--------------------------------------------------------------------------------
>>> +
>>> +--
>>> +-- Arguments are listed explicitly instead of '...' because the latter does not
>>> +-- jit.
>>> +--
>>> +local function storage_api_call_safe(func, arg1, arg2, arg3, arg4)
>>> + return func(arg1, arg2, arg3, arg4)
>>> +end
>>> +
>>> +--
>>> +-- Unsafe proxy is loaded with protections. But it is used rarely and only in
>>> +-- the beginning of instance's lifetime.
>>> +--
>>> +local function storage_api_call_unsafe(func, arg1, arg2, arg3, arg4)
>>> + -- box.info is quite expensive. Avoid calling it again when the instance
>>> + -- is finally loaded.
>>> + if not M.is_loaded then
>>> + -- box.info raises an error until box.cfg() is started.
>>> + local ok, status = pcall(function()
>>> + return box.info.status
>>> + end)
>> nit: It could be changed to type(box.cfg) == 'function'. I'd call it "common" pattern to check that box is not yet configured.
>>
>>> + if not ok then
>>> + local msg = 'box seem not to be configured'
>> nit: seem -> seems?
> Thanks, all fixed:
>
> ====================
> diff --git a/test/storage/storage.result b/test/storage/storage.result
> index e83b34f..790ba11 100644
> --- a/test/storage/storage.result
> +++ b/test/storage/storage.result
> @@ -967,15 +967,15 @@ _ = test_run:switch('storage_1_a')
> -- Leaving box.cfg() not called won't work because at 1.10 test-run somewhy
> -- raises an error when try to start an instance without box.cfg(). It can only
> -- be emulated.
> -old_info = box.info
> +old_cfg = box.cfg
> ---
> ...
> -box.info = setmetatable({}, {__index = function() error('not configured') end})
> +assert(type(old_cfg) == 'table')
> ---
> +- true
> ...
> -assert(not pcall(function() return box.info.status end))
> +box.cfg = function(...) return old_cfg(...) end
> ---
> -- true
> ...
> ok, err = pcall(vshard.storage.call, 1, 'read', 'echo', {100})
> ---
> @@ -984,11 +984,11 @@ assert(not ok and err.code == vshard.error.code.STORAGE_IS_DISABLED)
> ---
> - true
> ...
> -assert(err.message:match('box seem not to be configured') ~= nil)
> +assert(err.message:match('box seems not to be configured') ~= nil)
> ---
> - true
> ...
> -box.info = old_info
> +box.cfg = old_cfg
> ---
> ...
> -- Disabled until box is loaded.
> diff --git a/test/storage/storage.test.lua b/test/storage/storage.test.lua
> index ff39f2f..8695636 100644
> --- a/test/storage/storage.test.lua
> +++ b/test/storage/storage.test.lua
> @@ -309,14 +309,14 @@ _ = test_run:switch('storage_1_a')
> -- Leaving box.cfg() not called won't work because at 1.10 test-run somewhy
> -- raises an error when try to start an instance without box.cfg(). It can only
> -- be emulated.
> -old_info = box.info
> -box.info = setmetatable({}, {__index = function() error('not configured') end})
> -assert(not pcall(function() return box.info.status end))
> +old_cfg = box.cfg
> +assert(type(old_cfg) == 'table')
> +box.cfg = function(...) return old_cfg(...) end
>
> ok, err = pcall(vshard.storage.call, 1, 'read', 'echo', {100})
> assert(not ok and err.code == vshard.error.code.STORAGE_IS_DISABLED)
> -assert(err.message:match('box seem not to be configured') ~= nil)
> -box.info = old_info
> +assert(err.message:match('box seems not to be configured') ~= nil)
> +box.cfg = old_cfg
>
> -- Disabled until box is loaded.
> vshard.storage.internal.errinj.ERRINJ_CFG_DELAY = true
> diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
> index d3c4e2a..77da663 100644
> --- a/vshard/storage/init.lua
> +++ b/vshard/storage/init.lua
> @@ -2953,14 +2953,11 @@ local function storage_api_call_unsafe(func, arg1, arg2, arg3, arg4)
> -- box.info is quite expensive. Avoid calling it again when the instance
> -- is finally loaded.
> if not M.is_loaded then
> - -- box.info raises an error until box.cfg() is started.
> - local ok, status = pcall(function()
> - return box.info.status
> - end)
> - if not ok then
> - local msg = 'box seem not to be configured'
> + if type(box.cfg) == 'function' then
> + local msg = 'box seems not to be configured'
> return error(lerror.vshard(lerror.code.STORAGE_IS_DISABLED, msg))
> end
> + local status = box.info.status
> -- 'Orphan' is allowed because even if a replica is an orphan, it still
> -- could be up to date. Just not all other replicas are connected.
> if status ~= 'running' and status ~= 'orphan' then
Thanks for your patchset. LGTM.
On 17.12.2021 03:25, Vladislav Shpilevoy wrote:
> The patchset introduces vshard.storage auto and manual enable/disable. Disabled
> storage blocks most of public API calls. The state is handled by the router in
> a special way so it transparently retries requests failed due to the storage
> being disabled.
>
> Branch:http://github.com/tarantool/vshard/tree/gerold103/gh-298-replica-backoff-part-2
> Issue:https://github.com/tarantool/vshard/issues/298
>
> Vladislav Shpilevoy (5):
> router: backoff on some box errors
> storage: auto enable/disable
> storage: manual enable/disable
> error: introduce from_string
> router: backoff on storage being disabled
>
> example/localcfg.lua | 1 -
> test/router/router2.result | 325 ++++++++++++++++++++++++++++++++++
> test/router/router2.test.lua | 132 ++++++++++++++
> test/storage/storage.result | 128 +++++++++++++
> test/storage/storage.test.lua | 55 ++++++
> test/unit/error.result | 18 ++
> test/unit/error.test.lua | 6 +
> test/unit/garbage.result | 17 +-
> test/unit/garbage.test.lua | 15 +-
> test/unit/rebalancer.result | 2 +-
> test/unit/rebalancer.test.lua | 2 +-
> vshard/consts.lua | 1 +
> vshard/error.lua | 32 +++-
> vshard/replicaset.lua | 119 +++++++++++--
> vshard/router/init.lua | 3 +-
> vshard/storage/init.lua | 207 +++++++++++++++++-----
> 16 files changed, 987 insertions(+), 76 deletions(-)
>
Pushed to master.