From: Oleg Babin via Tarantool-patches <tarantool-patches@dev.tarantool.org>
To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>,
tarantool-patches@dev.tarantool.org,
yaroslav.dynnikov@tarantool.org
Subject: Re: [Tarantool-patches] [PATCH 8/9] recovery: introduce reactive recovery
Date: Wed, 10 Feb 2021 12:00:31 +0300 [thread overview]
Message-ID: <2deb1054-3bc0-2f97-aeb2-db9577fdd84a@tarantool.org> (raw)
In-Reply-To: <e148abb893fc1ca572346c409434e3df80173dce.1612914070.git.v.shpilevoy@tarantool.org>
Thanks for your patch. LGTM.
On 10/02/2021 02:46, Vladislav Shpilevoy wrote:
> Recovery is a fiber on a master node which tries to resolve
> SENDING/RECEIVING buckets into GARBAGE or ACTIVE, in case they are
> stuck. Usually it happens due to a conflict on the receiving side,
> or if a restart happens during bucket send.
>
> Recovery was proactive. It used to wakeup with a constant period
> to find and resolve the needed buckets.
>
> But this won't work with the future feature called 'map-reduce'.
> Map-reduce as a preparation stage will need to ensure that all
> buckets on a storage are readable and writable. With the current
> recovery algorithm if a bucket is broken, it won't be recovered
> for the next 5 seconds by default. During this time all new
> map-reduce requests can't execute.
>
> This is not acceptable. As well as too frequent wakeup of recovery
> fiber because it would waste TX thread time.
>
> The patch makes recovery fiber wakeup not by a timeout but by
> events happening with _bucket space. Recovery fiber sleeps on a
> condition variable which is signaled when _bucket is changed.
>
> This is very similar to the reactive GC feature in a previous
> commit.
>
> It is worth mentioning that the backoff happens not only when a
> bucket couldn't be recovered (its transfer is still in progress,
> for example), but also when a network error happened and recovery
> couldn't check state of the bucket on the other storage.
>
> It would be a useless busy loop to retry network errors
> immediately after their appearance. Recovery uses a backoff
> interval for them as well.
>
> Needed for #147
> ---
> test/router/router.result | 22 ++++++++---
> test/router/router.test.lua | 13 ++++++-
> test/storage/recovery.result | 8 ++++
> test/storage/recovery.test.lua | 5 +++
> test/storage/recovery_errinj.result | 16 +++++++-
> test/storage/recovery_errinj.test.lua | 9 ++++-
> vshard/consts.lua | 2 +-
> vshard/storage/init.lua | 54 +++++++++++++++++++++++----
> 8 files changed, 110 insertions(+), 19 deletions(-)
>
> diff --git a/test/router/router.result b/test/router/router.result
> index b2efd6d..3c1d073 100644
> --- a/test/router/router.result
> +++ b/test/router/router.result
> @@ -312,6 +312,11 @@ replicaset, err = vshard.router.bucket_discovery(2); return err == nil or err
> _ = test_run:switch('storage_2_a')
> ---
> ...
> +-- Pause recovery. It is too aggressive, and the test needs to see buckets in
> +-- their intermediate states.
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
> +---
> +...
> box.space._bucket:replace({1, vshard.consts.BUCKET.SENDING, util.replicasets[1]})
> ---
> - [1, 'sending', '<replicaset_1>']
> @@ -319,6 +324,9 @@ box.space._bucket:replace({1, vshard.consts.BUCKET.SENDING, util.replicasets[1]}
> _ = test_run:switch('storage_1_a')
> ---
> ...
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
> +---
> +...
> box.space._bucket:replace({1, vshard.consts.BUCKET.RECEIVING, util.replicasets[2]})
> ---
> - [1, 'receiving', '<replicaset_2>']
> @@ -342,19 +350,21 @@ util.check_error(vshard.router.call, 1, 'write', 'echo', {123})
> name: TRANSFER_IS_IN_PROGRESS
> message: Bucket 1 is transferring to replicaset <replicaset_1>
> ...
> -_ = test_run:switch('storage_2_a')
> +_ = test_run:switch('storage_1_a')
> +---
> +...
> +box.space._bucket:delete({1})
> ---
> +- [1, 'receiving', '<replicaset_2>']
> ...
> -box.space._bucket:replace({1, vshard.consts.BUCKET.ACTIVE})
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false
> ---
> -- [1, 'active']
> ...
> -_ = test_run:switch('storage_1_a')
> +_ = test_run:switch('storage_2_a')
> ---
> ...
> -box.space._bucket:delete({1})
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false
> ---
> -- [1, 'receiving', '<replicaset_2>']
> ...
> _ = test_run:switch('router_1')
> ---
> diff --git a/test/router/router.test.lua b/test/router/router.test.lua
> index 154310b..aa3eb3b 100644
> --- a/test/router/router.test.lua
> +++ b/test/router/router.test.lua
> @@ -114,19 +114,28 @@ replicaset, err = vshard.router.bucket_discovery(1); return err == nil or err
> replicaset, err = vshard.router.bucket_discovery(2); return err == nil or err
>
> _ = test_run:switch('storage_2_a')
> +-- Pause recovery. It is too aggressive, and the test needs to see buckets in
> +-- their intermediate states.
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
> box.space._bucket:replace({1, vshard.consts.BUCKET.SENDING, util.replicasets[1]})
> +
> _ = test_run:switch('storage_1_a')
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
> box.space._bucket:replace({1, vshard.consts.BUCKET.RECEIVING, util.replicasets[2]})
> +
> _ = test_run:switch('router_1')
> -- Ok to read sending bucket.
> vshard.router.call(1, 'read', 'echo', {123})
> -- Not ok to write sending bucket.
> util.check_error(vshard.router.call, 1, 'write', 'echo', {123})
>
> -_ = test_run:switch('storage_2_a')
> -box.space._bucket:replace({1, vshard.consts.BUCKET.ACTIVE})
> _ = test_run:switch('storage_1_a')
> box.space._bucket:delete({1})
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false
> +
> +_ = test_run:switch('storage_2_a')
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false
> +
> _ = test_run:switch('router_1')
>
> -- Check unavailability of master of a replicaset.
> diff --git a/test/storage/recovery.result b/test/storage/recovery.result
> index 8ccb0b9..fa92bca 100644
> --- a/test/storage/recovery.result
> +++ b/test/storage/recovery.result
> @@ -28,12 +28,20 @@ util.push_rs_filters(test_run)
> _ = test_run:switch("storage_2_a")
> ---
> ...
> +-- Pause until restart. Otherwise recovery does its job too fast and does not
> +-- allow to simulate the intermediate state.
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
> +---
> +...
> vshard.storage.rebalancer_disable()
> ---
> ...
> _ = test_run:switch("storage_1_a")
> ---
> ...
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
> +---
> +...
> -- Create buckets sending to rs2 and restart - recovery must
> -- garbage some of them and activate others. Receiving buckets
> -- must be garbaged on bootstrap.
> diff --git a/test/storage/recovery.test.lua b/test/storage/recovery.test.lua
> index a0651e8..93cec68 100644
> --- a/test/storage/recovery.test.lua
> +++ b/test/storage/recovery.test.lua
> @@ -10,8 +10,13 @@ util.wait_master(test_run, REPLICASET_2, 'storage_2_a')
> util.push_rs_filters(test_run)
>
> _ = test_run:switch("storage_2_a")
> +-- Pause until restart. Otherwise recovery does its job too fast and does not
> +-- allow to simulate the intermediate state.
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
> vshard.storage.rebalancer_disable()
> +
> _ = test_run:switch("storage_1_a")
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
>
> -- Create buckets sending to rs2 and restart - recovery must
> -- garbage some of them and activate others. Receiving buckets
> diff --git a/test/storage/recovery_errinj.result b/test/storage/recovery_errinj.result
> index 3e9a9bf..8c178d5 100644
> --- a/test/storage/recovery_errinj.result
> +++ b/test/storage/recovery_errinj.result
> @@ -35,9 +35,17 @@ _ = test_run:switch('storage_2_a')
> vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = true
> ---
> ...
> +-- Pause recovery. Otherwise it does its job too fast and does not allow to
> +-- simulate the intermediate state.
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
> +---
> +...
> _ = test_run:switch('storage_1_a')
> ---
> ...
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
> +---
> +...
> _bucket = box.space._bucket
> ---
> ...
> @@ -76,10 +84,16 @@ _bucket:get{1}
> ---
> - [1, 'active']
> ...
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false
> +---
> +...
> _ = test_run:switch('storage_1_a')
> ---
> ...
> -while _bucket:count() ~= 0 do vshard.storage.recovery_wakeup() fiber.sleep(0.1) end
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false
> +---
> +...
> +wait_bucket_is_collected(1)
> ---
> ...
> _ = test_run:switch("default")
> diff --git a/test/storage/recovery_errinj.test.lua b/test/storage/recovery_errinj.test.lua
> index 8c1a9d2..c730560 100644
> --- a/test/storage/recovery_errinj.test.lua
> +++ b/test/storage/recovery_errinj.test.lua
> @@ -14,7 +14,12 @@ util.push_rs_filters(test_run)
> --
> _ = test_run:switch('storage_2_a')
> vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = true
> +-- Pause recovery. Otherwise it does its job too fast and does not allow to
> +-- simulate the intermediate state.
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
> +
> _ = test_run:switch('storage_1_a')
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
> _bucket = box.space._bucket
> _bucket:replace{1, vshard.consts.BUCKET.ACTIVE, util.replicasets[2]}
> ret, err = vshard.storage.bucket_send(1, util.replicasets[2], {timeout = 0.1})
> @@ -27,9 +32,11 @@ vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = false
> _bucket = box.space._bucket
> while _bucket:get{1}.status ~= vshard.consts.BUCKET.ACTIVE do fiber.sleep(0.01) end
> _bucket:get{1}
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false
>
> _ = test_run:switch('storage_1_a')
> -while _bucket:count() ~= 0 do vshard.storage.recovery_wakeup() fiber.sleep(0.1) end
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false
> +wait_bucket_is_collected(1)
>
> _ = test_run:switch("default")
> test_run:drop_cluster(REPLICASET_2)
> diff --git a/vshard/consts.lua b/vshard/consts.lua
> index 3f1585a..cf3f422 100644
> --- a/vshard/consts.lua
> +++ b/vshard/consts.lua
> @@ -39,7 +39,7 @@ return {
> DEFAULT_SYNC_TIMEOUT = 1;
> RECONNECT_TIMEOUT = 0.5;
> GC_BACKOFF_INTERVAL = 5,
> - RECOVERY_INTERVAL = 5;
> + RECOVERY_BACKOFF_INTERVAL = 5,
> COLLECT_LUA_GARBAGE_INTERVAL = 100;
>
> DISCOVERY_IDLE_INTERVAL = 10,
> diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
> index 31a6fc7..85f5024 100644
> --- a/vshard/storage/init.lua
> +++ b/vshard/storage/init.lua
> @@ -634,13 +634,16 @@ end
> -- Infinite function to resolve status of buckets, whose 'sending'
> -- has failed due to tarantool or network problems. Restarts on
> -- reload.
> --- @param module_version Module version, on which the current
> --- function had been started. If the actual module version
> --- appears to be changed, then stop recovery. It is
> --- restarted in reloadable_fiber.
> --
> local function recovery_f()
> local module_version = M.module_version
> + -- Changes of _bucket increments bucket generation. Recovery has its own
> + -- bucket generation which is <= actual. Recovery is finished, when its
> + -- generation == bucket generation. In such a case the fiber does nothing
> + -- until next _bucket change.
> + local bucket_generation_recovered = -1
> + local bucket_generation_current = M.bucket_generation
> + local ok, sleep_time, is_all_recovered, total, recovered
> -- Interrupt recovery if a module has been reloaded. Perhaps,
> -- there was found a bug, and reload fixes it.
> while module_version == M.module_version do
> @@ -648,22 +651,57 @@ local function recovery_f()
> lfiber.yield()
> goto continue
> end
> - local ok, total, recovered = pcall(recovery_step_by_type,
> - consts.BUCKET.SENDING)
> + is_all_recovered = true
> + if bucket_generation_recovered == bucket_generation_current then
> + goto sleep
> + end
> +
> + ok, total, recovered = pcall(recovery_step_by_type,
> + consts.BUCKET.SENDING)
> if not ok then
> + is_all_recovered = false
> log.error('Error during sending buckets recovery: %s', total)
> + elseif total ~= recovered then
> + is_all_recovered = false
> end
> +
> ok, total, recovered = pcall(recovery_step_by_type,
> consts.BUCKET.RECEIVING)
> if not ok then
> + is_all_recovered = false
> log.error('Error during receiving buckets recovery: %s', total)
> elseif total == 0 then
> bucket_receiving_quota_reset()
> else
> bucket_receiving_quota_add(recovered)
> + if total ~= recovered then
> + is_all_recovered = false
> + end
> + end
> +
> + ::sleep::
> + if not is_all_recovered then
> + bucket_generation_recovered = -1
> + else
> + bucket_generation_recovered = bucket_generation_current
> + end
> + bucket_generation_current = M.bucket_generation
> +
> + if not is_all_recovered then
> + -- One option - some buckets are not broken. Their transmission is
> + -- still in progress. Don't need to retry immediately. Another
> + -- option - network errors when tried to repair the buckets. Also no
> + -- need to retry often. It won't help.
> + sleep_time = consts.RECOVERY_BACKOFF_INTERVAL
> + elseif bucket_generation_recovered ~= bucket_generation_current then
> + sleep_time = 0
> + else
> + sleep_time = consts.TIMEOUT_INFINITY
> + end
> + if module_version == M.module_version then
> + M.bucket_generation_cond:wait(sleep_time)
> end
> - lfiber.sleep(consts.RECOVERY_INTERVAL)
> - ::continue::
> + ::continue::
> end
> end
>
next prev parent reply other threads:[~2021-02-10 9:01 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-02-09 23:46 [Tarantool-patches] [PATCH 0/9] VShard Map-Reduce, part 1, preparations Vladislav Shpilevoy via Tarantool-patches
2021-02-09 23:46 ` [Tarantool-patches] [PATCH 1/9] rlist: move rlist to a new module Vladislav Shpilevoy via Tarantool-patches
2021-02-10 8:57 ` Oleg Babin via Tarantool-patches
2021-02-11 6:50 ` Oleg Babin via Tarantool-patches
2021-02-12 0:09 ` Vladislav Shpilevoy via Tarantool-patches
2021-02-09 23:46 ` [Tarantool-patches] [PATCH 2/9] Use fiber.clock() instead of .time() everywhere Vladislav Shpilevoy via Tarantool-patches
2021-02-10 8:57 ` Oleg Babin via Tarantool-patches
2021-02-10 22:33 ` Vladislav Shpilevoy via Tarantool-patches
2021-02-09 23:46 ` [Tarantool-patches] [PATCH 3/9] test: introduce a helper to wait for bucket GC Vladislav Shpilevoy via Tarantool-patches
2021-02-10 8:57 ` Oleg Babin via Tarantool-patches
2021-02-10 22:33 ` Vladislav Shpilevoy via Tarantool-patches
2021-02-11 6:50 ` Oleg Babin via Tarantool-patches
2021-02-09 23:46 ` [Tarantool-patches] [PATCH 4/9] storage: bucket_recv() should check rs lock Vladislav Shpilevoy via Tarantool-patches
2021-02-10 8:59 ` Oleg Babin via Tarantool-patches
2021-02-09 23:46 ` [Tarantool-patches] [PATCH 5/9] util: introduce yielding table functions Vladislav Shpilevoy via Tarantool-patches
2021-02-10 8:59 ` Oleg Babin via Tarantool-patches
2021-02-10 22:34 ` Vladislav Shpilevoy via Tarantool-patches
2021-02-11 6:50 ` Oleg Babin via Tarantool-patches
2021-02-09 23:46 ` [Tarantool-patches] [PATCH 6/9] cfg: introduce 'deprecated option' feature Vladislav Shpilevoy via Tarantool-patches
2021-02-10 8:59 ` Oleg Babin via Tarantool-patches
2021-02-10 22:34 ` Vladislav Shpilevoy via Tarantool-patches
2021-02-11 6:50 ` Oleg Babin via Tarantool-patches
2021-02-09 23:46 ` [Tarantool-patches] [PATCH 7/9] gc: introduce reactive garbage collector Vladislav Shpilevoy via Tarantool-patches
2021-02-10 9:00 ` Oleg Babin via Tarantool-patches
2021-02-10 22:35 ` Vladislav Shpilevoy via Tarantool-patches
2021-02-11 6:50 ` Oleg Babin via Tarantool-patches
2021-02-09 23:46 ` [Tarantool-patches] [PATCH 8/9] recovery: introduce reactive recovery Vladislav Shpilevoy via Tarantool-patches
2021-02-10 9:00 ` Oleg Babin via Tarantool-patches [this message]
2021-02-09 23:46 ` [Tarantool-patches] [PATCH 9/9] util: introduce binary heap data structure Vladislav Shpilevoy via Tarantool-patches
2021-02-10 9:01 ` Oleg Babin via Tarantool-patches
2021-02-10 22:36 ` Vladislav Shpilevoy via Tarantool-patches
2021-02-11 6:51 ` Oleg Babin via Tarantool-patches
2021-02-12 0:09 ` Vladislav Shpilevoy via Tarantool-patches
2021-03-05 22:03 ` Vladislav Shpilevoy via Tarantool-patches
2021-02-09 23:51 ` [Tarantool-patches] [PATCH 0/9] VShard Map-Reduce, part 1, preparations Vladislav Shpilevoy via Tarantool-patches
2021-02-12 11:02 ` Oleg Babin via Tarantool-patches
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2deb1054-3bc0-2f97-aeb2-db9577fdd84a@tarantool.org \
--to=tarantool-patches@dev.tarantool.org \
--cc=olegrok@tarantool.org \
--cc=v.shpilevoy@tarantool.org \
--cc=yaroslav.dynnikov@tarantool.org \
--subject='Re: [Tarantool-patches] [PATCH 8/9] recovery: introduce reactive recovery' \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox