Tarantool development patches archive
 help / color / mirror / Atom feed
From: Oleg Babin via Tarantool-patches <tarantool-patches@dev.tarantool.org>
To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>,
	tarantool-patches@dev.tarantool.org,
	yaroslav.dynnikov@tarantool.org
Subject: Re: [Tarantool-patches] [PATCH 8/9] recovery: introduce reactive recovery
Date: Wed, 10 Feb 2021 12:00:31 +0300
Message-ID: <2deb1054-3bc0-2f97-aeb2-db9577fdd84a@tarantool.org> (raw)
In-Reply-To: <e148abb893fc1ca572346c409434e3df80173dce.1612914070.git.v.shpilevoy@tarantool.org>

Thanks for your patch. LGTM.

On 10/02/2021 02:46, Vladislav Shpilevoy wrote:
> Recovery is a fiber on a master node which tries to resolve
> SENDING/RECEIVING buckets into GARBAGE or ACTIVE, in case they are
> stuck. Usually it happens due to a conflict on the receiving side,
> or if a restart happens during bucket send.
>
> Recovery was proactive. It used to wakeup with a constant period
> to find and resolve the needed buckets.
>
> But this won't work with the future feature called 'map-reduce'.
> Map-reduce as a preparation stage will need to ensure that all
> buckets on a storage are readable and writable. With the current
> recovery algorithm if a bucket is broken, it won't be recovered
> for the next 5 seconds by default. During this time all new
> map-reduce requests can't execute.
>
> This is not acceptable. As well as too frequent wakeup of recovery
> fiber because it would waste TX thread time.
>
> The patch makes recovery fiber wakeup not by a timeout but by
> events happening with _bucket space. Recovery fiber sleeps on a
> condition variable which is signaled when _bucket is changed.
>
> This is very similar to the reactive GC feature in a previous
> commit.
>
> It is worth mentioning that the backoff happens not only when a
> bucket couldn't be recovered (its transfer is still in progress,
> for example), but also when a network error happened and recovery
> couldn't check state of the bucket on the other storage.
>
> It would be a useless busy loop to retry network errors
> immediately after their appearance. Recovery uses a backoff
> interval for them as well.
>
> Needed for #147
> ---
>   test/router/router.result             | 22 ++++++++---
>   test/router/router.test.lua           | 13 ++++++-
>   test/storage/recovery.result          |  8 ++++
>   test/storage/recovery.test.lua        |  5 +++
>   test/storage/recovery_errinj.result   | 16 +++++++-
>   test/storage/recovery_errinj.test.lua |  9 ++++-
>   vshard/consts.lua                     |  2 +-
>   vshard/storage/init.lua               | 54 +++++++++++++++++++++++----
>   8 files changed, 110 insertions(+), 19 deletions(-)
>
> diff --git a/test/router/router.result b/test/router/router.result
> index b2efd6d..3c1d073 100644
> --- a/test/router/router.result
> +++ b/test/router/router.result
> @@ -312,6 +312,11 @@ replicaset, err = vshard.router.bucket_discovery(2); return err == nil or err
>   _ = test_run:switch('storage_2_a')
>   ---
>   ...
> +-- Pause recovery. It is too aggressive, and the test needs to see buckets in
> +-- their intermediate states.
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
> +---
> +...
>   box.space._bucket:replace({1, vshard.consts.BUCKET.SENDING, util.replicasets[1]})
>   ---
>   - [1, 'sending', '<replicaset_1>']
> @@ -319,6 +324,9 @@ box.space._bucket:replace({1, vshard.consts.BUCKET.SENDING, util.replicasets[1]}
>   _ = test_run:switch('storage_1_a')
>   ---
>   ...
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
> +---
> +...
>   box.space._bucket:replace({1, vshard.consts.BUCKET.RECEIVING, util.replicasets[2]})
>   ---
>   - [1, 'receiving', '<replicaset_2>']
> @@ -342,19 +350,21 @@ util.check_error(vshard.router.call, 1, 'write', 'echo', {123})
>     name: TRANSFER_IS_IN_PROGRESS
>     message: Bucket 1 is transferring to replicaset <replicaset_1>
>   ...
> -_ = test_run:switch('storage_2_a')
> +_ = test_run:switch('storage_1_a')
> +---
> +...
> +box.space._bucket:delete({1})
>   ---
> +- [1, 'receiving', '<replicaset_2>']
>   ...
> -box.space._bucket:replace({1, vshard.consts.BUCKET.ACTIVE})
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false
>   ---
> -- [1, 'active']
>   ...
> -_ = test_run:switch('storage_1_a')
> +_ = test_run:switch('storage_2_a')
>   ---
>   ...
> -box.space._bucket:delete({1})
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false
>   ---
> -- [1, 'receiving', '<replicaset_2>']
>   ...
>   _ = test_run:switch('router_1')
>   ---
> diff --git a/test/router/router.test.lua b/test/router/router.test.lua
> index 154310b..aa3eb3b 100644
> --- a/test/router/router.test.lua
> +++ b/test/router/router.test.lua
> @@ -114,19 +114,28 @@ replicaset, err = vshard.router.bucket_discovery(1); return err == nil or err
>   replicaset, err = vshard.router.bucket_discovery(2); return err == nil or err
>   
>   _ = test_run:switch('storage_2_a')
> +-- Pause recovery. It is too aggressive, and the test needs to see buckets in
> +-- their intermediate states.
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
>   box.space._bucket:replace({1, vshard.consts.BUCKET.SENDING, util.replicasets[1]})
> +
>   _ = test_run:switch('storage_1_a')
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
>   box.space._bucket:replace({1, vshard.consts.BUCKET.RECEIVING, util.replicasets[2]})
> +
>   _ = test_run:switch('router_1')
>   -- Ok to read sending bucket.
>   vshard.router.call(1, 'read', 'echo', {123})
>   -- Not ok to write sending bucket.
>   util.check_error(vshard.router.call, 1, 'write', 'echo', {123})
>   
> -_ = test_run:switch('storage_2_a')
> -box.space._bucket:replace({1, vshard.consts.BUCKET.ACTIVE})
>   _ = test_run:switch('storage_1_a')
>   box.space._bucket:delete({1})
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false
> +
> +_ = test_run:switch('storage_2_a')
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false
> +
>   _ = test_run:switch('router_1')
>   
>   -- Check unavailability of master of a replicaset.
> diff --git a/test/storage/recovery.result b/test/storage/recovery.result
> index 8ccb0b9..fa92bca 100644
> --- a/test/storage/recovery.result
> +++ b/test/storage/recovery.result
> @@ -28,12 +28,20 @@ util.push_rs_filters(test_run)
>   _ = test_run:switch("storage_2_a")
>   ---
>   ...
> +-- Pause until restart. Otherwise recovery does its job too fast and does not
> +-- allow to simulate the intermediate state.
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
> +---
> +...
>   vshard.storage.rebalancer_disable()
>   ---
>   ...
>   _ = test_run:switch("storage_1_a")
>   ---
>   ...
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
> +---
> +...
>   -- Create buckets sending to rs2 and restart - recovery must
>   -- garbage some of them and activate others. Receiving buckets
>   -- must be garbaged on bootstrap.
> diff --git a/test/storage/recovery.test.lua b/test/storage/recovery.test.lua
> index a0651e8..93cec68 100644
> --- a/test/storage/recovery.test.lua
> +++ b/test/storage/recovery.test.lua
> @@ -10,8 +10,13 @@ util.wait_master(test_run, REPLICASET_2, 'storage_2_a')
>   util.push_rs_filters(test_run)
>   
>   _ = test_run:switch("storage_2_a")
> +-- Pause until restart. Otherwise recovery does its job too fast and does not
> +-- allow to simulate the intermediate state.
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
>   vshard.storage.rebalancer_disable()
> +
>   _ = test_run:switch("storage_1_a")
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
>   
>   -- Create buckets sending to rs2 and restart - recovery must
>   -- garbage some of them and activate others. Receiving buckets
> diff --git a/test/storage/recovery_errinj.result b/test/storage/recovery_errinj.result
> index 3e9a9bf..8c178d5 100644
> --- a/test/storage/recovery_errinj.result
> +++ b/test/storage/recovery_errinj.result
> @@ -35,9 +35,17 @@ _ = test_run:switch('storage_2_a')
>   vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = true
>   ---
>   ...
> +-- Pause recovery. Otherwise it does its job too fast and does not allow to
> +-- simulate the intermediate state.
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
> +---
> +...
>   _ = test_run:switch('storage_1_a')
>   ---
>   ...
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
> +---
> +...
>   _bucket = box.space._bucket
>   ---
>   ...
> @@ -76,10 +84,16 @@ _bucket:get{1}
>   ---
>   - [1, 'active']
>   ...
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false
> +---
> +...
>   _ = test_run:switch('storage_1_a')
>   ---
>   ...
> -while _bucket:count() ~= 0 do vshard.storage.recovery_wakeup() fiber.sleep(0.1) end
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false
> +---
> +...
> +wait_bucket_is_collected(1)
>   ---
>   ...
>   _ = test_run:switch("default")
> diff --git a/test/storage/recovery_errinj.test.lua b/test/storage/recovery_errinj.test.lua
> index 8c1a9d2..c730560 100644
> --- a/test/storage/recovery_errinj.test.lua
> +++ b/test/storage/recovery_errinj.test.lua
> @@ -14,7 +14,12 @@ util.push_rs_filters(test_run)
>   --
>   _ = test_run:switch('storage_2_a')
>   vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = true
> +-- Pause recovery. Otherwise it does its job too fast and does not allow to
> +-- simulate the intermediate state.
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
> +
>   _ = test_run:switch('storage_1_a')
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
>   _bucket = box.space._bucket
>   _bucket:replace{1, vshard.consts.BUCKET.ACTIVE, util.replicasets[2]}
>   ret, err = vshard.storage.bucket_send(1, util.replicasets[2], {timeout = 0.1})
> @@ -27,9 +32,11 @@ vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = false
>   _bucket = box.space._bucket
>   while _bucket:get{1}.status ~= vshard.consts.BUCKET.ACTIVE do fiber.sleep(0.01) end
>   _bucket:get{1}
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false
>   
>   _ = test_run:switch('storage_1_a')
> -while _bucket:count() ~= 0 do vshard.storage.recovery_wakeup() fiber.sleep(0.1) end
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false
> +wait_bucket_is_collected(1)
>   
>   _ = test_run:switch("default")
>   test_run:drop_cluster(REPLICASET_2)
> diff --git a/vshard/consts.lua b/vshard/consts.lua
> index 3f1585a..cf3f422 100644
> --- a/vshard/consts.lua
> +++ b/vshard/consts.lua
> @@ -39,7 +39,7 @@ return {
>       DEFAULT_SYNC_TIMEOUT = 1;
>       RECONNECT_TIMEOUT = 0.5;
>       GC_BACKOFF_INTERVAL = 5,
> -    RECOVERY_INTERVAL = 5;
> +    RECOVERY_BACKOFF_INTERVAL = 5,
>       COLLECT_LUA_GARBAGE_INTERVAL = 100;
>   
>       DISCOVERY_IDLE_INTERVAL = 10,
> diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
> index 31a6fc7..85f5024 100644
> --- a/vshard/storage/init.lua
> +++ b/vshard/storage/init.lua
> @@ -634,13 +634,16 @@ end
>   -- Infinite function to resolve status of buckets, whose 'sending'
>   -- has failed due to tarantool or network problems. Restarts on
>   -- reload.
> --- @param module_version Module version, on which the current
> ---        function had been started. If the actual module version
> ---        appears to be changed, then stop recovery. It is
> ---        restarted in reloadable_fiber.
>   --
>   local function recovery_f()
>       local module_version = M.module_version
> +    -- Changes of _bucket increments bucket generation. Recovery has its own
> +    -- bucket generation which is <= actual. Recovery is finished, when its
> +    -- generation == bucket generation. In such a case the fiber does nothing
> +    -- until next _bucket change.
> +    local bucket_generation_recovered = -1
> +    local bucket_generation_current = M.bucket_generation
> +    local ok, sleep_time, is_all_recovered, total, recovered
>       -- Interrupt recovery if a module has been reloaded. Perhaps,
>       -- there was found a bug, and reload fixes it.
>       while module_version == M.module_version do
> @@ -648,22 +651,57 @@ local function recovery_f()
>               lfiber.yield()
>               goto continue
>           end
> -        local ok, total, recovered = pcall(recovery_step_by_type,
> -                                           consts.BUCKET.SENDING)
> +        is_all_recovered = true
> +        if bucket_generation_recovered == bucket_generation_current then
> +            goto sleep
> +        end
> +
> +        ok, total, recovered = pcall(recovery_step_by_type,
> +                                     consts.BUCKET.SENDING)
>           if not ok then
> +            is_all_recovered = false
>               log.error('Error during sending buckets recovery: %s', total)
> +        elseif total ~= recovered then
> +            is_all_recovered = false
>           end
> +
>           ok, total, recovered = pcall(recovery_step_by_type,
>                                        consts.BUCKET.RECEIVING)
>           if not ok then
> +            is_all_recovered = false
>               log.error('Error during receiving buckets recovery: %s', total)
>           elseif total == 0 then
>               bucket_receiving_quota_reset()
>           else
>               bucket_receiving_quota_add(recovered)
> +            if total ~= recovered then
> +                is_all_recovered = false
> +            end
> +        end
> +
> +    ::sleep::
> +        if not is_all_recovered then
> +            bucket_generation_recovered = -1
> +        else
> +            bucket_generation_recovered = bucket_generation_current
> +        end
> +        bucket_generation_current = M.bucket_generation
> +
> +        if not is_all_recovered then
> +            -- One option - some buckets are not broken. Their transmission is
> +            -- still in progress. Don't need to retry immediately. Another
> +            -- option - network errors when tried to repair the buckets. Also no
> +            -- need to retry often. It won't help.
> +            sleep_time = consts.RECOVERY_BACKOFF_INTERVAL
> +        elseif bucket_generation_recovered ~= bucket_generation_current then
> +            sleep_time = 0
> +        else
> +            sleep_time = consts.TIMEOUT_INFINITY
> +        end
> +        if module_version == M.module_version then
> +            M.bucket_generation_cond:wait(sleep_time)
>           end
> -        lfiber.sleep(consts.RECOVERY_INTERVAL)
> -        ::continue::
> +    ::continue::
>       end
>   end
>   

  reply	other threads:[~2021-02-10  9:01 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-09 23:46 [Tarantool-patches] [PATCH 0/9] VShard Map-Reduce, part 1, preparations Vladislav Shpilevoy via Tarantool-patches
2021-02-09 23:46 ` [Tarantool-patches] [PATCH 1/9] rlist: move rlist to a new module Vladislav Shpilevoy via Tarantool-patches
2021-02-10  8:57   ` Oleg Babin via Tarantool-patches
2021-02-11  6:50     ` Oleg Babin via Tarantool-patches
2021-02-12  0:09       ` Vladislav Shpilevoy via Tarantool-patches
2021-02-09 23:46 ` [Tarantool-patches] [PATCH 2/9] Use fiber.clock() instead of .time() everywhere Vladislav Shpilevoy via Tarantool-patches
2021-02-10  8:57   ` Oleg Babin via Tarantool-patches
2021-02-10 22:33     ` Vladislav Shpilevoy via Tarantool-patches
2021-02-09 23:46 ` [Tarantool-patches] [PATCH 3/9] test: introduce a helper to wait for bucket GC Vladislav Shpilevoy via Tarantool-patches
2021-02-10  8:57   ` Oleg Babin via Tarantool-patches
2021-02-10 22:33     ` Vladislav Shpilevoy via Tarantool-patches
2021-02-11  6:50       ` Oleg Babin via Tarantool-patches
2021-02-09 23:46 ` [Tarantool-patches] [PATCH 4/9] storage: bucket_recv() should check rs lock Vladislav Shpilevoy via Tarantool-patches
2021-02-10  8:59   ` Oleg Babin via Tarantool-patches
2021-02-09 23:46 ` [Tarantool-patches] [PATCH 5/9] util: introduce yielding table functions Vladislav Shpilevoy via Tarantool-patches
2021-02-10  8:59   ` Oleg Babin via Tarantool-patches
2021-02-10 22:34     ` Vladislav Shpilevoy via Tarantool-patches
2021-02-11  6:50       ` Oleg Babin via Tarantool-patches
2021-02-09 23:46 ` [Tarantool-patches] [PATCH 6/9] cfg: introduce 'deprecated option' feature Vladislav Shpilevoy via Tarantool-patches
2021-02-10  8:59   ` Oleg Babin via Tarantool-patches
2021-02-10 22:34     ` Vladislav Shpilevoy via Tarantool-patches
2021-02-11  6:50       ` Oleg Babin via Tarantool-patches
2021-02-09 23:46 ` [Tarantool-patches] [PATCH 7/9] gc: introduce reactive garbage collector Vladislav Shpilevoy via Tarantool-patches
2021-02-10  9:00   ` Oleg Babin via Tarantool-patches
2021-02-10 22:35     ` Vladislav Shpilevoy via Tarantool-patches
2021-02-11  6:50       ` Oleg Babin via Tarantool-patches
2021-02-09 23:46 ` [Tarantool-patches] [PATCH 8/9] recovery: introduce reactive recovery Vladislav Shpilevoy via Tarantool-patches
2021-02-10  9:00   ` Oleg Babin via Tarantool-patches [this message]
2021-02-09 23:46 ` [Tarantool-patches] [PATCH 9/9] util: introduce binary heap data structure Vladislav Shpilevoy via Tarantool-patches
2021-02-10  9:01   ` Oleg Babin via Tarantool-patches
2021-02-10 22:36     ` Vladislav Shpilevoy via Tarantool-patches
2021-02-11  6:51       ` Oleg Babin via Tarantool-patches
2021-02-12  0:09         ` Vladislav Shpilevoy via Tarantool-patches
2021-03-05 22:03   ` Vladislav Shpilevoy via Tarantool-patches
2021-02-09 23:51 ` [Tarantool-patches] [PATCH 0/9] VShard Map-Reduce, part 1, preparations Vladislav Shpilevoy via Tarantool-patches
2021-02-12 11:02   ` Oleg Babin via Tarantool-patches

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2deb1054-3bc0-2f97-aeb2-db9577fdd84a@tarantool.org \
    --to=tarantool-patches@dev.tarantool.org \
    --cc=olegrok@tarantool.org \
    --cc=v.shpilevoy@tarantool.org \
    --cc=yaroslav.dynnikov@tarantool.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Tarantool development patches archive

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://lists.tarantool.org/tarantool-patches/0 tarantool-patches/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 tarantool-patches tarantool-patches/ https://lists.tarantool.org/tarantool-patches \
		tarantool-patches@dev.tarantool.org.
	public-inbox-index tarantool-patches

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git