From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <tarantool-patches-bounces@dev.tarantool.org>
Received: from [87.239.111.99] (localhost [127.0.0.1])
	by dev.tarantool.org (Postfix) with ESMTP id 508F670361;
	Wed, 10 Feb 2021 12:01:13 +0300 (MSK)
DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 508F670361
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev;
	t=1612947673; bh=DjRIDA9LgRKlRIDo0YnvnF2FQCxTB0jjoh4GSbpEW5U=;
	h=To:References:Date:In-Reply-To:Subject:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=ZQw8apv3Nbs86E/lkheCzL2MgH/PYdcI38/6d5hcls0nDpue0UXy4nlEG2kVWoQuR
	 R6diUKgIphe5lbx+YeXdnqJqb/kGqAPFPo/lmpOCRPVQEHdIsLyl+fbNSJzVtzENp3
	 CJn0ZvpvzZuEqgR1eN1LMVL+2u+tQJnuNPzzJAoU=
Received: from smtp36.i.mail.ru (smtp36.i.mail.ru [94.100.177.96])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by dev.tarantool.org (Postfix) with ESMTPS id 545BC6C1BE
 for <tarantool-patches@dev.tarantool.org>;
 Wed, 10 Feb 2021 12:00:37 +0300 (MSK)
DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 545BC6C1BE
Received: by smtp36.i.mail.ru with esmtpa (envelope-from
 <olegrok@tarantool.org>)
 id 1l9lMO-0001XY-Fk; Wed, 10 Feb 2021 12:00:33 +0300
To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>,
 tarantool-patches@dev.tarantool.org, yaroslav.dynnikov@tarantool.org
References: <cover.1612914070.git.v.shpilevoy@tarantool.org>
 <e148abb893fc1ca572346c409434e3df80173dce.1612914070.git.v.shpilevoy@tarantool.org>
Message-ID: <2deb1054-3bc0-2f97-aeb2-db9577fdd84a@tarantool.org>
Date: Wed, 10 Feb 2021 12:00:31 +0300
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:78.0)
 Gecko/20100101 Thunderbird/78.7.0
MIME-Version: 1.0
In-Reply-To: <e148abb893fc1ca572346c409434e3df80173dce.1612914070.git.v.shpilevoy@tarantool.org>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Content-Language: en-GB
X-7564579A: EEAE043A70213CC8
X-77F55803: 4F1203BC0FB41BD953AC099BC0052A9C6D5758EA387698D8F2F71171D5C2C2E5182A05F538085040F29F04B3462F92255BD75D23E6249AE9FE456FF63ACAC6764F6D3368292A2EC3
X-7FA49CB5: FF5795518A3D127A4AD6D5ED66289B5278DA827A17800CE73C871DD2182510D5EA1F7E6F0F101C67BD4B6F7A4D31EC0BCC500DACC3FED6E28638F802B75D45FF8AA50765F79006372094AD700861FA748638F802B75D45FF5571747095F342E8C7A0BC55FA0FE5FC8767A9B625D33EB404A80827389B3FF26DD74542AE05EA7E389733CBF5DBD5E913377AFFFEAFD269176DF2183F8FC7C0A3E989B1926288338941B15DA834481FCF19DD082D7633A0EF3E4896CB9E6436389733CBF5DBD5E9D5E8D9A59859A8B601F8F2FECC0250C8CC7F00164DA146DA6F5DAA56C3B73B237318B6A418E8EAB8D32BA5DBAC0009BE9E8FC8737B5C2249B1F0E46EA745084D76E601842F6C81A12EF20D2F80756B5F7E9C4E3C761E06A776E601842F6C81A127C277FBC8AE2E8B6A41CCFFBAEDDFE13AA81AA40904B5D9DBF02ECDB25306B2B25CBF701D1BE8734AD6D5ED66289B5278DA827A17800CE7AD122C37AACC1F0267F23339F89546C5A8DF7F3B2552694A6FED454B719173D6725E5C173C3A84C3EC873B826A97446335872C767BF85DA2F004C906525384306FED454B719173D6462275124DF8B9C9DE2850DD75B2526BE5BFE6E7EFDEDCD789D4C264860C145E
X-C1DE0DAB: 0D63561A33F958A5633BED58D2D5C9F4940CA7DAFCA65F0B4824E6BB1378F2BFD59269BC5F550898D99A6476B3ADF6B47008B74DF8BB9EF7333BD3B22AA88B938A852937E12ACA75F04B387B5D7535DE410CA545F18667F91A7EA1CDA0B5A7A0
X-C8649E89: 4E36BF7865823D7055A7F0CF078B5EC49A30900B95165D342B8615F5CFAD9D02FBA74EA1D71B46A4565B7387B2D4BD74E432810BF1B6EFC298C23C1EFBE1CD961D7E09C32AA3244C00C66802EB2D337468F88A5CF6D8B581259227199D06760AFACE5A9C96DEB163
X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2biojmMmQ+JvDeDEFFtEHH9lJ/g==
X-Mailru-Sender: E02534FE7B43636DC6DBEBE9641397948199367F95F1E47EA6A14F1A4CC9EF80D42F93969D0DF7B723E75C7104EB1B885DEE61814008E47C7013064206BFB89F93956FB04BA385BE9437F6177E88F7363CDA0F3B3F5B9367
X-Mras: Ok
Subject: Re: [Tarantool-patches] [PATCH 8/9] recovery: introduce reactive
 recovery
X-BeenThere: tarantool-patches@dev.tarantool.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: Tarantool development patches <tarantool-patches.dev.tarantool.org>
List-Unsubscribe: <https://lists.tarantool.org/mailman/options/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=unsubscribe>
List-Archive: <https://lists.tarantool.org/pipermail/tarantool-patches/>
List-Post: <mailto:tarantool-patches@dev.tarantool.org>
List-Help: <mailto:tarantool-patches-request@dev.tarantool.org?subject=help>
List-Subscribe: <https://lists.tarantool.org/mailman/listinfo/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=subscribe>
From: Oleg Babin via Tarantool-patches <tarantool-patches@dev.tarantool.org>
Reply-To: Oleg Babin <olegrok@tarantool.org>
Errors-To: tarantool-patches-bounces@dev.tarantool.org
Sender: "Tarantool-patches" <tarantool-patches-bounces@dev.tarantool.org>

Thanks for your patch. LGTM.

On 10/02/2021 02:46, Vladislav Shpilevoy wrote:
> Recovery is a fiber on a master node which tries to resolve
> SENDING/RECEIVING buckets into GARBAGE or ACTIVE, in case they are
> stuck. Usually it happens due to a conflict on the receiving side,
> or if a restart happens during bucket send.
>
> Recovery was proactive. It used to wakeup with a constant period
> to find and resolve the needed buckets.
>
> But this won't work with the future feature called 'map-reduce'.
> Map-reduce as a preparation stage will need to ensure that all
> buckets on a storage are readable and writable. With the current
> recovery algorithm if a bucket is broken, it won't be recovered
> for the next 5 seconds by default. During this time all new
> map-reduce requests can't execute.
>
> This is not acceptable. As well as too frequent wakeup of recovery
> fiber because it would waste TX thread time.
>
> The patch makes recovery fiber wakeup not by a timeout but by
> events happening with _bucket space. Recovery fiber sleeps on a
> condition variable which is signaled when _bucket is changed.
>
> This is very similar to the reactive GC feature in a previous
> commit.
>
> It is worth mentioning that the backoff happens not only when a
> bucket couldn't be recovered (its transfer is still in progress,
> for example), but also when a network error happened and recovery
> couldn't check state of the bucket on the other storage.
>
> It would be a useless busy loop to retry network errors
> immediately after their appearance. Recovery uses a backoff
> interval for them as well.
>
> Needed for #147
> ---
>   test/router/router.result             | 22 ++++++++---
>   test/router/router.test.lua           | 13 ++++++-
>   test/storage/recovery.result          |  8 ++++
>   test/storage/recovery.test.lua        |  5 +++
>   test/storage/recovery_errinj.result   | 16 +++++++-
>   test/storage/recovery_errinj.test.lua |  9 ++++-
>   vshard/consts.lua                     |  2 +-
>   vshard/storage/init.lua               | 54 +++++++++++++++++++++++----
>   8 files changed, 110 insertions(+), 19 deletions(-)
>
> diff --git a/test/router/router.result b/test/router/router.result
> index b2efd6d..3c1d073 100644
> --- a/test/router/router.result
> +++ b/test/router/router.result
> @@ -312,6 +312,11 @@ replicaset, err = vshard.router.bucket_discovery(2); return err == nil or err
>   _ = test_run:switch('storage_2_a')
>   ---
>   ...
> +-- Pause recovery. It is too aggressive, and the test needs to see buckets in
> +-- their intermediate states.
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
> +---
> +...
>   box.space._bucket:replace({1, vshard.consts.BUCKET.SENDING, util.replicasets[1]})
>   ---
>   - [1, 'sending', '<replicaset_1>']
> @@ -319,6 +324,9 @@ box.space._bucket:replace({1, vshard.consts.BUCKET.SENDING, util.replicasets[1]}
>   _ = test_run:switch('storage_1_a')
>   ---
>   ...
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
> +---
> +...
>   box.space._bucket:replace({1, vshard.consts.BUCKET.RECEIVING, util.replicasets[2]})
>   ---
>   - [1, 'receiving', '<replicaset_2>']
> @@ -342,19 +350,21 @@ util.check_error(vshard.router.call, 1, 'write', 'echo', {123})
>     name: TRANSFER_IS_IN_PROGRESS
>     message: Bucket 1 is transferring to replicaset <replicaset_1>
>   ...
> -_ = test_run:switch('storage_2_a')
> +_ = test_run:switch('storage_1_a')
> +---
> +...
> +box.space._bucket:delete({1})
>   ---
> +- [1, 'receiving', '<replicaset_2>']
>   ...
> -box.space._bucket:replace({1, vshard.consts.BUCKET.ACTIVE})
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false
>   ---
> -- [1, 'active']
>   ...
> -_ = test_run:switch('storage_1_a')
> +_ = test_run:switch('storage_2_a')
>   ---
>   ...
> -box.space._bucket:delete({1})
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false
>   ---
> -- [1, 'receiving', '<replicaset_2>']
>   ...
>   _ = test_run:switch('router_1')
>   ---
> diff --git a/test/router/router.test.lua b/test/router/router.test.lua
> index 154310b..aa3eb3b 100644
> --- a/test/router/router.test.lua
> +++ b/test/router/router.test.lua
> @@ -114,19 +114,28 @@ replicaset, err = vshard.router.bucket_discovery(1); return err == nil or err
>   replicaset, err = vshard.router.bucket_discovery(2); return err == nil or err
>   
>   _ = test_run:switch('storage_2_a')
> +-- Pause recovery. It is too aggressive, and the test needs to see buckets in
> +-- their intermediate states.
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
>   box.space._bucket:replace({1, vshard.consts.BUCKET.SENDING, util.replicasets[1]})
> +
>   _ = test_run:switch('storage_1_a')
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
>   box.space._bucket:replace({1, vshard.consts.BUCKET.RECEIVING, util.replicasets[2]})
> +
>   _ = test_run:switch('router_1')
>   -- Ok to read sending bucket.
>   vshard.router.call(1, 'read', 'echo', {123})
>   -- Not ok to write sending bucket.
>   util.check_error(vshard.router.call, 1, 'write', 'echo', {123})
>   
> -_ = test_run:switch('storage_2_a')
> -box.space._bucket:replace({1, vshard.consts.BUCKET.ACTIVE})
>   _ = test_run:switch('storage_1_a')
>   box.space._bucket:delete({1})
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false
> +
> +_ = test_run:switch('storage_2_a')
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false
> +
>   _ = test_run:switch('router_1')
>   
>   -- Check unavailability of master of a replicaset.
> diff --git a/test/storage/recovery.result b/test/storage/recovery.result
> index 8ccb0b9..fa92bca 100644
> --- a/test/storage/recovery.result
> +++ b/test/storage/recovery.result
> @@ -28,12 +28,20 @@ util.push_rs_filters(test_run)
>   _ = test_run:switch("storage_2_a")
>   ---
>   ...
> +-- Pause until restart. Otherwise recovery does its job too fast and does not
> +-- allow to simulate the intermediate state.
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
> +---
> +...
>   vshard.storage.rebalancer_disable()
>   ---
>   ...
>   _ = test_run:switch("storage_1_a")
>   ---
>   ...
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
> +---
> +...
>   -- Create buckets sending to rs2 and restart - recovery must
>   -- garbage some of them and activate others. Receiving buckets
>   -- must be garbaged on bootstrap.
> diff --git a/test/storage/recovery.test.lua b/test/storage/recovery.test.lua
> index a0651e8..93cec68 100644
> --- a/test/storage/recovery.test.lua
> +++ b/test/storage/recovery.test.lua
> @@ -10,8 +10,13 @@ util.wait_master(test_run, REPLICASET_2, 'storage_2_a')
>   util.push_rs_filters(test_run)
>   
>   _ = test_run:switch("storage_2_a")
> +-- Pause until restart. Otherwise recovery does its job too fast and does not
> +-- allow to simulate the intermediate state.
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
>   vshard.storage.rebalancer_disable()
> +
>   _ = test_run:switch("storage_1_a")
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
>   
>   -- Create buckets sending to rs2 and restart - recovery must
>   -- garbage some of them and activate others. Receiving buckets
> diff --git a/test/storage/recovery_errinj.result b/test/storage/recovery_errinj.result
> index 3e9a9bf..8c178d5 100644
> --- a/test/storage/recovery_errinj.result
> +++ b/test/storage/recovery_errinj.result
> @@ -35,9 +35,17 @@ _ = test_run:switch('storage_2_a')
>   vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = true
>   ---
>   ...
> +-- Pause recovery. Otherwise it does its job too fast and does not allow to
> +-- simulate the intermediate state.
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
> +---
> +...
>   _ = test_run:switch('storage_1_a')
>   ---
>   ...
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
> +---
> +...
>   _bucket = box.space._bucket
>   ---
>   ...
> @@ -76,10 +84,16 @@ _bucket:get{1}
>   ---
>   - [1, 'active']
>   ...
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false
> +---
> +...
>   _ = test_run:switch('storage_1_a')
>   ---
>   ...
> -while _bucket:count() ~= 0 do vshard.storage.recovery_wakeup() fiber.sleep(0.1) end
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false
> +---
> +...
> +wait_bucket_is_collected(1)
>   ---
>   ...
>   _ = test_run:switch("default")
> diff --git a/test/storage/recovery_errinj.test.lua b/test/storage/recovery_errinj.test.lua
> index 8c1a9d2..c730560 100644
> --- a/test/storage/recovery_errinj.test.lua
> +++ b/test/storage/recovery_errinj.test.lua
> @@ -14,7 +14,12 @@ util.push_rs_filters(test_run)
>   --
>   _ = test_run:switch('storage_2_a')
>   vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = true
> +-- Pause recovery. Otherwise it does its job too fast and does not allow to
> +-- simulate the intermediate state.
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
> +
>   _ = test_run:switch('storage_1_a')
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true
>   _bucket = box.space._bucket
>   _bucket:replace{1, vshard.consts.BUCKET.ACTIVE, util.replicasets[2]}
>   ret, err = vshard.storage.bucket_send(1, util.replicasets[2], {timeout = 0.1})
> @@ -27,9 +32,11 @@ vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = false
>   _bucket = box.space._bucket
>   while _bucket:get{1}.status ~= vshard.consts.BUCKET.ACTIVE do fiber.sleep(0.01) end
>   _bucket:get{1}
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false
>   
>   _ = test_run:switch('storage_1_a')
> -while _bucket:count() ~= 0 do vshard.storage.recovery_wakeup() fiber.sleep(0.1) end
> +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false
> +wait_bucket_is_collected(1)
>   
>   _ = test_run:switch("default")
>   test_run:drop_cluster(REPLICASET_2)
> diff --git a/vshard/consts.lua b/vshard/consts.lua
> index 3f1585a..cf3f422 100644
> --- a/vshard/consts.lua
> +++ b/vshard/consts.lua
> @@ -39,7 +39,7 @@ return {
>       DEFAULT_SYNC_TIMEOUT = 1;
>       RECONNECT_TIMEOUT = 0.5;
>       GC_BACKOFF_INTERVAL = 5,
> -    RECOVERY_INTERVAL = 5;
> +    RECOVERY_BACKOFF_INTERVAL = 5,
>       COLLECT_LUA_GARBAGE_INTERVAL = 100;
>   
>       DISCOVERY_IDLE_INTERVAL = 10,
> diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua
> index 31a6fc7..85f5024 100644
> --- a/vshard/storage/init.lua
> +++ b/vshard/storage/init.lua
> @@ -634,13 +634,16 @@ end
>   -- Infinite function to resolve status of buckets, whose 'sending'
>   -- has failed due to tarantool or network problems. Restarts on
>   -- reload.
> --- @param module_version Module version, on which the current
> ---        function had been started. If the actual module version
> ---        appears to be changed, then stop recovery. It is
> ---        restarted in reloadable_fiber.
>   --
>   local function recovery_f()
>       local module_version = M.module_version
> +    -- Changes of _bucket increments bucket generation. Recovery has its own
> +    -- bucket generation which is <= actual. Recovery is finished, when its
> +    -- generation == bucket generation. In such a case the fiber does nothing
> +    -- until next _bucket change.
> +    local bucket_generation_recovered = -1
> +    local bucket_generation_current = M.bucket_generation
> +    local ok, sleep_time, is_all_recovered, total, recovered
>       -- Interrupt recovery if a module has been reloaded. Perhaps,
>       -- there was found a bug, and reload fixes it.
>       while module_version == M.module_version do
> @@ -648,22 +651,57 @@ local function recovery_f()
>               lfiber.yield()
>               goto continue
>           end
> -        local ok, total, recovered = pcall(recovery_step_by_type,
> -                                           consts.BUCKET.SENDING)
> +        is_all_recovered = true
> +        if bucket_generation_recovered == bucket_generation_current then
> +            goto sleep
> +        end
> +
> +        ok, total, recovered = pcall(recovery_step_by_type,
> +                                     consts.BUCKET.SENDING)
>           if not ok then
> +            is_all_recovered = false
>               log.error('Error during sending buckets recovery: %s', total)
> +        elseif total ~= recovered then
> +            is_all_recovered = false
>           end
> +
>           ok, total, recovered = pcall(recovery_step_by_type,
>                                        consts.BUCKET.RECEIVING)
>           if not ok then
> +            is_all_recovered = false
>               log.error('Error during receiving buckets recovery: %s', total)
>           elseif total == 0 then
>               bucket_receiving_quota_reset()
>           else
>               bucket_receiving_quota_add(recovered)
> +            if total ~= recovered then
> +                is_all_recovered = false
> +            end
> +        end
> +
> +    ::sleep::
> +        if not is_all_recovered then
> +            bucket_generation_recovered = -1
> +        else
> +            bucket_generation_recovered = bucket_generation_current
> +        end
> +        bucket_generation_current = M.bucket_generation
> +
> +        if not is_all_recovered then
> +            -- One option - some buckets are not broken. Their transmission is
> +            -- still in progress. Don't need to retry immediately. Another
> +            -- option - network errors when tried to repair the buckets. Also no
> +            -- need to retry often. It won't help.
> +            sleep_time = consts.RECOVERY_BACKOFF_INTERVAL
> +        elseif bucket_generation_recovered ~= bucket_generation_current then
> +            sleep_time = 0
> +        else
> +            sleep_time = consts.TIMEOUT_INFINITY
> +        end
> +        if module_version == M.module_version then
> +            M.bucket_generation_cond:wait(sleep_time)
>           end
> -        lfiber.sleep(consts.RECOVERY_INTERVAL)
> -        ::continue::
> +    ::continue::
>       end
>   end
>