From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id 213EB70361; Wed, 10 Feb 2021 02:50:27 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 213EB70361 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1612914627; bh=ClVujaPbhogPUL0tNTVhZN4jZLbaJigVSMpjm5rNGkM=; h=To:Date:In-Reply-To:References:Subject:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=uFUDFslVLc8fPgbX3RgtaTYNjWnRBLEmF1LXkq4LKzYt6BK+7FOEq0KyI7ZaEs+qg JMVlnDmaJ7Tx4s/fyHFQKDE1W4RALV8S64uyn1R8tBt9dKPiNMpbyPxNrqncADYPSf Fb3NfN7lV4IgNTcGB8H5TKRtEVZWeTXDntzUkfuE= Received: from smtpng3.m.smailru.net (smtpng3.m.smailru.net [94.100.177.149]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 04CAD68701 for ; Wed, 10 Feb 2021 02:46:30 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 04CAD68701 Received: by smtpng3.m.smailru.net with esmtpa (envelope-from ) id 1l9ci9-0002gH-S8; Wed, 10 Feb 2021 02:46:26 +0300 To: tarantool-patches@dev.tarantool.org, olegrok@tarantool.org, yaroslav.dynnikov@tarantool.org Date: Wed, 10 Feb 2021 00:46:14 +0100 Message-Id: X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-7564579A: EEAE043A70213CC8 X-77F55803: 4F1203BC0FB41BD953AC099BC0052A9C6D5758EA387698D8F2F71171D5C2C2E5182A05F5380850400ACF30C0D1B6D5FFB42436F29B4917C5D92BF6928EC8FA2390C8EF5264F5629A X-7FA49CB5: FF5795518A3D127A4AD6D5ED66289B5278DA827A17800CE7DB7B102DCB413779EA1F7E6F0F101C67BD4B6F7A4D31EC0BCC500DACC3FED6E28638F802B75D45FF8AA50765F7900637A0569EA9A35E44F48638F802B75D45FF5571747095F342E8C7A0BC55FA0FE5FCA6073DB25481D6C5E5116C7EA2D552C0FD5C757A63F510E2389733CBF5DBD5E913377AFFFEAFD269176DF2183F8FC7C0B27420F9988F54058941B15DA834481FCF19DD082D7633A0EF3E4896CB9E6436389733CBF5DBD5E9D5E8D9A59859A8B6197FFA0EFC27E0ACA471835C12D1D977C4224003CC836476EC64975D915A344093EC92FD9297F6718AA50765F7900637427B078F297B269AA7F4EDE966BC389F395957E7521B51C24C7702A67D5C33162DBA43225CD8A89F616AD31D0D18CD5CCE5475246E174218B5C8C57E37DE458B4C7702A67D5C3316FA3894348FB808DBAF038BB36E94EA6B574AF45C6390F7469DAA53EE0834AAEE X-B7AD71C0: 6FEFE4C63DFE2D85469AD6E133326EAB664F5199923B286E81C2AD9CFA0FBF5C9C2BBA594F31363B5803BE1F3B17DC36E8F7B195E1C9783184C08FF3D8764024C527CFD643BC0A42 X-C1DE0DAB: C20DE7B7AB408E4181F030C43753B8186998911F362727C4C7A0BC55FA0FE5FCA6073DB25481D6C5E5116C7EA2D552C08DF36B7025343AEFB1881A6453793CE9C32612AADDFBE061C61BE10805914D3804EBA3D8E7E5B87ABF8C51168CD8EBDB791E6C230873D55CDC48ACC2A39D04F89CDFB48F4795C241BDAD6C7F3747799A X-C8649E89: 4E36BF7865823D7055A7F0CF078B5EC49A30900B95165D34F6110710A6CC527F26DB6E484909271B9FAFDD86F161137DFB13AA212EF4CA5B1E0EBA4BEA80CB441D7E09C32AA3244C7D159931C10D80C39BB0C4B7218E0B5B3C6EB905E3A8056BFACE5A9C96DEB163 X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2biojmMmQ+JvDeDHT8hizIkp3ww== X-Mailru-Sender: 689FA8AB762F73936BC43F508A063822E885D4559D26B6FD901BE7DB6C1ACF533841015FED1DE5223CC9A89AB576DD93FB559BB5D741EB963CF37A108A312F5C27E8A8C3839CE0E267EA787935ED9F1B X-Mras: Ok Subject: [Tarantool-patches] [PATCH 8/9] recovery: introduce reactive recovery X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Vladislav Shpilevoy via Tarantool-patches Reply-To: Vladislav Shpilevoy Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" Recovery is a fiber on a master node which tries to resolve SENDING/RECEIVING buckets into GARBAGE or ACTIVE, in case they are stuck. Usually it happens due to a conflict on the receiving side, or if a restart happens during bucket send. Recovery was proactive. It used to wakeup with a constant period to find and resolve the needed buckets. But this won't work with the future feature called 'map-reduce'. Map-reduce as a preparation stage will need to ensure that all buckets on a storage are readable and writable. With the current recovery algorithm if a bucket is broken, it won't be recovered for the next 5 seconds by default. During this time all new map-reduce requests can't execute. This is not acceptable. As well as too frequent wakeup of recovery fiber because it would waste TX thread time. The patch makes recovery fiber wakeup not by a timeout but by events happening with _bucket space. Recovery fiber sleeps on a condition variable which is signaled when _bucket is changed. This is very similar to the reactive GC feature in a previous commit. It is worth mentioning that the backoff happens not only when a bucket couldn't be recovered (its transfer is still in progress, for example), but also when a network error happened and recovery couldn't check state of the bucket on the other storage. It would be a useless busy loop to retry network errors immediately after their appearance. Recovery uses a backoff interval for them as well. Needed for #147 --- test/router/router.result | 22 ++++++++--- test/router/router.test.lua | 13 ++++++- test/storage/recovery.result | 8 ++++ test/storage/recovery.test.lua | 5 +++ test/storage/recovery_errinj.result | 16 +++++++- test/storage/recovery_errinj.test.lua | 9 ++++- vshard/consts.lua | 2 +- vshard/storage/init.lua | 54 +++++++++++++++++++++++---- 8 files changed, 110 insertions(+), 19 deletions(-) diff --git a/test/router/router.result b/test/router/router.result index b2efd6d..3c1d073 100644 --- a/test/router/router.result +++ b/test/router/router.result @@ -312,6 +312,11 @@ replicaset, err = vshard.router.bucket_discovery(2); return err == nil or err _ = test_run:switch('storage_2_a') --- ... +-- Pause recovery. It is too aggressive, and the test needs to see buckets in +-- their intermediate states. +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true +--- +... box.space._bucket:replace({1, vshard.consts.BUCKET.SENDING, util.replicasets[1]}) --- - [1, 'sending', ''] @@ -319,6 +324,9 @@ box.space._bucket:replace({1, vshard.consts.BUCKET.SENDING, util.replicasets[1]} _ = test_run:switch('storage_1_a') --- ... +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true +--- +... box.space._bucket:replace({1, vshard.consts.BUCKET.RECEIVING, util.replicasets[2]}) --- - [1, 'receiving', ''] @@ -342,19 +350,21 @@ util.check_error(vshard.router.call, 1, 'write', 'echo', {123}) name: TRANSFER_IS_IN_PROGRESS message: Bucket 1 is transferring to replicaset ... -_ = test_run:switch('storage_2_a') +_ = test_run:switch('storage_1_a') +--- +... +box.space._bucket:delete({1}) --- +- [1, 'receiving', ''] ... -box.space._bucket:replace({1, vshard.consts.BUCKET.ACTIVE}) +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false --- -- [1, 'active'] ... -_ = test_run:switch('storage_1_a') +_ = test_run:switch('storage_2_a') --- ... -box.space._bucket:delete({1}) +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false --- -- [1, 'receiving', ''] ... _ = test_run:switch('router_1') --- diff --git a/test/router/router.test.lua b/test/router/router.test.lua index 154310b..aa3eb3b 100644 --- a/test/router/router.test.lua +++ b/test/router/router.test.lua @@ -114,19 +114,28 @@ replicaset, err = vshard.router.bucket_discovery(1); return err == nil or err replicaset, err = vshard.router.bucket_discovery(2); return err == nil or err _ = test_run:switch('storage_2_a') +-- Pause recovery. It is too aggressive, and the test needs to see buckets in +-- their intermediate states. +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true box.space._bucket:replace({1, vshard.consts.BUCKET.SENDING, util.replicasets[1]}) + _ = test_run:switch('storage_1_a') +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true box.space._bucket:replace({1, vshard.consts.BUCKET.RECEIVING, util.replicasets[2]}) + _ = test_run:switch('router_1') -- Ok to read sending bucket. vshard.router.call(1, 'read', 'echo', {123}) -- Not ok to write sending bucket. util.check_error(vshard.router.call, 1, 'write', 'echo', {123}) -_ = test_run:switch('storage_2_a') -box.space._bucket:replace({1, vshard.consts.BUCKET.ACTIVE}) _ = test_run:switch('storage_1_a') box.space._bucket:delete({1}) +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false + +_ = test_run:switch('storage_2_a') +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false + _ = test_run:switch('router_1') -- Check unavailability of master of a replicaset. diff --git a/test/storage/recovery.result b/test/storage/recovery.result index 8ccb0b9..fa92bca 100644 --- a/test/storage/recovery.result +++ b/test/storage/recovery.result @@ -28,12 +28,20 @@ util.push_rs_filters(test_run) _ = test_run:switch("storage_2_a") --- ... +-- Pause until restart. Otherwise recovery does its job too fast and does not +-- allow to simulate the intermediate state. +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true +--- +... vshard.storage.rebalancer_disable() --- ... _ = test_run:switch("storage_1_a") --- ... +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true +--- +... -- Create buckets sending to rs2 and restart - recovery must -- garbage some of them and activate others. Receiving buckets -- must be garbaged on bootstrap. diff --git a/test/storage/recovery.test.lua b/test/storage/recovery.test.lua index a0651e8..93cec68 100644 --- a/test/storage/recovery.test.lua +++ b/test/storage/recovery.test.lua @@ -10,8 +10,13 @@ util.wait_master(test_run, REPLICASET_2, 'storage_2_a') util.push_rs_filters(test_run) _ = test_run:switch("storage_2_a") +-- Pause until restart. Otherwise recovery does its job too fast and does not +-- allow to simulate the intermediate state. +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true vshard.storage.rebalancer_disable() + _ = test_run:switch("storage_1_a") +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true -- Create buckets sending to rs2 and restart - recovery must -- garbage some of them and activate others. Receiving buckets diff --git a/test/storage/recovery_errinj.result b/test/storage/recovery_errinj.result index 3e9a9bf..8c178d5 100644 --- a/test/storage/recovery_errinj.result +++ b/test/storage/recovery_errinj.result @@ -35,9 +35,17 @@ _ = test_run:switch('storage_2_a') vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = true --- ... +-- Pause recovery. Otherwise it does its job too fast and does not allow to +-- simulate the intermediate state. +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true +--- +... _ = test_run:switch('storage_1_a') --- ... +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true +--- +... _bucket = box.space._bucket --- ... @@ -76,10 +84,16 @@ _bucket:get{1} --- - [1, 'active'] ... +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false +--- +... _ = test_run:switch('storage_1_a') --- ... -while _bucket:count() ~= 0 do vshard.storage.recovery_wakeup() fiber.sleep(0.1) end +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false +--- +... +wait_bucket_is_collected(1) --- ... _ = test_run:switch("default") diff --git a/test/storage/recovery_errinj.test.lua b/test/storage/recovery_errinj.test.lua index 8c1a9d2..c730560 100644 --- a/test/storage/recovery_errinj.test.lua +++ b/test/storage/recovery_errinj.test.lua @@ -14,7 +14,12 @@ util.push_rs_filters(test_run) -- _ = test_run:switch('storage_2_a') vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = true +-- Pause recovery. Otherwise it does its job too fast and does not allow to +-- simulate the intermediate state. +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true + _ = test_run:switch('storage_1_a') +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = true _bucket = box.space._bucket _bucket:replace{1, vshard.consts.BUCKET.ACTIVE, util.replicasets[2]} ret, err = vshard.storage.bucket_send(1, util.replicasets[2], {timeout = 0.1}) @@ -27,9 +32,11 @@ vshard.storage.internal.errinj.ERRINJ_LAST_RECEIVE_DELAY = false _bucket = box.space._bucket while _bucket:get{1}.status ~= vshard.consts.BUCKET.ACTIVE do fiber.sleep(0.01) end _bucket:get{1} +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false _ = test_run:switch('storage_1_a') -while _bucket:count() ~= 0 do vshard.storage.recovery_wakeup() fiber.sleep(0.1) end +vshard.storage.internal.errinj.ERRINJ_NO_RECOVERY = false +wait_bucket_is_collected(1) _ = test_run:switch("default") test_run:drop_cluster(REPLICASET_2) diff --git a/vshard/consts.lua b/vshard/consts.lua index 3f1585a..cf3f422 100644 --- a/vshard/consts.lua +++ b/vshard/consts.lua @@ -39,7 +39,7 @@ return { DEFAULT_SYNC_TIMEOUT = 1; RECONNECT_TIMEOUT = 0.5; GC_BACKOFF_INTERVAL = 5, - RECOVERY_INTERVAL = 5; + RECOVERY_BACKOFF_INTERVAL = 5, COLLECT_LUA_GARBAGE_INTERVAL = 100; DISCOVERY_IDLE_INTERVAL = 10, diff --git a/vshard/storage/init.lua b/vshard/storage/init.lua index 31a6fc7..85f5024 100644 --- a/vshard/storage/init.lua +++ b/vshard/storage/init.lua @@ -634,13 +634,16 @@ end -- Infinite function to resolve status of buckets, whose 'sending' -- has failed due to tarantool or network problems. Restarts on -- reload. --- @param module_version Module version, on which the current --- function had been started. If the actual module version --- appears to be changed, then stop recovery. It is --- restarted in reloadable_fiber. -- local function recovery_f() local module_version = M.module_version + -- Changes of _bucket increments bucket generation. Recovery has its own + -- bucket generation which is <= actual. Recovery is finished, when its + -- generation == bucket generation. In such a case the fiber does nothing + -- until next _bucket change. + local bucket_generation_recovered = -1 + local bucket_generation_current = M.bucket_generation + local ok, sleep_time, is_all_recovered, total, recovered -- Interrupt recovery if a module has been reloaded. Perhaps, -- there was found a bug, and reload fixes it. while module_version == M.module_version do @@ -648,22 +651,57 @@ local function recovery_f() lfiber.yield() goto continue end - local ok, total, recovered = pcall(recovery_step_by_type, - consts.BUCKET.SENDING) + is_all_recovered = true + if bucket_generation_recovered == bucket_generation_current then + goto sleep + end + + ok, total, recovered = pcall(recovery_step_by_type, + consts.BUCKET.SENDING) if not ok then + is_all_recovered = false log.error('Error during sending buckets recovery: %s', total) + elseif total ~= recovered then + is_all_recovered = false end + ok, total, recovered = pcall(recovery_step_by_type, consts.BUCKET.RECEIVING) if not ok then + is_all_recovered = false log.error('Error during receiving buckets recovery: %s', total) elseif total == 0 then bucket_receiving_quota_reset() else bucket_receiving_quota_add(recovered) + if total ~= recovered then + is_all_recovered = false + end + end + + ::sleep:: + if not is_all_recovered then + bucket_generation_recovered = -1 + else + bucket_generation_recovered = bucket_generation_current + end + bucket_generation_current = M.bucket_generation + + if not is_all_recovered then + -- One option - some buckets are not broken. Their transmission is + -- still in progress. Don't need to retry immediately. Another + -- option - network errors when tried to repair the buckets. Also no + -- need to retry often. It won't help. + sleep_time = consts.RECOVERY_BACKOFF_INTERVAL + elseif bucket_generation_recovered ~= bucket_generation_current then + sleep_time = 0 + else + sleep_time = consts.TIMEOUT_INFINITY + end + if module_version == M.module_version then + M.bucket_generation_cond:wait(sleep_time) end - lfiber.sleep(consts.RECOVERY_INTERVAL) - ::continue:: + ::continue:: end end -- 2.24.3 (Apple Git-128)