From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id 650286ECC0; Fri, 17 Dec 2021 03:28:11 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 650286ECC0 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1639700891; bh=4bRqReKo8uj2RyTwQ/TqwOJKrRG5oeG7F3Y/YB12EOc=; h=To:Date:In-Reply-To:References:Subject:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=onRvPWlKyNim8Q3LHfvxCo5hnJ253/XLM5b5YwjrvN2RG9Eg80JmSIjNRLF0qEaTB eOc2VLtSvXs38bPBx0ZQiqHJFUlv1Z0ttBwESiAbhBqxdd/BxyHB8xWpbE1t/wcw78 mcCtiPqsP8vCv26+YmXC0tZ2ZVgvM6fZyd6sGmJ0= Received: from smtpng1.i.mail.ru (smtpng1.i.mail.ru [94.100.181.251]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id E52AC6E21E for ; Fri, 17 Dec 2021 03:25:37 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org E52AC6E21E Received: by smtpng1.m.smailru.net with esmtpa (envelope-from ) id 1my145-0007At-71; Fri, 17 Dec 2021 03:25:37 +0300 To: tarantool-patches@dev.tarantool.org, olegrok@tarantool.org Date: Fri, 17 Dec 2021 01:25:31 +0100 Message-Id: <6de64cef8a095f578bea59b02f8b5ec4ceac3684.1639700518.git.v.shpilevoy@tarantool.org> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-7564579A: EEAE043A70213CC8 X-77F55803: 4F1203BC0FB41BD9B5397E24C93BDA67728EE92B76C34EA93F43F472DF252270182A05F538085040CC1F9BA1F9A093F75C33BED543154DE2E6BBD27B1D838E2E89A62FA31D881177 X-7FA49CB5: FF5795518A3D127A4AD6D5ED66289B5278DA827A17800CE749E89BD568380EECC2099A533E45F2D0395957E7521B51C2CFCAF695D4D8E9FCEA1F7E6F0F101C6778DA827A17800CE74FC7AD0AD96C1577EA1F7E6F0F101C6723150C8DA25C47586E58E00D9D99D84E1BDDB23E98D2D38BBCA57AF85F7723F2409AAA5594710666F92A74AC5B78C821CC7F00164DA146DAFE8445B8C89999728AA50765F7900637D0FEED2715E18529389733CBF5DBD5E9C8A9BA7A39EFB766F5D81C698A659EA7CC7F00164DA146DA9985D098DBDEAEC8B861051D4BA689FCF6B57BC7E6449061A352F6E88A58FB86F5D81C698A659EA7E827F84554CEF5019E625A9149C048EE9ECD01F8117BC8BEE2021AF6380DFAD18AA50765F790063735872C767BF85DA227C277FBC8AE2E8B953A8A48A05D51F175ECD9A6C639B01B4E70A05D1297E1BBCB5012B2E24CD356 X-C1DE0DAB: C20DE7B7AB408E4181F030C43753B8186998911F362727C4C7A0BC55FA0FE5FC3163393FEE54070FCF05D575D554BE4D52CB30BEB5360330B1881A6453793CE9C32612AADDFBE061C61BE10805914D3804EBA3D8E7E5B87ABF8C51168CD8EBDBD215BE4436AF2686DC48ACC2A39D04F89CDFB48F4795C241BDAD6C7F3747799A X-C8649E89: 4E36BF7865823D7055A7F0CF078B5EC49A30900B95165D3494FB0335DF05DC3A1660756C7E19358B49094B80C92B7CBA14D6553BDC4DFEADA1BA43B0596B85FE1D7E09C32AA3244CE436C343ACEEABEAFA2E35FB9820EC7F33C9DC155518937F729B2BEF169E0186 X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2biojieEIankJUzpv5yquVAaIlg== X-Mailru-Sender: 689FA8AB762F7393C37E3C1AEC41BA5DC3F3292CCF508F078D4C1022FDA97E8D3841015FED1DE5223CC9A89AB576DD93FB559BB5D741EB963CF37A108A312F5C27E8A8C3839CE0E25FEEDEB644C299C0ED14614B50AE0675 X-Mras: Ok Subject: [Tarantool-patches] [PATCH vshard 5/5] router: backoff on storage being disabled X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Vladislav Shpilevoy via Tarantool-patches Reply-To: Vladislav Shpilevoy Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" If a storage reports it is disabled, then it probably will take some time before it can accept new requests. This patch makes STORAGE_IS_DISABLED error cause the connection's backoff. In line with 'access denied' and 'no such function' errors. Because the reason for all 3 is the same - the storage is not ready to accept requests yet. Such requests are transparently retried now. Closes #298 @TarantoolBot document Title: vshard.storage.enable/disable() `vshard.storage.disable()` makes most of the `vshard.storage` functions throw an error. As Lua exception, not via `nil, err` pattern. `vshard.storage.enable()` reverts the disable. By default the storage is enabled. Additionally, the storage is forcefully disabled automatically until `vshard.storage.cfg()` is finished and the instance finished recovery (its `box.info.status` is `'running'`, for example). Auto-disable protects from usage of vshard functions before the storage's global state is fully created. Manual `vshard.storage.disable()` helps to achieve the same for user's application. For instance, a user might want to do some preparatory work after `vshard.storage.cfg` before the application is ready for requests. Then the flow would be: ```Lua vshard.storage.disable() vshard.storage.cfg(...) -- Do your preparatory work here ... vshard.storage.enable() ``` The routers handle the errors signaling about the storage being disabled in a special way. They put connections to such instances into a backoff state for some time and will try to use other replicas. For example, assume a replicaset has replicas 'replica_1' and 'replica_2'. Assume 'replica_1' is disabled due to any reason. If a router will try to talk to 'replica_1', it will get a special error and will transparently retry to 'replica_2'. When 'replica_1' is enabled again, the router will notice it too and will send requests to it again. It all works exclusively for read-only requests. Read-write requests can only be sent to a master, which is one per replicaset. They are not retried. --- test/router/router2.result | 88 ++++++++++++++++++++++++++++++++++++ test/router/router2.test.lua | 35 ++++++++++++++ vshard/replicaset.lua | 19 +++++++- 3 files changed, 140 insertions(+), 2 deletions(-) diff --git a/test/router/router2.result b/test/router/router2.result index a501dbf..ebf0b3f 100644 --- a/test/router/router2.result +++ b/test/router/router2.result @@ -548,6 +548,94 @@ vshard.storage.call = old_storage_call | --- | ... +-- +-- Storage is disabled = backoff. +-- +vshard.storage.disable() + | --- + | ... + +test_run:switch('router_1') + | --- + | - true + | ... +-- Drop old backoffs. +fiber.sleep(vshard.consts.REPLICA_BACKOFF_INTERVAL) + | --- + | ... +-- Success, but internally the request was retried. +res, err = vshard.router.callro(1, 'echo', {100}, long_timeout) + | --- + | ... +assert(res == 100) + | --- + | - true + | ... +-- The best replica entered backoff state. +util = require('util') + | --- + | ... +storage_2 = vshard.router.static.replicasets[replicasets[2]] + | --- + | ... +storage_2_a = storage_2.replicas[util.name_to_uuid.storage_2_a] + | --- + | ... +assert(storage_2_a.backoff_ts ~= nil) + | --- + | - true + | ... + +test_run:switch('storage_2_b') + | --- + | - true + | ... +assert(echo_count == 1) + | --- + | - true + | ... +echo_count = 0 + | --- + | ... + +test_run:switch('storage_2_a') + | --- + | - true + | ... +assert(echo_count == 0) + | --- + | - true + | ... +vshard.storage.enable() + | --- + | ... + +test_run:switch('router_1') + | --- + | - true + | ... +-- Drop the backoff. +fiber.sleep(vshard.consts.REPLICA_BACKOFF_INTERVAL) + | --- + | ... +-- Now goes to the best replica - it is enabled again. +res, err = vshard.router.callro(1, 'echo', {100}, long_timeout) + | --- + | ... +assert(res == 100) + | --- + | - true + | ... + +test_run:switch('storage_2_a') + | --- + | - true + | ... +assert(echo_count == 1) + | --- + | - true + | ... + _ = test_run:switch("default") | --- | ... diff --git a/test/router/router2.test.lua b/test/router/router2.test.lua index fb0c3b2..1c21876 100644 --- a/test/router/router2.test.lua +++ b/test/router/router2.test.lua @@ -216,6 +216,41 @@ test_run:switch('storage_2_a') assert(echo_count == 0) vshard.storage.call = old_storage_call +-- +-- Storage is disabled = backoff. +-- +vshard.storage.disable() + +test_run:switch('router_1') +-- Drop old backoffs. +fiber.sleep(vshard.consts.REPLICA_BACKOFF_INTERVAL) +-- Success, but internally the request was retried. +res, err = vshard.router.callro(1, 'echo', {100}, long_timeout) +assert(res == 100) +-- The best replica entered backoff state. +util = require('util') +storage_2 = vshard.router.static.replicasets[replicasets[2]] +storage_2_a = storage_2.replicas[util.name_to_uuid.storage_2_a] +assert(storage_2_a.backoff_ts ~= nil) + +test_run:switch('storage_2_b') +assert(echo_count == 1) +echo_count = 0 + +test_run:switch('storage_2_a') +assert(echo_count == 0) +vshard.storage.enable() + +test_run:switch('router_1') +-- Drop the backoff. +fiber.sleep(vshard.consts.REPLICA_BACKOFF_INTERVAL) +-- Now goes to the best replica - it is enabled again. +res, err = vshard.router.callro(1, 'echo', {100}, long_timeout) +assert(res == 100) + +test_run:switch('storage_2_a') +assert(echo_count == 1) + _ = test_run:switch("default") _ = test_run:cmd("stop server router_1") _ = test_run:cmd("cleanup server router_1") diff --git a/vshard/replicaset.lua b/vshard/replicaset.lua index 573a555..623d24d 100644 --- a/vshard/replicaset.lua +++ b/vshard/replicaset.lua @@ -347,9 +347,21 @@ local function replica_call(replica, func, args, opts) if opts.timeout >= replica.net_timeout then replica_on_failed_request(replica) end + local err = storage_status + -- VShard functions can throw exceptions using error() function. When + -- it reaches the network layer, it is wrapped into LuajitError. Try to + -- extract the original error if this is the case. Not always is + -- possible - the string representation could be truncated. + -- + -- In old Tarantool versions LuajitError turned into ClientError on the + -- client. Check both types. + if func:startswith('vshard.') and (err.type == 'LuajitError' or + err.type == 'ClientError') then + err = lerror.from_string(err.message) or err + end log.error("Exception during calling '%s' on '%s': %s", func, replica, - storage_status) - return false, nil, lerror.make(storage_status) + err) + return false, nil, lerror.make(err) else replica_on_success_request(replica) end @@ -472,6 +484,9 @@ local function can_backoff_after_error(e, func) return e.message:startswith("Procedure 'vshard.") end end + if e.type == 'ShardingError' then + return e.code == vshard.error.code.STORAGE_IS_DISABLED + end return false end -- 2.24.3 (Apple Git-128)