From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id E20CE21BBF for ; Tue, 26 Jun 2018 07:11:48 -0400 (EDT) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 44fMsJLYPHIh for ; Tue, 26 Jun 2018 07:11:48 -0400 (EDT) Received: from smtp62.i.mail.ru (smtp62.i.mail.ru [217.69.128.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 5283C21A35 for ; Tue, 26 Jun 2018 07:11:46 -0400 (EDT) Subject: [tarantool-patches] Re: [PATCH 2/2] Fix discovery/reconfigure race References: From: Vladislav Shpilevoy Message-ID: Date: Tue, 26 Jun 2018 14:11:42 +0300 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8"; format="flowed" Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-subscribe: List-owner: List-post: List-archive: To: Alex Khatskevich , tarantool-patches@freelists.org Thanks for the patch! See 5 comments below. >> >> 3. This cycle makes no sense. With set LONG_DISCOVERY it is >> equivalent to calling router.discovery_wakeup() once. > Disagree. > I need to stuck on the first routemap element. However I need to wakeup it #replicasets > times, because the discovery_fiber sleeps after each iteration. 1. I removed this cycle together with any wakeups and the test passed. So you are wrong. Please, make the test be testing that discovery really works. This is my diff that does not break the test: @@ -419,17 +419,11 @@ end; -- Perform #replicasets phases of discovery, to update replicasets -- object in for loop of discovery fiber since previous cfg. -for _, __ in pairs(vshard.router.internal.replicasets) do - vshard.router.discovery_wakeup() - fiber.sleep(0.02) -end; +vshard.router.discovery_wakeup(); -- Simulate long `callro`. -- Stuck on first rs in replicasets. vshard.router.internal.errinj.ERRINJ_LONG_DISCOVERY = true; -for _, __ in pairs(vshard.router.internal.replicasets) do - vshard.router.discovery_wakeup() - fiber.sleep(0.02) -end; +fiber.sleep(0.02); vshard.router.cfg(cfg); vshard.router.internal.route_map[333] = nil >> >>> +--- >>> +... >>> +vshard.router.cfg(cfg); >>> +--- >>> +... >>> +vshard.router.internal.errinj.LONG_DISCOVERY = nil; >>> +--- >>> +... >>> +-- Do discovery iteration. >>> +vshard.router.discovery_wakeup() >>> +fiber.sleep(0.02) >> >> 4. Concrete timeouts are the way to create an unstable test. >> Please, get rid of them and replace with 'while not cond do wait end' >> where necessary. > It is important that discovery fiber did only one iteration of discovery. So, If I want some `while` mechanism I want to introduce several synchronization and block/unblock phases. > I get you point and suggest compromise here. > 1. I have places an assert here, which checks that expected discovery has occurred. > 2. I have launched the test 100 times and it did not fail 2. Please, just remove this single sleeps and do 'while cond do fiber.sleep(0.01) end'. The fact that the test is passed on your computer means nothing. It can always pass on your machine, but sometimes fail on mine, easily. Or on the travis, or on a customer machine. You do not need any 'synchronization and block/unblock phases'. You already know what do you expect to see after the sleep, so just put this condition in the while. If you want to test some deadlines like it is done for auto network timeouts for the router, then just remember the current time before while, afterwards and test that the duration was < your limit. I have already debugged several of such '100% passing tests with hard timeouts' and do not want to do it again. >> >>>   - true >>> diff --git a/vshard/router/init.lua b/vshard/router/init.lua >>> index 7e765fa..df5b343 100644 >>> --- a/vshard/router/init.lua >>> +++ b/vshard/router/init.lua >>> @@ -127,10 +127,23 @@ local function discovery_f(module_version) >>> +            -- Renew replicasets object in case of reconfigure >>> +            -- and reload events. >> >> 5. You do not renew here anything. >> >>> +            if M.replicasets ~= old_replicasets then >>> +                break >>> +            end > I have rewritten the comment. The replicasets variable captured by the > for loop should be updated by calling `break` and exiting the for loop. > > 3. Please, put a new patch version here. > diff --git a/vshard/router/init.lua b/vshard/router/init.lua > index 7e765fa..5fc0f59 100644 > --- a/vshard/router/init.lua > +++ b/vshard/router/init.lua > @@ -127,10 +128,23 @@ local function discovery_f(module_version) > local iterations_until_lua_gc = > consts.COLLECT_LUA_GARBAGE_INTERVAL / consts.DISCOVERY_INTERVAL > while module_version == M.module_version do > - for _, replicaset in pairs(M.replicasets) do > + local old_replicasets = M.replicasets > + for rs_uuid, replicaset in pairs(M.replicasets) do > local active_buckets, err = > replicaset:callro('vshard.storage.buckets_discovery', {}, > {timeout = 2}) > + while M.errinj.ERRINJ_LONG_DISCOVERY do > + -- Stuck on the first replicaset. 4. When you set this errinj, it sticks on the first rs with no special checks and blocks other replicasets too. So remove this check. Just do sleep until the errinj is set. To prevent your dissension: I removed the check and the test passed too. Together with the diff above about wakeups removal. > + if rs_uuid ~= select(1, next(M.replicasets)) then 5. select(1... here does nothing. It is equivalent to rs_uuid ~= next(M.replicasets) You can start arguing, but do not. I have already checked it. > + break > + end > + lfiber.sleep(0.01) > + end > + -- Renew replicasets object captured by the for loop > + -- in case of reconfigure and reload events. > + if M.replicasets ~= old_replicasets then > + break > + end > if not active_buckets then > log.error('Error during discovery %s: %s', replicaset, err) > else