From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp36.i.mail.ru (smtp36.i.mail.ru [94.100.177.96]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 41F0945C305 for ; Mon, 7 Dec 2020 15:22:26 +0300 (MSK) Date: Mon, 7 Dec 2020 12:22:23 +0000 From: Nikita Pettik Message-ID: <20201207122223.GA21104@tarantool.org> References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: Subject: Re: [Tarantool-patches] [PATCH v2] test: fix hang of vinyl/select_consistency.* test List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Alexander V. Tikhonov" Cc: tarantool-patches@dev.tarantool.org On 02 Dec 21:09, Alexander V. Tikhonov wrote: > --- vinyl/select_consistency.result Thu Jun 11 11:59:48 2020 > +++ var/152_vinyl/select_consistency.result Tue Oct 20 09:13:02 2020 > @@ -137,16 +137,3 @@ > for i = 1, ch:size() do > ch:get() > end; > ---- > -... > -test_run:cmd("setopt delimiter ''"); > ---- > -- true > -... > > This issue stops the testing till the complete timeout of the test-run > will stop all the testing process. Hm, could we patch test-run so that timeout of single test won't stop the whole testing process? > It happened because on heavy loaded hosts may occure the situation > when the previous snapshot was inprogress when the new snapshot came > with the same file name *.snap.inprogress. It happens before the > current snapshot completed and printed "dump completed" in log file. > Also this file *.snap.inprogress was seen left on manual debug, when > the test hanged before this patch. To resolve the test issue fiber > sleep delay after it can be increased, but we want to save the issue > reproducable. The current patch corrects the test to avoid of hang on > box snapshot, to be able to continue testing after it failed. > > Needed for #4385 > --- > > Github: https://github.com/tarantool/tarantool/tree/avtikhon/gh-4385_hang_select_consistency > Issue: https://github.com/tarantool/tarantool/issues/4385 > > test/vinyl/select_consistency.result | 67 ++++++++++++++++++++------ > test/vinyl/select_consistency.test.lua | 61 ++++++++++++++++++----- > test/vinyl/suite.ini | 3 +- > 3 files changed, 103 insertions(+), 28 deletions(-) > > diff --git a/test/vinyl/select_consistency.result b/test/vinyl/select_consistency.result > index f27d0ec68..f6d96473d 100644 > --- a/test/vinyl/select_consistency.result > +++ b/test/vinyl/select_consistency.result > @@ -4,6 +4,9 @@ test_run = require('test_run').new() > fiber = require 'fiber' > --- > ... > +log = require 'log' > +--- > +... > math.randomseed(os.time()) > --- > ... > @@ -52,6 +55,8 @@ function gen_update() > end; > --- > ... > +failed = false > + > function dml_loop() > local insert = true > while not stop do > @@ -75,8 +80,18 @@ end; > ... > function snap_loop() > while not stop do > - box.snapshot() > - fiber.sleep(0.1) > + local ok = fiber.create(function() Still don't understand necessity of creating another one fiber. snap_loop() is itself executed in a separate fiber. I suggest you to dig a bit into this problem, it may hide another issue. > + local ok, err = pcall(box.snapshot) > + if ok == false then > + log.error("error: box.snapshot failed with error " .. err) > + end > + return ok > + end) > + if ok == false then > + failed = true > + break > + end > + fiber.sleep(0.5) > end > ch:put(true) > end; > @@ -97,10 +112,7 @@ _ = fiber.create(dml_loop); > _ = fiber.create(snap_loop); > --- > ... > -failed = {}; > ---- > -... > -for i = 1, 10000 do > +function select_loop() > local val = math.random(MAX_VAL) > box.begin() > local res1 = s.index.i1:select({val}) > @@ -117,25 +129,48 @@ for i = 1, 10000 do > end > end > if not found then > + log.error("error: equal not found for res1 and res2:") > + local ts1 = '' > + local ts2 = '' > + for _, t1 in ipairs(res1) do > + ts1 = ts1 .. " " .. t1[1] > + end > + log.info("error: equal not found for res1" .. ts1) Why the first log is error meanwhile second and third are info? I suggest to make them both error to make it consistent. > + for _, t2 in ipairs(res2) do > + ts2 = ts2 .. " " .. t2[1] > + end > + log.info("error: equal not found for res2" .. ts2) > equal = false > test_run:cmd("setopt delimiter ''"); > > -#failed == 0 or failed > +test_run:wait_cond(function() return check_get() end) > +failed In previous review iter you approved that there's no need in :wait_cond() here, but still it is presented. > - "issues": [ "gh-4385" ] > + "issues": [ "gh-4385" ], > + "checksums": [ "94d5359d451ee7bfdb35a037f80b14ff", "f7e8ebc72f4d59941b4ca10fde474963" ] Why there are two checksums? How to point out corresponding to particular checksum error? Moreover, with increased timeout this test shouldn't be flacky anymore, should it?