From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp3.mail.ru (smtp3.mail.ru [94.100.179.58]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id DB36245C304 for ; Wed, 2 Dec 2020 21:57:45 +0300 (MSK) Date: Wed, 2 Dec 2020 21:57:43 +0300 From: "Alexander V. Tikhonov" Message-ID: <20201202185743.GA232692@hpalx> References: <5b204cb5b32bb96c6a2cec2b06f399b360137250.1605465603.git.avtikhon@tarantool.org> <20201116210654.GA13996@tarantool.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20201116210654.GA13996@tarantool.org> Subject: Re: [Tarantool-patches] [PATCH v1] test: fix hang of vinyl/select_consistency.* test List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Nikita Pettik Cc: tarantool-patches@dev.tarantool.org Hi Nikita, thanks for the review. I've made all of your suggestions, please review below. On Mon, Nov 16, 2020 at 09:06:54PM +0000, Nikita Pettik wrote: > On 15 Nov 21:43, Alexander V. Tikhonov wrote: > > > > It happened because on heavy loaded hosts may occure the situation > > when the previous snapshot was inprogress when the new snapshot came > > with the same file name *.snap.inprogress. It happens before the > > current snapshot completed and printed "dump completed" in log file. > > Also this file *.snap.inprogress was seen left on manual debug, when > > the test hanged before this patch. To resolve the test issue fiber > > sleep delay after it can be increased, but we want to save the issue > > reproducable. The current patch corrects the test to avoid of hang on > > I guess increasing sleep and fileing exact repro with 'bug' label is > enough to deal with this test. > Ok, I've added instructions how to reproduce the issue in the issue and increased the sleep from 0.1 to 0.5. > > box snapshot, to be able to continue testing after it failed. Fiber > > sleep was even decreased after adding fiber for box.snapshot to be > > able to reproduce the issue. > > > > Needed for #4385 > > --- > > @@ -75,8 +80,13 @@ end; > > ... > > function snap_loop() > > while not stop do > > - box.snapshot() > > - fiber.sleep(0.1) > > + local ok, err = fiber.create(function() local ok, err = pcall(box.snapshot) return ok, err end) > > Why not simply wrap box.snapshot() in pcall? Why do you need another > one separate fiber for it? > Found that the test hangs running without it. Also this part, I'd corrected a bit. > > @@ -97,10 +107,7 @@ _ = fiber.create(dml_loop); > > _ = fiber.create(snap_loop); > > --- > > ... > > -failed = {}; > > ---- > > -... > > -for i = 1, 10000 do > > +function run_iter() > > I'd call it select_loop() (as dml/snap loops above). > Sure, changed. > > local val = math.random(MAX_VAL) > > box.begin() > > local res1 = s.index.i1:select({val}) > > @@ -117,25 +124,37 @@ for i = 1, 10000 do > > end > > end > > if not found then > > + log.error("error: equal not found for #res1 = " .. #res1 .. ", #res2 = " .. #res2) > > What's the point of printing number of elements in res1/res2? > Why not print the whole result set? > Right, added. > > +end; > > +--- > > +... > > +for i = 1, 10000 do > > + if failed or not run_iter(i) then > > + log.error("error: failed on iteration " .. i) > > + failed = true > > + break > > + end > > end; > > +function check_get() > > + for i = 1, ch:size() do > > + if not test_run:wait_cond(function() return ch:get() ~= nil end) then > > As I see from docs, ch:get() has optional timeout param. Why not use it? > I mean not only this particular test, but in general. > Added timeout using local setup as define to make it easy to find all the tests with the same changes in the future. > > + log.error("error: hanged on ch:get() on iteration " .. i) > > + return false > > + end > > + end > > + return true > > end; > > --- > > ... > > @@ -143,10 +162,14 @@ test_run:cmd("setopt delimiter ''"); > > --- > > - true > > ... > > -#failed == 0 or failed > > +test_run:wait_cond(function() return check_get() end) > > Why do you need another one wait_cond over check_get? > Right, no need for now.