From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp32.i.mail.ru (smtp32.i.mail.ru [94.100.177.92]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id A4F184429E1 for ; Tue, 23 Jun 2020 17:53:00 +0300 (MSK) Date: Tue, 23 Jun 2020 17:52:10 +0300 From: Alexander Turenko Message-ID: <20200623145210.ytsv7xc2iz2fvjcg@tkn_work_nb> References: <2074a5617eb0da1c16830aab2f64f51f22ecb9bf.1592231572.git.avtikhon@tarantool.org> <20200618205046.hklilhvpapongixz@tkn_work_nb> <20200619133800.GA26690@hpalx> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20200619133800.GA26690@hpalx> Subject: Re: [Tarantool-patches] [PATCH v1] test: fix flaky replication/wal_rw_stress.test.lua List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Alexander V. Tikhonov" Cc: tarantool-patches@dev.tarantool.org On Fri, Jun 19, 2020 at 04:38:00PM +0300, Alexander V. Tikhonov wrote: > Hi Alexander, thanks for the review, please check my comments. > Also found that the comment in the original test mistakenly has > issue number 3893 instead of 3883 - I've fixed it. > > On Thu, Jun 18, 2020 at 11:50:46PM +0300, Alexander Turenko wrote: > > TL;DR: Can you verify that the problem we want to detect with the test > > still may be detected after the fix? > > > > (More details are below.) > > > > WBR, Alexander Turenko. > > > > > diff --git a/test/replication/wal_rw_stress.test.lua b/test/replication/wal_rw_stress.test.lua > > > index 08570b285..48d68c5ac 100644 > > > --- a/test/replication/wal_rw_stress.test.lua > > > +++ b/test/replication/wal_rw_stress.test.lua > > > @@ -38,7 +38,7 @@ test_run:cmd("setopt delimiter ''"); > > > -- are running in different threads, there shouldn't be any rw errors. > > > test_run:cmd("switch replica") > > > box.cfg{replication = replication} > > > -box.info.replication[1].downstream.status ~= 'stopped' or box.info > > > +test_run:wait_cond(function() return box.info.replication[1].downstream.status ~= 'stopped' end) or box.info > > > test_run:cmd("switch default") > > > > The comment above says 'there shouldn't be any rw errors'. Your fix > > hides a transient 'writev(1), <...>', which I guess is a temporary > > connectivity problem. But I guess it also may hide an rw error for which > > the test case was added (related to disc). Or such error should keep the > > relay in the stopped state forever? > > I've checked the error for which the test was added. I've reverted the > b9db91e1cdcc97c269703420c7b292e0f125f0ec ('xlog: fix fallocate vs read > race') patch and successfully got the needed error "tx checksum > mismatch": Thanks for the verification! Okay so. LGTM. Nit: I would add a comment to the test that wait_cond() allows to overcome a transient network connectivity errors, but 'tx checksum mismatch' is persistent one and will be catched. WBR, Alexander Turenko.