From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp63.i.mail.ru (smtp63.i.mail.ru [217.69.128.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 8A19441C5DA for ; Thu, 18 Jun 2020 23:51:33 +0300 (MSK) Date: Thu, 18 Jun 2020 23:50:46 +0300 From: Alexander Turenko Message-ID: <20200618205046.hklilhvpapongixz@tkn_work_nb> References: <2074a5617eb0da1c16830aab2f64f51f22ecb9bf.1592231572.git.avtikhon@tarantool.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <2074a5617eb0da1c16830aab2f64f51f22ecb9bf.1592231572.git.avtikhon@tarantool.org> Subject: Re: [Tarantool-patches] [PATCH v1] test: fix flaky replication/wal_rw_stress.test.lua List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Alexander V. Tikhonov" Cc: tarantool-patches@dev.tarantool.org TL;DR: Can you verify that the problem we want to detect with the test still may be detected after the fix? (More details are below.) WBR, Alexander Turenko. > diff --git a/test/replication/wal_rw_stress.test.lua b/test/replication/wal_rw_stress.test.lua > index 08570b285..48d68c5ac 100644 > --- a/test/replication/wal_rw_stress.test.lua > +++ b/test/replication/wal_rw_stress.test.lua > @@ -38,7 +38,7 @@ test_run:cmd("setopt delimiter ''"); > -- are running in different threads, there shouldn't be any rw errors. > test_run:cmd("switch replica") > box.cfg{replication = replication} > -box.info.replication[1].downstream.status ~= 'stopped' or box.info > +test_run:wait_cond(function() return box.info.replication[1].downstream.status ~= 'stopped' end) or box.info > test_run:cmd("switch default") The comment above says 'there shouldn't be any rw errors'. Your fix hides a transient 'writev(1), <...>', which I guess is a temporary connectivity problem. But I guess it also may hide an rw error for which the test case was added (related to disc). Or such error should keep the relay in the stopped state forever? I tried to revert b9db91e1cdcc97c269703420c7b292e0f125f0ec ('xlog: fix fallocate vs read race') (only src/, not test/), removed the test from the fragile list, clean the repository (to ensure that we'll run with the new HAVE_POSIX_FALLOCATE value) and run the test 1000 times in 32 parallel jobs: $ (cd test && ./test-run.py -j 32 $(yes replication/wal_rw_stress | head -n 1000)) <...> Statistics: * pass: 1000 My plan was: reproduce the original issue (#3883) and verify that your fix does not hide it. However the plan fails on the first step. Can you check, whether #3883 is reproducible for you after reverting the fix? Even if it will hide the original problem, the error message should differ. I guess we can filter out connectivity problems from disc rw problems in the wait_cond() function. BTW, I also checked whether #4977 reproduced on master for me: and it seems, no: 1000 tests passed in 32 parallel jobs. Maybe it is reproducible only in some specific environment? On FreeBSD and/or in VirtualBox? I tried it on Linux laptop (initially I missed that it occurs on FreeBSD, sorry). Side note: I suggest to use something like the following to carry long lines: | test_run:wait_cond(function() \ | return box.info.replication[1].downstream.status ~= 'stopped' \ | end or box.info