From: "Alexander V. Tikhonov" <avtikhon@tarantool.org> To: Alexander Turenko <alexander.turenko@tarantool.org> Cc: tarantool-patches@dev.tarantool.org Subject: Re: [Tarantool-patches] [PATCH v1] test: fix flaky replication/wal_rw_stress.test.lua Date: Fri, 19 Jun 2020 16:38:00 +0300 [thread overview] Message-ID: <20200619133800.GA26690@hpalx> (raw) In-Reply-To: <20200618205046.hklilhvpapongixz@tkn_work_nb> Hi Alexander, thanks for the review, please check my comments. Also found that the comment in the original test mistakenly has issue number 3893 instead of 3883 - I've fixed it. On Thu, Jun 18, 2020 at 11:50:46PM +0300, Alexander Turenko wrote: > TL;DR: Can you verify that the problem we want to detect with the test > still may be detected after the fix? > > (More details are below.) > > WBR, Alexander Turenko. > > > diff --git a/test/replication/wal_rw_stress.test.lua b/test/replication/wal_rw_stress.test.lua > > index 08570b285..48d68c5ac 100644 > > --- a/test/replication/wal_rw_stress.test.lua > > +++ b/test/replication/wal_rw_stress.test.lua > > @@ -38,7 +38,7 @@ test_run:cmd("setopt delimiter ''"); > > -- are running in different threads, there shouldn't be any rw errors. > > test_run:cmd("switch replica") > > box.cfg{replication = replication} > > -box.info.replication[1].downstream.status ~= 'stopped' or box.info > > +test_run:wait_cond(function() return box.info.replication[1].downstream.status ~= 'stopped' end) or box.info > > test_run:cmd("switch default") > > The comment above says 'there shouldn't be any rw errors'. Your fix > hides a transient 'writev(1), <...>', which I guess is a temporary > connectivity problem. But I guess it also may hide an rw error for which > the test case was added (related to disc). Or such error should keep the > relay in the stopped state forever? I've checked the error for which the test was added. I've reverted the b9db91e1cdcc97c269703420c7b292e0f125f0ec ('xlog: fix fallocate vs read race') patch and successfully got the needed error "tx checksum mismatch": [153] --- replication/wal_rw_stress.result Fri Jun 19 15:01:49 2020 [153] +++ replication/wal_rw_stress.reject Fri Jun 19 15:04:02 2020 [153] @@ -73,7 +73,43 @@ [153] ... [153] test_run:wait_cond(function() return box.info.replication[1].downstream.status ~= 'stopped' end) or box.info [153] --- [153] -- true [153] +- version: 2.5.0-147-ge7a70be [153] + id: 2 [153] + ro: false [153] + uuid: ce5bf8e7-9147-4753-813f-fd1f28e1b6e6 [153] + package: Tarantool [153] + cluster: [153] + uuid: 879f16f2-4d0b-4d00-a3e5-0e4c5fffb8e2 [153] + listen: unix/:(socket) [153] + replication_anon: [153] + count: 0 [153] + replication: [153] + 1: [153] + id: 1 [153] + uuid: c9062e75-97e5-44e5-82fd-226864f95415 [153] + lsn: 20016 [153] + upstream: [153] + status: follow [153] + idle: 0.032541885972023 [153] + peer: unix/:/export/avtikhon/bld/test/var/153_replication/master.socket-iproto [153] + lag: 3.6001205444336e-05 [153] + downstream: [153] + status: stopped [153] + message: tx checksum mismatch > > I tried to revert b9db91e1cdcc97c269703420c7b292e0f125f0ec ('xlog: fix > fallocate vs read race') (only src/, not test/), removed the test from > the fragile list, clean the repository (to ensure that we'll run with > the new HAVE_POSIX_FALLOCATE value) and run the test 1000 times in 32 > parallel jobs: > > $ (cd test && ./test-run.py -j 32 $(yes replication/wal_rw_stress | head -n 1000)) > <...> > Statistics: > * pass: 1000 > I've used the following script command: ( cd /export/avtikhon/src && \ git reset --hard && \ git checkout master -f && \ patch -R -p1 -i revert3883_wo_tests.patch && \ patch -p1 -i 4977.patch && \ cd ../bld/ && \ rm -rf * && \ cmake ../src -DCMAKE_BUILD_TYPE=Debug && \ make -j && \ cd ../src/test && \ ( l=0 ; while ./test-run.py --long --builddir ../../bld --vardir ../../bld/test/var -j200 \ `for r in {1..400} ; do echo replication/wal_rw_stress.test.lua ; done 2>/dev/null` ; \ do l=$(($l+1)) ; echo ======== $l ============= ; done | tee a.log 2>&1 ) && \ grep "tx checksum mismatch" a.log ) > My plan was: reproduce the original issue (#3883) and verify that your > fix does not hide it. However the plan fails on the first step. > > Can you check, whether #3883 is reproducible for you after reverting the > fix? > > Even if it will hide the original problem, the error message should > differ. I guess we can filter out connectivity problems from disc rw > problems in the wait_cond() function. > Please check comments above. > BTW, I also checked whether #4977 reproduced on master for me: and it > seems, no: 1000 tests passed in 32 parallel jobs. > > Maybe it is reproducible only in some specific environment? On FreeBSD > and/or in VirtualBox? I tried it on Linux laptop (initially I missed > that it occurs on FreeBSD, sorry). > Right the issue really hard reproducible and only on FreeBSD VBox machine it was reproduced. > Side note: I suggest to use something like the following to carry long > lines: > > | test_run:wait_cond(function() \ > | return box.info.replication[1].downstream.status ~= 'stopped' \ > | end or box.info Ok, sure, I've fixed it.
next prev parent reply other threads:[~2020-06-19 13:38 UTC|newest] Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-06-15 14:34 Alexander V. Tikhonov 2020-06-18 20:50 ` Alexander Turenko 2020-06-19 13:38 ` Alexander V. Tikhonov [this message] 2020-06-23 14:52 ` Alexander Turenko 2020-06-26 9:32 ` Kirill Yukhin
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20200619133800.GA26690@hpalx \ --to=avtikhon@tarantool.org \ --cc=alexander.turenko@tarantool.org \ --cc=tarantool-patches@dev.tarantool.org \ --subject='Re: [Tarantool-patches] [PATCH v1] test: fix flaky replication/wal_rw_stress.test.lua' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox