Tarantool development patches archive
 help / color / mirror / Atom feed
From: "Alexander V. Tikhonov" <avtikhon@tarantool.org>
To: Alexander Turenko <alexander.turenko@tarantool.org>
Cc: tarantool-patches@dev.tarantool.org
Subject: Re: [Tarantool-patches] [PATCH v1] test: fix flaky replication/wal_rw_stress.test.lua
Date: Fri, 19 Jun 2020 16:38:00 +0300	[thread overview]
Message-ID: <20200619133800.GA26690@hpalx> (raw)
In-Reply-To: <20200618205046.hklilhvpapongixz@tkn_work_nb>

Hi Alexander, thanks for the review, please check my comments.
Also found that the comment in the original test mistakenly has
issue number 3893 instead of 3883 - I've fixed it.

On Thu, Jun 18, 2020 at 11:50:46PM +0300, Alexander Turenko wrote:
> TL;DR: Can you verify that the problem we want to detect with the test
> still may be detected after the fix?
> 
> (More details are below.)
> 
> WBR, Alexander Turenko.
> 
> > diff --git a/test/replication/wal_rw_stress.test.lua b/test/replication/wal_rw_stress.test.lua
> > index 08570b285..48d68c5ac 100644
> > --- a/test/replication/wal_rw_stress.test.lua
> > +++ b/test/replication/wal_rw_stress.test.lua
> > @@ -38,7 +38,7 @@ test_run:cmd("setopt delimiter ''");
> >  -- are running in different threads, there shouldn't be any rw errors.
> >  test_run:cmd("switch replica")
> >  box.cfg{replication = replication}
> > -box.info.replication[1].downstream.status ~= 'stopped' or box.info
> > +test_run:wait_cond(function() return box.info.replication[1].downstream.status ~= 'stopped' end) or box.info
> >  test_run:cmd("switch default")
> 
> The comment above says 'there shouldn't be any rw errors'. Your fix
> hides a transient 'writev(1), <...>', which I guess is a temporary
> connectivity problem. But I guess it also may hide an rw error for which
> the test case was added (related to disc). Or such error should keep the
> relay in the stopped state forever?

I've checked the error for which the test was added. I've reverted the
b9db91e1cdcc97c269703420c7b292e0f125f0ec ('xlog: fix fallocate vs read
race') patch and successfully got the needed error "tx checksum
mismatch":

[153] --- replication/wal_rw_stress.result      Fri Jun 19 15:01:49 2020
[153] +++ replication/wal_rw_stress.reject      Fri Jun 19 15:04:02 2020
[153] @@ -73,7 +73,43 @@
[153]  ...
[153]  test_run:wait_cond(function() return box.info.replication[1].downstream.status ~= 'stopped' end) or box.info
[153]  ---
[153] -- true
[153] +- version: 2.5.0-147-ge7a70be
[153] +  id: 2
[153] +  ro: false
[153] +  uuid: ce5bf8e7-9147-4753-813f-fd1f28e1b6e6
[153] +  package: Tarantool
[153] +  cluster:
[153] +    uuid: 879f16f2-4d0b-4d00-a3e5-0e4c5fffb8e2
[153] +  listen: unix/:(socket)
[153] +  replication_anon:
[153] +    count: 0
[153] +  replication:
[153] +    1:
[153] +      id: 1
[153] +      uuid: c9062e75-97e5-44e5-82fd-226864f95415
[153] +      lsn: 20016
[153] +      upstream:
[153] +        status: follow
[153] +        idle: 0.032541885972023
[153] +        peer: unix/:/export/avtikhon/bld/test/var/153_replication/master.socket-iproto
[153] +        lag: 3.6001205444336e-05
[153] +      downstream:
[153] +        status: stopped
[153] +        message: tx checksum mismatch

> 
> I tried to revert b9db91e1cdcc97c269703420c7b292e0f125f0ec ('xlog: fix
> fallocate vs read race') (only src/, not test/), removed the test from
> the fragile list, clean the repository (to ensure that we'll run with
> the new HAVE_POSIX_FALLOCATE value) and run the test 1000 times in 32
> parallel jobs:
> 
> $ (cd test && ./test-run.py -j 32 $(yes replication/wal_rw_stress | head -n 1000))
> <...>
> Statistics:
> * pass: 1000
>

I've used the following script command:

( cd /export/avtikhon/src && \
git reset --hard && \
git checkout master -f && \
patch -R -p1 -i revert3883_wo_tests.patch && \
patch -p1 -i 4977.patch && \
cd ../bld/ && \
rm -rf * && \
cmake ../src -DCMAKE_BUILD_TYPE=Debug && \
make -j && \
cd ../src/test && \
( l=0 ; while ./test-run.py --long --builddir ../../bld --vardir ../../bld/test/var -j200 \
`for r in {1..400} ; do echo replication/wal_rw_stress.test.lua ; done 2>/dev/null` ; \
do l=$(($l+1)) ; echo ======== $l ============= ; done | tee a.log 2>&1 ) && \
grep "tx checksum mismatch" a.log )

> My plan was: reproduce the original issue (#3883) and verify that your
> fix does not hide it. However the plan fails on the first step.
> 
> Can you check, whether #3883 is reproducible for you after reverting the
> fix?
> 
> Even if it will hide the original problem, the error message should
> differ. I guess we can filter out connectivity problems from disc rw
> problems in the wait_cond() function.
>

Please check comments above.

> BTW, I also checked whether #4977 reproduced on master for me: and it
> seems, no: 1000 tests passed in 32 parallel jobs.
> 
> Maybe it is reproducible only in some specific environment? On FreeBSD
> and/or in VirtualBox? I tried it on Linux laptop (initially I missed
> that it occurs on FreeBSD, sorry).
>

Right the issue really hard reproducible and only on FreeBSD VBox
machine it was reproduced.

> Side note: I suggest to use something like the following to carry long
> lines:
> 
>  | test_run:wait_cond(function()                                     \
>  |     return box.info.replication[1].downstream.status ~= 'stopped' \
>  | end or box.info

Ok, sure, I've fixed it.

  reply	other threads:[~2020-06-19 13:38 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-15 14:34 Alexander V. Tikhonov
2020-06-18 20:50 ` Alexander Turenko
2020-06-19 13:38   ` Alexander V. Tikhonov [this message]
2020-06-23 14:52     ` Alexander Turenko
2020-06-26  9:32 ` Kirill Yukhin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200619133800.GA26690@hpalx \
    --to=avtikhon@tarantool.org \
    --cc=alexander.turenko@tarantool.org \
    --cc=tarantool-patches@dev.tarantool.org \
    --subject='Re: [Tarantool-patches] [PATCH v1] test: fix flaky replication/wal_rw_stress.test.lua' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox