[Tarantool-patches] [PATCH v1] test: fix flaky replication/wal_rw

Tarantool development patches archive
 help / color / mirror / Atom feed

* [Tarantool-patches] [PATCH v1] test: fix flaky replication/wal_rw_stress.test.lua
@ 2020-06-15 14:34 Alexander V. Tikhonov
  2020-06-18 20:50 ` Alexander Turenko
  2020-06-26  9:32 ` Kirill Yukhin
  0 siblings, 2 replies; 5+ messages in thread
From: Alexander V. Tikhonov @ 2020-06-15 14:34 UTC (permalink / raw)
  To: Sergey Bronnikov, Alexander Turenko, Kirill Yukhin; +Cc: tarantool-patches

Found issue:

 [016] --- replication/wal_rw_stress.result	Fri Feb 21 11:53:21 2020
 [016] +++ replication/wal_rw_stress.reject	Fri May  8 08:23:56 2020
 [016] @@ -73,7 +73,42 @@
 [016]  ...
 [016]  box.info.replication[1].downstream.status ~= 'stopped' or box.info
 [016]  ---
 [016] -- true
 [016] +- version: 2.5.0-27-g32f59756a
 [016] +  id: 2
 [016] +  ro: false
 [016] +  uuid: 41cbebcc-9105-11ea-96ab-08002739cbd6
 [016] +  package: Tarantool
 [016] +  cluster:
 [016] +    uuid: 397c196f-9105-11ea-96ab-08002739cbd6
 [016] +  listen: unix/:/home/vagrant/tarantool/test/var/016_replication/replica.socket-iproto
 [016] +  replication:
 [016] +    1:
 [016] +      id: 1
 [016] +      uuid: 397a1886-9105-11ea-96ab-08002739cbd6
 [016] +      lsn: 10005
 [016] +      upstream:
 [016] +        status: follow
 [016] +        idle: 0.46353673400017
 [016] +        peer: unix/:/home/vagrant/tarantool/test/var/016_replication/master.socket-iproto
 [016] +        lag: -0.45732522010803
 [016] +      downstream:
 [016] +        status: stopped
 [016] +        message: writev(1), called on fd 24, aka unix/:/home/vagrant/tarantool/test/var/016_replicati
 [016] +        system_message: Broken pipe
 [016] +    2:
 [016] +      id: 2
 [016] +      uuid: 41cbebcc-9105-11ea-96ab-08002739cbd6
 [016] +      lsn: 0
 [016] +  signature: 10005
 [016] +  status: running
 [016] +  vinyl: []
 [016] +  uptime: 2
 [016] +  lsn: 0
 [016] +  sql: []
 [016] +  gc: []
 [016] +  pid: 41231
 [016] +  memory: []
 [016] +  vclock: {1: 10005}
 [016]  ...
 [016]  test_run:cmd("switch default")
 [016]  ---

To check the downstream status and it's message need to wait until an
downstream appears. This prevents an attempt to index a nil value when
one of those functions are called before a record about a peer appears
in box.info.replication. It was observed on test:
  replication/show_error_on_disconnect
after commit
  c6bea65f8ef5f6c737cf70c0127189d0ebcbc36e ('replication: recfg with 0
quorum returns immediately').

Closes #4977
---

Github: https://github.com/tarantool/tarantool/tree/avtikhon/gh-4977-replication-wal_rw_stress
Issue: https://github.com/tarantool/tarantool/issues/4977

 test/replication/suite.ini              | 1 -
 test/replication/wal_rw_stress.result   | 2 +-
 test/replication/wal_rw_stress.test.lua | 2 +-
 3 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/test/replication/suite.ini b/test/replication/suite.ini
index 6a8944020..231bedaaf 100644
--- a/test/replication/suite.ini
+++ b/test/replication/suite.ini
@@ -18,7 +18,6 @@ fragile = errinj.test.lua            ; gh-3870
           sync.test.lua              ; gh-3835 gh-3877
           transaction.test.lua       ; gh-4312
           wal_off.test.lua           ; gh-4355
-          wal_rw_stress.test.lua     ; gh-4977
           replica_rejoin.test.lua    ; gh-4985
           recover_missing_xlog.test.lua ; gh-4989
           box_set_replication_stress ; gh-4992
diff --git a/test/replication/wal_rw_stress.result b/test/replication/wal_rw_stress.result
index cc68877b0..cfb2f8a9e 100644
--- a/test/replication/wal_rw_stress.result
+++ b/test/replication/wal_rw_stress.result
@@ -71,7 +71,7 @@ test_run:cmd("switch replica")
 box.cfg{replication = replication}
 ---
 ...
-box.info.replication[1].downstream.status ~= 'stopped' or box.info
+test_run:wait_cond(function() return box.info.replication[1].downstream.status ~= 'stopped' end) or box.info
 ---
 - true
 ...
diff --git a/test/replication/wal_rw_stress.test.lua b/test/replication/wal_rw_stress.test.lua
index 08570b285..48d68c5ac 100644
--- a/test/replication/wal_rw_stress.test.lua
+++ b/test/replication/wal_rw_stress.test.lua
@@ -38,7 +38,7 @@ test_run:cmd("setopt delimiter ''");
 -- are running in different threads, there shouldn't be any rw errors.
 test_run:cmd("switch replica")
 box.cfg{replication = replication}
-box.info.replication[1].downstream.status ~= 'stopped' or box.info
+test_run:wait_cond(function() return box.info.replication[1].downstream.status ~= 'stopped' end) or box.info
 test_run:cmd("switch default")
 
 -- Cleanup.
-- 
2.17.1

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Tarantool-patches] [PATCH v1] test: fix flaky replication/wal_rw_stress.test.lua
  2020-06-15 14:34 [Tarantool-patches] [PATCH v1] test: fix flaky replication/wal_rw_stress.test.lua Alexander V. Tikhonov
@ 2020-06-18 20:50 ` Alexander Turenko
  2020-06-19 13:38   ` Alexander V. Tikhonov
  2020-06-26  9:32 ` Kirill Yukhin
  1 sibling, 1 reply; 5+ messages in thread
From: Alexander Turenko @ 2020-06-18 20:50 UTC (permalink / raw)
  To: Alexander V. Tikhonov; +Cc: tarantool-patches

TL;DR: Can you verify that the problem we want to detect with the test
still may be detected after the fix?

(More details are below.)

WBR, Alexander Turenko.

> diff --git a/test/replication/wal_rw_stress.test.lua b/test/replication/wal_rw_stress.test.lua
> index 08570b285..48d68c5ac 100644
> --- a/test/replication/wal_rw_stress.test.lua
> +++ b/test/replication/wal_rw_stress.test.lua
> @@ -38,7 +38,7 @@ test_run:cmd("setopt delimiter ''");
>  -- are running in different threads, there shouldn't be any rw errors.
>  test_run:cmd("switch replica")
>  box.cfg{replication = replication}
> -box.info.replication[1].downstream.status ~= 'stopped' or box.info
> +test_run:wait_cond(function() return box.info.replication[1].downstream.status ~= 'stopped' end) or box.info
>  test_run:cmd("switch default")

The comment above says 'there shouldn't be any rw errors'. Your fix
hides a transient 'writev(1), <...>', which I guess is a temporary
connectivity problem. But I guess it also may hide an rw error for which
the test case was added (related to disc). Or such error should keep the
relay in the stopped state forever?

I tried to revert b9db91e1cdcc97c269703420c7b292e0f125f0ec ('xlog: fix
fallocate vs read race') (only src/, not test/), removed the test from
the fragile list, clean the repository (to ensure that we'll run with
the new HAVE_POSIX_FALLOCATE value) and run the test 1000 times in 32
parallel jobs:

$ (cd test && ./test-run.py -j 32 $(yes replication/wal_rw_stress | head -n 1000))
<...>
Statistics:
* pass: 1000

My plan was: reproduce the original issue (#3883) and verify that your
fix does not hide it. However the plan fails on the first step.

Can you check, whether #3883 is reproducible for you after reverting the
fix?

Even if it will hide the original problem, the error message should
differ. I guess we can filter out connectivity problems from disc rw
problems in the wait_cond() function.

BTW, I also checked whether #4977 reproduced on master for me: and it
seems, no: 1000 tests passed in 32 parallel jobs.

Maybe it is reproducible only in some specific environment? On FreeBSD
and/or in VirtualBox? I tried it on Linux laptop (initially I missed
that it occurs on FreeBSD, sorry).

Side note: I suggest to use something like the following to carry long
lines:

 | test_run:wait_cond(function()                                     \
 |     return box.info.replication[1].downstream.status ~= 'stopped' \
 | end or box.info

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Tarantool-patches] [PATCH v1] test: fix flaky replication/wal_rw_stress.test.lua
  2020-06-18 20:50 ` Alexander Turenko
@ 2020-06-19 13:38   ` Alexander V. Tikhonov
  2020-06-23 14:52     ` Alexander Turenko
  0 siblings, 1 reply; 5+ messages in thread
From: Alexander V. Tikhonov @ 2020-06-19 13:38 UTC (permalink / raw)
  To: Alexander Turenko; +Cc: tarantool-patches

Hi Alexander, thanks for the review, please check my comments.
Also found that the comment in the original test mistakenly has
issue number 3893 instead of 3883 - I've fixed it.

On Thu, Jun 18, 2020 at 11:50:46PM +0300, Alexander Turenko wrote:
> TL;DR: Can you verify that the problem we want to detect with the test
> still may be detected after the fix?
> 
> (More details are below.)
> 
> WBR, Alexander Turenko.
> 
> > diff --git a/test/replication/wal_rw_stress.test.lua b/test/replication/wal_rw_stress.test.lua
> > index 08570b285..48d68c5ac 100644
> > --- a/test/replication/wal_rw_stress.test.lua
> > +++ b/test/replication/wal_rw_stress.test.lua
> > @@ -38,7 +38,7 @@ test_run:cmd("setopt delimiter ''");
> >  -- are running in different threads, there shouldn't be any rw errors.
> >  test_run:cmd("switch replica")
> >  box.cfg{replication = replication}
> > -box.info.replication[1].downstream.status ~= 'stopped' or box.info
> > +test_run:wait_cond(function() return box.info.replication[1].downstream.status ~= 'stopped' end) or box.info
> >  test_run:cmd("switch default")
> 
> The comment above says 'there shouldn't be any rw errors'. Your fix
> hides a transient 'writev(1), <...>', which I guess is a temporary
> connectivity problem. But I guess it also may hide an rw error for which
> the test case was added (related to disc). Or such error should keep the
> relay in the stopped state forever?

I've checked the error for which the test was added. I've reverted the
b9db91e1cdcc97c269703420c7b292e0f125f0ec ('xlog: fix fallocate vs read
race') patch and successfully got the needed error "tx checksum
mismatch":

[153] --- replication/wal_rw_stress.result      Fri Jun 19 15:01:49 2020
[153] +++ replication/wal_rw_stress.reject      Fri Jun 19 15:04:02 2020
[153] @@ -73,7 +73,43 @@
[153]  ...
[153]  test_run:wait_cond(function() return box.info.replication[1].downstream.status ~= 'stopped' end) or box.info
[153]  ---
[153] -- true
[153] +- version: 2.5.0-147-ge7a70be
[153] +  id: 2
[153] +  ro: false
[153] +  uuid: ce5bf8e7-9147-4753-813f-fd1f28e1b6e6
[153] +  package: Tarantool
[153] +  cluster:
[153] +    uuid: 879f16f2-4d0b-4d00-a3e5-0e4c5fffb8e2
[153] +  listen: unix/:(socket)
[153] +  replication_anon:
[153] +    count: 0
[153] +  replication:
[153] +    1:
[153] +      id: 1
[153] +      uuid: c9062e75-97e5-44e5-82fd-226864f95415
[153] +      lsn: 20016
[153] +      upstream:
[153] +        status: follow
[153] +        idle: 0.032541885972023
[153] +        peer: unix/:/export/avtikhon/bld/test/var/153_replication/master.socket-iproto
[153] +        lag: 3.6001205444336e-05
[153] +      downstream:
[153] +        status: stopped
[153] +        message: tx checksum mismatch

> 
> I tried to revert b9db91e1cdcc97c269703420c7b292e0f125f0ec ('xlog: fix
> fallocate vs read race') (only src/, not test/), removed the test from
> the fragile list, clean the repository (to ensure that we'll run with
> the new HAVE_POSIX_FALLOCATE value) and run the test 1000 times in 32
> parallel jobs:
> 
> $ (cd test && ./test-run.py -j 32 $(yes replication/wal_rw_stress | head -n 1000))
> <...>
> Statistics:
> * pass: 1000
>

I've used the following script command:

( cd /export/avtikhon/src && \
git reset --hard && \
git checkout master -f && \
patch -R -p1 -i revert3883_wo_tests.patch && \
patch -p1 -i 4977.patch && \
cd ../bld/ && \
rm -rf * && \
cmake ../src -DCMAKE_BUILD_TYPE=Debug && \
make -j && \
cd ../src/test && \
( l=0 ; while ./test-run.py --long --builddir ../../bld --vardir ../../bld/test/var -j200 \
`for r in {1..400} ; do echo replication/wal_rw_stress.test.lua ; done 2>/dev/null` ; \
do l=$(($l+1)) ; echo ======== $l ============= ; done | tee a.log 2>&1 ) && \
grep "tx checksum mismatch" a.log )

> My plan was: reproduce the original issue (#3883) and verify that your
> fix does not hide it. However the plan fails on the first step.
> 
> Can you check, whether #3883 is reproducible for you after reverting the
> fix?
> 
> Even if it will hide the original problem, the error message should
> differ. I guess we can filter out connectivity problems from disc rw
> problems in the wait_cond() function.
>

Please check comments above.

> BTW, I also checked whether #4977 reproduced on master for me: and it
> seems, no: 1000 tests passed in 32 parallel jobs.
> 
> Maybe it is reproducible only in some specific environment? On FreeBSD
> and/or in VirtualBox? I tried it on Linux laptop (initially I missed
> that it occurs on FreeBSD, sorry).
>

Right the issue really hard reproducible and only on FreeBSD VBox
machine it was reproduced.

> Side note: I suggest to use something like the following to carry long
> lines:
> 
>  | test_run:wait_cond(function()                                     \
>  |     return box.info.replication[1].downstream.status ~= 'stopped' \
>  | end or box.info

Ok, sure, I've fixed it.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Tarantool-patches] [PATCH v1] test: fix flaky replication/wal_rw_stress.test.lua
  2020-06-19 13:38   ` Alexander V. Tikhonov
@ 2020-06-23 14:52     ` Alexander Turenko
  0 siblings, 0 replies; 5+ messages in thread
From: Alexander Turenko @ 2020-06-23 14:52 UTC (permalink / raw)
  To: Alexander V. Tikhonov; +Cc: tarantool-patches

On Fri, Jun 19, 2020 at 04:38:00PM +0300, Alexander V. Tikhonov wrote:
> Hi Alexander, thanks for the review, please check my comments.
> Also found that the comment in the original test mistakenly has
> issue number 3893 instead of 3883 - I've fixed it.
> 
> On Thu, Jun 18, 2020 at 11:50:46PM +0300, Alexander Turenko wrote:
> > TL;DR: Can you verify that the problem we want to detect with the test
> > still may be detected after the fix?
> > 
> > (More details are below.)
> > 
> > WBR, Alexander Turenko.
> > 
> > > diff --git a/test/replication/wal_rw_stress.test.lua b/test/replication/wal_rw_stress.test.lua
> > > index 08570b285..48d68c5ac 100644
> > > --- a/test/replication/wal_rw_stress.test.lua
> > > +++ b/test/replication/wal_rw_stress.test.lua
> > > @@ -38,7 +38,7 @@ test_run:cmd("setopt delimiter ''");
> > >  -- are running in different threads, there shouldn't be any rw errors.
> > >  test_run:cmd("switch replica")
> > >  box.cfg{replication = replication}
> > > -box.info.replication[1].downstream.status ~= 'stopped' or box.info
> > > +test_run:wait_cond(function() return box.info.replication[1].downstream.status ~= 'stopped' end) or box.info
> > >  test_run:cmd("switch default")
> > 
> > The comment above says 'there shouldn't be any rw errors'. Your fix
> > hides a transient 'writev(1), <...>', which I guess is a temporary
> > connectivity problem. But I guess it also may hide an rw error for which
> > the test case was added (related to disc). Or such error should keep the
> > relay in the stopped state forever?
> 
> I've checked the error for which the test was added. I've reverted the
> b9db91e1cdcc97c269703420c7b292e0f125f0ec ('xlog: fix fallocate vs read
> race') patch and successfully got the needed error "tx checksum
> mismatch":

Thanks for the verification! Okay so. LGTM.

Nit: I would add a comment to the test that wait_cond() allows to
overcome a transient network connectivity errors, but 'tx checksum
mismatch' is persistent one and will be catched.

WBR, Alexander Turenko.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Tarantool-patches] [PATCH v1] test: fix flaky replication/wal_rw_stress.test.lua
  2020-06-15 14:34 [Tarantool-patches] [PATCH v1] test: fix flaky replication/wal_rw_stress.test.lua Alexander V. Tikhonov
  2020-06-18 20:50 ` Alexander Turenko
@ 2020-06-26  9:32 ` Kirill Yukhin
  1 sibling, 0 replies; 5+ messages in thread
From: Kirill Yukhin @ 2020-06-26  9:32 UTC (permalink / raw)
  To: Alexander V. Tikhonov; +Cc: tarantool-patches, Alexander Turenko

Hello,

On 15 июн 17:34, Alexander V. Tikhonov wrote:
> Found issue:
> 
>  [016] --- replication/wal_rw_stress.result	Fri Feb 21 11:53:21 2020
>  [016] +++ replication/wal_rw_stress.reject	Fri May  8 08:23:56 2020
>  [016] @@ -73,7 +73,42 @@
>  [016]  ...
>  [016]  box.info.replication[1].downstream.status ~= 'stopped' or box.info
>  [016]  ---
>  [016] -- true
>  [016] +- version: 2.5.0-27-g32f59756a
>  [016] +  id: 2
>  [016] +  ro: false
>  [016] +  uuid: 41cbebcc-9105-11ea-96ab-08002739cbd6
>  [016] +  package: Tarantool
>  [016] +  cluster:
>  [016] +    uuid: 397c196f-9105-11ea-96ab-08002739cbd6
>  [016] +  listen: unix/:/home/vagrant/tarantool/test/var/016_replication/replica.socket-iproto
>  [016] +  replication:
>  [016] +    1:
>  [016] +      id: 1
>  [016] +      uuid: 397a1886-9105-11ea-96ab-08002739cbd6
>  [016] +      lsn: 10005
>  [016] +      upstream:
>  [016] +        status: follow
>  [016] +        idle: 0.46353673400017
>  [016] +        peer: unix/:/home/vagrant/tarantool/test/var/016_replication/master.socket-iproto
>  [016] +        lag: -0.45732522010803
>  [016] +      downstream:
>  [016] +        status: stopped
>  [016] +        message: writev(1), called on fd 24, aka unix/:/home/vagrant/tarantool/test/var/016_replicati
>  [016] +        system_message: Broken pipe
>  [016] +    2:
>  [016] +      id: 2
>  [016] +      uuid: 41cbebcc-9105-11ea-96ab-08002739cbd6
>  [016] +      lsn: 0
>  [016] +  signature: 10005
>  [016] +  status: running
>  [016] +  vinyl: []
>  [016] +  uptime: 2
>  [016] +  lsn: 0
>  [016] +  sql: []
>  [016] +  gc: []
>  [016] +  pid: 41231
>  [016] +  memory: []
>  [016] +  vclock: {1: 10005}
>  [016]  ...
>  [016]  test_run:cmd("switch default")
>  [016]  ---
> 
> To check the downstream status and it's message need to wait until an
> downstream appears. This prevents an attempt to index a nil value when
> one of those functions are called before a record about a peer appears
> in box.info.replication. It was observed on test:
>   replication/show_error_on_disconnect
> after commit
>   c6bea65f8ef5f6c737cf70c0127189d0ebcbc36e ('replication: recfg with 0
> quorum returns immediately').
> 
> Closes #4977

I've checked your patch into 1.10, 2.3, 2.4 and master.

--
Regards, Kirill Yukhin

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-06-26  9:32 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-15 14:34 [Tarantool-patches] [PATCH v1] test: fix flaky replication/wal_rw_stress.test.lua Alexander V. Tikhonov
2020-06-18 20:50 ` Alexander Turenko
2020-06-19 13:38   ` Alexander V. Tikhonov
2020-06-23 14:52     ` Alexander Turenko
2020-06-26  9:32 ` Kirill Yukhin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox