* [Tarantool-patches] [PATCH v1] test: fix flaky replication/wal_rw_stress.test.lua
@ 2020-06-15 14:34 Alexander V. Tikhonov
2020-06-18 20:50 ` Alexander Turenko
2020-06-26 9:32 ` Kirill Yukhin
0 siblings, 2 replies; 5+ messages in thread
From: Alexander V. Tikhonov @ 2020-06-15 14:34 UTC (permalink / raw)
To: Sergey Bronnikov, Alexander Turenko, Kirill Yukhin; +Cc: tarantool-patches
Found issue:
[016] --- replication/wal_rw_stress.result Fri Feb 21 11:53:21 2020
[016] +++ replication/wal_rw_stress.reject Fri May 8 08:23:56 2020
[016] @@ -73,7 +73,42 @@
[016] ...
[016] box.info.replication[1].downstream.status ~= 'stopped' or box.info
[016] ---
[016] -- true
[016] +- version: 2.5.0-27-g32f59756a
[016] + id: 2
[016] + ro: false
[016] + uuid: 41cbebcc-9105-11ea-96ab-08002739cbd6
[016] + package: Tarantool
[016] + cluster:
[016] + uuid: 397c196f-9105-11ea-96ab-08002739cbd6
[016] + listen: unix/:/home/vagrant/tarantool/test/var/016_replication/replica.socket-iproto
[016] + replication:
[016] + 1:
[016] + id: 1
[016] + uuid: 397a1886-9105-11ea-96ab-08002739cbd6
[016] + lsn: 10005
[016] + upstream:
[016] + status: follow
[016] + idle: 0.46353673400017
[016] + peer: unix/:/home/vagrant/tarantool/test/var/016_replication/master.socket-iproto
[016] + lag: -0.45732522010803
[016] + downstream:
[016] + status: stopped
[016] + message: writev(1), called on fd 24, aka unix/:/home/vagrant/tarantool/test/var/016_replicati
[016] + system_message: Broken pipe
[016] + 2:
[016] + id: 2
[016] + uuid: 41cbebcc-9105-11ea-96ab-08002739cbd6
[016] + lsn: 0
[016] + signature: 10005
[016] + status: running
[016] + vinyl: []
[016] + uptime: 2
[016] + lsn: 0
[016] + sql: []
[016] + gc: []
[016] + pid: 41231
[016] + memory: []
[016] + vclock: {1: 10005}
[016] ...
[016] test_run:cmd("switch default")
[016] ---
To check the downstream status and it's message need to wait until an
downstream appears. This prevents an attempt to index a nil value when
one of those functions are called before a record about a peer appears
in box.info.replication. It was observed on test:
replication/show_error_on_disconnect
after commit
c6bea65f8ef5f6c737cf70c0127189d0ebcbc36e ('replication: recfg with 0
quorum returns immediately').
Closes #4977
---
Github: https://github.com/tarantool/tarantool/tree/avtikhon/gh-4977-replication-wal_rw_stress
Issue: https://github.com/tarantool/tarantool/issues/4977
test/replication/suite.ini | 1 -
test/replication/wal_rw_stress.result | 2 +-
test/replication/wal_rw_stress.test.lua | 2 +-
3 files changed, 2 insertions(+), 3 deletions(-)
diff --git a/test/replication/suite.ini b/test/replication/suite.ini
index 6a8944020..231bedaaf 100644
--- a/test/replication/suite.ini
+++ b/test/replication/suite.ini
@@ -18,7 +18,6 @@ fragile = errinj.test.lua ; gh-3870
sync.test.lua ; gh-3835 gh-3877
transaction.test.lua ; gh-4312
wal_off.test.lua ; gh-4355
- wal_rw_stress.test.lua ; gh-4977
replica_rejoin.test.lua ; gh-4985
recover_missing_xlog.test.lua ; gh-4989
box_set_replication_stress ; gh-4992
diff --git a/test/replication/wal_rw_stress.result b/test/replication/wal_rw_stress.result
index cc68877b0..cfb2f8a9e 100644
--- a/test/replication/wal_rw_stress.result
+++ b/test/replication/wal_rw_stress.result
@@ -71,7 +71,7 @@ test_run:cmd("switch replica")
box.cfg{replication = replication}
---
...
-box.info.replication[1].downstream.status ~= 'stopped' or box.info
+test_run:wait_cond(function() return box.info.replication[1].downstream.status ~= 'stopped' end) or box.info
---
- true
...
diff --git a/test/replication/wal_rw_stress.test.lua b/test/replication/wal_rw_stress.test.lua
index 08570b285..48d68c5ac 100644
--- a/test/replication/wal_rw_stress.test.lua
+++ b/test/replication/wal_rw_stress.test.lua
@@ -38,7 +38,7 @@ test_run:cmd("setopt delimiter ''");
-- are running in different threads, there shouldn't be any rw errors.
test_run:cmd("switch replica")
box.cfg{replication = replication}
-box.info.replication[1].downstream.status ~= 'stopped' or box.info
+test_run:wait_cond(function() return box.info.replication[1].downstream.status ~= 'stopped' end) or box.info
test_run:cmd("switch default")
-- Cleanup.
--
2.17.1
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Tarantool-patches] [PATCH v1] test: fix flaky replication/wal_rw_stress.test.lua
2020-06-15 14:34 [Tarantool-patches] [PATCH v1] test: fix flaky replication/wal_rw_stress.test.lua Alexander V. Tikhonov
@ 2020-06-18 20:50 ` Alexander Turenko
2020-06-19 13:38 ` Alexander V. Tikhonov
2020-06-26 9:32 ` Kirill Yukhin
1 sibling, 1 reply; 5+ messages in thread
From: Alexander Turenko @ 2020-06-18 20:50 UTC (permalink / raw)
To: Alexander V. Tikhonov; +Cc: tarantool-patches
TL;DR: Can you verify that the problem we want to detect with the test
still may be detected after the fix?
(More details are below.)
WBR, Alexander Turenko.
> diff --git a/test/replication/wal_rw_stress.test.lua b/test/replication/wal_rw_stress.test.lua
> index 08570b285..48d68c5ac 100644
> --- a/test/replication/wal_rw_stress.test.lua
> +++ b/test/replication/wal_rw_stress.test.lua
> @@ -38,7 +38,7 @@ test_run:cmd("setopt delimiter ''");
> -- are running in different threads, there shouldn't be any rw errors.
> test_run:cmd("switch replica")
> box.cfg{replication = replication}
> -box.info.replication[1].downstream.status ~= 'stopped' or box.info
> +test_run:wait_cond(function() return box.info.replication[1].downstream.status ~= 'stopped' end) or box.info
> test_run:cmd("switch default")
The comment above says 'there shouldn't be any rw errors'. Your fix
hides a transient 'writev(1), <...>', which I guess is a temporary
connectivity problem. But I guess it also may hide an rw error for which
the test case was added (related to disc). Or such error should keep the
relay in the stopped state forever?
I tried to revert b9db91e1cdcc97c269703420c7b292e0f125f0ec ('xlog: fix
fallocate vs read race') (only src/, not test/), removed the test from
the fragile list, clean the repository (to ensure that we'll run with
the new HAVE_POSIX_FALLOCATE value) and run the test 1000 times in 32
parallel jobs:
$ (cd test && ./test-run.py -j 32 $(yes replication/wal_rw_stress | head -n 1000))
<...>
Statistics:
* pass: 1000
My plan was: reproduce the original issue (#3883) and verify that your
fix does not hide it. However the plan fails on the first step.
Can you check, whether #3883 is reproducible for you after reverting the
fix?
Even if it will hide the original problem, the error message should
differ. I guess we can filter out connectivity problems from disc rw
problems in the wait_cond() function.
BTW, I also checked whether #4977 reproduced on master for me: and it
seems, no: 1000 tests passed in 32 parallel jobs.
Maybe it is reproducible only in some specific environment? On FreeBSD
and/or in VirtualBox? I tried it on Linux laptop (initially I missed
that it occurs on FreeBSD, sorry).
Side note: I suggest to use something like the following to carry long
lines:
| test_run:wait_cond(function() \
| return box.info.replication[1].downstream.status ~= 'stopped' \
| end or box.info
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Tarantool-patches] [PATCH v1] test: fix flaky replication/wal_rw_stress.test.lua
2020-06-18 20:50 ` Alexander Turenko
@ 2020-06-19 13:38 ` Alexander V. Tikhonov
2020-06-23 14:52 ` Alexander Turenko
0 siblings, 1 reply; 5+ messages in thread
From: Alexander V. Tikhonov @ 2020-06-19 13:38 UTC (permalink / raw)
To: Alexander Turenko; +Cc: tarantool-patches
Hi Alexander, thanks for the review, please check my comments.
Also found that the comment in the original test mistakenly has
issue number 3893 instead of 3883 - I've fixed it.
On Thu, Jun 18, 2020 at 11:50:46PM +0300, Alexander Turenko wrote:
> TL;DR: Can you verify that the problem we want to detect with the test
> still may be detected after the fix?
>
> (More details are below.)
>
> WBR, Alexander Turenko.
>
> > diff --git a/test/replication/wal_rw_stress.test.lua b/test/replication/wal_rw_stress.test.lua
> > index 08570b285..48d68c5ac 100644
> > --- a/test/replication/wal_rw_stress.test.lua
> > +++ b/test/replication/wal_rw_stress.test.lua
> > @@ -38,7 +38,7 @@ test_run:cmd("setopt delimiter ''");
> > -- are running in different threads, there shouldn't be any rw errors.
> > test_run:cmd("switch replica")
> > box.cfg{replication = replication}
> > -box.info.replication[1].downstream.status ~= 'stopped' or box.info
> > +test_run:wait_cond(function() return box.info.replication[1].downstream.status ~= 'stopped' end) or box.info
> > test_run:cmd("switch default")
>
> The comment above says 'there shouldn't be any rw errors'. Your fix
> hides a transient 'writev(1), <...>', which I guess is a temporary
> connectivity problem. But I guess it also may hide an rw error for which
> the test case was added (related to disc). Or such error should keep the
> relay in the stopped state forever?
I've checked the error for which the test was added. I've reverted the
b9db91e1cdcc97c269703420c7b292e0f125f0ec ('xlog: fix fallocate vs read
race') patch and successfully got the needed error "tx checksum
mismatch":
[153] --- replication/wal_rw_stress.result Fri Jun 19 15:01:49 2020
[153] +++ replication/wal_rw_stress.reject Fri Jun 19 15:04:02 2020
[153] @@ -73,7 +73,43 @@
[153] ...
[153] test_run:wait_cond(function() return box.info.replication[1].downstream.status ~= 'stopped' end) or box.info
[153] ---
[153] -- true
[153] +- version: 2.5.0-147-ge7a70be
[153] + id: 2
[153] + ro: false
[153] + uuid: ce5bf8e7-9147-4753-813f-fd1f28e1b6e6
[153] + package: Tarantool
[153] + cluster:
[153] + uuid: 879f16f2-4d0b-4d00-a3e5-0e4c5fffb8e2
[153] + listen: unix/:(socket)
[153] + replication_anon:
[153] + count: 0
[153] + replication:
[153] + 1:
[153] + id: 1
[153] + uuid: c9062e75-97e5-44e5-82fd-226864f95415
[153] + lsn: 20016
[153] + upstream:
[153] + status: follow
[153] + idle: 0.032541885972023
[153] + peer: unix/:/export/avtikhon/bld/test/var/153_replication/master.socket-iproto
[153] + lag: 3.6001205444336e-05
[153] + downstream:
[153] + status: stopped
[153] + message: tx checksum mismatch
>
> I tried to revert b9db91e1cdcc97c269703420c7b292e0f125f0ec ('xlog: fix
> fallocate vs read race') (only src/, not test/), removed the test from
> the fragile list, clean the repository (to ensure that we'll run with
> the new HAVE_POSIX_FALLOCATE value) and run the test 1000 times in 32
> parallel jobs:
>
> $ (cd test && ./test-run.py -j 32 $(yes replication/wal_rw_stress | head -n 1000))
> <...>
> Statistics:
> * pass: 1000
>
I've used the following script command:
( cd /export/avtikhon/src && \
git reset --hard && \
git checkout master -f && \
patch -R -p1 -i revert3883_wo_tests.patch && \
patch -p1 -i 4977.patch && \
cd ../bld/ && \
rm -rf * && \
cmake ../src -DCMAKE_BUILD_TYPE=Debug && \
make -j && \
cd ../src/test && \
( l=0 ; while ./test-run.py --long --builddir ../../bld --vardir ../../bld/test/var -j200 \
`for r in {1..400} ; do echo replication/wal_rw_stress.test.lua ; done 2>/dev/null` ; \
do l=$(($l+1)) ; echo ======== $l ============= ; done | tee a.log 2>&1 ) && \
grep "tx checksum mismatch" a.log )
> My plan was: reproduce the original issue (#3883) and verify that your
> fix does not hide it. However the plan fails on the first step.
>
> Can you check, whether #3883 is reproducible for you after reverting the
> fix?
>
> Even if it will hide the original problem, the error message should
> differ. I guess we can filter out connectivity problems from disc rw
> problems in the wait_cond() function.
>
Please check comments above.
> BTW, I also checked whether #4977 reproduced on master for me: and it
> seems, no: 1000 tests passed in 32 parallel jobs.
>
> Maybe it is reproducible only in some specific environment? On FreeBSD
> and/or in VirtualBox? I tried it on Linux laptop (initially I missed
> that it occurs on FreeBSD, sorry).
>
Right the issue really hard reproducible and only on FreeBSD VBox
machine it was reproduced.
> Side note: I suggest to use something like the following to carry long
> lines:
>
> | test_run:wait_cond(function() \
> | return box.info.replication[1].downstream.status ~= 'stopped' \
> | end or box.info
Ok, sure, I've fixed it.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Tarantool-patches] [PATCH v1] test: fix flaky replication/wal_rw_stress.test.lua
2020-06-19 13:38 ` Alexander V. Tikhonov
@ 2020-06-23 14:52 ` Alexander Turenko
0 siblings, 0 replies; 5+ messages in thread
From: Alexander Turenko @ 2020-06-23 14:52 UTC (permalink / raw)
To: Alexander V. Tikhonov; +Cc: tarantool-patches
On Fri, Jun 19, 2020 at 04:38:00PM +0300, Alexander V. Tikhonov wrote:
> Hi Alexander, thanks for the review, please check my comments.
> Also found that the comment in the original test mistakenly has
> issue number 3893 instead of 3883 - I've fixed it.
>
> On Thu, Jun 18, 2020 at 11:50:46PM +0300, Alexander Turenko wrote:
> > TL;DR: Can you verify that the problem we want to detect with the test
> > still may be detected after the fix?
> >
> > (More details are below.)
> >
> > WBR, Alexander Turenko.
> >
> > > diff --git a/test/replication/wal_rw_stress.test.lua b/test/replication/wal_rw_stress.test.lua
> > > index 08570b285..48d68c5ac 100644
> > > --- a/test/replication/wal_rw_stress.test.lua
> > > +++ b/test/replication/wal_rw_stress.test.lua
> > > @@ -38,7 +38,7 @@ test_run:cmd("setopt delimiter ''");
> > > -- are running in different threads, there shouldn't be any rw errors.
> > > test_run:cmd("switch replica")
> > > box.cfg{replication = replication}
> > > -box.info.replication[1].downstream.status ~= 'stopped' or box.info
> > > +test_run:wait_cond(function() return box.info.replication[1].downstream.status ~= 'stopped' end) or box.info
> > > test_run:cmd("switch default")
> >
> > The comment above says 'there shouldn't be any rw errors'. Your fix
> > hides a transient 'writev(1), <...>', which I guess is a temporary
> > connectivity problem. But I guess it also may hide an rw error for which
> > the test case was added (related to disc). Or such error should keep the
> > relay in the stopped state forever?
>
> I've checked the error for which the test was added. I've reverted the
> b9db91e1cdcc97c269703420c7b292e0f125f0ec ('xlog: fix fallocate vs read
> race') patch and successfully got the needed error "tx checksum
> mismatch":
Thanks for the verification! Okay so. LGTM.
Nit: I would add a comment to the test that wait_cond() allows to
overcome a transient network connectivity errors, but 'tx checksum
mismatch' is persistent one and will be catched.
WBR, Alexander Turenko.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Tarantool-patches] [PATCH v1] test: fix flaky replication/wal_rw_stress.test.lua
2020-06-15 14:34 [Tarantool-patches] [PATCH v1] test: fix flaky replication/wal_rw_stress.test.lua Alexander V. Tikhonov
2020-06-18 20:50 ` Alexander Turenko
@ 2020-06-26 9:32 ` Kirill Yukhin
1 sibling, 0 replies; 5+ messages in thread
From: Kirill Yukhin @ 2020-06-26 9:32 UTC (permalink / raw)
To: Alexander V. Tikhonov; +Cc: tarantool-patches, Alexander Turenko
Hello,
On 15 июн 17:34, Alexander V. Tikhonov wrote:
> Found issue:
>
> [016] --- replication/wal_rw_stress.result Fri Feb 21 11:53:21 2020
> [016] +++ replication/wal_rw_stress.reject Fri May 8 08:23:56 2020
> [016] @@ -73,7 +73,42 @@
> [016] ...
> [016] box.info.replication[1].downstream.status ~= 'stopped' or box.info
> [016] ---
> [016] -- true
> [016] +- version: 2.5.0-27-g32f59756a
> [016] + id: 2
> [016] + ro: false
> [016] + uuid: 41cbebcc-9105-11ea-96ab-08002739cbd6
> [016] + package: Tarantool
> [016] + cluster:
> [016] + uuid: 397c196f-9105-11ea-96ab-08002739cbd6
> [016] + listen: unix/:/home/vagrant/tarantool/test/var/016_replication/replica.socket-iproto
> [016] + replication:
> [016] + 1:
> [016] + id: 1
> [016] + uuid: 397a1886-9105-11ea-96ab-08002739cbd6
> [016] + lsn: 10005
> [016] + upstream:
> [016] + status: follow
> [016] + idle: 0.46353673400017
> [016] + peer: unix/:/home/vagrant/tarantool/test/var/016_replication/master.socket-iproto
> [016] + lag: -0.45732522010803
> [016] + downstream:
> [016] + status: stopped
> [016] + message: writev(1), called on fd 24, aka unix/:/home/vagrant/tarantool/test/var/016_replicati
> [016] + system_message: Broken pipe
> [016] + 2:
> [016] + id: 2
> [016] + uuid: 41cbebcc-9105-11ea-96ab-08002739cbd6
> [016] + lsn: 0
> [016] + signature: 10005
> [016] + status: running
> [016] + vinyl: []
> [016] + uptime: 2
> [016] + lsn: 0
> [016] + sql: []
> [016] + gc: []
> [016] + pid: 41231
> [016] + memory: []
> [016] + vclock: {1: 10005}
> [016] ...
> [016] test_run:cmd("switch default")
> [016] ---
>
> To check the downstream status and it's message need to wait until an
> downstream appears. This prevents an attempt to index a nil value when
> one of those functions are called before a record about a peer appears
> in box.info.replication. It was observed on test:
> replication/show_error_on_disconnect
> after commit
> c6bea65f8ef5f6c737cf70c0127189d0ebcbc36e ('replication: recfg with 0
> quorum returns immediately').
>
> Closes #4977
I've checked your patch into 1.10, 2.3, 2.4 and master.
--
Regards, Kirill Yukhin
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2020-06-26 9:32 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-15 14:34 [Tarantool-patches] [PATCH v1] test: fix flaky replication/wal_rw_stress.test.lua Alexander V. Tikhonov
2020-06-18 20:50 ` Alexander Turenko
2020-06-19 13:38 ` Alexander V. Tikhonov
2020-06-23 14:52 ` Alexander Turenko
2020-06-26 9:32 ` Kirill Yukhin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox