Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH 0/2] fix replica iteration issue & stabilize quorum test
@ 2019-11-18 13:19 Ilya Kosarev
  2019-11-18 13:19 ` [Tarantool-patches] [PATCH 1/2] replication: make anon replicas iteration safe Ilya Kosarev
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Ilya Kosarev @ 2019-11-18 13:19 UTC (permalink / raw)
  To: tarantool-patches

This patchset fixes anon replicas iteration issues in
replicaset_foollow and stabilizes quorum test.

Branch: https://github.com/tarantool/tarantool/tree/i.kosarev/gh-4586-fix-quorum-test
Issue: https://github.com/tarantool/tarantool/issues/4586

Ilya Kosarev (2):
  replication: make anon replicas iteration safe
  test: stabilize quorum test conditions

 src/box/replication.cc           |  3 ++-
 test/replication/quorum.result   | 28 +++++++++++++++++++---------
 test/replication/quorum.test.lua | 16 +++++++++-------
 3 files changed, 30 insertions(+), 17 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Tarantool-patches] [PATCH 1/2] replication: make anon replicas iteration safe
  2019-11-18 13:19 [Tarantool-patches] [PATCH 0/2] fix replica iteration issue & stabilize quorum test Ilya Kosarev
@ 2019-11-18 13:19 ` Ilya Kosarev
  2019-11-18 13:19 ` [Tarantool-patches] [PATCH 2/2] test: stabilize quorum test conditions Ilya Kosarev
  2019-11-19 23:13 ` [Tarantool-patches] [PATCH 0/2] fix replica iteration issue & stabilize quorum test Vladislav Shpilevoy
  2 siblings, 0 replies; 12+ messages in thread
From: Ilya Kosarev @ 2019-11-18 13:19 UTC (permalink / raw)
  To: tarantool-patches

In replicaset_follow we iterate anon replicas list: list of replicas
that haven't received an UUID. In case of successful connect replica
link is being removed from anon list. If it happens immediately,
without yield in applier, iteration breaks. Now it is fixed by
rlist_foreach_entry_safe instead of common rlist_foreach_entry.

Part of #4586
---
 src/box/replication.cc | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/box/replication.cc b/src/box/replication.cc
index 509187cd3..495a01c4f 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -800,7 +800,8 @@ replicaset_follow(void)
 		if (replica->applier != NULL)
 			applier_resume(replica->applier);
 	}
-	rlist_foreach_entry(replica, &replicaset.anon, in_anon) {
+	struct replica *tmp;
+	rlist_foreach_entry_safe(replica, &replicaset.anon, in_anon, tmp) {
 		/* Restart appliers that failed to connect. */
 		applier_start(replica->applier);
 	}
-- 
2.17.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Tarantool-patches] [PATCH 2/2] test: stabilize quorum test conditions
  2019-11-18 13:19 [Tarantool-patches] [PATCH 0/2] fix replica iteration issue & stabilize quorum test Ilya Kosarev
  2019-11-18 13:19 ` [Tarantool-patches] [PATCH 1/2] replication: make anon replicas iteration safe Ilya Kosarev
@ 2019-11-18 13:19 ` Ilya Kosarev
  2019-11-19 23:13 ` [Tarantool-patches] [PATCH 0/2] fix replica iteration issue & stabilize quorum test Vladislav Shpilevoy
  2 siblings, 0 replies; 12+ messages in thread
From: Ilya Kosarev @ 2019-11-18 13:19 UTC (permalink / raw)
  To: tarantool-patches

There were some pass conditions in quorum test which could take some
time to be satisfied. Now they are wrapped using test_run:wait_cond to
make the test stable.

Closes #4586
---
 test/replication/quorum.result   | 28 +++++++++++++++++++---------
 test/replication/quorum.test.lua | 16 +++++++++-------
 2 files changed, 28 insertions(+), 16 deletions(-)

diff --git a/test/replication/quorum.result b/test/replication/quorum.result
index ff5fa0150..73def113f 100644
--- a/test/replication/quorum.result
+++ b/test/replication/quorum.result
@@ -115,15 +115,15 @@ box.info.status -- running
 - running
 ...
 -- Check that the replica follows all masters.
-box.info.id == 1 or box.info.replication[1].upstream.status == 'follow'
+box.info.id == 1 or test_run:wait_cond(function() return box.info.replication[1].upstream.status == 'follow' end, 20)
 ---
 - true
 ...
-box.info.id == 2 or box.info.replication[2].upstream.status == 'follow'
+box.info.id == 2 or test_run:wait_cond(function() return box.info.replication[2].upstream.status == 'follow' end, 20)
 ---
 - true
 ...
-box.info.id == 3 or box.info.replication[3].upstream.status == 'follow'
+box.info.id == 3 or test_run:wait_cond(function() return box.info.replication[3].upstream.status == 'follow' end, 20)
 ---
 - true
 ...
@@ -149,6 +149,10 @@ test_run:cmd('stop server quorum1')
 ---
 - true
 ...
+test_run:wait_cond(function() return box.space.test.index.primary ~= nil end, 20)
+---
+- true
+...
 for i = 1, 100 do box.space.test:insert{i} end
 ---
 ...
@@ -166,9 +170,9 @@ test_run:cmd('switch quorum1')
 ---
 - true
 ...
-box.space.test:count() -- 100
+test_run:wait_cond(function() return box.space.test:count() == 100 end, 20)
 ---
-- 100
+- true
 ...
 -- Rebootstrap one node of the cluster and check that others follow.
 -- Note, due to ERRINJ_RELAY_TIMEOUT there is a substantial delay
@@ -197,9 +201,9 @@ test_run:cmd('switch quorum1')
 - true
 ...
 test_run:cmd('restart server quorum1 with cleanup=1, args="0.1 0.5"')
-box.space.test:count() -- 100
+test_run:wait_cond(function() return box.space.test:count() == 100 end, 20)
 ---
-- 100
+- true
 ...
 -- The rebootstrapped replica will be assigned id = 4,
 -- because ids 1..3 are busy.
@@ -210,8 +214,9 @@ test_run:cmd('switch quorum2')
 fiber = require('fiber')
 ---
 ...
-while box.info.replication[4].upstream.status ~= 'follow' do fiber.sleep(0.001) end
+test_run:wait_cond(function() return box.info.replication[4].upstream.status == 'follow' end, 20)
 ---
+- true
 ...
 box.info.replication[4].upstream.status
 ---
@@ -224,8 +229,13 @@ test_run:cmd('switch quorum3')
 fiber = require('fiber')
 ---
 ...
-while box.info.replication[4].upstream.status ~= 'follow' do fiber.sleep(0.001) end
+test_run:wait_cond(function() return box.info.replication ~= nil end, 20)
 ---
+- true
+...
+test_run:wait_cond(function() return box.info.replication[4].upstream.status == 'follow' end, 20)
+---
+- true
 ...
 box.info.replication[4].upstream.status
 ---
diff --git a/test/replication/quorum.test.lua b/test/replication/quorum.test.lua
index 98febb367..1842cdffe 100644
--- a/test/replication/quorum.test.lua
+++ b/test/replication/quorum.test.lua
@@ -47,9 +47,9 @@ box.info.ro -- false
 box.info.status -- running
 
 -- Check that the replica follows all masters.
-box.info.id == 1 or box.info.replication[1].upstream.status == 'follow'
-box.info.id == 2 or box.info.replication[2].upstream.status == 'follow'
-box.info.id == 3 or box.info.replication[3].upstream.status == 'follow'
+box.info.id == 1 or test_run:wait_cond(function() return box.info.replication[1].upstream.status == 'follow' end, 20)
+box.info.id == 2 or test_run:wait_cond(function() return box.info.replication[2].upstream.status == 'follow' end, 20)
+box.info.id == 3 or test_run:wait_cond(function() return box.info.replication[3].upstream.status == 'follow' end, 20)
 
 -- Check that box.cfg() doesn't return until the instance
 -- catches up with all configured replicas.
@@ -59,13 +59,14 @@ test_run:cmd('switch quorum2')
 box.error.injection.set("ERRINJ_RELAY_TIMEOUT", 0.001)
 test_run:cmd('stop server quorum1')
 
+test_run:wait_cond(function() return box.space.test.index.primary ~= nil end, 20)
 for i = 1, 100 do box.space.test:insert{i} end
 fiber = require('fiber')
 fiber.sleep(0.1)
 
 test_run:cmd('start server quorum1 with args="0.1  0.5"')
 test_run:cmd('switch quorum1')
-box.space.test:count() -- 100
+test_run:wait_cond(function() return box.space.test:count() == 100 end, 20)
 
 -- Rebootstrap one node of the cluster and check that others follow.
 -- Note, due to ERRINJ_RELAY_TIMEOUT there is a substantial delay
@@ -81,17 +82,18 @@ box.snapshot()
 test_run:cmd('switch quorum1')
 test_run:cmd('restart server quorum1 with cleanup=1, args="0.1 0.5"')
 
-box.space.test:count() -- 100
+test_run:wait_cond(function() return box.space.test:count() == 100 end, 20)
 
 -- The rebootstrapped replica will be assigned id = 4,
 -- because ids 1..3 are busy.
 test_run:cmd('switch quorum2')
 fiber = require('fiber')
-while box.info.replication[4].upstream.status ~= 'follow' do fiber.sleep(0.001) end
+test_run:wait_cond(function() return box.info.replication[4].upstream.status == 'follow' end, 20)
 box.info.replication[4].upstream.status
 test_run:cmd('switch quorum3')
 fiber = require('fiber')
-while box.info.replication[4].upstream.status ~= 'follow' do fiber.sleep(0.001) end
+test_run:wait_cond(function() return box.info.replication ~= nil end, 20)
+test_run:wait_cond(function() return box.info.replication[4].upstream.status == 'follow' end, 20)
 box.info.replication[4].upstream.status
 
 -- Cleanup.
-- 
2.17.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Tarantool-patches] [PATCH 0/2] fix replica iteration issue & stabilize quorum test
  2019-11-18 13:19 [Tarantool-patches] [PATCH 0/2] fix replica iteration issue & stabilize quorum test Ilya Kosarev
  2019-11-18 13:19 ` [Tarantool-patches] [PATCH 1/2] replication: make anon replicas iteration safe Ilya Kosarev
  2019-11-18 13:19 ` [Tarantool-patches] [PATCH 2/2] test: stabilize quorum test conditions Ilya Kosarev
@ 2019-11-19 23:13 ` Vladislav Shpilevoy
  2019-11-20  1:29   ` Ilya Kosarev
  2 siblings, 1 reply; 12+ messages in thread
From: Vladislav Shpilevoy @ 2019-11-19 23:13 UTC (permalink / raw)
  To: Ilya Kosarev, tarantool-patches

Hi! Thanks for the patch!

The commits LGTM. But looks like there are more problems. I tried on
the branch:

    python test-run.py replication/quorum. replication/quorum. replication/quorum. replication/quorum. replication/quorum. --conf memtx

Got a crash one time, and wrong results other time. But overall
the test works more stable now, IMO.

For the crash I have a core file. But it is not for gdb.
I can send it to you, or extract any info if you need.

On the summary, I think we can't write 'Closes' in the
last commit yet. We need to improve these commits, or
add more fixes on the top.

=====================================================================================

Crash (looks really strange, may be an independent bug):

(lldb) bt
* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x00007fff688be2c6 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff68973bf1 libsystem_pthread.dylib`pthread_kill + 284
    frame #2: 0x00007fff688286a6 libsystem_c.dylib`abort + 127
    frame #3: 0x00000001026fd136 tarantool`sig_fatal_cb(signo=11, siginfo=0x000000010557fa38, context=0x000000010557faa0) at main.cc:300:2
    frame #4: 0x00007fff68968b5d libsystem_platform.dylib`_sigtramp + 29
    frame #5: 0x00007fff6896f138 libsystem_pthread.dylib`pthread_mutex_lock + 1
    frame #6: 0x0000000102b0600a tarantool`etp_submit(user=0x00007fc8500072c0, req=0x00007fc851901340) at etp.c:533:7
    frame #7: 0x0000000102b05eca tarantool`eio_submit(req=0x00007fc851901340) at eio.c:482:3
    frame #8: 0x000000010288d004 tarantool`coio_task_execute(task=0x00007fc851901340, timeout=15768000000) at coio_task.c:245:2
    frame #9: 0x000000010288da2b tarantool`coio_getaddrinfo(host="localhost", port="56816", hints=0x000000010557fd98, res=0x000000010557fd90, timeout=15768000000) at coio_task.c:412:6
    frame #10: 0x000000010285d787 tarantool`lbox_socket_getaddrinfo(L=0x00000001051ae590) at socket.c:814:12
    frame #11: 0x00000001028b932d tarantool`lj_BC_FUNCC + 68
    frame #12: 0x00000001028e79ae tarantool`lua_pcall(L=0x00000001051ae590, nargs=3, nresults=-1, errfunc=0) at lj_api.c:1139:12
    frame #13: 0x000000010285af73 tarantool`luaT_call(L=0x00000001051ae590, nargs=3, nreturns=-1) at utils.c:1036:6
    frame #14: 0x00000001028532c6 tarantool`lua_fiber_run_f(ap=0x00000001054003e8) at fiber.c:433:11
    frame #15: 0x00000001026fc91a tarantool`fiber_cxx_invoke(f=(tarantool`lua_fiber_run_f at fiber.c:427), ap=0x00000001054003e8)(__va_list_tag*), __va_list_tag*) at fiber.h:742:10
    frame #16: 0x000000010287765b tarantool`fiber_loop(data=0x0000000000000000) at fiber.c:737:18
    frame #17: 0x0000000102b0c787 tarantool`coro_init at coro.c:110:3
(lldb) 

=====================================================================================

Wrong results:

[005] replication/quorum.test.lua                     memtx           [ pass ]
[002] replication/quorum.test.lua                     memtx           [ fail ]
[002] 
[002] Test failed! Result content mismatch:
[002] --- replication/quorum.result	Tue Nov 19 23:57:26 2019
[002] +++ replication/quorum.reject	Wed Nov 20 00:00:20 2019
[002] @@ -42,7 +42,8 @@
[002]  ...
[002]  box.space.test:replace{100} -- error
[002]  ---
[002] -- error: Can't modify data because this instance is in read-only mode.
[002] +- error: '[string "return box.space.test:replace{100} -- error "]:1: attempt to index
[002] +    field ''test'' (a nil value)'
[002]  ...
[002]  box.cfg{replication={}}
[002]  ---
[002] @@ -66,7 +67,8 @@
[002]  ...
[002]  box.space.test:replace{100} -- error
[002]  ---
[002] -- error: Can't modify data because this instance is in read-only mode.
[002] +- error: '[string "return box.space.test:replace{100} -- error "]:1: attempt to index
[002] +    field ''test'' (a nil value)'
[002]  ...
[002]  box.cfg{replication_connect_quorum = 2}
[002]  ---
[002] @@ -97,7 +99,8 @@
[002]  ...
[002]  box.space.test:replace{100} -- error
[002]  ---
[002] -- error: Can't modify data because this instance is in read-only mode.
[002] +- error: '[string "return box.space.test:replace{100} -- error "]:1: attempt to index
[002] +    field ''test'' (a nil value)'
[002]  ...
[002]  test_run:cmd('start server quorum1 with args="0.1 0.5"')
[002]  ---
[002] @@ -151,10 +154,13 @@
[002]  ...
[002]  test_run:wait_cond(function() return box.space.test.index.primary ~= nil end, 20)
[002]  ---
[002] -- true
[002] +- error: '[string "return test_run:wait_cond(function() return b..."]:1: attempt to
[002] +    index field ''test'' (a nil value)'
[002]  ...
[002]  for i = 1, 100 do box.space.test:insert{i} end
[002]  ---
[002] +- error: '[string "for i = 1, 100 do box.space.test:insert{i} end "]:1: attempt to
[002] +    index field ''test'' (a nil value)'
[002]  ...
[002]  fiber = require('fiber')
[002]  ---
[002] @@ -172,7 +178,8 @@
[002]  ...
[002]  test_run:wait_cond(function() return box.space.test:count() == 100 end, 20)
[002]  ---
[002] -- true
[002] +- error: '[string "return test_run:wait_cond(function() return b..."]:1: attempt to
[002] +    index field ''test'' (a nil value)'
[002]  ...
[002]  -- Rebootstrap one node of the cluster and check that others follow.
[002]  -- Note, due to ERRINJ_RELAY_TIMEOUT there is a substantial delay
[002] @@ -203,7 +210,8 @@
[002]  test_run:cmd('restart server quorum1 with cleanup=1, args="0.1 0.5"')
[002]  test_run:wait_cond(function() return box.space.test:count() == 100 end, 20)
[002]  ---
[002] -- true
[002] +- error: '[string "return test_run:wait_cond(function() return b..."]:1: attempt to
[002] +    index field ''test'' (a nil value)'
[002]  ...
[002]  -- The rebootstrapped replica will be assigned id = 4,
[002]  -- because ids 1..3 are busy.
[002] 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Tarantool-patches] [PATCH 0/2] fix replica iteration issue & stabilize quorum test
  2019-11-19 23:13 ` [Tarantool-patches] [PATCH 0/2] fix replica iteration issue & stabilize quorum test Vladislav Shpilevoy
@ 2019-11-20  1:29   ` Ilya Kosarev
  2019-11-20  1:41     ` Ilya Kosarev
  2019-11-20 22:18     ` Vladislav Shpilevoy
  0 siblings, 2 replies; 12+ messages in thread
From: Ilya Kosarev @ 2019-11-20  1:29 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

[-- Attachment #1: Type: text/plain, Size: 7527 bytes --]

Hi!

Thanks for your review.

Did u run tests on exactly this patchset or on the branch,
https://github.com/tarantool/tarantool/tree/i.kosarev/gh-4586-fix-quorum-test
which also contains  relay: fix join vclock obtainment in 
relay_initial_join ?
It is not yet checked in master (and might not get there, as far as
Georgy has alternative fix as a part of sync replication), but
is vital for the stability of the test.
On my machine (Ubuntu 18.04.3 LTS) quorum test works perfectly on
mentioned branch. I use next bash instruction to run it under 'load':
l=0 ; while ./test-run.py -j20 `for r in {1..64} ; do echo quorum ; done` 2>/dev/null ; do l=$(($l+1)) ; echo ======== $l ============= ; done
Anyway, I guess provided problems are not connected with join_vclock
patch but are mac os specific, as far as i can't reproduce them locally.
Guess we have some mac os machines, i will ask for access.
It seems to me that wrong results problem is quite easy to handle.
I have no idea for now how to handle provided segfault, however, i am
sure it has nothing to do with the segfault mentioned in the issue, as
far as it was caused by unsafe iteration of anon replicas. Other wrong
results problems, mentioned in the ticket, are also handled in the
patchset.
Therefore i propose to close this ticket with the provided patchset
although there are some other problems. Then i will open new issue with
error info you provided and start to work on it as soon as i get remote
access to some mac os machine.

>Среда, 20 ноября 2019, 2:06 +03:00 от Vladislav Shpilevoy <v.shpilevoy@tarantool.org>:
>
>Hi! Thanks for the patch!
>
>The commits LGTM. But looks like there are more problems. I tried on
>the branch:
>
>    python test-run.py replication/quorum. replication/quorum. replication/quorum. replication/quorum. replication/quorum. --conf memtx
>
>Got a crash one time, and wrong results other time. But overall
>the test works more stable now, IMO.
>
>For the crash I have a core file. But it is not for gdb.
>I can send it to you, or extract any info if you need.
>
>On the summary, I think we can't write 'Closes' in the
>last commit yet. We need to improve these commits, or
>add more fixes on the top.
>
>=====================================================================================
>
>Crash (looks really strange, may be an independent bug):
>
>(lldb) bt
>* thread #1, stop reason = signal SIGSTOP
>  * frame #0: 0x00007fff688be2c6 libsystem_kernel.dylib`__pthread_kill + 10
>    frame #1: 0x00007fff68973bf1 libsystem_pthread.dylib`pthread_kill + 284
>    frame #2: 0x00007fff688286a6 libsystem_c.dylib`abort + 127
>    frame #3: 0x00000001026fd136 tarantool`sig_fatal_cb(signo=11, siginfo=0x000000010557fa38, context=0x000000010557faa0) at main.cc:300:2
>    frame #4: 0x00007fff68968b5d libsystem_platform.dylib`_sigtramp + 29
>    frame #5: 0x00007fff6896f138 libsystem_pthread.dylib`pthread_mutex_lock + 1
>    frame #6: 0x0000000102b0600a tarantool`etp_submit(user=0x00007fc8500072c0, req=0x00007fc851901340) at etp.c:533:7
>    frame #7: 0x0000000102b05eca tarantool`eio_submit(req=0x00007fc851901340) at eio.c:482:3
>    frame #8: 0x000000010288d004 tarantool`coio_task_execute(task=0x00007fc851901340, timeout=15768000000) at coio_task.c:245:2
>    frame #9: 0x000000010288da2b tarantool`coio_getaddrinfo(host="localhost", port="56816", hints=0x000000010557fd98, res=0x000000010557fd90, timeout=15768000000) at coio_task.c:412:6
>    frame #10: 0x000000010285d787 tarantool`lbox_socket_getaddrinfo(L=0x00000001051ae590) at socket.c:814:12
>    frame #11: 0x00000001028b932d tarantool`lj_BC_FUNCC + 68
>    frame #12: 0x00000001028e79ae tarantool`lua_pcall(L=0x00000001051ae590, nargs=3, nresults=-1, errfunc=0) at lj_api.c:1139:12
>    frame #13: 0x000000010285af73 tarantool`luaT_call(L=0x00000001051ae590, nargs=3, nreturns=-1) at utils.c:1036:6
>    frame #14: 0x00000001028532c6 tarantool`lua_fiber_run_f(ap=0x00000001054003e8) at fiber.c:433:11
>    frame #15: 0x00000001026fc91a tarantool`fiber_cxx_invoke(f=(tarantool`lua_fiber_run_f at fiber.c:427), ap=0x00000001054003e8)(__va_list_tag*), __va_list_tag*) at fiber.h:742:10
>    frame #16: 0x000000010287765b tarantool`fiber_loop(data=0x0000000000000000) at fiber.c:737:18
>    frame #17: 0x0000000102b0c787 tarantool`coro_init at coro.c:110:3
>(lldb) 
>
>=====================================================================================
>
>Wrong results:
>
>[005] replication/quorum.test.lua                     memtx           [ pass ]
>[002] replication/quorum.test.lua                     memtx           [ fail ]
>[002] 
>[002] Test failed! Result content mismatch:
>[002] --- replication/quorum.result	Tue Nov 19 23:57:26 2019
>[002] +++ replication/quorum.reject	Wed Nov 20 00:00:20 2019
>[002] @@ -42,7 +42,8 @@
>[002]  ...
>[002]  box.space.test:replace{100} -- error
>[002]  ---
>[002] -- error: Can't modify data because this instance is in read-only mode.
>[002] +- error: '[string "return box.space.test:replace{100} -- error "]:1: attempt to index
>[002] +    field ''test'' (a nil value)'
>[002]  ...
>[002]  box.cfg{replication={}}
>[002]  ---
>[002] @@ -66,7 +67,8 @@
>[002]  ...
>[002]  box.space.test:replace{100} -- error
>[002]  ---
>[002] -- error: Can't modify data because this instance is in read-only mode.
>[002] +- error: '[string "return box.space.test:replace{100} -- error "]:1: attempt to index
>[002] +    field ''test'' (a nil value)'
>[002]  ...
>[002]  box.cfg{replication_connect_quorum = 2}
>[002]  ---
>[002] @@ -97,7 +99,8 @@
>[002]  ...
>[002]  box.space.test:replace{100} -- error
>[002]  ---
>[002] -- error: Can't modify data because this instance is in read-only mode.
>[002] +- error: '[string "return box.space.test:replace{100} -- error "]:1: attempt to index
>[002] +    field ''test'' (a nil value)'
>[002]  ...
>[002]  test_run:cmd('start server quorum1 with args="0.1 0.5"')
>[002]  ---
>[002] @@ -151,10 +154,13 @@
>[002]  ...
>[002]  test_run:wait_cond(function() return box.space.test.index.primary ~= nil end, 20)
>[002]  ---
>[002] -- true
>[002] +- error: '[string "return test_run:wait_cond(function() return b..."]:1: attempt to
>[002] +    index field ''test'' (a nil value)'
>[002]  ...
>[002]  for i = 1, 100 do box.space.test:insert{i} end
>[002]  ---
>[002] +- error: '[string "for i = 1, 100 do box.space.test:insert{i} end "]:1: attempt to
>[002] +    index field ''test'' (a nil value)'
>[002]  ...
>[002]  fiber = require('fiber')
>[002]  ---
>[002] @@ -172,7 +178,8 @@
>[002]  ...
>[002]  test_run:wait_cond(function() return box.space.test:count() == 100 end, 20)
>[002]  ---
>[002] -- true
>[002] +- error: '[string "return test_run:wait_cond(function() return b..."]:1: attempt to
>[002] +    index field ''test'' (a nil value)'
>[002]  ...
>[002]  -- Rebootstrap one node of the cluster and check that others follow.
>[002]  -- Note, due to ERRINJ_RELAY_TIMEOUT there is a substantial delay
>[002] @@ -203,7 +210,8 @@
>[002]  test_run:cmd('restart server quorum1 with cleanup=1, args="0.1 0.5"')
>[002]  test_run:wait_cond(function() return box.space.test:count() == 100 end, 20)
>[002]  ---
>[002] -- true
>[002] +- error: '[string "return test_run:wait_cond(function() return b..."]:1: attempt to
>[002] +    index field ''test'' (a nil value)'
>[002]  ...
>[002]  -- The rebootstrapped replica will be assigned id = 4,
>[002]  -- because ids 1..3 are busy.
>[002] 


-- 
Ilya Kosarev

[-- Attachment #2: Type: text/html, Size: 8858 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Tarantool-patches] [PATCH 0/2] fix replica iteration issue & stabilize quorum test
  2019-11-20  1:29   ` Ilya Kosarev
@ 2019-11-20  1:41     ` Ilya Kosarev
  2019-11-20 22:18     ` Vladislav Shpilevoy
  1 sibling, 0 replies; 12+ messages in thread
From: Ilya Kosarev @ 2019-11-20  1:41 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

[-- Attachment #1: Type: text/plain, Size: 7921 bytes --]

Sorry, initially i didn't notice you ran the test on the branch.
Though my answer is still valid except the 1tst paragraph.


>Среда, 20 ноября 2019, 4:29 +03:00 от Ilya Kosarev <i.kosarev@tarantool.org>:
>
>Hi!
>
>Thanks for your review.
>
>Did u run tests on exactly this patchset or on the branch,
>https://github.com/tarantool/tarantool/tree/i.kosarev/gh-4586-fix-quorum-test
>which also contains  relay: fix join vclock obtainment in 
>relay_initial_join ?
>It is not yet checked in master (and might not get there, as far as
>Georgy has alternative fix as a part of sync replication), but
>is vital for the stability of the test.
>On my machine (Ubuntu 18.04.3 LTS) quorum test works perfectly on
>mentioned branch. I use next bash instruction to run it under 'load':
>l=0 ; while ./test-run.py -j20 `for r in {1..64} ; do echo quorum ; done` 2>/dev/null ; do l=$(($l+1)) ; echo ======== $l ============= ; done
>Anyway, I guess provided problems are not connected with join_vclock
>patch but are mac os specific, as far as i can't reproduce them locally.
>Guess we have some mac os machines, i will ask for access.
>It seems to me that wrong results problem is quite easy to handle.
>I have no idea for now how to handle provided segfault, however, i am
>sure it has nothing to do with the segfault mentioned in the issue, as
>far as it was caused by unsafe iteration of anon replicas. Other wrong
>results problems, mentioned in the ticket, are also handled in the
>patchset.
>Therefore i propose to close this ticket with the provided patchset
>although there are some other problems. Then i will open new issue with
>error info you provided and start to work on it as soon as i get remote
>access to some mac os machine.
>
>>Среда, 20 ноября 2019, 2:06 +03:00 от Vladislav Shpilevoy < v.shpilevoy@tarantool.org >:
>>
>>Hi! Thanks for the patch!
>>
>>The commits LGTM. But looks like there are more problems. I tried on
>>the branch:
>>
>>    python test-run.py replication/quorum. replication/quorum. replication/quorum. replication/quorum. replication/quorum. --conf memtx
>>
>>Got a crash one time, and wrong results other time. But overall
>>the test works more stable now, IMO.
>>
>>For the crash I have a core file. But it is not for gdb.
>>I can send it to you, or extract any info if you need.
>>
>>On the summary, I think we can't write 'Closes' in the
>>last commit yet. We need to improve these commits, or
>>add more fixes on the top.
>>
>>=====================================================================================
>>
>>Crash (looks really strange, may be an independent bug):
>>
>>(lldb) bt
>>* thread #1, stop reason = signal SIGSTOP
>>  * frame #0: 0x00007fff688be2c6 libsystem_kernel.dylib`__pthread_kill + 10
>>    frame #1: 0x00007fff68973bf1 libsystem_pthread.dylib`pthread_kill + 284
>>    frame #2: 0x00007fff688286a6 libsystem_c.dylib`abort + 127
>>    frame #3: 0x00000001026fd136 tarantool`sig_fatal_cb(signo=11, siginfo=0x000000010557fa38, context=0x000000010557faa0) at main.cc:300:2
>>    frame #4: 0x00007fff68968b5d libsystem_platform.dylib`_sigtramp + 29
>>    frame #5: 0x00007fff6896f138 libsystem_pthread.dylib`pthread_mutex_lock + 1
>>    frame #6: 0x0000000102b0600a tarantool`etp_submit(user=0x00007fc8500072c0, req=0x00007fc851901340) at etp.c:533:7
>>    frame #7: 0x0000000102b05eca tarantool`eio_submit(req=0x00007fc851901340) at eio.c:482:3
>>    frame #8: 0x000000010288d004 tarantool`coio_task_execute(task=0x00007fc851901340, timeout=15768000000) at coio_task.c:245:2
>>    frame #9: 0x000000010288da2b tarantool`coio_getaddrinfo(host="localhost", port="56816", hints=0x000000010557fd98, res=0x000000010557fd90, timeout=15768000000) at coio_task.c:412:6
>>    frame #10: 0x000000010285d787 tarantool`lbox_socket_getaddrinfo(L=0x00000001051ae590) at socket.c:814:12
>>    frame #11: 0x00000001028b932d tarantool`lj_BC_FUNCC + 68
>>    frame #12: 0x00000001028e79ae tarantool`lua_pcall(L=0x00000001051ae590, nargs=3, nresults=-1, errfunc=0) at lj_api.c:1139:12
>>    frame #13: 0x000000010285af73 tarantool`luaT_call(L=0x00000001051ae590, nargs=3, nreturns=-1) at utils.c:1036:6
>>    frame #14: 0x00000001028532c6 tarantool`lua_fiber_run_f(ap=0x00000001054003e8) at fiber.c:433:11
>>    frame #15: 0x00000001026fc91a tarantool`fiber_cxx_invoke(f=(tarantool`lua_fiber_run_f at fiber.c:427), ap=0x00000001054003e8)(__va_list_tag*), __va_list_tag*) at fiber.h:742:10
>>    frame #16: 0x000000010287765b tarantool`fiber_loop(data=0x0000000000000000) at fiber.c:737:18
>>    frame #17: 0x0000000102b0c787 tarantool`coro_init at coro.c:110:3
>>(lldb) 
>>
>>=====================================================================================
>>
>>Wrong results:
>>
>>[005] replication/quorum.test.lua                     memtx           [ pass ]
>>[002] replication/quorum.test.lua                     memtx           [ fail ]
>>[002] 
>>[002] Test failed! Result content mismatch:
>>[002] --- replication/quorum.result	Tue Nov 19 23:57:26 2019
>>[002] +++ replication/quorum.reject	Wed Nov 20 00:00:20 2019
>>[002] @@ -42,7 +42,8 @@
>>[002]  ...
>>[002]  box.space.test:replace{100} -- error
>>[002]  ---
>>[002] -- error: Can't modify data because this instance is in read-only mode.
>>[002] +- error: '[string "return box.space.test:replace{100} -- error "]:1: attempt to index
>>[002] +    field ''test'' (a nil value)'
>>[002]  ...
>>[002]  box.cfg{replication={}}
>>[002]  ---
>>[002] @@ -66,7 +67,8 @@
>>[002]  ...
>>[002]  box.space.test:replace{100} -- error
>>[002]  ---
>>[002] -- error: Can't modify data because this instance is in read-only mode.
>>[002] +- error: '[string "return box.space.test:replace{100} -- error "]:1: attempt to index
>>[002] +    field ''test'' (a nil value)'
>>[002]  ...
>>[002]  box.cfg{replication_connect_quorum = 2}
>>[002]  ---
>>[002] @@ -97,7 +99,8 @@
>>[002]  ...
>>[002]  box.space.test:replace{100} -- error
>>[002]  ---
>>[002] -- error: Can't modify data because this instance is in read-only mode.
>>[002] +- error: '[string "return box.space.test:replace{100} -- error "]:1: attempt to index
>>[002] +    field ''test'' (a nil value)'
>>[002]  ...
>>[002]  test_run:cmd('start server quorum1 with args="0.1 0.5"')
>>[002]  ---
>>[002] @@ -151,10 +154,13 @@
>>[002]  ...
>>[002]  test_run:wait_cond(function() return box.space.test.index.primary ~= nil end, 20)
>>[002]  ---
>>[002] -- true
>>[002] +- error: '[string "return test_run:wait_cond(function() return b..."]:1: attempt to
>>[002] +    index field ''test'' (a nil value)'
>>[002]  ...
>>[002]  for i = 1, 100 do box.space.test:insert{i} end
>>[002]  ---
>>[002] +- error: '[string "for i = 1, 100 do box.space.test:insert{i} end "]:1: attempt to
>>[002] +    index field ''test'' (a nil value)'
>>[002]  ...
>>[002]  fiber = require('fiber')
>>[002]  ---
>>[002] @@ -172,7 +178,8 @@
>>[002]  ...
>>[002]  test_run:wait_cond(function() return box.space.test:count() == 100 end, 20)
>>[002]  ---
>>[002] -- true
>>[002] +- error: '[string "return test_run:wait_cond(function() return b..."]:1: attempt to
>>[002] +    index field ''test'' (a nil value)'
>>[002]  ...
>>[002]  -- Rebootstrap one node of the cluster and check that others follow.
>>[002]  -- Note, due to ERRINJ_RELAY_TIMEOUT there is a substantial delay
>>[002] @@ -203,7 +210,8 @@
>>[002]  test_run:cmd('restart server quorum1 with cleanup=1, args="0.1 0.5"')
>>[002]  test_run:wait_cond(function() return box.space.test:count() == 100 end, 20)
>>[002]  ---
>>[002] -- true
>>[002] +- error: '[string "return test_run:wait_cond(function() return b..."]:1: attempt to
>>[002] +    index field ''test'' (a nil value)'
>>[002]  ...
>>[002]  -- The rebootstrapped replica will be assigned id = 4,
>>[002]  -- because ids 1..3 are busy.
>>[002] 
>
>
>-- 
>Ilya Kosarev


-- 
Ilya Kosarev

[-- Attachment #2: Type: text/html, Size: 9799 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Tarantool-patches] [PATCH 0/2] fix replica iteration issue & stabilize quorum test
  2019-11-20  1:29   ` Ilya Kosarev
  2019-11-20  1:41     ` Ilya Kosarev
@ 2019-11-20 22:18     ` Vladislav Shpilevoy
  2019-11-20 22:48       ` Ilya Kosarev
  1 sibling, 1 reply; 12+ messages in thread
From: Vladislav Shpilevoy @ 2019-11-20 22:18 UTC (permalink / raw)
  To: Ilya Kosarev; +Cc: tarantool-patches

On 20/11/2019 02:29, Ilya Kosarev wrote:
> Hi!
> 
> Thanks for your review.
> 
> Did u run tests on exactly this patchset or on the branch,
> https://github.com/tarantool/tarantool/tree/i.kosarev/gh-4586-fix-quorum-test
> which also contains relay: fix join vclock obtainment in <https://github.com/tarantool/tarantool/commit/83fa385f2e82911dee1a9189f90b1a94c8455023>
> relay_initial_join <https://github.com/tarantool/tarantool/commit/83fa385f2e82911dee1a9189f90b1a94c8455023>?
> It is not yet checked in master (and might not get there, as far as
> Georgy has alternative fix as a part of sync replication), but
> is vital for the stability of the test.
> 
> On my machine (Ubuntu 18.04.3 LTS) quorum test works perfectly on
> mentioned branch. I use next bash instruction to run it under 'load':
> l=0 ; while ./test-run.py -j20 `for r in {1..64} ; do echo quorum ; done` 2>/dev/null ; do l=$(($l+1)) ; echo ======== $l ============= ; done

I run it on your branch, and it fails. Even without these
complex bash scripts. I run it 5-10 times, and it fails/crashes.

> 
> Anyway, I guess provided problems are not connected with join_vclock
> patch but are mac os specific, as far as i can't reproduce them locally.
> Guess we have some mac os machines, i will ask for access.

Maybe they are not related. But at least the wrong test results mean,
that the ticket can't be closed. Because ticket is called:

    test: flaky segfault on replication/quorum test under high load

And it is still flaky.

For the crash you can open a new ticket.

> 
> It seems to me that wrong results problem is quite easy to handle.
> I have no idea for now how to handle provided segfault, however, i am
> sure it has nothing to do with the segfault mentioned in the issue, as
> far as it was caused by unsafe iteration of anon replicas. Other wrong
> results problems, mentioned in the ticket, are also handled in the
> patchset.

The problems, mentioned in the ticket, are just comments. Helpers.
The ticket is about flakiness. Not about several concrete fails. If the
test is still flaky, what is a point of closing this issue and opening
exactly the same again?

> 
> Therefore i propose to close this ticket with the provided patchset
> although there are some other problems. Then i will open new issue with
> error info you provided and start to work on it as soon as i get remote
> access to some mac os machine

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Tarantool-patches] [PATCH 0/2] fix replica iteration issue & stabilize quorum test
  2019-11-20 22:18     ` Vladislav Shpilevoy
@ 2019-11-20 22:48       ` Ilya Kosarev
  0 siblings, 0 replies; 12+ messages in thread
From: Ilya Kosarev @ 2019-11-20 22:48 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

[-- Attachment #1: Type: text/plain, Size: 2663 bytes --]


Thanks for the answer.

Sent patch as 'Part of' instead of 'Closes'.

>Четверг, 21 ноября 2019, 1:11 +03:00 от Vladislav Shpilevoy < v.shpilevoy@tarantool.org >:
>
>On 20/11/2019 02:29, Ilya Kosarev wrote:
>> Hi!
>> 
>> Thanks for your review.
>> 
>> Did u run tests on exactly this patchset or on the branch,
>>  https://github.com/tarantool/tarantool/tree/i.kosarev/gh-4586-fix-quorum-test
>> which also contains relay: fix join vclock obtainment in < https://github.com/tarantool/tarantool/commit/83fa385f2e82911dee1a9189f90b1a94c8455023 >
>> relay_initial_join < https://github.com/tarantool/tarantool/commit/83fa385f2e82911dee1a9189f90b1a94c8455023 >?
>> It is not yet checked in master (and might not get there, as far as
>> Georgy has alternative fix as a part of sync replication), but
>> is vital for the stability of the test.
>> 
>> On my machine (Ubuntu 18.04.3 LTS) quorum test works perfectly on
>> mentioned branch. I use next bash instruction to run it under 'load':
>> l=0 ; while ./test-run.py -j20 `for r in {1..64} ; do echo quorum ; done` 2>/dev/null ; do l=$(($l+1)) ; echo ======== $l ============= ; done
>I run it on your branch, and it fails. Even without these
>complex bash scripts. I run it 5-10 times, and it fails/crashes.
>
>> 
>> Anyway, I guess provided problems are not connected with join_vclock
>> patch but are mac os specific, as far as i can't reproduce them locally.
>> Guess we have some mac os machines, i will ask for access.
>
>Maybe they are not related. But at least the wrong test results mean,
>that the ticket can't be closed. Because ticket is called:
>
>    test: flaky segfault on replication/quorum test under high load
>
>And it is still flaky.
>
>For the crash you can open a new ticket.
>
>> 
>> It seems to me that wrong results problem is quite easy to handle.
>> I have no idea for now how to handle provided segfault, however, i am
>> sure it has nothing to do with the segfault mentioned in the issue, as
>> far as it was caused by unsafe iteration of anon replicas. Other wrong
>> results problems, mentioned in the ticket, are also handled in the
>> patchset.
>
>The problems, mentioned in the ticket, are just comments. Helpers.
>The ticket is about flakiness. Not about several concrete fails. If the
>test is still flaky, what is a point of closing this issue and opening
>exactly the same again?
>
>> 
>> Therefore i propose to close this ticket with the provided patchset
>> although there are some other problems. Then i will open new issue with
>> error info you provided and start to work on it as soon as i get remote
>> access to some mac os machine


-- 
Ilya Kosarev

[-- Attachment #2: Type: text/html, Size: 4181 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Tarantool-patches] [PATCH 2/2] test: stabilize quorum test conditions
  2019-11-20 23:07   ` Vladislav Shpilevoy
@ 2019-11-21  0:12     ` Ilya Kosarev
  0 siblings, 0 replies; 12+ messages in thread
From: Ilya Kosarev @ 2019-11-21  0:12 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

[-- Attachment #1: Type: text/plain, Size: 6651 bytes --]


I am going to fix it as soon as i will be able to test it, as far as it
doesn't seem convenient to fix something i can't reproduce. For now i
just brought the patch to the current state.
>Четверг, 21 ноября 2019, 2:01 +03:00 от Vladislav Shpilevoy <v.shpilevoy@tarantool.org>:
>
>Hm. So you are not going to fix the flaky error I mentioned
>in the previous thread about this commit?
>
>Seems like it is also about 'conditions which need time to be
>satisfied'.
>
>On 20/11/2019 23:47, Ilya Kosarev wrote:
>> There were some pass conditions in quorum test which could take some
>> time to be satisfied. Now they are wrapped using test_run:wait_cond to
>> make the test stable.
>> 
>> Part of #4586
>> ---
>>  test/replication/quorum.result   | 30 +++++++++++++++++-------------
>>  test/replication/quorum.test.lua | 18 +++++++++---------
>>  2 files changed, 26 insertions(+), 22 deletions(-)
>> 
>> diff --git a/test/replication/quorum.result b/test/replication/quorum.result
>> index ff5fa0150..12604c8de 100644
>> --- a/test/replication/quorum.result
>> +++ b/test/replication/quorum.result
>> @@ -115,15 +115,15 @@ box.info.status -- running
>>  - running
>>  ...
>>  -- Check that the replica follows all masters.
>> -box.info.id == 1 or box.info.replication[1].upstream.status == 'follow'
>> +box.info.id == 1 or test_run:wait_cond(function() return box.info.replication[1].upstream.status == 'follow' end, 20)
>>  ---
>>  - true
>>  ...
>> -box.info.id == 2 or box.info.replication[2].upstream.status == 'follow'
>> +box.info.id == 2 or test_run:wait_cond(function() return box.info.replication[2].upstream.status == 'follow' end, 20)
>>  ---
>>  - true
>>  ...
>> -box.info.id == 3 or box.info.replication[3].upstream.status == 'follow'
>> +box.info.id == 3 or test_run:wait_cond(function() return box.info.replication[3].upstream.status == 'follow' end, 20)
>>  ---
>>  - true
>>  ...
>> @@ -149,6 +149,10 @@ test_run:cmd('stop server quorum1')
>>  ---
>>  - true
>>  ...
>> +test_run:wait_cond(function() return box.space.test.index.primary ~= nil end, 20)
>> +---
>> +- true
>> +...
>>  for i = 1, 100 do box.space.test:insert{i} end
>>  ---
>>  ...
>> @@ -166,9 +170,9 @@ test_run:cmd('switch quorum1')
>>  ---
>>  - true
>>  ...
>> -box.space.test:count() -- 100
>> +test_run:wait_cond(function() return box.space.test:count() == 100 end, 20)
>>  ---
>> -- 100
>> +- true
>>  ...
>>  -- Rebootstrap one node of the cluster and check that others follow.
>>  -- Note, due to ERRINJ_RELAY_TIMEOUT there is a substantial delay
>> @@ -197,9 +201,9 @@ test_run:cmd('switch quorum1')
>>  - true
>>  ...
>>  test_run:cmd('restart server quorum1 with cleanup=1, args="0.1 0.5"')
>> -box.space.test:count() -- 100
>> +test_run:wait_cond(function() return box.space.test:count() == 100 end, 20)
>>  ---
>> -- 100
>> +- true
>>  ...
>>  -- The rebootstrapped replica will be assigned id = 4,
>>  -- because ids 1..3 are busy.
>> @@ -207,11 +211,9 @@ test_run:cmd('switch quorum2')
>>  ---
>>  - true
>>  ...
>> -fiber = require('fiber')
>> ----
>> -...
>> -while box.info.replication[4].upstream.status ~= 'follow' do fiber.sleep(0.001) end
>> +test_run:wait_cond(function() return box.info.replication[4].upstream.status == 'follow' end, 20)
>>  ---
>> +- true
>>  ...
>>  box.info.replication[4].upstream.status
>>  ---
>> @@ -221,11 +223,13 @@ test_run:cmd('switch quorum3')
>>  ---
>>  - true
>>  ...
>> -fiber = require('fiber')
>> +test_run:wait_cond(function() return box.info.replication ~= nil end, 20)
>>  ---
>> +- true
>>  ...
>> -while box.info.replication[4].upstream.status ~= 'follow' do fiber.sleep(0.001) end
>> +test_run:wait_cond(function() return box.info.replication[4].upstream.status == 'follow' end, 20)
>>  ---
>> +- true
>>  ...
>>  box.info.replication[4].upstream.status
>>  ---
>> diff --git a/test/replication/quorum.test.lua b/test/replication/quorum.test.lua
>> index 98febb367..be23200d3 100644
>> --- a/test/replication/quorum.test.lua
>> +++ b/test/replication/quorum.test.lua
>> @@ -47,9 +47,9 @@ box.info.ro -- false
>>  box.info.status -- running
>> 
>>  -- Check that the replica follows all masters.
>> -box.info.id == 1 or box.info.replication[1].upstream.status == 'follow'
>> -box.info.id == 2 or box.info.replication[2].upstream.status == 'follow'
>> -box.info.id == 3 or box.info.replication[3].upstream.status == 'follow'
>> +box.info.id == 1 or test_run:wait_cond(function() return box.info.replication[1].upstream.status == 'follow' end, 20)
>> +box.info.id == 2 or test_run:wait_cond(function() return box.info.replication[2].upstream.status == 'follow' end, 20)
>> +box.info.id == 3 or test_run:wait_cond(function() return box.info.replication[3].upstream.status == 'follow' end, 20)
>> 
>>  -- Check that box.cfg() doesn't return until the instance
>>  -- catches up with all configured replicas.
>> @@ -59,13 +59,14 @@ test_run:cmd('switch quorum2')
>>  box.error.injection.set("ERRINJ_RELAY_TIMEOUT", 0.001)
>>  test_run:cmd('stop server quorum1')
>> 
>> +test_run:wait_cond(function() return box.space.test.index.primary ~= nil end, 20)
>>  for i = 1, 100 do box.space.test:insert{i} end
>>  fiber = require('fiber')
>>  fiber.sleep(0.1)
>> 
>>  test_run:cmd('start server quorum1 with args="0.1  0.5"')
>>  test_run:cmd('switch quorum1')
>> -box.space.test:count() -- 100
>> +test_run:wait_cond(function() return box.space.test:count() == 100 end, 20)
>> 
>>  -- Rebootstrap one node of the cluster and check that others follow.
>>  -- Note, due to ERRINJ_RELAY_TIMEOUT there is a substantial delay
>> @@ -81,17 +82,16 @@ box.snapshot()
>>  test_run:cmd('switch quorum1')
>>  test_run:cmd('restart server quorum1 with cleanup=1, args="0.1 0.5"')
>> 
>> -box.space.test:count() -- 100
>> +test_run:wait_cond(function() return box.space.test:count() == 100 end, 20)
>> 
>>  -- The rebootstrapped replica will be assigned id = 4,
>>  -- because ids 1..3 are busy.
>>  test_run:cmd('switch quorum2')
>> -fiber = require('fiber')
>> -while box.info.replication[4].upstream.status ~= 'follow' do fiber.sleep(0.001) end
>> +test_run:wait_cond(function() return box.info.replication[4].upstream.status == 'follow' end, 20)
>>  box.info.replication[4].upstream.status
>>  test_run:cmd('switch quorum3')
>> -fiber = require('fiber')
>> -while box.info.replication[4].upstream.status ~= 'follow' do fiber.sleep(0.001) end
>> +test_run:wait_cond(function() return box.info.replication ~= nil end, 20)
>> +test_run:wait_cond(function() return box.info.replication[4].upstream.status == 'follow' end, 20)
>>  box.info.replication[4].upstream.status
>> 
>>  -- Cleanup.
>> 


-- 
Ilya Kosarev

[-- Attachment #2: Type: text/html, Size: 7991 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Tarantool-patches] [PATCH 2/2] test: stabilize quorum test conditions
  2019-11-20 22:47 ` Ilya Kosarev
@ 2019-11-20 23:07   ` Vladislav Shpilevoy
  2019-11-21  0:12     ` Ilya Kosarev
  0 siblings, 1 reply; 12+ messages in thread
From: Vladislav Shpilevoy @ 2019-11-20 23:07 UTC (permalink / raw)
  To: Ilya Kosarev, tarantool-patches

Hm. So you are not going to fix the flaky error I mentioned
in the previous thread about this commit?

Seems like it is also about 'conditions which need time to be
satisfied'.

On 20/11/2019 23:47, Ilya Kosarev wrote:
> There were some pass conditions in quorum test which could take some
> time to be satisfied. Now they are wrapped using test_run:wait_cond to
> make the test stable.
> 
> Part of #4586
> ---
>  test/replication/quorum.result   | 30 +++++++++++++++++-------------
>  test/replication/quorum.test.lua | 18 +++++++++---------
>  2 files changed, 26 insertions(+), 22 deletions(-)
> 
> diff --git a/test/replication/quorum.result b/test/replication/quorum.result
> index ff5fa0150..12604c8de 100644
> --- a/test/replication/quorum.result
> +++ b/test/replication/quorum.result
> @@ -115,15 +115,15 @@ box.info.status -- running
>  - running
>  ...
>  -- Check that the replica follows all masters.
> -box.info.id == 1 or box.info.replication[1].upstream.status == 'follow'
> +box.info.id == 1 or test_run:wait_cond(function() return box.info.replication[1].upstream.status == 'follow' end, 20)
>  ---
>  - true
>  ...
> -box.info.id == 2 or box.info.replication[2].upstream.status == 'follow'
> +box.info.id == 2 or test_run:wait_cond(function() return box.info.replication[2].upstream.status == 'follow' end, 20)
>  ---
>  - true
>  ...
> -box.info.id == 3 or box.info.replication[3].upstream.status == 'follow'
> +box.info.id == 3 or test_run:wait_cond(function() return box.info.replication[3].upstream.status == 'follow' end, 20)
>  ---
>  - true
>  ...
> @@ -149,6 +149,10 @@ test_run:cmd('stop server quorum1')
>  ---
>  - true
>  ...
> +test_run:wait_cond(function() return box.space.test.index.primary ~= nil end, 20)
> +---
> +- true
> +...
>  for i = 1, 100 do box.space.test:insert{i} end
>  ---
>  ...
> @@ -166,9 +170,9 @@ test_run:cmd('switch quorum1')
>  ---
>  - true
>  ...
> -box.space.test:count() -- 100
> +test_run:wait_cond(function() return box.space.test:count() == 100 end, 20)
>  ---
> -- 100
> +- true
>  ...
>  -- Rebootstrap one node of the cluster and check that others follow.
>  -- Note, due to ERRINJ_RELAY_TIMEOUT there is a substantial delay
> @@ -197,9 +201,9 @@ test_run:cmd('switch quorum1')
>  - true
>  ...
>  test_run:cmd('restart server quorum1 with cleanup=1, args="0.1 0.5"')
> -box.space.test:count() -- 100
> +test_run:wait_cond(function() return box.space.test:count() == 100 end, 20)
>  ---
> -- 100
> +- true
>  ...
>  -- The rebootstrapped replica will be assigned id = 4,
>  -- because ids 1..3 are busy.
> @@ -207,11 +211,9 @@ test_run:cmd('switch quorum2')
>  ---
>  - true
>  ...
> -fiber = require('fiber')
> ----
> -...
> -while box.info.replication[4].upstream.status ~= 'follow' do fiber.sleep(0.001) end
> +test_run:wait_cond(function() return box.info.replication[4].upstream.status == 'follow' end, 20)
>  ---
> +- true
>  ...
>  box.info.replication[4].upstream.status
>  ---
> @@ -221,11 +223,13 @@ test_run:cmd('switch quorum3')
>  ---
>  - true
>  ...
> -fiber = require('fiber')
> +test_run:wait_cond(function() return box.info.replication ~= nil end, 20)
>  ---
> +- true
>  ...
> -while box.info.replication[4].upstream.status ~= 'follow' do fiber.sleep(0.001) end
> +test_run:wait_cond(function() return box.info.replication[4].upstream.status == 'follow' end, 20)
>  ---
> +- true
>  ...
>  box.info.replication[4].upstream.status
>  ---
> diff --git a/test/replication/quorum.test.lua b/test/replication/quorum.test.lua
> index 98febb367..be23200d3 100644
> --- a/test/replication/quorum.test.lua
> +++ b/test/replication/quorum.test.lua
> @@ -47,9 +47,9 @@ box.info.ro -- false
>  box.info.status -- running
>  
>  -- Check that the replica follows all masters.
> -box.info.id == 1 or box.info.replication[1].upstream.status == 'follow'
> -box.info.id == 2 or box.info.replication[2].upstream.status == 'follow'
> -box.info.id == 3 or box.info.replication[3].upstream.status == 'follow'
> +box.info.id == 1 or test_run:wait_cond(function() return box.info.replication[1].upstream.status == 'follow' end, 20)
> +box.info.id == 2 or test_run:wait_cond(function() return box.info.replication[2].upstream.status == 'follow' end, 20)
> +box.info.id == 3 or test_run:wait_cond(function() return box.info.replication[3].upstream.status == 'follow' end, 20)
>  
>  -- Check that box.cfg() doesn't return until the instance
>  -- catches up with all configured replicas.
> @@ -59,13 +59,14 @@ test_run:cmd('switch quorum2')
>  box.error.injection.set("ERRINJ_RELAY_TIMEOUT", 0.001)
>  test_run:cmd('stop server quorum1')
>  
> +test_run:wait_cond(function() return box.space.test.index.primary ~= nil end, 20)
>  for i = 1, 100 do box.space.test:insert{i} end
>  fiber = require('fiber')
>  fiber.sleep(0.1)
>  
>  test_run:cmd('start server quorum1 with args="0.1  0.5"')
>  test_run:cmd('switch quorum1')
> -box.space.test:count() -- 100
> +test_run:wait_cond(function() return box.space.test:count() == 100 end, 20)
>  
>  -- Rebootstrap one node of the cluster and check that others follow.
>  -- Note, due to ERRINJ_RELAY_TIMEOUT there is a substantial delay
> @@ -81,17 +82,16 @@ box.snapshot()
>  test_run:cmd('switch quorum1')
>  test_run:cmd('restart server quorum1 with cleanup=1, args="0.1 0.5"')
>  
> -box.space.test:count() -- 100
> +test_run:wait_cond(function() return box.space.test:count() == 100 end, 20)
>  
>  -- The rebootstrapped replica will be assigned id = 4,
>  -- because ids 1..3 are busy.
>  test_run:cmd('switch quorum2')
> -fiber = require('fiber')
> -while box.info.replication[4].upstream.status ~= 'follow' do fiber.sleep(0.001) end
> +test_run:wait_cond(function() return box.info.replication[4].upstream.status == 'follow' end, 20)
>  box.info.replication[4].upstream.status
>  test_run:cmd('switch quorum3')
> -fiber = require('fiber')
> -while box.info.replication[4].upstream.status ~= 'follow' do fiber.sleep(0.001) end
> +test_run:wait_cond(function() return box.info.replication ~= nil end, 20)
> +test_run:wait_cond(function() return box.info.replication[4].upstream.status == 'follow' end, 20)
>  box.info.replication[4].upstream.status
>  
>  -- Cleanup.
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Tarantool-patches] [PATCH 2/2] test: stabilize quorum test conditions
       [not found] <cover.1574290043.git.i.kosarev@tarantool.org>
@ 2019-11-20 22:47 ` Ilya Kosarev
  2019-11-20 23:07   ` Vladislav Shpilevoy
  0 siblings, 1 reply; 12+ messages in thread
From: Ilya Kosarev @ 2019-11-20 22:47 UTC (permalink / raw)
  To: tarantool-patches; +Cc: v.shpilevoy

There were some pass conditions in quorum test which could take some
time to be satisfied. Now they are wrapped using test_run:wait_cond to
make the test stable.

Part of #4586
---
 test/replication/quorum.result   | 30 +++++++++++++++++-------------
 test/replication/quorum.test.lua | 18 +++++++++---------
 2 files changed, 26 insertions(+), 22 deletions(-)

diff --git a/test/replication/quorum.result b/test/replication/quorum.result
index ff5fa0150..12604c8de 100644
--- a/test/replication/quorum.result
+++ b/test/replication/quorum.result
@@ -115,15 +115,15 @@ box.info.status -- running
 - running
 ...
 -- Check that the replica follows all masters.
-box.info.id == 1 or box.info.replication[1].upstream.status == 'follow'
+box.info.id == 1 or test_run:wait_cond(function() return box.info.replication[1].upstream.status == 'follow' end, 20)
 ---
 - true
 ...
-box.info.id == 2 or box.info.replication[2].upstream.status == 'follow'
+box.info.id == 2 or test_run:wait_cond(function() return box.info.replication[2].upstream.status == 'follow' end, 20)
 ---
 - true
 ...
-box.info.id == 3 or box.info.replication[3].upstream.status == 'follow'
+box.info.id == 3 or test_run:wait_cond(function() return box.info.replication[3].upstream.status == 'follow' end, 20)
 ---
 - true
 ...
@@ -149,6 +149,10 @@ test_run:cmd('stop server quorum1')
 ---
 - true
 ...
+test_run:wait_cond(function() return box.space.test.index.primary ~= nil end, 20)
+---
+- true
+...
 for i = 1, 100 do box.space.test:insert{i} end
 ---
 ...
@@ -166,9 +170,9 @@ test_run:cmd('switch quorum1')
 ---
 - true
 ...
-box.space.test:count() -- 100
+test_run:wait_cond(function() return box.space.test:count() == 100 end, 20)
 ---
-- 100
+- true
 ...
 -- Rebootstrap one node of the cluster and check that others follow.
 -- Note, due to ERRINJ_RELAY_TIMEOUT there is a substantial delay
@@ -197,9 +201,9 @@ test_run:cmd('switch quorum1')
 - true
 ...
 test_run:cmd('restart server quorum1 with cleanup=1, args="0.1 0.5"')
-box.space.test:count() -- 100
+test_run:wait_cond(function() return box.space.test:count() == 100 end, 20)
 ---
-- 100
+- true
 ...
 -- The rebootstrapped replica will be assigned id = 4,
 -- because ids 1..3 are busy.
@@ -207,11 +211,9 @@ test_run:cmd('switch quorum2')
 ---
 - true
 ...
-fiber = require('fiber')
----
-...
-while box.info.replication[4].upstream.status ~= 'follow' do fiber.sleep(0.001) end
+test_run:wait_cond(function() return box.info.replication[4].upstream.status == 'follow' end, 20)
 ---
+- true
 ...
 box.info.replication[4].upstream.status
 ---
@@ -221,11 +223,13 @@ test_run:cmd('switch quorum3')
 ---
 - true
 ...
-fiber = require('fiber')
+test_run:wait_cond(function() return box.info.replication ~= nil end, 20)
 ---
+- true
 ...
-while box.info.replication[4].upstream.status ~= 'follow' do fiber.sleep(0.001) end
+test_run:wait_cond(function() return box.info.replication[4].upstream.status == 'follow' end, 20)
 ---
+- true
 ...
 box.info.replication[4].upstream.status
 ---
diff --git a/test/replication/quorum.test.lua b/test/replication/quorum.test.lua
index 98febb367..be23200d3 100644
--- a/test/replication/quorum.test.lua
+++ b/test/replication/quorum.test.lua
@@ -47,9 +47,9 @@ box.info.ro -- false
 box.info.status -- running
 
 -- Check that the replica follows all masters.
-box.info.id == 1 or box.info.replication[1].upstream.status == 'follow'
-box.info.id == 2 or box.info.replication[2].upstream.status == 'follow'
-box.info.id == 3 or box.info.replication[3].upstream.status == 'follow'
+box.info.id == 1 or test_run:wait_cond(function() return box.info.replication[1].upstream.status == 'follow' end, 20)
+box.info.id == 2 or test_run:wait_cond(function() return box.info.replication[2].upstream.status == 'follow' end, 20)
+box.info.id == 3 or test_run:wait_cond(function() return box.info.replication[3].upstream.status == 'follow' end, 20)
 
 -- Check that box.cfg() doesn't return until the instance
 -- catches up with all configured replicas.
@@ -59,13 +59,14 @@ test_run:cmd('switch quorum2')
 box.error.injection.set("ERRINJ_RELAY_TIMEOUT", 0.001)
 test_run:cmd('stop server quorum1')
 
+test_run:wait_cond(function() return box.space.test.index.primary ~= nil end, 20)
 for i = 1, 100 do box.space.test:insert{i} end
 fiber = require('fiber')
 fiber.sleep(0.1)
 
 test_run:cmd('start server quorum1 with args="0.1  0.5"')
 test_run:cmd('switch quorum1')
-box.space.test:count() -- 100
+test_run:wait_cond(function() return box.space.test:count() == 100 end, 20)
 
 -- Rebootstrap one node of the cluster and check that others follow.
 -- Note, due to ERRINJ_RELAY_TIMEOUT there is a substantial delay
@@ -81,17 +82,16 @@ box.snapshot()
 test_run:cmd('switch quorum1')
 test_run:cmd('restart server quorum1 with cleanup=1, args="0.1 0.5"')
 
-box.space.test:count() -- 100
+test_run:wait_cond(function() return box.space.test:count() == 100 end, 20)
 
 -- The rebootstrapped replica will be assigned id = 4,
 -- because ids 1..3 are busy.
 test_run:cmd('switch quorum2')
-fiber = require('fiber')
-while box.info.replication[4].upstream.status ~= 'follow' do fiber.sleep(0.001) end
+test_run:wait_cond(function() return box.info.replication[4].upstream.status == 'follow' end, 20)
 box.info.replication[4].upstream.status
 test_run:cmd('switch quorum3')
-fiber = require('fiber')
-while box.info.replication[4].upstream.status ~= 'follow' do fiber.sleep(0.001) end
+test_run:wait_cond(function() return box.info.replication ~= nil end, 20)
+test_run:wait_cond(function() return box.info.replication[4].upstream.status == 'follow' end, 20)
 box.info.replication[4].upstream.status
 
 -- Cleanup.
-- 
2.17.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Tarantool-patches] [PATCH 2/2] test: stabilize quorum test conditions
       [not found] <cover.1574159473.git.i.kosarev@tarantool.org>
@ 2019-11-19 10:31 ` Ilya Kosarev
  0 siblings, 0 replies; 12+ messages in thread
From: Ilya Kosarev @ 2019-11-19 10:31 UTC (permalink / raw)
  To: tarantool-patches; +Cc: v.shpilevoy

There were some pass conditions in quorum test which could take some
time to be satisfied. Now they are wrapped using test_run:wait_cond to
make the test stable.

Closes #4586
---
 test/replication/quorum.result   | 30 +++++++++++++++++-------------
 test/replication/quorum.test.lua | 18 +++++++++---------
 2 files changed, 26 insertions(+), 22 deletions(-)

diff --git a/test/replication/quorum.result b/test/replication/quorum.result
index ff5fa0150..12604c8de 100644
--- a/test/replication/quorum.result
+++ b/test/replication/quorum.result
@@ -115,15 +115,15 @@ box.info.status -- running
 - running
 ...
 -- Check that the replica follows all masters.
-box.info.id == 1 or box.info.replication[1].upstream.status == 'follow'
+box.info.id == 1 or test_run:wait_cond(function() return box.info.replication[1].upstream.status == 'follow' end, 20)
 ---
 - true
 ...
-box.info.id == 2 or box.info.replication[2].upstream.status == 'follow'
+box.info.id == 2 or test_run:wait_cond(function() return box.info.replication[2].upstream.status == 'follow' end, 20)
 ---
 - true
 ...
-box.info.id == 3 or box.info.replication[3].upstream.status == 'follow'
+box.info.id == 3 or test_run:wait_cond(function() return box.info.replication[3].upstream.status == 'follow' end, 20)
 ---
 - true
 ...
@@ -149,6 +149,10 @@ test_run:cmd('stop server quorum1')
 ---
 - true
 ...
+test_run:wait_cond(function() return box.space.test.index.primary ~= nil end, 20)
+---
+- true
+...
 for i = 1, 100 do box.space.test:insert{i} end
 ---
 ...
@@ -166,9 +170,9 @@ test_run:cmd('switch quorum1')
 ---
 - true
 ...
-box.space.test:count() -- 100
+test_run:wait_cond(function() return box.space.test:count() == 100 end, 20)
 ---
-- 100
+- true
 ...
 -- Rebootstrap one node of the cluster and check that others follow.
 -- Note, due to ERRINJ_RELAY_TIMEOUT there is a substantial delay
@@ -197,9 +201,9 @@ test_run:cmd('switch quorum1')
 - true
 ...
 test_run:cmd('restart server quorum1 with cleanup=1, args="0.1 0.5"')
-box.space.test:count() -- 100
+test_run:wait_cond(function() return box.space.test:count() == 100 end, 20)
 ---
-- 100
+- true
 ...
 -- The rebootstrapped replica will be assigned id = 4,
 -- because ids 1..3 are busy.
@@ -207,11 +211,9 @@ test_run:cmd('switch quorum2')
 ---
 - true
 ...
-fiber = require('fiber')
----
-...
-while box.info.replication[4].upstream.status ~= 'follow' do fiber.sleep(0.001) end
+test_run:wait_cond(function() return box.info.replication[4].upstream.status == 'follow' end, 20)
 ---
+- true
 ...
 box.info.replication[4].upstream.status
 ---
@@ -221,11 +223,13 @@ test_run:cmd('switch quorum3')
 ---
 - true
 ...
-fiber = require('fiber')
+test_run:wait_cond(function() return box.info.replication ~= nil end, 20)
 ---
+- true
 ...
-while box.info.replication[4].upstream.status ~= 'follow' do fiber.sleep(0.001) end
+test_run:wait_cond(function() return box.info.replication[4].upstream.status == 'follow' end, 20)
 ---
+- true
 ...
 box.info.replication[4].upstream.status
 ---
diff --git a/test/replication/quorum.test.lua b/test/replication/quorum.test.lua
index 98febb367..be23200d3 100644
--- a/test/replication/quorum.test.lua
+++ b/test/replication/quorum.test.lua
@@ -47,9 +47,9 @@ box.info.ro -- false
 box.info.status -- running
 
 -- Check that the replica follows all masters.
-box.info.id == 1 or box.info.replication[1].upstream.status == 'follow'
-box.info.id == 2 or box.info.replication[2].upstream.status == 'follow'
-box.info.id == 3 or box.info.replication[3].upstream.status == 'follow'
+box.info.id == 1 or test_run:wait_cond(function() return box.info.replication[1].upstream.status == 'follow' end, 20)
+box.info.id == 2 or test_run:wait_cond(function() return box.info.replication[2].upstream.status == 'follow' end, 20)
+box.info.id == 3 or test_run:wait_cond(function() return box.info.replication[3].upstream.status == 'follow' end, 20)
 
 -- Check that box.cfg() doesn't return until the instance
 -- catches up with all configured replicas.
@@ -59,13 +59,14 @@ test_run:cmd('switch quorum2')
 box.error.injection.set("ERRINJ_RELAY_TIMEOUT", 0.001)
 test_run:cmd('stop server quorum1')
 
+test_run:wait_cond(function() return box.space.test.index.primary ~= nil end, 20)
 for i = 1, 100 do box.space.test:insert{i} end
 fiber = require('fiber')
 fiber.sleep(0.1)
 
 test_run:cmd('start server quorum1 with args="0.1  0.5"')
 test_run:cmd('switch quorum1')
-box.space.test:count() -- 100
+test_run:wait_cond(function() return box.space.test:count() == 100 end, 20)
 
 -- Rebootstrap one node of the cluster and check that others follow.
 -- Note, due to ERRINJ_RELAY_TIMEOUT there is a substantial delay
@@ -81,17 +82,16 @@ box.snapshot()
 test_run:cmd('switch quorum1')
 test_run:cmd('restart server quorum1 with cleanup=1, args="0.1 0.5"')
 
-box.space.test:count() -- 100
+test_run:wait_cond(function() return box.space.test:count() == 100 end, 20)
 
 -- The rebootstrapped replica will be assigned id = 4,
 -- because ids 1..3 are busy.
 test_run:cmd('switch quorum2')
-fiber = require('fiber')
-while box.info.replication[4].upstream.status ~= 'follow' do fiber.sleep(0.001) end
+test_run:wait_cond(function() return box.info.replication[4].upstream.status == 'follow' end, 20)
 box.info.replication[4].upstream.status
 test_run:cmd('switch quorum3')
-fiber = require('fiber')
-while box.info.replication[4].upstream.status ~= 'follow' do fiber.sleep(0.001) end
+test_run:wait_cond(function() return box.info.replication ~= nil end, 20)
+test_run:wait_cond(function() return box.info.replication[4].upstream.status == 'follow' end, 20)
 box.info.replication[4].upstream.status
 
 -- Cleanup.
-- 
2.17.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-11-21  0:12 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-18 13:19 [Tarantool-patches] [PATCH 0/2] fix replica iteration issue & stabilize quorum test Ilya Kosarev
2019-11-18 13:19 ` [Tarantool-patches] [PATCH 1/2] replication: make anon replicas iteration safe Ilya Kosarev
2019-11-18 13:19 ` [Tarantool-patches] [PATCH 2/2] test: stabilize quorum test conditions Ilya Kosarev
2019-11-19 23:13 ` [Tarantool-patches] [PATCH 0/2] fix replica iteration issue & stabilize quorum test Vladislav Shpilevoy
2019-11-20  1:29   ` Ilya Kosarev
2019-11-20  1:41     ` Ilya Kosarev
2019-11-20 22:18     ` Vladislav Shpilevoy
2019-11-20 22:48       ` Ilya Kosarev
     [not found] <cover.1574159473.git.i.kosarev@tarantool.org>
2019-11-19 10:31 ` [Tarantool-patches] [PATCH 2/2] test: stabilize quorum test conditions Ilya Kosarev
     [not found] <cover.1574290043.git.i.kosarev@tarantool.org>
2019-11-20 22:47 ` Ilya Kosarev
2019-11-20 23:07   ` Vladislav Shpilevoy
2019-11-21  0:12     ` Ilya Kosarev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox