* [Tarantool-patches] [PATCH 0/2] fix replica iteration issue & stabilize quorum test @ 2019-11-18 13:19 Ilya Kosarev 2019-11-18 13:19 ` [Tarantool-patches] [PATCH 1/2] replication: make anon replicas iteration safe Ilya Kosarev ` (2 more replies) 0 siblings, 3 replies; 8+ messages in thread From: Ilya Kosarev @ 2019-11-18 13:19 UTC (permalink / raw) To: tarantool-patches This patchset fixes anon replicas iteration issues in replicaset_foollow and stabilizes quorum test. Branch: https://github.com/tarantool/tarantool/tree/i.kosarev/gh-4586-fix-quorum-test Issue: https://github.com/tarantool/tarantool/issues/4586 Ilya Kosarev (2): replication: make anon replicas iteration safe test: stabilize quorum test conditions src/box/replication.cc | 3 ++- test/replication/quorum.result | 28 +++++++++++++++++++--------- test/replication/quorum.test.lua | 16 +++++++++------- 3 files changed, 30 insertions(+), 17 deletions(-) -- 2.17.1 ^ permalink raw reply [flat|nested] 8+ messages in thread
* [Tarantool-patches] [PATCH 1/2] replication: make anon replicas iteration safe 2019-11-18 13:19 [Tarantool-patches] [PATCH 0/2] fix replica iteration issue & stabilize quorum test Ilya Kosarev @ 2019-11-18 13:19 ` Ilya Kosarev 2019-11-18 13:19 ` [Tarantool-patches] [PATCH 2/2] test: stabilize quorum test conditions Ilya Kosarev 2019-11-19 23:13 ` [Tarantool-patches] [PATCH 0/2] fix replica iteration issue & stabilize quorum test Vladislav Shpilevoy 2 siblings, 0 replies; 8+ messages in thread From: Ilya Kosarev @ 2019-11-18 13:19 UTC (permalink / raw) To: tarantool-patches In replicaset_follow we iterate anon replicas list: list of replicas that haven't received an UUID. In case of successful connect replica link is being removed from anon list. If it happens immediately, without yield in applier, iteration breaks. Now it is fixed by rlist_foreach_entry_safe instead of common rlist_foreach_entry. Part of #4586 --- src/box/replication.cc | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/src/box/replication.cc b/src/box/replication.cc index 509187cd3..495a01c4f 100644 --- a/src/box/replication.cc +++ b/src/box/replication.cc @@ -800,7 +800,8 @@ replicaset_follow(void) if (replica->applier != NULL) applier_resume(replica->applier); } - rlist_foreach_entry(replica, &replicaset.anon, in_anon) { + struct replica *tmp; + rlist_foreach_entry_safe(replica, &replicaset.anon, in_anon, tmp) { /* Restart appliers that failed to connect. */ applier_start(replica->applier); } -- 2.17.1 ^ permalink raw reply [flat|nested] 8+ messages in thread
* [Tarantool-patches] [PATCH 2/2] test: stabilize quorum test conditions 2019-11-18 13:19 [Tarantool-patches] [PATCH 0/2] fix replica iteration issue & stabilize quorum test Ilya Kosarev 2019-11-18 13:19 ` [Tarantool-patches] [PATCH 1/2] replication: make anon replicas iteration safe Ilya Kosarev @ 2019-11-18 13:19 ` Ilya Kosarev 2019-11-19 23:13 ` [Tarantool-patches] [PATCH 0/2] fix replica iteration issue & stabilize quorum test Vladislav Shpilevoy 2 siblings, 0 replies; 8+ messages in thread From: Ilya Kosarev @ 2019-11-18 13:19 UTC (permalink / raw) To: tarantool-patches There were some pass conditions in quorum test which could take some time to be satisfied. Now they are wrapped using test_run:wait_cond to make the test stable. Closes #4586 --- test/replication/quorum.result | 28 +++++++++++++++++++--------- test/replication/quorum.test.lua | 16 +++++++++------- 2 files changed, 28 insertions(+), 16 deletions(-) diff --git a/test/replication/quorum.result b/test/replication/quorum.result index ff5fa0150..73def113f 100644 --- a/test/replication/quorum.result +++ b/test/replication/quorum.result @@ -115,15 +115,15 @@ box.info.status -- running - running ... -- Check that the replica follows all masters. -box.info.id == 1 or box.info.replication[1].upstream.status == 'follow' +box.info.id == 1 or test_run:wait_cond(function() return box.info.replication[1].upstream.status == 'follow' end, 20) --- - true ... -box.info.id == 2 or box.info.replication[2].upstream.status == 'follow' +box.info.id == 2 or test_run:wait_cond(function() return box.info.replication[2].upstream.status == 'follow' end, 20) --- - true ... -box.info.id == 3 or box.info.replication[3].upstream.status == 'follow' +box.info.id == 3 or test_run:wait_cond(function() return box.info.replication[3].upstream.status == 'follow' end, 20) --- - true ... @@ -149,6 +149,10 @@ test_run:cmd('stop server quorum1') --- - true ... +test_run:wait_cond(function() return box.space.test.index.primary ~= nil end, 20) +--- +- true +... for i = 1, 100 do box.space.test:insert{i} end --- ... @@ -166,9 +170,9 @@ test_run:cmd('switch quorum1') --- - true ... -box.space.test:count() -- 100 +test_run:wait_cond(function() return box.space.test:count() == 100 end, 20) --- -- 100 +- true ... -- Rebootstrap one node of the cluster and check that others follow. -- Note, due to ERRINJ_RELAY_TIMEOUT there is a substantial delay @@ -197,9 +201,9 @@ test_run:cmd('switch quorum1') - true ... test_run:cmd('restart server quorum1 with cleanup=1, args="0.1 0.5"') -box.space.test:count() -- 100 +test_run:wait_cond(function() return box.space.test:count() == 100 end, 20) --- -- 100 +- true ... -- The rebootstrapped replica will be assigned id = 4, -- because ids 1..3 are busy. @@ -210,8 +214,9 @@ test_run:cmd('switch quorum2') fiber = require('fiber') --- ... -while box.info.replication[4].upstream.status ~= 'follow' do fiber.sleep(0.001) end +test_run:wait_cond(function() return box.info.replication[4].upstream.status == 'follow' end, 20) --- +- true ... box.info.replication[4].upstream.status --- @@ -224,8 +229,13 @@ test_run:cmd('switch quorum3') fiber = require('fiber') --- ... -while box.info.replication[4].upstream.status ~= 'follow' do fiber.sleep(0.001) end +test_run:wait_cond(function() return box.info.replication ~= nil end, 20) --- +- true +... +test_run:wait_cond(function() return box.info.replication[4].upstream.status == 'follow' end, 20) +--- +- true ... box.info.replication[4].upstream.status --- diff --git a/test/replication/quorum.test.lua b/test/replication/quorum.test.lua index 98febb367..1842cdffe 100644 --- a/test/replication/quorum.test.lua +++ b/test/replication/quorum.test.lua @@ -47,9 +47,9 @@ box.info.ro -- false box.info.status -- running -- Check that the replica follows all masters. -box.info.id == 1 or box.info.replication[1].upstream.status == 'follow' -box.info.id == 2 or box.info.replication[2].upstream.status == 'follow' -box.info.id == 3 or box.info.replication[3].upstream.status == 'follow' +box.info.id == 1 or test_run:wait_cond(function() return box.info.replication[1].upstream.status == 'follow' end, 20) +box.info.id == 2 or test_run:wait_cond(function() return box.info.replication[2].upstream.status == 'follow' end, 20) +box.info.id == 3 or test_run:wait_cond(function() return box.info.replication[3].upstream.status == 'follow' end, 20) -- Check that box.cfg() doesn't return until the instance -- catches up with all configured replicas. @@ -59,13 +59,14 @@ test_run:cmd('switch quorum2') box.error.injection.set("ERRINJ_RELAY_TIMEOUT", 0.001) test_run:cmd('stop server quorum1') +test_run:wait_cond(function() return box.space.test.index.primary ~= nil end, 20) for i = 1, 100 do box.space.test:insert{i} end fiber = require('fiber') fiber.sleep(0.1) test_run:cmd('start server quorum1 with args="0.1 0.5"') test_run:cmd('switch quorum1') -box.space.test:count() -- 100 +test_run:wait_cond(function() return box.space.test:count() == 100 end, 20) -- Rebootstrap one node of the cluster and check that others follow. -- Note, due to ERRINJ_RELAY_TIMEOUT there is a substantial delay @@ -81,17 +82,18 @@ box.snapshot() test_run:cmd('switch quorum1') test_run:cmd('restart server quorum1 with cleanup=1, args="0.1 0.5"') -box.space.test:count() -- 100 +test_run:wait_cond(function() return box.space.test:count() == 100 end, 20) -- The rebootstrapped replica will be assigned id = 4, -- because ids 1..3 are busy. test_run:cmd('switch quorum2') fiber = require('fiber') -while box.info.replication[4].upstream.status ~= 'follow' do fiber.sleep(0.001) end +test_run:wait_cond(function() return box.info.replication[4].upstream.status == 'follow' end, 20) box.info.replication[4].upstream.status test_run:cmd('switch quorum3') fiber = require('fiber') -while box.info.replication[4].upstream.status ~= 'follow' do fiber.sleep(0.001) end +test_run:wait_cond(function() return box.info.replication ~= nil end, 20) +test_run:wait_cond(function() return box.info.replication[4].upstream.status == 'follow' end, 20) box.info.replication[4].upstream.status -- Cleanup. -- 2.17.1 ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Tarantool-patches] [PATCH 0/2] fix replica iteration issue & stabilize quorum test 2019-11-18 13:19 [Tarantool-patches] [PATCH 0/2] fix replica iteration issue & stabilize quorum test Ilya Kosarev 2019-11-18 13:19 ` [Tarantool-patches] [PATCH 1/2] replication: make anon replicas iteration safe Ilya Kosarev 2019-11-18 13:19 ` [Tarantool-patches] [PATCH 2/2] test: stabilize quorum test conditions Ilya Kosarev @ 2019-11-19 23:13 ` Vladislav Shpilevoy 2019-11-20 1:29 ` Ilya Kosarev 2 siblings, 1 reply; 8+ messages in thread From: Vladislav Shpilevoy @ 2019-11-19 23:13 UTC (permalink / raw) To: Ilya Kosarev, tarantool-patches Hi! Thanks for the patch! The commits LGTM. But looks like there are more problems. I tried on the branch: python test-run.py replication/quorum. replication/quorum. replication/quorum. replication/quorum. replication/quorum. --conf memtx Got a crash one time, and wrong results other time. But overall the test works more stable now, IMO. For the crash I have a core file. But it is not for gdb. I can send it to you, or extract any info if you need. On the summary, I think we can't write 'Closes' in the last commit yet. We need to improve these commits, or add more fixes on the top. ===================================================================================== Crash (looks really strange, may be an independent bug): (lldb) bt * thread #1, stop reason = signal SIGSTOP * frame #0: 0x00007fff688be2c6 libsystem_kernel.dylib`__pthread_kill + 10 frame #1: 0x00007fff68973bf1 libsystem_pthread.dylib`pthread_kill + 284 frame #2: 0x00007fff688286a6 libsystem_c.dylib`abort + 127 frame #3: 0x00000001026fd136 tarantool`sig_fatal_cb(signo=11, siginfo=0x000000010557fa38, context=0x000000010557faa0) at main.cc:300:2 frame #4: 0x00007fff68968b5d libsystem_platform.dylib`_sigtramp + 29 frame #5: 0x00007fff6896f138 libsystem_pthread.dylib`pthread_mutex_lock + 1 frame #6: 0x0000000102b0600a tarantool`etp_submit(user=0x00007fc8500072c0, req=0x00007fc851901340) at etp.c:533:7 frame #7: 0x0000000102b05eca tarantool`eio_submit(req=0x00007fc851901340) at eio.c:482:3 frame #8: 0x000000010288d004 tarantool`coio_task_execute(task=0x00007fc851901340, timeout=15768000000) at coio_task.c:245:2 frame #9: 0x000000010288da2b tarantool`coio_getaddrinfo(host="localhost", port="56816", hints=0x000000010557fd98, res=0x000000010557fd90, timeout=15768000000) at coio_task.c:412:6 frame #10: 0x000000010285d787 tarantool`lbox_socket_getaddrinfo(L=0x00000001051ae590) at socket.c:814:12 frame #11: 0x00000001028b932d tarantool`lj_BC_FUNCC + 68 frame #12: 0x00000001028e79ae tarantool`lua_pcall(L=0x00000001051ae590, nargs=3, nresults=-1, errfunc=0) at lj_api.c:1139:12 frame #13: 0x000000010285af73 tarantool`luaT_call(L=0x00000001051ae590, nargs=3, nreturns=-1) at utils.c:1036:6 frame #14: 0x00000001028532c6 tarantool`lua_fiber_run_f(ap=0x00000001054003e8) at fiber.c:433:11 frame #15: 0x00000001026fc91a tarantool`fiber_cxx_invoke(f=(tarantool`lua_fiber_run_f at fiber.c:427), ap=0x00000001054003e8)(__va_list_tag*), __va_list_tag*) at fiber.h:742:10 frame #16: 0x000000010287765b tarantool`fiber_loop(data=0x0000000000000000) at fiber.c:737:18 frame #17: 0x0000000102b0c787 tarantool`coro_init at coro.c:110:3 (lldb) ===================================================================================== Wrong results: [005] replication/quorum.test.lua memtx [ pass ] [002] replication/quorum.test.lua memtx [ fail ] [002] [002] Test failed! Result content mismatch: [002] --- replication/quorum.result Tue Nov 19 23:57:26 2019 [002] +++ replication/quorum.reject Wed Nov 20 00:00:20 2019 [002] @@ -42,7 +42,8 @@ [002] ... [002] box.space.test:replace{100} -- error [002] --- [002] -- error: Can't modify data because this instance is in read-only mode. [002] +- error: '[string "return box.space.test:replace{100} -- error "]:1: attempt to index [002] + field ''test'' (a nil value)' [002] ... [002] box.cfg{replication={}} [002] --- [002] @@ -66,7 +67,8 @@ [002] ... [002] box.space.test:replace{100} -- error [002] --- [002] -- error: Can't modify data because this instance is in read-only mode. [002] +- error: '[string "return box.space.test:replace{100} -- error "]:1: attempt to index [002] + field ''test'' (a nil value)' [002] ... [002] box.cfg{replication_connect_quorum = 2} [002] --- [002] @@ -97,7 +99,8 @@ [002] ... [002] box.space.test:replace{100} -- error [002] --- [002] -- error: Can't modify data because this instance is in read-only mode. [002] +- error: '[string "return box.space.test:replace{100} -- error "]:1: attempt to index [002] + field ''test'' (a nil value)' [002] ... [002] test_run:cmd('start server quorum1 with args="0.1 0.5"') [002] --- [002] @@ -151,10 +154,13 @@ [002] ... [002] test_run:wait_cond(function() return box.space.test.index.primary ~= nil end, 20) [002] --- [002] -- true [002] +- error: '[string "return test_run:wait_cond(function() return b..."]:1: attempt to [002] + index field ''test'' (a nil value)' [002] ... [002] for i = 1, 100 do box.space.test:insert{i} end [002] --- [002] +- error: '[string "for i = 1, 100 do box.space.test:insert{i} end "]:1: attempt to [002] + index field ''test'' (a nil value)' [002] ... [002] fiber = require('fiber') [002] --- [002] @@ -172,7 +178,8 @@ [002] ... [002] test_run:wait_cond(function() return box.space.test:count() == 100 end, 20) [002] --- [002] -- true [002] +- error: '[string "return test_run:wait_cond(function() return b..."]:1: attempt to [002] + index field ''test'' (a nil value)' [002] ... [002] -- Rebootstrap one node of the cluster and check that others follow. [002] -- Note, due to ERRINJ_RELAY_TIMEOUT there is a substantial delay [002] @@ -203,7 +210,8 @@ [002] test_run:cmd('restart server quorum1 with cleanup=1, args="0.1 0.5"') [002] test_run:wait_cond(function() return box.space.test:count() == 100 end, 20) [002] --- [002] -- true [002] +- error: '[string "return test_run:wait_cond(function() return b..."]:1: attempt to [002] + index field ''test'' (a nil value)' [002] ... [002] -- The rebootstrapped replica will be assigned id = 4, [002] -- because ids 1..3 are busy. [002] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Tarantool-patches] [PATCH 0/2] fix replica iteration issue & stabilize quorum test 2019-11-19 23:13 ` [Tarantool-patches] [PATCH 0/2] fix replica iteration issue & stabilize quorum test Vladislav Shpilevoy @ 2019-11-20 1:29 ` Ilya Kosarev 2019-11-20 1:41 ` Ilya Kosarev 2019-11-20 22:18 ` Vladislav Shpilevoy 0 siblings, 2 replies; 8+ messages in thread From: Ilya Kosarev @ 2019-11-20 1:29 UTC (permalink / raw) To: Vladislav Shpilevoy; +Cc: tarantool-patches [-- Attachment #1: Type: text/plain, Size: 7527 bytes --] Hi! Thanks for your review. Did u run tests on exactly this patchset or on the branch, https://github.com/tarantool/tarantool/tree/i.kosarev/gh-4586-fix-quorum-test which also contains relay: fix join vclock obtainment in relay_initial_join ? It is not yet checked in master (and might not get there, as far as Georgy has alternative fix as a part of sync replication), but is vital for the stability of the test. On my machine (Ubuntu 18.04.3 LTS) quorum test works perfectly on mentioned branch. I use next bash instruction to run it under 'load': l=0 ; while ./test-run.py -j20 `for r in {1..64} ; do echo quorum ; done` 2>/dev/null ; do l=$(($l+1)) ; echo ======== $l ============= ; done Anyway, I guess provided problems are not connected with join_vclock patch but are mac os specific, as far as i can't reproduce them locally. Guess we have some mac os machines, i will ask for access. It seems to me that wrong results problem is quite easy to handle. I have no idea for now how to handle provided segfault, however, i am sure it has nothing to do with the segfault mentioned in the issue, as far as it was caused by unsafe iteration of anon replicas. Other wrong results problems, mentioned in the ticket, are also handled in the patchset. Therefore i propose to close this ticket with the provided patchset although there are some other problems. Then i will open new issue with error info you provided and start to work on it as soon as i get remote access to some mac os machine. >Среда, 20 ноября 2019, 2:06 +03:00 от Vladislav Shpilevoy <v.shpilevoy@tarantool.org>: > >Hi! Thanks for the patch! > >The commits LGTM. But looks like there are more problems. I tried on >the branch: > > python test-run.py replication/quorum. replication/quorum. replication/quorum. replication/quorum. replication/quorum. --conf memtx > >Got a crash one time, and wrong results other time. But overall >the test works more stable now, IMO. > >For the crash I have a core file. But it is not for gdb. >I can send it to you, or extract any info if you need. > >On the summary, I think we can't write 'Closes' in the >last commit yet. We need to improve these commits, or >add more fixes on the top. > >===================================================================================== > >Crash (looks really strange, may be an independent bug): > >(lldb) bt >* thread #1, stop reason = signal SIGSTOP > * frame #0: 0x00007fff688be2c6 libsystem_kernel.dylib`__pthread_kill + 10 > frame #1: 0x00007fff68973bf1 libsystem_pthread.dylib`pthread_kill + 284 > frame #2: 0x00007fff688286a6 libsystem_c.dylib`abort + 127 > frame #3: 0x00000001026fd136 tarantool`sig_fatal_cb(signo=11, siginfo=0x000000010557fa38, context=0x000000010557faa0) at main.cc:300:2 > frame #4: 0x00007fff68968b5d libsystem_platform.dylib`_sigtramp + 29 > frame #5: 0x00007fff6896f138 libsystem_pthread.dylib`pthread_mutex_lock + 1 > frame #6: 0x0000000102b0600a tarantool`etp_submit(user=0x00007fc8500072c0, req=0x00007fc851901340) at etp.c:533:7 > frame #7: 0x0000000102b05eca tarantool`eio_submit(req=0x00007fc851901340) at eio.c:482:3 > frame #8: 0x000000010288d004 tarantool`coio_task_execute(task=0x00007fc851901340, timeout=15768000000) at coio_task.c:245:2 > frame #9: 0x000000010288da2b tarantool`coio_getaddrinfo(host="localhost", port="56816", hints=0x000000010557fd98, res=0x000000010557fd90, timeout=15768000000) at coio_task.c:412:6 > frame #10: 0x000000010285d787 tarantool`lbox_socket_getaddrinfo(L=0x00000001051ae590) at socket.c:814:12 > frame #11: 0x00000001028b932d tarantool`lj_BC_FUNCC + 68 > frame #12: 0x00000001028e79ae tarantool`lua_pcall(L=0x00000001051ae590, nargs=3, nresults=-1, errfunc=0) at lj_api.c:1139:12 > frame #13: 0x000000010285af73 tarantool`luaT_call(L=0x00000001051ae590, nargs=3, nreturns=-1) at utils.c:1036:6 > frame #14: 0x00000001028532c6 tarantool`lua_fiber_run_f(ap=0x00000001054003e8) at fiber.c:433:11 > frame #15: 0x00000001026fc91a tarantool`fiber_cxx_invoke(f=(tarantool`lua_fiber_run_f at fiber.c:427), ap=0x00000001054003e8)(__va_list_tag*), __va_list_tag*) at fiber.h:742:10 > frame #16: 0x000000010287765b tarantool`fiber_loop(data=0x0000000000000000) at fiber.c:737:18 > frame #17: 0x0000000102b0c787 tarantool`coro_init at coro.c:110:3 >(lldb) > >===================================================================================== > >Wrong results: > >[005] replication/quorum.test.lua memtx [ pass ] >[002] replication/quorum.test.lua memtx [ fail ] >[002] >[002] Test failed! Result content mismatch: >[002] --- replication/quorum.result Tue Nov 19 23:57:26 2019 >[002] +++ replication/quorum.reject Wed Nov 20 00:00:20 2019 >[002] @@ -42,7 +42,8 @@ >[002] ... >[002] box.space.test:replace{100} -- error >[002] --- >[002] -- error: Can't modify data because this instance is in read-only mode. >[002] +- error: '[string "return box.space.test:replace{100} -- error "]:1: attempt to index >[002] + field ''test'' (a nil value)' >[002] ... >[002] box.cfg{replication={}} >[002] --- >[002] @@ -66,7 +67,8 @@ >[002] ... >[002] box.space.test:replace{100} -- error >[002] --- >[002] -- error: Can't modify data because this instance is in read-only mode. >[002] +- error: '[string "return box.space.test:replace{100} -- error "]:1: attempt to index >[002] + field ''test'' (a nil value)' >[002] ... >[002] box.cfg{replication_connect_quorum = 2} >[002] --- >[002] @@ -97,7 +99,8 @@ >[002] ... >[002] box.space.test:replace{100} -- error >[002] --- >[002] -- error: Can't modify data because this instance is in read-only mode. >[002] +- error: '[string "return box.space.test:replace{100} -- error "]:1: attempt to index >[002] + field ''test'' (a nil value)' >[002] ... >[002] test_run:cmd('start server quorum1 with args="0.1 0.5"') >[002] --- >[002] @@ -151,10 +154,13 @@ >[002] ... >[002] test_run:wait_cond(function() return box.space.test.index.primary ~= nil end, 20) >[002] --- >[002] -- true >[002] +- error: '[string "return test_run:wait_cond(function() return b..."]:1: attempt to >[002] + index field ''test'' (a nil value)' >[002] ... >[002] for i = 1, 100 do box.space.test:insert{i} end >[002] --- >[002] +- error: '[string "for i = 1, 100 do box.space.test:insert{i} end "]:1: attempt to >[002] + index field ''test'' (a nil value)' >[002] ... >[002] fiber = require('fiber') >[002] --- >[002] @@ -172,7 +178,8 @@ >[002] ... >[002] test_run:wait_cond(function() return box.space.test:count() == 100 end, 20) >[002] --- >[002] -- true >[002] +- error: '[string "return test_run:wait_cond(function() return b..."]:1: attempt to >[002] + index field ''test'' (a nil value)' >[002] ... >[002] -- Rebootstrap one node of the cluster and check that others follow. >[002] -- Note, due to ERRINJ_RELAY_TIMEOUT there is a substantial delay >[002] @@ -203,7 +210,8 @@ >[002] test_run:cmd('restart server quorum1 with cleanup=1, args="0.1 0.5"') >[002] test_run:wait_cond(function() return box.space.test:count() == 100 end, 20) >[002] --- >[002] -- true >[002] +- error: '[string "return test_run:wait_cond(function() return b..."]:1: attempt to >[002] + index field ''test'' (a nil value)' >[002] ... >[002] -- The rebootstrapped replica will be assigned id = 4, >[002] -- because ids 1..3 are busy. >[002] -- Ilya Kosarev [-- Attachment #2: Type: text/html, Size: 8858 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Tarantool-patches] [PATCH 0/2] fix replica iteration issue & stabilize quorum test 2019-11-20 1:29 ` Ilya Kosarev @ 2019-11-20 1:41 ` Ilya Kosarev 2019-11-20 22:18 ` Vladislav Shpilevoy 1 sibling, 0 replies; 8+ messages in thread From: Ilya Kosarev @ 2019-11-20 1:41 UTC (permalink / raw) To: Vladislav Shpilevoy; +Cc: tarantool-patches [-- Attachment #1: Type: text/plain, Size: 7921 bytes --] Sorry, initially i didn't notice you ran the test on the branch. Though my answer is still valid except the 1tst paragraph. >Среда, 20 ноября 2019, 4:29 +03:00 от Ilya Kosarev <i.kosarev@tarantool.org>: > >Hi! > >Thanks for your review. > >Did u run tests on exactly this patchset or on the branch, >https://github.com/tarantool/tarantool/tree/i.kosarev/gh-4586-fix-quorum-test >which also contains relay: fix join vclock obtainment in >relay_initial_join ? >It is not yet checked in master (and might not get there, as far as >Georgy has alternative fix as a part of sync replication), but >is vital for the stability of the test. >On my machine (Ubuntu 18.04.3 LTS) quorum test works perfectly on >mentioned branch. I use next bash instruction to run it under 'load': >l=0 ; while ./test-run.py -j20 `for r in {1..64} ; do echo quorum ; done` 2>/dev/null ; do l=$(($l+1)) ; echo ======== $l ============= ; done >Anyway, I guess provided problems are not connected with join_vclock >patch but are mac os specific, as far as i can't reproduce them locally. >Guess we have some mac os machines, i will ask for access. >It seems to me that wrong results problem is quite easy to handle. >I have no idea for now how to handle provided segfault, however, i am >sure it has nothing to do with the segfault mentioned in the issue, as >far as it was caused by unsafe iteration of anon replicas. Other wrong >results problems, mentioned in the ticket, are also handled in the >patchset. >Therefore i propose to close this ticket with the provided patchset >although there are some other problems. Then i will open new issue with >error info you provided and start to work on it as soon as i get remote >access to some mac os machine. > >>Среда, 20 ноября 2019, 2:06 +03:00 от Vladislav Shpilevoy < v.shpilevoy@tarantool.org >: >> >>Hi! Thanks for the patch! >> >>The commits LGTM. But looks like there are more problems. I tried on >>the branch: >> >> python test-run.py replication/quorum. replication/quorum. replication/quorum. replication/quorum. replication/quorum. --conf memtx >> >>Got a crash one time, and wrong results other time. But overall >>the test works more stable now, IMO. >> >>For the crash I have a core file. But it is not for gdb. >>I can send it to you, or extract any info if you need. >> >>On the summary, I think we can't write 'Closes' in the >>last commit yet. We need to improve these commits, or >>add more fixes on the top. >> >>===================================================================================== >> >>Crash (looks really strange, may be an independent bug): >> >>(lldb) bt >>* thread #1, stop reason = signal SIGSTOP >> * frame #0: 0x00007fff688be2c6 libsystem_kernel.dylib`__pthread_kill + 10 >> frame #1: 0x00007fff68973bf1 libsystem_pthread.dylib`pthread_kill + 284 >> frame #2: 0x00007fff688286a6 libsystem_c.dylib`abort + 127 >> frame #3: 0x00000001026fd136 tarantool`sig_fatal_cb(signo=11, siginfo=0x000000010557fa38, context=0x000000010557faa0) at main.cc:300:2 >> frame #4: 0x00007fff68968b5d libsystem_platform.dylib`_sigtramp + 29 >> frame #5: 0x00007fff6896f138 libsystem_pthread.dylib`pthread_mutex_lock + 1 >> frame #6: 0x0000000102b0600a tarantool`etp_submit(user=0x00007fc8500072c0, req=0x00007fc851901340) at etp.c:533:7 >> frame #7: 0x0000000102b05eca tarantool`eio_submit(req=0x00007fc851901340) at eio.c:482:3 >> frame #8: 0x000000010288d004 tarantool`coio_task_execute(task=0x00007fc851901340, timeout=15768000000) at coio_task.c:245:2 >> frame #9: 0x000000010288da2b tarantool`coio_getaddrinfo(host="localhost", port="56816", hints=0x000000010557fd98, res=0x000000010557fd90, timeout=15768000000) at coio_task.c:412:6 >> frame #10: 0x000000010285d787 tarantool`lbox_socket_getaddrinfo(L=0x00000001051ae590) at socket.c:814:12 >> frame #11: 0x00000001028b932d tarantool`lj_BC_FUNCC + 68 >> frame #12: 0x00000001028e79ae tarantool`lua_pcall(L=0x00000001051ae590, nargs=3, nresults=-1, errfunc=0) at lj_api.c:1139:12 >> frame #13: 0x000000010285af73 tarantool`luaT_call(L=0x00000001051ae590, nargs=3, nreturns=-1) at utils.c:1036:6 >> frame #14: 0x00000001028532c6 tarantool`lua_fiber_run_f(ap=0x00000001054003e8) at fiber.c:433:11 >> frame #15: 0x00000001026fc91a tarantool`fiber_cxx_invoke(f=(tarantool`lua_fiber_run_f at fiber.c:427), ap=0x00000001054003e8)(__va_list_tag*), __va_list_tag*) at fiber.h:742:10 >> frame #16: 0x000000010287765b tarantool`fiber_loop(data=0x0000000000000000) at fiber.c:737:18 >> frame #17: 0x0000000102b0c787 tarantool`coro_init at coro.c:110:3 >>(lldb) >> >>===================================================================================== >> >>Wrong results: >> >>[005] replication/quorum.test.lua memtx [ pass ] >>[002] replication/quorum.test.lua memtx [ fail ] >>[002] >>[002] Test failed! Result content mismatch: >>[002] --- replication/quorum.result Tue Nov 19 23:57:26 2019 >>[002] +++ replication/quorum.reject Wed Nov 20 00:00:20 2019 >>[002] @@ -42,7 +42,8 @@ >>[002] ... >>[002] box.space.test:replace{100} -- error >>[002] --- >>[002] -- error: Can't modify data because this instance is in read-only mode. >>[002] +- error: '[string "return box.space.test:replace{100} -- error "]:1: attempt to index >>[002] + field ''test'' (a nil value)' >>[002] ... >>[002] box.cfg{replication={}} >>[002] --- >>[002] @@ -66,7 +67,8 @@ >>[002] ... >>[002] box.space.test:replace{100} -- error >>[002] --- >>[002] -- error: Can't modify data because this instance is in read-only mode. >>[002] +- error: '[string "return box.space.test:replace{100} -- error "]:1: attempt to index >>[002] + field ''test'' (a nil value)' >>[002] ... >>[002] box.cfg{replication_connect_quorum = 2} >>[002] --- >>[002] @@ -97,7 +99,8 @@ >>[002] ... >>[002] box.space.test:replace{100} -- error >>[002] --- >>[002] -- error: Can't modify data because this instance is in read-only mode. >>[002] +- error: '[string "return box.space.test:replace{100} -- error "]:1: attempt to index >>[002] + field ''test'' (a nil value)' >>[002] ... >>[002] test_run:cmd('start server quorum1 with args="0.1 0.5"') >>[002] --- >>[002] @@ -151,10 +154,13 @@ >>[002] ... >>[002] test_run:wait_cond(function() return box.space.test.index.primary ~= nil end, 20) >>[002] --- >>[002] -- true >>[002] +- error: '[string "return test_run:wait_cond(function() return b..."]:1: attempt to >>[002] + index field ''test'' (a nil value)' >>[002] ... >>[002] for i = 1, 100 do box.space.test:insert{i} end >>[002] --- >>[002] +- error: '[string "for i = 1, 100 do box.space.test:insert{i} end "]:1: attempt to >>[002] + index field ''test'' (a nil value)' >>[002] ... >>[002] fiber = require('fiber') >>[002] --- >>[002] @@ -172,7 +178,8 @@ >>[002] ... >>[002] test_run:wait_cond(function() return box.space.test:count() == 100 end, 20) >>[002] --- >>[002] -- true >>[002] +- error: '[string "return test_run:wait_cond(function() return b..."]:1: attempt to >>[002] + index field ''test'' (a nil value)' >>[002] ... >>[002] -- Rebootstrap one node of the cluster and check that others follow. >>[002] -- Note, due to ERRINJ_RELAY_TIMEOUT there is a substantial delay >>[002] @@ -203,7 +210,8 @@ >>[002] test_run:cmd('restart server quorum1 with cleanup=1, args="0.1 0.5"') >>[002] test_run:wait_cond(function() return box.space.test:count() == 100 end, 20) >>[002] --- >>[002] -- true >>[002] +- error: '[string "return test_run:wait_cond(function() return b..."]:1: attempt to >>[002] + index field ''test'' (a nil value)' >>[002] ... >>[002] -- The rebootstrapped replica will be assigned id = 4, >>[002] -- because ids 1..3 are busy. >>[002] > > >-- >Ilya Kosarev -- Ilya Kosarev [-- Attachment #2: Type: text/html, Size: 9799 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Tarantool-patches] [PATCH 0/2] fix replica iteration issue & stabilize quorum test 2019-11-20 1:29 ` Ilya Kosarev 2019-11-20 1:41 ` Ilya Kosarev @ 2019-11-20 22:18 ` Vladislav Shpilevoy 2019-11-20 22:48 ` Ilya Kosarev 1 sibling, 1 reply; 8+ messages in thread From: Vladislav Shpilevoy @ 2019-11-20 22:18 UTC (permalink / raw) To: Ilya Kosarev; +Cc: tarantool-patches On 20/11/2019 02:29, Ilya Kosarev wrote: > Hi! > > Thanks for your review. > > Did u run tests on exactly this patchset or on the branch, > https://github.com/tarantool/tarantool/tree/i.kosarev/gh-4586-fix-quorum-test > which also contains relay: fix join vclock obtainment in <https://github.com/tarantool/tarantool/commit/83fa385f2e82911dee1a9189f90b1a94c8455023> > relay_initial_join <https://github.com/tarantool/tarantool/commit/83fa385f2e82911dee1a9189f90b1a94c8455023>? > It is not yet checked in master (and might not get there, as far as > Georgy has alternative fix as a part of sync replication), but > is vital for the stability of the test. > > On my machine (Ubuntu 18.04.3 LTS) quorum test works perfectly on > mentioned branch. I use next bash instruction to run it under 'load': > l=0 ; while ./test-run.py -j20 `for r in {1..64} ; do echo quorum ; done` 2>/dev/null ; do l=$(($l+1)) ; echo ======== $l ============= ; done I run it on your branch, and it fails. Even without these complex bash scripts. I run it 5-10 times, and it fails/crashes. > > Anyway, I guess provided problems are not connected with join_vclock > patch but are mac os specific, as far as i can't reproduce them locally. > Guess we have some mac os machines, i will ask for access. Maybe they are not related. But at least the wrong test results mean, that the ticket can't be closed. Because ticket is called: test: flaky segfault on replication/quorum test under high load And it is still flaky. For the crash you can open a new ticket. > > It seems to me that wrong results problem is quite easy to handle. > I have no idea for now how to handle provided segfault, however, i am > sure it has nothing to do with the segfault mentioned in the issue, as > far as it was caused by unsafe iteration of anon replicas. Other wrong > results problems, mentioned in the ticket, are also handled in the > patchset. The problems, mentioned in the ticket, are just comments. Helpers. The ticket is about flakiness. Not about several concrete fails. If the test is still flaky, what is a point of closing this issue and opening exactly the same again? > > Therefore i propose to close this ticket with the provided patchset > although there are some other problems. Then i will open new issue with > error info you provided and start to work on it as soon as i get remote > access to some mac os machine ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Tarantool-patches] [PATCH 0/2] fix replica iteration issue & stabilize quorum test 2019-11-20 22:18 ` Vladislav Shpilevoy @ 2019-11-20 22:48 ` Ilya Kosarev 0 siblings, 0 replies; 8+ messages in thread From: Ilya Kosarev @ 2019-11-20 22:48 UTC (permalink / raw) To: Vladislav Shpilevoy; +Cc: tarantool-patches [-- Attachment #1: Type: text/plain, Size: 2663 bytes --] Thanks for the answer. Sent patch as 'Part of' instead of 'Closes'. >Четверг, 21 ноября 2019, 1:11 +03:00 от Vladislav Shpilevoy < v.shpilevoy@tarantool.org >: > >On 20/11/2019 02:29, Ilya Kosarev wrote: >> Hi! >> >> Thanks for your review. >> >> Did u run tests on exactly this patchset or on the branch, >> https://github.com/tarantool/tarantool/tree/i.kosarev/gh-4586-fix-quorum-test >> which also contains relay: fix join vclock obtainment in < https://github.com/tarantool/tarantool/commit/83fa385f2e82911dee1a9189f90b1a94c8455023 > >> relay_initial_join < https://github.com/tarantool/tarantool/commit/83fa385f2e82911dee1a9189f90b1a94c8455023 >? >> It is not yet checked in master (and might not get there, as far as >> Georgy has alternative fix as a part of sync replication), but >> is vital for the stability of the test. >> >> On my machine (Ubuntu 18.04.3 LTS) quorum test works perfectly on >> mentioned branch. I use next bash instruction to run it under 'load': >> l=0 ; while ./test-run.py -j20 `for r in {1..64} ; do echo quorum ; done` 2>/dev/null ; do l=$(($l+1)) ; echo ======== $l ============= ; done >I run it on your branch, and it fails. Even without these >complex bash scripts. I run it 5-10 times, and it fails/crashes. > >> >> Anyway, I guess provided problems are not connected with join_vclock >> patch but are mac os specific, as far as i can't reproduce them locally. >> Guess we have some mac os machines, i will ask for access. > >Maybe they are not related. But at least the wrong test results mean, >that the ticket can't be closed. Because ticket is called: > > test: flaky segfault on replication/quorum test under high load > >And it is still flaky. > >For the crash you can open a new ticket. > >> >> It seems to me that wrong results problem is quite easy to handle. >> I have no idea for now how to handle provided segfault, however, i am >> sure it has nothing to do with the segfault mentioned in the issue, as >> far as it was caused by unsafe iteration of anon replicas. Other wrong >> results problems, mentioned in the ticket, are also handled in the >> patchset. > >The problems, mentioned in the ticket, are just comments. Helpers. >The ticket is about flakiness. Not about several concrete fails. If the >test is still flaky, what is a point of closing this issue and opening >exactly the same again? > >> >> Therefore i propose to close this ticket with the provided patchset >> although there are some other problems. Then i will open new issue with >> error info you provided and start to work on it as soon as i get remote >> access to some mac os machine -- Ilya Kosarev [-- Attachment #2: Type: text/html, Size: 4181 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2019-11-20 22:48 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-11-18 13:19 [Tarantool-patches] [PATCH 0/2] fix replica iteration issue & stabilize quorum test Ilya Kosarev 2019-11-18 13:19 ` [Tarantool-patches] [PATCH 1/2] replication: make anon replicas iteration safe Ilya Kosarev 2019-11-18 13:19 ` [Tarantool-patches] [PATCH 2/2] test: stabilize quorum test conditions Ilya Kosarev 2019-11-19 23:13 ` [Tarantool-patches] [PATCH 0/2] fix replica iteration issue & stabilize quorum test Vladislav Shpilevoy 2019-11-20 1:29 ` Ilya Kosarev 2019-11-20 1:41 ` Ilya Kosarev 2019-11-20 22:18 ` Vladislav Shpilevoy 2019-11-20 22:48 ` Ilya Kosarev
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox