From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp48.i.mail.ru (smtp48.i.mail.ru [94.100.177.108]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 2F619430D56 for ; Wed, 20 Nov 2019 02:06:34 +0300 (MSK) References: From: Vladislav Shpilevoy Message-ID: <4a5e0fa4-02bd-8022-a5a9-32177392b2e8@tarantool.org> Date: Wed, 20 Nov 2019 00:13:02 +0100 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Subject: Re: [Tarantool-patches] [PATCH 0/2] fix replica iteration issue & stabilize quorum test List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Ilya Kosarev , tarantool-patches@dev.tarantool.org Hi! Thanks for the patch! The commits LGTM. But looks like there are more problems. I tried on the branch: python test-run.py replication/quorum. replication/quorum. replication/quorum. replication/quorum. replication/quorum. --conf memtx Got a crash one time, and wrong results other time. But overall the test works more stable now, IMO. For the crash I have a core file. But it is not for gdb. I can send it to you, or extract any info if you need. On the summary, I think we can't write 'Closes' in the last commit yet. We need to improve these commits, or add more fixes on the top. ===================================================================================== Crash (looks really strange, may be an independent bug): (lldb) bt * thread #1, stop reason = signal SIGSTOP * frame #0: 0x00007fff688be2c6 libsystem_kernel.dylib`__pthread_kill + 10 frame #1: 0x00007fff68973bf1 libsystem_pthread.dylib`pthread_kill + 284 frame #2: 0x00007fff688286a6 libsystem_c.dylib`abort + 127 frame #3: 0x00000001026fd136 tarantool`sig_fatal_cb(signo=11, siginfo=0x000000010557fa38, context=0x000000010557faa0) at main.cc:300:2 frame #4: 0x00007fff68968b5d libsystem_platform.dylib`_sigtramp + 29 frame #5: 0x00007fff6896f138 libsystem_pthread.dylib`pthread_mutex_lock + 1 frame #6: 0x0000000102b0600a tarantool`etp_submit(user=0x00007fc8500072c0, req=0x00007fc851901340) at etp.c:533:7 frame #7: 0x0000000102b05eca tarantool`eio_submit(req=0x00007fc851901340) at eio.c:482:3 frame #8: 0x000000010288d004 tarantool`coio_task_execute(task=0x00007fc851901340, timeout=15768000000) at coio_task.c:245:2 frame #9: 0x000000010288da2b tarantool`coio_getaddrinfo(host="localhost", port="56816", hints=0x000000010557fd98, res=0x000000010557fd90, timeout=15768000000) at coio_task.c:412:6 frame #10: 0x000000010285d787 tarantool`lbox_socket_getaddrinfo(L=0x00000001051ae590) at socket.c:814:12 frame #11: 0x00000001028b932d tarantool`lj_BC_FUNCC + 68 frame #12: 0x00000001028e79ae tarantool`lua_pcall(L=0x00000001051ae590, nargs=3, nresults=-1, errfunc=0) at lj_api.c:1139:12 frame #13: 0x000000010285af73 tarantool`luaT_call(L=0x00000001051ae590, nargs=3, nreturns=-1) at utils.c:1036:6 frame #14: 0x00000001028532c6 tarantool`lua_fiber_run_f(ap=0x00000001054003e8) at fiber.c:433:11 frame #15: 0x00000001026fc91a tarantool`fiber_cxx_invoke(f=(tarantool`lua_fiber_run_f at fiber.c:427), ap=0x00000001054003e8)(__va_list_tag*), __va_list_tag*) at fiber.h:742:10 frame #16: 0x000000010287765b tarantool`fiber_loop(data=0x0000000000000000) at fiber.c:737:18 frame #17: 0x0000000102b0c787 tarantool`coro_init at coro.c:110:3 (lldb) ===================================================================================== Wrong results: [005] replication/quorum.test.lua memtx [ pass ] [002] replication/quorum.test.lua memtx [ fail ] [002] [002] Test failed! Result content mismatch: [002] --- replication/quorum.result Tue Nov 19 23:57:26 2019 [002] +++ replication/quorum.reject Wed Nov 20 00:00:20 2019 [002] @@ -42,7 +42,8 @@ [002] ... [002] box.space.test:replace{100} -- error [002] --- [002] -- error: Can't modify data because this instance is in read-only mode. [002] +- error: '[string "return box.space.test:replace{100} -- error "]:1: attempt to index [002] + field ''test'' (a nil value)' [002] ... [002] box.cfg{replication={}} [002] --- [002] @@ -66,7 +67,8 @@ [002] ... [002] box.space.test:replace{100} -- error [002] --- [002] -- error: Can't modify data because this instance is in read-only mode. [002] +- error: '[string "return box.space.test:replace{100} -- error "]:1: attempt to index [002] + field ''test'' (a nil value)' [002] ... [002] box.cfg{replication_connect_quorum = 2} [002] --- [002] @@ -97,7 +99,8 @@ [002] ... [002] box.space.test:replace{100} -- error [002] --- [002] -- error: Can't modify data because this instance is in read-only mode. [002] +- error: '[string "return box.space.test:replace{100} -- error "]:1: attempt to index [002] + field ''test'' (a nil value)' [002] ... [002] test_run:cmd('start server quorum1 with args="0.1 0.5"') [002] --- [002] @@ -151,10 +154,13 @@ [002] ... [002] test_run:wait_cond(function() return box.space.test.index.primary ~= nil end, 20) [002] --- [002] -- true [002] +- error: '[string "return test_run:wait_cond(function() return b..."]:1: attempt to [002] + index field ''test'' (a nil value)' [002] ... [002] for i = 1, 100 do box.space.test:insert{i} end [002] --- [002] +- error: '[string "for i = 1, 100 do box.space.test:insert{i} end "]:1: attempt to [002] + index field ''test'' (a nil value)' [002] ... [002] fiber = require('fiber') [002] --- [002] @@ -172,7 +178,8 @@ [002] ... [002] test_run:wait_cond(function() return box.space.test:count() == 100 end, 20) [002] --- [002] -- true [002] +- error: '[string "return test_run:wait_cond(function() return b..."]:1: attempt to [002] + index field ''test'' (a nil value)' [002] ... [002] -- Rebootstrap one node of the cluster and check that others follow. [002] -- Note, due to ERRINJ_RELAY_TIMEOUT there is a substantial delay [002] @@ -203,7 +210,8 @@ [002] test_run:cmd('restart server quorum1 with cleanup=1, args="0.1 0.5"') [002] test_run:wait_cond(function() return box.space.test:count() == 100 end, 20) [002] --- [002] -- true [002] +- error: '[string "return test_run:wait_cond(function() return b..."]:1: attempt to [002] + index field ''test'' (a nil value)' [002] ... [002] -- The rebootstrapped replica will be assigned id = 4, [002] -- because ids 1..3 are busy. [002]