[Tarantool-patches] [PATCH 0/2] fix replica iteration issue & stabilize quorum test

Ilya Kosarev i.kosarev at tarantool.org
Wed Nov 20 04:29:08 MSK 2019


Hi!

Thanks for your review.

Did u run tests on exactly this patchset or on the branch,
https://github.com/tarantool/tarantool/tree/i.kosarev/gh-4586-fix-quorum-test
which also contains  relay: fix join vclock obtainment in 
relay_initial_join ?
It is not yet checked in master (and might not get there, as far as
Georgy has alternative fix as a part of sync replication), but
is vital for the stability of the test.
On my machine (Ubuntu 18.04.3 LTS) quorum test works perfectly on
mentioned branch. I use next bash instruction to run it under 'load':
l=0 ; while ./test-run.py -j20 `for r in {1..64} ; do echo quorum ; done` 2>/dev/null ; do l=$(($l+1)) ; echo ======== $l ============= ; done
Anyway, I guess provided problems are not connected with join_vclock
patch but are mac os specific, as far as i can't reproduce them locally.
Guess we have some mac os machines, i will ask for access.
It seems to me that wrong results problem is quite easy to handle.
I have no idea for now how to handle provided segfault, however, i am
sure it has nothing to do with the segfault mentioned in the issue, as
far as it was caused by unsafe iteration of anon replicas. Other wrong
results problems, mentioned in the ticket, are also handled in the
patchset.
Therefore i propose to close this ticket with the provided patchset
although there are some other problems. Then i will open new issue with
error info you provided and start to work on it as soon as i get remote
access to some mac os machine.

>Среда, 20 ноября 2019, 2:06 +03:00 от Vladislav Shpilevoy <v.shpilevoy at tarantool.org>:
>
>Hi! Thanks for the patch!
>
>The commits LGTM. But looks like there are more problems. I tried on
>the branch:
>
>    python test-run.py replication/quorum. replication/quorum. replication/quorum. replication/quorum. replication/quorum. --conf memtx
>
>Got a crash one time, and wrong results other time. But overall
>the test works more stable now, IMO.
>
>For the crash I have a core file. But it is not for gdb.
>I can send it to you, or extract any info if you need.
>
>On the summary, I think we can't write 'Closes' in the
>last commit yet. We need to improve these commits, or
>add more fixes on the top.
>
>=====================================================================================
>
>Crash (looks really strange, may be an independent bug):
>
>(lldb) bt
>* thread #1, stop reason = signal SIGSTOP
>  * frame #0: 0x00007fff688be2c6 libsystem_kernel.dylib`__pthread_kill + 10
>    frame #1: 0x00007fff68973bf1 libsystem_pthread.dylib`pthread_kill + 284
>    frame #2: 0x00007fff688286a6 libsystem_c.dylib`abort + 127
>    frame #3: 0x00000001026fd136 tarantool`sig_fatal_cb(signo=11, siginfo=0x000000010557fa38, context=0x000000010557faa0) at main.cc:300:2
>    frame #4: 0x00007fff68968b5d libsystem_platform.dylib`_sigtramp + 29
>    frame #5: 0x00007fff6896f138 libsystem_pthread.dylib`pthread_mutex_lock + 1
>    frame #6: 0x0000000102b0600a tarantool`etp_submit(user=0x00007fc8500072c0, req=0x00007fc851901340) at etp.c:533:7
>    frame #7: 0x0000000102b05eca tarantool`eio_submit(req=0x00007fc851901340) at eio.c:482:3
>    frame #8: 0x000000010288d004 tarantool`coio_task_execute(task=0x00007fc851901340, timeout=15768000000) at coio_task.c:245:2
>    frame #9: 0x000000010288da2b tarantool`coio_getaddrinfo(host="localhost", port="56816", hints=0x000000010557fd98, res=0x000000010557fd90, timeout=15768000000) at coio_task.c:412:6
>    frame #10: 0x000000010285d787 tarantool`lbox_socket_getaddrinfo(L=0x00000001051ae590) at socket.c:814:12
>    frame #11: 0x00000001028b932d tarantool`lj_BC_FUNCC + 68
>    frame #12: 0x00000001028e79ae tarantool`lua_pcall(L=0x00000001051ae590, nargs=3, nresults=-1, errfunc=0) at lj_api.c:1139:12
>    frame #13: 0x000000010285af73 tarantool`luaT_call(L=0x00000001051ae590, nargs=3, nreturns=-1) at utils.c:1036:6
>    frame #14: 0x00000001028532c6 tarantool`lua_fiber_run_f(ap=0x00000001054003e8) at fiber.c:433:11
>    frame #15: 0x00000001026fc91a tarantool`fiber_cxx_invoke(f=(tarantool`lua_fiber_run_f at fiber.c:427), ap=0x00000001054003e8)(__va_list_tag*), __va_list_tag*) at fiber.h:742:10
>    frame #16: 0x000000010287765b tarantool`fiber_loop(data=0x0000000000000000) at fiber.c:737:18
>    frame #17: 0x0000000102b0c787 tarantool`coro_init at coro.c:110:3
>(lldb) 
>
>=====================================================================================
>
>Wrong results:
>
>[005] replication/quorum.test.lua                     memtx           [ pass ]
>[002] replication/quorum.test.lua                     memtx           [ fail ]
>[002] 
>[002] Test failed! Result content mismatch:
>[002] --- replication/quorum.result	Tue Nov 19 23:57:26 2019
>[002] +++ replication/quorum.reject	Wed Nov 20 00:00:20 2019
>[002] @@ -42,7 +42,8 @@
>[002]  ...
>[002]  box.space.test:replace{100} -- error
>[002]  ---
>[002] -- error: Can't modify data because this instance is in read-only mode.
>[002] +- error: '[string "return box.space.test:replace{100} -- error "]:1: attempt to index
>[002] +    field ''test'' (a nil value)'
>[002]  ...
>[002]  box.cfg{replication={}}
>[002]  ---
>[002] @@ -66,7 +67,8 @@
>[002]  ...
>[002]  box.space.test:replace{100} -- error
>[002]  ---
>[002] -- error: Can't modify data because this instance is in read-only mode.
>[002] +- error: '[string "return box.space.test:replace{100} -- error "]:1: attempt to index
>[002] +    field ''test'' (a nil value)'
>[002]  ...
>[002]  box.cfg{replication_connect_quorum = 2}
>[002]  ---
>[002] @@ -97,7 +99,8 @@
>[002]  ...
>[002]  box.space.test:replace{100} -- error
>[002]  ---
>[002] -- error: Can't modify data because this instance is in read-only mode.
>[002] +- error: '[string "return box.space.test:replace{100} -- error "]:1: attempt to index
>[002] +    field ''test'' (a nil value)'
>[002]  ...
>[002]  test_run:cmd('start server quorum1 with args="0.1 0.5"')
>[002]  ---
>[002] @@ -151,10 +154,13 @@
>[002]  ...
>[002]  test_run:wait_cond(function() return box.space.test.index.primary ~= nil end, 20)
>[002]  ---
>[002] -- true
>[002] +- error: '[string "return test_run:wait_cond(function() return b..."]:1: attempt to
>[002] +    index field ''test'' (a nil value)'
>[002]  ...
>[002]  for i = 1, 100 do box.space.test:insert{i} end
>[002]  ---
>[002] +- error: '[string "for i = 1, 100 do box.space.test:insert{i} end "]:1: attempt to
>[002] +    index field ''test'' (a nil value)'
>[002]  ...
>[002]  fiber = require('fiber')
>[002]  ---
>[002] @@ -172,7 +178,8 @@
>[002]  ...
>[002]  test_run:wait_cond(function() return box.space.test:count() == 100 end, 20)
>[002]  ---
>[002] -- true
>[002] +- error: '[string "return test_run:wait_cond(function() return b..."]:1: attempt to
>[002] +    index field ''test'' (a nil value)'
>[002]  ...
>[002]  -- Rebootstrap one node of the cluster and check that others follow.
>[002]  -- Note, due to ERRINJ_RELAY_TIMEOUT there is a substantial delay
>[002] @@ -203,7 +210,8 @@
>[002]  test_run:cmd('restart server quorum1 with cleanup=1, args="0.1 0.5"')
>[002]  test_run:wait_cond(function() return box.space.test:count() == 100 end, 20)
>[002]  ---
>[002] -- true
>[002] +- error: '[string "return test_run:wait_cond(function() return b..."]:1: attempt to
>[002] +    index field ''test'' (a nil value)'
>[002]  ...
>[002]  -- The rebootstrapped replica will be assigned id = 4,
>[002]  -- because ids 1..3 are busy.
>[002] 


-- 
Ilya Kosarev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.tarantool.org/pipermail/tarantool-patches/attachments/20191120/1057bce3/attachment.html>


More information about the Tarantool-patches mailing list