Sorry, initially i didn't notice you ran the test on the branch.
Though my answer is still valid except the 1tst paragraph.


Среда, 20 ноября 2019, 4:29 +03:00 от Ilya Kosarev <i.kosarev@tarantool.org>:

Hi!

Thanks for your review.

Did u run tests on exactly this patchset or on the branch,
https://github.com/tarantool/tarantool/tree/i.kosarev/gh-4586-fix-quorum-test
which also contains relay: fix join vclock obtainment in
relay_initial_join?
It is not yet checked in master (and might not get there, as far as
Georgy has alternative fix as a part of sync replication), but
is vital for the stability of the test.

On my machine (Ubuntu 18.04.3 LTS) quorum test works perfectly on
mentioned branch. I use next bash instruction to run it under 'load':
l=0 ; while ./test-run.py -j20 `for r in {1..64} ; do echo quorum ; done` 2>/dev/null ; do l=$(($l+1)) ; echo ======== $l ============= ; done

Anyway, I guess provided problems are not connected with join_vclock
patch but are mac os specific, as far as i can't reproduce them locally.
Guess we have some mac os machines, i will ask for access.

It seems to me that wrong results problem is quite easy to handle.
I have no idea for now how to handle provided segfault, however, i am
sure it has nothing to do with the segfault mentioned in the issue, as
far as it was caused by unsafe iteration of anon replicas. Other wrong
results problems, mentioned in the ticket, are also handled in the
patchset.

Therefore i propose to close this ticket with the provided patchset
although there are some other problems. Then i will open new issue with
error info you provided and start to work on it as soon as i get remote
access to some mac os machine.



Среда, 20 ноября 2019, 2:06 +03:00 от Vladislav Shpilevoy <v.shpilevoy@tarantool.org>:

Hi! Thanks for the patch!

The commits LGTM. But looks like there are more problems. I tried on
the branch:

    python test-run.py replication/quorum. replication/quorum. replication/quorum. replication/quorum. replication/quorum. --conf memtx

Got a crash one time, and wrong results other time. But overall
the test works more stable now, IMO.

For the crash I have a core file. But it is not for gdb.
I can send it to you, or extract any info if you need.

On the summary, I think we can't write 'Closes' in the
last commit yet. We need to improve these commits, or
add more fixes on the top.

=====================================================================================

Crash (looks really strange, may be an independent bug):

(lldb) bt
* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x00007fff688be2c6 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff68973bf1 libsystem_pthread.dylib`pthread_kill + 284
    frame #2: 0x00007fff688286a6 libsystem_c.dylib`abort + 127
    frame #3: 0x00000001026fd136 tarantool`sig_fatal_cb(signo=11, siginfo=0x000000010557fa38, context=0x000000010557faa0) at main.cc:300:2
    frame #4: 0x00007fff68968b5d libsystem_platform.dylib`_sigtramp + 29
    frame #5: 0x00007fff6896f138 libsystem_pthread.dylib`pthread_mutex_lock + 1
    frame #6: 0x0000000102b0600a tarantool`etp_submit(user=0x00007fc8500072c0, req=0x00007fc851901340) at etp.c:533:7
    frame #7: 0x0000000102b05eca tarantool`eio_submit(req=0x00007fc851901340) at eio.c:482:3
    frame #8: 0x000000010288d004 tarantool`coio_task_execute(task=0x00007fc851901340, timeout=15768000000) at coio_task.c:245:2
    frame #9: 0x000000010288da2b tarantool`coio_getaddrinfo(host="localhost", port="56816", hints=0x000000010557fd98, res=0x000000010557fd90, timeout=15768000000) at coio_task.c:412:6
    frame #10: 0x000000010285d787 tarantool`lbox_socket_getaddrinfo(L=0x00000001051ae590) at socket.c:814:12
    frame #11: 0x00000001028b932d tarantool`lj_BC_FUNCC + 68
    frame #12: 0x00000001028e79ae tarantool`lua_pcall(L=0x00000001051ae590, nargs=3, nresults=-1, errfunc=0) at lj_api.c:1139:12
    frame #13: 0x000000010285af73 tarantool`luaT_call(L=0x00000001051ae590, nargs=3, nreturns=-1) at utils.c:1036:6
    frame #14: 0x00000001028532c6 tarantool`lua_fiber_run_f(ap=0x00000001054003e8) at fiber.c:433:11
    frame #15: 0x00000001026fc91a tarantool`fiber_cxx_invoke(f=(tarantool`lua_fiber_run_f at fiber.c:427), ap=0x00000001054003e8)(__va_list_tag*), __va_list_tag*) at fiber.h:742:10
    frame #16: 0x000000010287765b tarantool`fiber_loop(data=0x0000000000000000) at fiber.c:737:18
    frame #17: 0x0000000102b0c787 tarantool`coro_init at coro.c:110:3
(lldb)

=====================================================================================

Wrong results:

[005] replication/quorum.test.lua memtx [ pass ]
[002] replication/quorum.test.lua memtx [ fail ]
[002]
[002] Test failed! Result content mismatch:
[002] --- replication/quorum.result Tue Nov 19 23:57:26 2019
[002] +++ replication/quorum.reject Wed Nov 20 00:00:20 2019
[002] @@ -42,7 +42,8 @@
[002] ...
[002] box.space.test:replace{100} -- error
[002] ---
[002] -- error: Can't modify data because this instance is in read-only mode.
[002] +- error: '[string "return box.space.test:replace{100} -- error "]:1: attempt to index
[002] + field ''test'' (a nil value)'
[002] ...
[002] box.cfg{replication={}}
[002] ---
[002] @@ -66,7 +67,8 @@
[002] ...
[002] box.space.test:replace{100} -- error
[002] ---
[002] -- error: Can't modify data because this instance is in read-only mode.
[002] +- error: '[string "return box.space.test:replace{100} -- error "]:1: attempt to index
[002] + field ''test'' (a nil value)'
[002] ...
[002] box.cfg{replication_connect_quorum = 2}
[002] ---
[002] @@ -97,7 +99,8 @@
[002] ...
[002] box.space.test:replace{100} -- error
[002] ---
[002] -- error: Can't modify data because this instance is in read-only mode.
[002] +- error: '[string "return box.space.test:replace{100} -- error "]:1: attempt to index
[002] + field ''test'' (a nil value)'
[002] ...
[002] test_run:cmd('start server quorum1 with args="0.1 0.5"')
[002] ---
[002] @@ -151,10 +154,13 @@
[002] ...
[002] test_run:wait_cond(function() return box.space.test.index.primary ~= nil end, 20)
[002] ---
[002] -- true
[002] +- error: '[string "return test_run:wait_cond(function() return b..."]:1: attempt to
[002] + index field ''test'' (a nil value)'
[002] ...
[002] for i = 1, 100 do box.space.test:insert{i} end
[002] ---
[002] +- error: '[string "for i = 1, 100 do box.space.test:insert{i} end "]:1: attempt to
[002] + index field ''test'' (a nil value)'
[002] ...
[002] fiber = require('fiber')
[002] ---
[002] @@ -172,7 +178,8 @@
[002] ...
[002] test_run:wait_cond(function() return box.space.test:count() == 100 end, 20)
[002] ---
[002] -- true
[002] +- error: '[string "return test_run:wait_cond(function() return b..."]:1: attempt to
[002] + index field ''test'' (a nil value)'
[002] ...
[002] -- Rebootstrap one node of the cluster and check that others follow.
[002] -- Note, due to ERRINJ_RELAY_TIMEOUT there is a substantial delay
[002] @@ -203,7 +210,8 @@
[002] test_run:cmd('restart server quorum1 with cleanup=1, args="0.1 0.5"')
[002] test_run:wait_cond(function() return box.space.test:count() == 100 end, 20)
[002] ---
[002] -- true
[002] +- error: '[string "return test_run:wait_cond(function() return b..."]:1: attempt to
[002] + index field ''test'' (a nil value)'
[002] ...
[002] -- The rebootstrapped replica will be assigned id = 4,
[002] -- because ids 1..3 are busy.
[002]


--
Ilya Kosarev


--
Ilya Kosarev