[Tarantool-patches] [PATCH] test: fix flaky socket test
Ilya Kosarev
i.kosarev at tarantool.org
Tue Dec 10 17:36:53 MSK 2019
Hi!
Thanks for the comments!
5 answers are below.
Sent v3 considering discussed questions.
>Воскресенье, 8 декабря 2019, 18:52 +03:00 от Vladislav Shpilevoy <v.shpilevoy at tarantool.org>:
>
>Hi! Thanks for the patch!
>
>Please, keep my comments near your answers. Because
>I don't remember all my comments, and for each your
>answer I need to lookup my original email to remember,
>what was the problem.
>
>This is usually simpler both to you and a reviewer,
>when your answers are inlined into the original email.
>
>Also don't drop the comments. Here is my comment 2:
>
>>> 2. I understand, why it is 'connect time' in the previous
>>> hunk. But here the socket is already connected. So it is
>>> not connect time.
>>>
>>> On the other hand, there are places right after connect,
>>> which don't use this timeout. For example, line 457
>>> in socket.result file:
>>>
>>> sc:sysconnect('127.0.0.1', s:name().port) or errno() == errno.EINPROGRESS
>>> ---
>>> - true
>>> ...
>>> sc:writable(10)
>>> ---
>>> - true
>>> ...
>>>
>>> Here it is 10 seconds, while in other places you started
>>> using 240.
>And it is still actual. Why is it called 'CONNECT_TIME' if
>it has nothing to do with connect in most of the places?
>The problem is in the name. I would introduce one timeout
>for conds, connects, readable, etc. You even already have
>it - COND_TIME.
2. Ok, I see the problem here. I will carefully reconsider such
places to make them consistent.
>
>
>On 06/12/2019 17:01, Ilya Kosarev wrote:
>> Hi!
>>
>> Thanks for your review.
>>
>> 4. I run socket.test to simulate "high load" in parallel with itself
>> (while ./test-run.py -j20 `for r in {1..64} ; do echo socket ; done` 2>/dev/null ; do l=$(($l+1)) ; echo ======== $l ============= ; done).
>
>I tried this example on your branch and got that:
>
>=========================================================================
>[009] --- app/socket.result Sun Dec 8 15:40:58 2019
>[009] +++ app/socket.reject Sun Dec 8 16:09:23 2019
>[009] @@ -2670,13 +2670,17 @@
>[009] -- case: recvfrom reads first 512 bytes from the message with tcp
>[009] received_message, from = receiving_socket:recvfrom()
>[009] ---
>[009] +- error: '[string "received_message, from = receiving_socket:rec..."]:1: attempt to
>[009] + index global ''receiving_socket'' (a nil value)'
>[009] ...
>[009] e = receiving_socket:errno()
>[009] ---
>[009] +- error: '[string "e = receiving_socket:errno() "]:1: attempt to index global ''receiving_socket''
>[009] + (a nil value)'
>[009] ...
>[009] received_message == message:sub(1, 512) -- expected true
>[009] ---
>[009] -- true
>[009] +- false
>[009] ...
>[009] received_message:len() == 512 -- expected true
>[009] ---
>[009] @@ -2684,7 +2688,7 @@
>[009] ...
>[009] received_message
>[009] ---
>[009] -- xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
>[009] +- yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
>[009] ...
>[009] received_message:len()
>[009] ---
>[009] @@ -2692,11 +2696,13 @@
>[009] ...
>[009] from == nil -- expected true
>[009] ---
>[009] -- true
>[009] +- false
>[009] ...
>[009] from
>[009] ---
>[009] -- null
>[009] +- host: 127.0.0.1
>[009] + family: AF_INET
>[009] + port: 57721
>[009] ...
>[009] e == 0 -- expected true
>[009] ---
>[009] @@ -2709,33 +2715,39 @@
>[009] -- case: recvfrom does not discard the message tail with tcp
>[009] received_message, from = receiving_socket:recvfrom()
>[009] ---
>[009] +- error: '[string "received_message, from = receiving_socket:rec..."]:1: attempt to
>[009] + index global ''receiving_socket'' (a nil value)'
>[009] ...
>[009] e = receiving_socket:errno()
>[009] ---
>[009] +- error: '[string "e = receiving_socket:errno() "]:1: attempt to index global ''receiving_socket''
>[009] + (a nil value)'
>[009] ...
>[009] received_message == message:sub(513, 513) -- expected true
>[009] ---
>[009] -- true
>[009] +- false
>[009] ...
>[009] received_message:len() == 1 -- expected true
>[009] ---
>[009] -- true
>[009] +- false
>[009] ...
>[009] received_message
>[009] ---
>[009] -- x
>[009] +- yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
>[009] ...
>[009] received_message:len()
>[009] ---
>[009] -- 1
>[009] +- 512
>[009] ...
>[009] from == nil -- expected true
>[009] ---
>[009] -- true
>[009] +- false
>[009] ...
>[009] from
>[009] ---
>[009] -- null
>[009] +- host: 127.0.0.1
>[009] + family: AF_INET
>[009] + port: 57721
>[009] ...
>[009] e == 0 -- expected true
>[009] ---
>[009] @@ -2747,7 +2759,8 @@
>[009] ...
>[009] receiving_socket:close()
>[009] ---
>[009] -- true
>[009] +- error: '[string "return receiving_socket:close() "]:1: attempt to index global ''receiving_socket''
>[009] + (a nil value)'
>[009] ...
>[009] sending_socket:close()
>[009] ---
>[009]
>=========================================================================
>
>Sometimes I get this:
>
>=========================================================================
>[008] app/socket.test.lua [ fail ]
>[008]
>[008] Test failed! Result content mismatch:
>[008] --- app/socket.result Sun Dec 8 15:40:58 2019
>[008] +++ app/socket.reject Sun Dec 8 16:11:27 2019
>[008] @@ -2340,23 +2340,24 @@
>[008] ...
>[008] received_message == '' -- expected true
>[008] ---
>[008] -- true
>[008] +- false
>[008] ...
>[008] received_message
>[008] ---
>[008] --
>[008] +- null
>[008] ...
>[008] from ~= nil -- expected true
>[008] ---
>[008] -- true
>[008] +- false
>[008] ...
>[008] from.host == '127.0.0.1' -- expected true
>[008] ---
>[008] -- true
>[008] +- error: '[string "return from.host == ''127.0.0.1'' -- expected t..."]:1: attempt
>[008] + to index global ''from'' (a nil value)'
>[008] ...
>[008] e == 0 -- expected true
>[008] ---
>[008] -- true
>[008] +- false
>[008] ...
>[008] -- case: recv, no datagram, explicit size
>[008] received_message = receiving_socket:recv(512)
>[008] @@ -2386,23 +2387,25 @@
>[008] ...
>[008] received_message == nil -- expected true
>[008] ---
>[008] -- true
>[008] +- false
>[008] ...
>[008] received_message
>[008] ---
>[008] -- null
>[008] +-
>[008] ...
>[008] from == nil -- expected true
>[008] ---
>[008] -- true
>[008] +- false
>[008] ...
>[008] from
>[008] ---
>[008] -- null
>[008] +- host: 127.0.0.1
>[008] + family: AF_INET
>[008] + port: 59867
>[008] ...
>[008] e == errno.EAGAIN -- expected true
>[008] ---
>[008] -- true
>[008] +- false
>[008] ...
>[008] -- case: recv, zero datagram, explicit size
>[008] sendto_zero(sending_socket, '127.0.0.1', receiving_socket_port)
>=========================================================================
>
>So socket.test.lua still is unstable. The command with 64
>socket.test.lua gives me an error in 100% cases.
4. Ok, I guess these are mac os specific issues. I haven't got mac os
machine access yet, however, I hope to get it soon and fix these problems.
I still tried to stabilize these cases; if it don't work, I will reconsider them later.
>
>
>> Using fiber.time() as random seed leads to the same results produced
>> on parallel tasks. Using instance uuid as a seed seems to be a good
>> decision.
>
>I thought that it is strange, because UUIDs are also generated
>randomly, and by the same random generator as is used in Lua.
>But now I see why math.random() gives you the same numbers
>even though UUIDs are different: Lua math.random() has own
>random number generator. Not from the standard library.
>
>> 5. Well, considering the new seed generation, 10s seem to be enough.
>
>So you could skip this strange extraction of a number from
>UUID, but increase the timeout? I vote for it. Because the
>failing example with 64 instances of socket.test.lua is very
>artificial, and even then it will work in case the timeout
>is big enough, right? UUID parsing adds unnecessary complexity
>and raises questions like I asked.
5. Ok, I agree.
>
>
>> 6. If we close socket in the main fiber, we might attempt to use closed
>> socket in background fiber.
>
>And what? We don't care anymore for the background
>fiber after the socket is not needed anymore. Ok, it will
>stumble into errors in that fiber, but how does it affect
>further execution of the test?
6. It affects further test execution, as far as it leads
to the following error:
[015] c:receive("*l", "Line: ")
[015] ---
[015] - null
[015] -- closed
[015] -- 'Line: Fu'
[015] +- Connection reset by peer
>
>
>>
>> 7. Of course it shouldn't be nil and test will fail in this case. We
>> just want to be sure that test won't hang even if server is still nil.
>> Otherwise we will hang on chan:put(), which is not nice.
>
>If it will always be nil in case of an error, we won't notice the
>problem. Please, drop it. Increase a timeout somewhere if necessary,
>but consideration of a test passed, if in fact it didn't, is worse.
7. We don't consider the test passed because of this construction.
We will notice the problem, as far as test_run:wait_cond()
will return false instead of true, and also receive1 & receive2
values will be nil instead of '00000' & '1111122222'.
We just don't want to hang on chan:put and wait for extra 120 sec.
for test to fail, that is why it is hidden under if clause.
>
--
Ilya Kosarev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.tarantool.org/pipermail/tarantool-patches/attachments/20191210/6a0acf21/attachment.html>
More information about the Tarantool-patches
mailing list