[Tarantool-patches] [PATCH] test: fix flaky socket test

Ilya Kosarev i.kosarev at tarantool.org
Tue Dec 10 17:36:53 MSK 2019


Hi!

Thanks for the comments!

5 answers are below.

Sent v3 considering discussed questions.

>Воскресенье,  8 декабря 2019, 18:52 +03:00 от Vladislav Shpilevoy <v.shpilevoy at tarantool.org>:
>
>Hi! Thanks for the patch!
>
>Please, keep my comments near your answers. Because
>I don't remember all my comments, and for each your
>answer I need to lookup my original email to remember,
>what was the problem.
>
>This is usually simpler both to you and a reviewer,
>when your answers are inlined into the original email.
>
>Also don't drop the comments. Here is my comment 2:
>
>>> 2. I understand, why it is 'connect time' in the previous
>>> hunk. But here the socket is already connected. So it is
>>> not connect time.
>>> 
>>> On the other hand, there are places right after connect,
>>> which don't use this timeout. For example, line 457
>>> in socket.result file:
>>> 
>>>     sc:sysconnect('127.0.0.1', s:name().port) or errno() == errno.EINPROGRESS
>>>     ---
>>>     - true
>>>     ...
>>>     sc:writable(10)
>>>     ---
>>>     - true
>>>     ...
>>> 
>>> Here it is 10 seconds, while in other places you started
>>> using 240.
>And it is still actual. Why is it called 'CONNECT_TIME' if
>it has nothing to do with connect in most of the places?
>The problem is in the name. I would introduce one timeout
>for conds, connects, readable, etc. You even already have
>it - COND_TIME. 
2. Ok, I see the problem here. I will carefully reconsider such
places to make them consistent.
>
>
>On 06/12/2019 17:01, Ilya Kosarev wrote:
>> Hi!
>> 
>> Thanks for your review.
>> 
>> 4. I run socket.test to simulate "high load" in parallel with itself
>> (while ./test-run.py -j20 `for r in {1..64} ; do echo socket ; done` 2>/dev/null ; do l=$(($l+1)) ; echo ======== $l ============= ; done).
>
>I tried this example on your branch and got that:
>
>=========================================================================
>[009] --- app/socket.result	Sun Dec  8 15:40:58 2019
>[009] +++ app/socket.reject	Sun Dec  8 16:09:23 2019
>[009] @@ -2670,13 +2670,17 @@
>[009]  -- case: recvfrom reads first 512 bytes from the message with tcp
>[009]  received_message, from = receiving_socket:recvfrom()
>[009]  ---
>[009] +- error: '[string "received_message, from = receiving_socket:rec..."]:1: attempt to
>[009] +    index global ''receiving_socket'' (a nil value)'
>[009]  ...
>[009]  e = receiving_socket:errno()
>[009]  ---
>[009] +- error: '[string "e = receiving_socket:errno() "]:1: attempt to index global ''receiving_socket''
>[009] +    (a nil value)'
>[009]  ...
>[009]  received_message == message:sub(1, 512) -- expected true
>[009]  ---
>[009] -- true
>[009] +- false
>[009]  ...
>[009]  received_message:len() == 512 -- expected true
>[009]  ---
>[009] @@ -2684,7 +2688,7 @@
>[009]  ...
>[009]  received_message
>[009]  ---
>[009] -- xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
>[009] +- yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
>[009]  ...
>[009]  received_message:len()
>[009]  ---
>[009] @@ -2692,11 +2696,13 @@
>[009]  ...
>[009]  from == nil -- expected true
>[009]  ---
>[009] -- true
>[009] +- false
>[009]  ...
>[009]  from
>[009]  ---
>[009] -- null
>[009] +- host: 127.0.0.1
>[009] +  family: AF_INET
>[009] +  port: 57721
>[009]  ...
>[009]  e == 0 -- expected true
>[009]  ---
>[009] @@ -2709,33 +2715,39 @@
>[009]  -- case: recvfrom does not discard the message tail with tcp
>[009]  received_message, from = receiving_socket:recvfrom()
>[009]  ---
>[009] +- error: '[string "received_message, from = receiving_socket:rec..."]:1: attempt to
>[009] +    index global ''receiving_socket'' (a nil value)'
>[009]  ...
>[009]  e = receiving_socket:errno()
>[009]  ---
>[009] +- error: '[string "e = receiving_socket:errno() "]:1: attempt to index global ''receiving_socket''
>[009] +    (a nil value)'
>[009]  ...
>[009]  received_message == message:sub(513, 513) -- expected true
>[009]  ---
>[009] -- true
>[009] +- false
>[009]  ...
>[009]  received_message:len() == 1 -- expected true
>[009]  ---
>[009] -- true
>[009] +- false
>[009]  ...
>[009]  received_message
>[009]  ---
>[009] -- x
>[009] +- yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
>[009]  ...
>[009]  received_message:len()
>[009]  ---
>[009] -- 1
>[009] +- 512
>[009]  ...
>[009]  from == nil -- expected true
>[009]  ---
>[009] -- true
>[009] +- false
>[009]  ...
>[009]  from
>[009]  ---
>[009] -- null
>[009] +- host: 127.0.0.1
>[009] +  family: AF_INET
>[009] +  port: 57721
>[009]  ...
>[009]  e == 0 -- expected true
>[009]  ---
>[009] @@ -2747,7 +2759,8 @@
>[009]  ...
>[009]  receiving_socket:close()
>[009]  ---
>[009] -- true
>[009] +- error: '[string "return receiving_socket:close() "]:1: attempt to index global ''receiving_socket''
>[009] +    (a nil value)'
>[009]  ...
>[009]  sending_socket:close()
>[009]  ---
>[009] 
>=========================================================================
>
>Sometimes I get this:
>
>=========================================================================
>[008] app/socket.test.lua                                             [ fail ]
>[008] 
>[008] Test failed! Result content mismatch:
>[008] --- app/socket.result	Sun Dec  8 15:40:58 2019
>[008] +++ app/socket.reject	Sun Dec  8 16:11:27 2019
>[008] @@ -2340,23 +2340,24 @@
>[008]  ...
>[008]  received_message == '' -- expected true
>[008]  ---
>[008] -- true
>[008] +- false
>[008]  ...
>[008]  received_message
>[008]  ---
>[008] -- 
>[008] +- null
>[008]  ...
>[008]  from ~= nil -- expected true
>[008]  ---
>[008] -- true
>[008] +- false
>[008]  ...
>[008]  from.host == '127.0.0.1' -- expected true
>[008]  ---
>[008] -- true
>[008] +- error: '[string "return from.host == ''127.0.0.1'' -- expected t..."]:1: attempt
>[008] +    to index global ''from'' (a nil value)'
>[008]  ...
>[008]  e == 0 -- expected true
>[008]  ---
>[008] -- true
>[008] +- false
>[008]  ...
>[008]  -- case: recv, no datagram, explicit size
>[008]  received_message = receiving_socket:recv(512)
>[008] @@ -2386,23 +2387,25 @@
>[008]  ...
>[008]  received_message == nil -- expected true
>[008]  ---
>[008] -- true
>[008] +- false
>[008]  ...
>[008]  received_message
>[008]  ---
>[008] -- null
>[008] +- 
>[008]  ...
>[008]  from == nil -- expected true
>[008]  ---
>[008] -- true
>[008] +- false
>[008]  ...
>[008]  from
>[008]  ---
>[008] -- null
>[008] +- host: 127.0.0.1
>[008] +  family: AF_INET
>[008] +  port: 59867
>[008]  ...
>[008]  e == errno.EAGAIN -- expected true
>[008]  ---
>[008] -- true
>[008] +- false
>[008]  ...
>[008]  -- case: recv, zero datagram, explicit size
>[008]  sendto_zero(sending_socket, '127.0.0.1', receiving_socket_port)
>=========================================================================
>
>So socket.test.lua still is unstable. The command with 64
>socket.test.lua gives me an error in 100% cases. 
4. Ok, I guess these are mac os specific issues. I haven't got mac os
machine access yet, however, I hope to get it soon and fix these problems.
I still tried to stabilize these cases; if it don't work, I will reconsider them later.
>
>
>> Using fiber.time() as random seed leads to the same results produced
>> on parallel tasks. Using instance uuid as a seed seems to be a good
>> decision.
>
>I thought that it is strange, because UUIDs are also generated
>randomly, and by the same random generator as is used in Lua.
>But now I see why math.random() gives you the same numbers
>even though UUIDs are different: Lua math.random() has own
>random number generator. Not from the standard library.
>
>> 5. Well, considering the new seed generation, 10s seem to be enough.
>
>So you could skip this strange extraction of a number from
>UUID, but increase the timeout? I vote for it. Because the
>failing example with 64 instances of socket.test.lua is very
>artificial, and even then it will work in case the timeout
>is big enough, right? UUID parsing adds unnecessary complexity
>and raises questions like I asked. 
5. Ok, I agree.
>
>
>> 6. If we close socket in the main fiber, we might attempt to use closed
>> socket in background fiber.
>
>And what? We don't care anymore for the background
>fiber after the socket is not needed anymore. Ok, it will
>stumble into errors in that fiber, but how does it affect
>further execution of the test? 
6. It affects further test execution, as far as it leads
to the following error:
[015]  c:receive("*l", "Line: ")
[015]  ---
[015]  - null
[015] -- closed
[015] -- 'Line: Fu'
[015] +- Connection reset by peer
>
>
>> 
>> 7. Of course it shouldn't be nil and test will fail in this case. We
>> just want to be sure that test won't hang even if server is still nil.
>> Otherwise we will hang on chan:put(), which is not nice.
>
>If it will always be nil in case of an error, we won't notice the
>problem. Please, drop it. Increase a timeout somewhere if necessary,
>but consideration of a test passed, if in fact it didn't, is worse. 
7. We don't consider the test passed because of this construction.
We will notice the problem, as far as  test_run:wait_cond()
will return false instead of true, and also  receive1 &  receive2
values will be nil instead of  '00000' &  '1111122222'.
We just don't want to hang on chan:put and wait for extra 120 sec.
for test to fail, that is why it is hidden under if clause.
>


-- 
Ilya Kosarev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.tarantool.org/pipermail/tarantool-patches/attachments/20191210/6a0acf21/attachment.html>


More information about the Tarantool-patches mailing list