[Tarantool-patches] [PATCH] test: fix flaky socket test

Vladislav Shpilevoy v.shpilevoy at tarantool.org
Sun Dec 8 18:52:28 MSK 2019


Hi! Thanks for the patch!

Please, keep my comments near your answers. Because
I don't remember all my comments, and for each your
answer I need to lookup my original email to remember,
what was the problem.

This is usually simpler both to you and a reviewer,
when your answers are inlined into the original email.

Also don't drop the comments. Here is my comment 2:

>> 2. I understand, why it is 'connect time' in the previous
>> hunk. But here the socket is already connected. So it is
>> not connect time.
>> 
>> On the other hand, there are places right after connect,
>> which don't use this timeout. For example, line 457
>> in socket.result file:
>> 
>>     sc:sysconnect('127.0.0.1', s:name().port) or errno() == errno.EINPROGRESS
>>     ---
>>     - true
>>     ...
>>     sc:writable(10)
>>     ---
>>     - true
>>     ...
>> 
>> Here it is 10 seconds, while in other places you started
>> using 240.

And it is still actual. Why is it called 'CONNECT_TIME' if
it has nothing to do with connect in most of the places?
The problem is in the name. I would introduce one timeout
for conds, connects, readable, etc. You even already have
it - COND_TIME.

On 06/12/2019 17:01, Ilya Kosarev wrote:
> Hi!
> 
> Thanks for your review.
> 
> 4. I run socket.test to simulate "high load" in parallel with itself
> (while ./test-run.py -j20 `for r in {1..64} ; do echo socket ; done` 2>/dev/null ; do l=$(($l+1)) ; echo ======== $l ============= ; done).

I tried this example on your branch and got that:

=========================================================================
[009] --- app/socket.result	Sun Dec  8 15:40:58 2019
[009] +++ app/socket.reject	Sun Dec  8 16:09:23 2019
[009] @@ -2670,13 +2670,17 @@
[009]  -- case: recvfrom reads first 512 bytes from the message with tcp
[009]  received_message, from = receiving_socket:recvfrom()
[009]  ---
[009] +- error: '[string "received_message, from = receiving_socket:rec..."]:1: attempt to
[009] +    index global ''receiving_socket'' (a nil value)'
[009]  ...
[009]  e = receiving_socket:errno()
[009]  ---
[009] +- error: '[string "e = receiving_socket:errno() "]:1: attempt to index global ''receiving_socket''
[009] +    (a nil value)'
[009]  ...
[009]  received_message == message:sub(1, 512) -- expected true
[009]  ---
[009] -- true
[009] +- false
[009]  ...
[009]  received_message:len() == 512 -- expected true
[009]  ---
[009] @@ -2684,7 +2688,7 @@
[009]  ...
[009]  received_message
[009]  ---
[009] -- xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
[009] +- yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
[009]  ...
[009]  received_message:len()
[009]  ---
[009] @@ -2692,11 +2696,13 @@
[009]  ...
[009]  from == nil -- expected true
[009]  ---
[009] -- true
[009] +- false
[009]  ...
[009]  from
[009]  ---
[009] -- null
[009] +- host: 127.0.0.1
[009] +  family: AF_INET
[009] +  port: 57721
[009]  ...
[009]  e == 0 -- expected true
[009]  ---
[009] @@ -2709,33 +2715,39 @@
[009]  -- case: recvfrom does not discard the message tail with tcp
[009]  received_message, from = receiving_socket:recvfrom()
[009]  ---
[009] +- error: '[string "received_message, from = receiving_socket:rec..."]:1: attempt to
[009] +    index global ''receiving_socket'' (a nil value)'
[009]  ...
[009]  e = receiving_socket:errno()
[009]  ---
[009] +- error: '[string "e = receiving_socket:errno() "]:1: attempt to index global ''receiving_socket''
[009] +    (a nil value)'
[009]  ...
[009]  received_message == message:sub(513, 513) -- expected true
[009]  ---
[009] -- true
[009] +- false
[009]  ...
[009]  received_message:len() == 1 -- expected true
[009]  ---
[009] -- true
[009] +- false
[009]  ...
[009]  received_message
[009]  ---
[009] -- x
[009] +- yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
[009]  ...
[009]  received_message:len()
[009]  ---
[009] -- 1
[009] +- 512
[009]  ...
[009]  from == nil -- expected true
[009]  ---
[009] -- true
[009] +- false
[009]  ...
[009]  from
[009]  ---
[009] -- null
[009] +- host: 127.0.0.1
[009] +  family: AF_INET
[009] +  port: 57721
[009]  ...
[009]  e == 0 -- expected true
[009]  ---
[009] @@ -2747,7 +2759,8 @@
[009]  ...
[009]  receiving_socket:close()
[009]  ---
[009] -- true
[009] +- error: '[string "return receiving_socket:close() "]:1: attempt to index global ''receiving_socket''
[009] +    (a nil value)'
[009]  ...
[009]  sending_socket:close()
[009]  ---
[009] 
=========================================================================

Sometimes I get this:

=========================================================================
[008] app/socket.test.lua                                             [ fail ]
[008] 
[008] Test failed! Result content mismatch:
[008] --- app/socket.result	Sun Dec  8 15:40:58 2019
[008] +++ app/socket.reject	Sun Dec  8 16:11:27 2019
[008] @@ -2340,23 +2340,24 @@
[008]  ...
[008]  received_message == '' -- expected true
[008]  ---
[008] -- true
[008] +- false
[008]  ...
[008]  received_message
[008]  ---
[008] -- 
[008] +- null
[008]  ...
[008]  from ~= nil -- expected true
[008]  ---
[008] -- true
[008] +- false
[008]  ...
[008]  from.host == '127.0.0.1' -- expected true
[008]  ---
[008] -- true
[008] +- error: '[string "return from.host == ''127.0.0.1'' -- expected t..."]:1: attempt
[008] +    to index global ''from'' (a nil value)'
[008]  ...
[008]  e == 0 -- expected true
[008]  ---
[008] -- true
[008] +- false
[008]  ...
[008]  -- case: recv, no datagram, explicit size
[008]  received_message = receiving_socket:recv(512)
[008] @@ -2386,23 +2387,25 @@
[008]  ...
[008]  received_message == nil -- expected true
[008]  ---
[008] -- true
[008] +- false
[008]  ...
[008]  received_message
[008]  ---
[008] -- null
[008] +- 
[008]  ...
[008]  from == nil -- expected true
[008]  ---
[008] -- true
[008] +- false
[008]  ...
[008]  from
[008]  ---
[008] -- null
[008] +- host: 127.0.0.1
[008] +  family: AF_INET
[008] +  port: 59867
[008]  ...
[008]  e == errno.EAGAIN -- expected true
[008]  ---
[008] -- true
[008] +- false
[008]  ...
[008]  -- case: recv, zero datagram, explicit size
[008]  sendto_zero(sending_socket, '127.0.0.1', receiving_socket_port)
=========================================================================

So socket.test.lua still is unstable. The command with 64
socket.test.lua gives me an error in 100% cases.

> Using fiber.time() as random seed leads to the same results produced
> on parallel tasks. Using instance uuid as a seed seems to be a good
> decision.

I thought that it is strange, because UUIDs are also generated
randomly, and by the same random generator as is used in Lua.
But now I see why math.random() gives you the same numbers
even though UUIDs are different: Lua math.random() has own
random number generator. Not from the standard library.

> 5. Well, considering the new seed generation, 10s seem to be enough.

So you could skip this strange extraction of a number from
UUID, but increase the timeout? I vote for it. Because the
failing example with 64 instances of socket.test.lua is very
artificial, and even then it will work in case the timeout
is big enough, right? UUID parsing adds unnecessary complexity
and raises questions like I asked.

> 6. If we close socket in the main fiber, we might attempt to use closed
> socket in background fiber.

And what? We don't care anymore for the background
fiber after the socket is not needed anymore. Ok, it will
stumble into errors in that fiber, but how does it affect
further execution of the test?

> 
> 7. Of course it shouldn't be nil and test will fail in this case. We
> just want to be sure that test won't hang even if server is still nil.
> Otherwise we will hang on chan:put(), which is not nice.

If it will always be nil in case of an error, we won't notice the
problem. Please, drop it. Increase a timeout somewhere if necessary,
but consideration of a test passed, if in fact it didn't, is worse.


More information about the Tarantool-patches mailing list