From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp62.i.mail.ru (smtp62.i.mail.ru [217.69.128.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id E8BF446971A for ; Sun, 8 Dec 2019 18:52:31 +0300 (MSK) References: <20191204134320.27760-1-i.kosarev@tarantool.org> <1575648087.956039140@f128.i.mail.ru> From: Vladislav Shpilevoy Message-ID: <075c5646-976c-b673-2639-634333bf1eaf@tarantool.org> Date: Sun, 8 Dec 2019 16:52:28 +0100 MIME-Version: 1.0 In-Reply-To: <1575648087.956039140@f128.i.mail.ru> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Subject: Re: [Tarantool-patches] [PATCH] test: fix flaky socket test List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Ilya Kosarev Cc: tarantool-patches@dev.tarantool.org Hi! Thanks for the patch! Please, keep my comments near your answers. Because I don't remember all my comments, and for each your answer I need to lookup my original email to remember, what was the problem. This is usually simpler both to you and a reviewer, when your answers are inlined into the original email. Also don't drop the comments. Here is my comment 2: >> 2. I understand, why it is 'connect time' in the previous >> hunk. But here the socket is already connected. So it is >> not connect time. >> >> On the other hand, there are places right after connect, >> which don't use this timeout. For example, line 457 >> in socket.result file: >> >> sc:sysconnect('127.0.0.1', s:name().port) or errno() == errno.EINPROGRESS >> --- >> - true >> ... >> sc:writable(10) >> --- >> - true >> ... >> >> Here it is 10 seconds, while in other places you started >> using 240. And it is still actual. Why is it called 'CONNECT_TIME' if it has nothing to do with connect in most of the places? The problem is in the name. I would introduce one timeout for conds, connects, readable, etc. You even already have it - COND_TIME. On 06/12/2019 17:01, Ilya Kosarev wrote: > Hi! > > Thanks for your review. > > 4. I run socket.test to simulate "high load" in parallel with itself > (while ./test-run.py -j20 `for r in {1..64} ; do echo socket ; done` 2>/dev/null ; do l=$(($l+1)) ; echo ======== $l ============= ; done). I tried this example on your branch and got that: ========================================================================= [009] --- app/socket.result Sun Dec 8 15:40:58 2019 [009] +++ app/socket.reject Sun Dec 8 16:09:23 2019 [009] @@ -2670,13 +2670,17 @@ [009] -- case: recvfrom reads first 512 bytes from the message with tcp [009] received_message, from = receiving_socket:recvfrom() [009] --- [009] +- error: '[string "received_message, from = receiving_socket:rec..."]:1: attempt to [009] + index global ''receiving_socket'' (a nil value)' [009] ... [009] e = receiving_socket:errno() [009] --- [009] +- error: '[string "e = receiving_socket:errno() "]:1: attempt to index global ''receiving_socket'' [009] + (a nil value)' [009] ... [009] received_message == message:sub(1, 512) -- expected true [009] --- [009] -- true [009] +- false [009] ... [009] received_message:len() == 512 -- expected true [009] --- [009] @@ -2684,7 +2688,7 @@ [009] ... [009] received_message [009] --- [009] -- xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx [009] +- yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy [009] ... [009] received_message:len() [009] --- [009] @@ -2692,11 +2696,13 @@ [009] ... [009] from == nil -- expected true [009] --- [009] -- true [009] +- false [009] ... [009] from [009] --- [009] -- null [009] +- host: 127.0.0.1 [009] + family: AF_INET [009] + port: 57721 [009] ... [009] e == 0 -- expected true [009] --- [009] @@ -2709,33 +2715,39 @@ [009] -- case: recvfrom does not discard the message tail with tcp [009] received_message, from = receiving_socket:recvfrom() [009] --- [009] +- error: '[string "received_message, from = receiving_socket:rec..."]:1: attempt to [009] + index global ''receiving_socket'' (a nil value)' [009] ... [009] e = receiving_socket:errno() [009] --- [009] +- error: '[string "e = receiving_socket:errno() "]:1: attempt to index global ''receiving_socket'' [009] + (a nil value)' [009] ... [009] received_message == message:sub(513, 513) -- expected true [009] --- [009] -- true [009] +- false [009] ... [009] received_message:len() == 1 -- expected true [009] --- [009] -- true [009] +- false [009] ... [009] received_message [009] --- [009] -- x [009] +- yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy [009] ... [009] received_message:len() [009] --- [009] -- 1 [009] +- 512 [009] ... [009] from == nil -- expected true [009] --- [009] -- true [009] +- false [009] ... [009] from [009] --- [009] -- null [009] +- host: 127.0.0.1 [009] + family: AF_INET [009] + port: 57721 [009] ... [009] e == 0 -- expected true [009] --- [009] @@ -2747,7 +2759,8 @@ [009] ... [009] receiving_socket:close() [009] --- [009] -- true [009] +- error: '[string "return receiving_socket:close() "]:1: attempt to index global ''receiving_socket'' [009] + (a nil value)' [009] ... [009] sending_socket:close() [009] --- [009] ========================================================================= Sometimes I get this: ========================================================================= [008] app/socket.test.lua [ fail ] [008] [008] Test failed! Result content mismatch: [008] --- app/socket.result Sun Dec 8 15:40:58 2019 [008] +++ app/socket.reject Sun Dec 8 16:11:27 2019 [008] @@ -2340,23 +2340,24 @@ [008] ... [008] received_message == '' -- expected true [008] --- [008] -- true [008] +- false [008] ... [008] received_message [008] --- [008] -- [008] +- null [008] ... [008] from ~= nil -- expected true [008] --- [008] -- true [008] +- false [008] ... [008] from.host == '127.0.0.1' -- expected true [008] --- [008] -- true [008] +- error: '[string "return from.host == ''127.0.0.1'' -- expected t..."]:1: attempt [008] + to index global ''from'' (a nil value)' [008] ... [008] e == 0 -- expected true [008] --- [008] -- true [008] +- false [008] ... [008] -- case: recv, no datagram, explicit size [008] received_message = receiving_socket:recv(512) [008] @@ -2386,23 +2387,25 @@ [008] ... [008] received_message == nil -- expected true [008] --- [008] -- true [008] +- false [008] ... [008] received_message [008] --- [008] -- null [008] +- [008] ... [008] from == nil -- expected true [008] --- [008] -- true [008] +- false [008] ... [008] from [008] --- [008] -- null [008] +- host: 127.0.0.1 [008] + family: AF_INET [008] + port: 59867 [008] ... [008] e == errno.EAGAIN -- expected true [008] --- [008] -- true [008] +- false [008] ... [008] -- case: recv, zero datagram, explicit size [008] sendto_zero(sending_socket, '127.0.0.1', receiving_socket_port) ========================================================================= So socket.test.lua still is unstable. The command with 64 socket.test.lua gives me an error in 100% cases. > Using fiber.time() as random seed leads to the same results produced > on parallel tasks. Using instance uuid as a seed seems to be a good > decision. I thought that it is strange, because UUIDs are also generated randomly, and by the same random generator as is used in Lua. But now I see why math.random() gives you the same numbers even though UUIDs are different: Lua math.random() has own random number generator. Not from the standard library. > 5. Well, considering the new seed generation, 10s seem to be enough. So you could skip this strange extraction of a number from UUID, but increase the timeout? I vote for it. Because the failing example with 64 instances of socket.test.lua is very artificial, and even then it will work in case the timeout is big enough, right? UUID parsing adds unnecessary complexity and raises questions like I asked. > 6. If we close socket in the main fiber, we might attempt to use closed > socket in background fiber. And what? We don't care anymore for the background fiber after the socket is not needed anymore. Ok, it will stumble into errors in that fiber, but how does it affect further execution of the test? > > 7. Of course it shouldn't be nil and test will fail in this case. We > just want to be sure that test won't hang even if server is still nil. > Otherwise we will hang on chan:put(), which is not nice. If it will always be nil in case of an error, we won't notice the problem. Please, drop it. Increase a timeout somewhere if necessary, but consideration of a test passed, if in fact it didn't, is worse.