Hi! Thanks for your review! Sent v5 including all proposed enhancements. >Среда, 25 декабря 2019, 4:29 +03:00 от Alexander Turenko : > >Thanks for the patch! > >I really appreciate the work in direction toward stable tests. Unstable >CI sometimes hides real problems, so actually it is direction to make >tarantool itself more stable. > >All changes look correct. There are places however that we can polish a >bit. Please, consider my comments below. > >How I tested it: > >$ (cd test && ./test-run.py -j 30 --force $(yes app/socket | head -n 1000)) > >Observed the following fails: > >Case 1. > >[008] Test failed! Result content mismatch: >[008] --- app/socket.result Tue Dec 24 15:55:40 2019 >[008] +++ app/socket.reject Tue Dec 24 15:56:24 2019 >[008] @@ -1371,7 +1371,7 @@ >[008] ... >[008] cnt >[008] --- >[008] -- 2 >[008] +- 3 >[008] ... >[008] client:write('hi') >[008] --- > >Another instance is connected to our server. I found that it is >socket.connect() from 'case: socket receive inconsistent behavior' (and >this is the reason of case 3 below). The code the the following: > > | srv = socket.tcp_server('0.0.0.0', 0, fn) > | s = socket.connect('localhost', srv:name().port) > >localhost may be resolved into ::1 (ipv6) and 127.0.0.1 (ipv4). Linux >allows to bind the same port number on different addresses. >getaddrinfo() on Linux give ipv6 addr/port first (usually, depends >on settings) and if another process listens on ::1 on the same port as >we bound on ipv4, we'll connect to this process. > >The change that eliminates the fail above for me: > >@@ -979,7 +983,7 @@ chan = fiber.channel() > counter = 0 > fn = function(s) counter = 0; while true do s:write((tostring(counter)):rep(chan:get())); counter = counter + 1 end end > srv = socket.tcp_server('0.0.0.0', 0, fn) >-s = socket.connect('localhost', srv:name().port) >+s = socket.connect('127.0.0.1', srv:name().port) > chan:put(5) > chan:put(5) > s:receive(5) > >(CCed Maria as author of the test case.) > >However it is possible that there are other similar cases. I suggest to >verify the test against them. Please, beware of changing negative cases >(when we expect that a timeout will occur): setting large timeout on >them would significantly increase the test run time (we'll always wait >100 seconds at the point). > >BTW, why not to listen on 127.0.0.1 rather then on all interfaces >(0.0.0.0)? > >BTW, unix sockets may help to don't bother with clashes of different >kinds. > >---- > >[018] --- app/socket.result Tue Dec 24 15:58:20 2019 >[018] +++ app/socket.reject Tue Dec 24 15:59:29 2019 >[018] @@ -2822,7 +2822,7 @@ >[018] ... >[018] client:read(1, 5) == '' >[018] --- >[018] -- true >[018] +- false >[018] ... >[018] server:close() >[018] --- > >I printed client_.errno. It is 110 (ETIMEDOUT). It seems it worth to >replace 5 with WAIT_COND_TIME (or, as TIMEOUT as I proposed to name it >below). Maybe there are other such places. > >---- > >Case 3. > >Test hung! Result content mismatch: >--- app/socket.result Wed Dec 25 01:04:26 2019 >+++ var/003_app/socket.result Wed Dec 25 01:05:22 2019 >@@ -2857,34 +2857,3 @@ > --- > ... > chan:put(5) >---- >-- true >-... >-chan:put(5) >---- >-- true >-... >-s:receive(5) >---- >-- '00000' >-... >-chan:put(5) >---- >-- true >-... >-s:settimeout(1) >---- >-- 1 >-... >-s:receive('*a') >---- >-- '1111122222' >-... >-s:close() >---- >-- 1 >-... >-srv:close() >---- >-- true >-... > >Explained above (see case 1). > >---- > >[017] --- app/socket.result Wed Dec 25 01:18:29 2019 >[017] +++ app/socket.reject Wed Dec 25 01:19:30 2019 >[017] @@ -40,7 +40,7 @@ >[017] ... >[017] test_run:cmd("push filter '(error: .builtin/.*[.]lua):[0-9]+' to '\\1'") >[017] --- >[017] -- true >[017] +- Hello, world > >It seems we connected to some of our socket server rather then test-run. >Looks like the same ipv4 / ipv6 problem. > >After the following patch for test-run I don't see it anymore (don't >sure I made enough runs, however): > >diff --git a/lib/test_suite.py b/lib/test_suite.py >index 84b99f4..0a137c3 100644 >--- a/lib/test_suite.py >+++ b/lib/test_suite.py >@@ -197,7 +197,7 @@ class TestSuite: >     def start_server(self, server): >         # create inspector daemon for cluster tests >         inspector = TarantoolInspector( >- 'localhost', server.inspector_port >+ '127.0.0.1', server.inspector_port >         ) >         inspector.start() >         # fixme: remove this string if we fix all legacy tests >diff --git a/test_run.lua b/test_run.lua >index e37b559..fc1b833 100644 >--- a/test_run.lua >+++ b/test_run.lua >@@ -458,7 +458,7 @@ local inspector_methods = { > local function inspector_new(host, port) >     local inspector = {} >  >- inspector.host = host or 'localhost' >+ inspector.host = host or '127.0.0.1' >     inspector.port = port or tonumber(os.getenv('INSPECTOR')) >     if inspector.port == nil then >         error('Inspector not started') > >---- > >On Mon, Dec 23, 2019 at 11:48:33PM +0300, Ilya Kosarev wrote: >> socket.test had a number of flaky problems: >> - socket readiness expectation >> - race conditions on socket shutdown in emulation test cases >> - UDP datagrams losses on mac os >> - excessive random port searches >> Now they are solved. Socket test is not fragile anymore. >> >> Closes #4451, #4426, #4469 >> --- >> Branch: https://github.com/tarantool/tarantool/tree/i.kosarev/gh-4426-4451-fix-socket-test >> Issues: https://github.com/tarantool/tarantool/issues/4426 >> https://github.com/tarantool/tarantool/issues/4451 >> https://github.com/tarantool/tarantool/issues/4469 >> >> Changes in v2: >> - reconsidered socket readiness expectation >> - reduced conditions waiting time >> >> Changes in v3: >> - reconsidered expectations to unify them >> - simplified randomization >> >> Changes in v4: >> - left infinite timeouts alone >> - wrapped UDP datagrams awaiting with wait_cond > >That's my bad. As I see, other parts of the test follows the following >pattern for waiting UDP packets: > > | s:readable(TIMEOUT) > | s:recv() > >test_run:wait_cond() is correct solution too, but :readable() looks a >bit simpler. Anyway, let's use one that you like more. > >(It is the only change I propose, which I didn't verify.) > >> - replaced manual port randomization with zero bind > >Nice idea! > >> - removed extra wait_cond wrappers >> >> test/app/socket.result | 104 ++++++++++++++++++++++----------------- >> test/app/socket.test.lua | 94 ++++++++++++++++++++--------------- >> test/app/suite.ini | 1 - >> 3 files changed, 114 insertions(+), 85 deletions(-) > >> diff --git a/test/app/socket.test.lua b/test/app/socket.test.lua >> index c72d41763f4..4ed6c0f25f7 100644 >> --- a/test/app/socket.test.lua >> +++ b/test/app/socket.test.lua >> @@ -13,7 +13,7 @@ env = require('test_run') >> test_run = env.new() >> test_run:cmd("push filter '(error: .builtin/.*[.]lua):[0-9]+' to '\\1'") >> >> -WAIT_COND_TIME = 10 >> +WAIT_COND_TIME = 100 > >Nit: I would name it just TIMEOUT, since it is not only for >test_run:wait_cond() now. > >> @@ -619,8 +608,14 @@ s:settimeout(100500) >> rch, wch = fiber.channel(1), fiber.channel(1) >> sc = socket.connect(host, port) >> test_run:cmd("setopt delimiter ';'") >> +socket_opened = true >> cfiber = fiber.create(function(sc, rch, wch) >> - while sc:send(wch:get()) and rch:put(sc:receive("*l")) do end >> + while socket_opened do >> + sc:send(wch:get()) >> + local data = sc:receive("*l") >> + if not socket_opened then sc:close() end >> + rch:put(data) >> + end > >I would use existing channel to signal the fiber to close the socket: it >looks simpler then using a channel and a variable both. Proposed diff: > >diff --git a/test/app/socket.test.lua b/test/app/socket.test.lua >index 4ed6c0f25..0fe22709a 100644 >--- a/test/app/socket.test.lua >+++ b/test/app/socket.test.lua >@@ -608,13 +608,16 @@ s:settimeout(100500) > rch, wch = fiber.channel(1), fiber.channel(1) > sc = socket.connect(host, port) > test_run:cmd("setopt delimiter ';'") >-socket_opened = true > cfiber = fiber.create(function(sc, rch, wch) >- while socket_opened do >- sc:send(wch:get()) >- local data = sc:receive("*l") >- if not socket_opened then sc:close() end >- rch:put(data) >+ while true do >+ local outgoing_data = wch:get() >+ if outgoing_data == nil then >+ sc:close() >+ break >+ end >+ sc:send(outgoing_data) >+ local incoming_data = sc:receive("*l") >+ rch:put(incoming_data) >     end > end, sc, rch, wch); > test_run:cmd("setopt delimiter ''"); >@@ -646,9 +649,10 @@ rch:get() > wch:put("DATA\n") > c:receive(4) > c:receive("*l") >-socket_opened = false >+rch:get() > wch:put("Fu") > c:send("354 Please type your message\n") >+wch:put(nil) -- EOF > c:receive("*l", "Line: ") > c:receive() > c:receive(10) > >> end, sc, rch, wch); >> test_run:cmd("setopt delimiter ''"); >> >> @@ -651,9 +646,9 @@ rch:get() >> wch:put("DATA\n") >> c:receive(4) >> c:receive("*l") >> +socket_opened = false >> wch:put("Fu") >> c:send("354 Please type your message\n") >> -sc:close() >> c:receive("*l", "Line: ") >> c:receive() >> c:receive(10) > >> @@ -778,7 +773,9 @@ e == errno.EAGAIN -- expected true >> >> -- case: recv, zero datagram >> sendto_zero(sending_socket, '127.0.0.1', receiving_socket_port) >> -received_message = receiving_socket:recv() >> +received_message = test_run:wait_cond(function() \ >> + return receiving_socket:recv() \ >> +end, WAIT_COND_TIME) > >Proposed to use :readable() (see above). > >> --- case: sicket receive inconsistent behavior >> +-- case: socket receive inconsistent behavior >> chan = fiber.channel() >> counter = 0 >> fn = function(s) counter = 0; while true do s:write((tostring(counter)):rep(chan:get())); counter = counter + 1 end end >> -srv = socket.tcp_server('0.0.0.0', 8888, fn) >> -s = socket.connect('localhost', 8888) >> +srv = socket.tcp_server('0.0.0.0', 0, fn) >> +s = socket.connect('localhost', srv:name().port) > >Proposed to connect to 127.0.0.1 (see above). > >> chan:put(5) >> chan:put(5) >> s:receive(5) -- Ilya Kosarev