From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp38.i.mail.ru (smtp38.i.mail.ru [94.100.177.98]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 0737B46970E for ; Wed, 25 Dec 2019 04:29:39 +0300 (MSK) Date: Wed, 25 Dec 2019 04:29:35 +0300 From: Alexander Turenko Message-ID: <20191225012935.psngzpcp7zasjvr3@tkn_work_nb> References: <20191223204833.9192-1-i.kosarev@tarantool.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20191223204833.9192-1-i.kosarev@tarantool.org> Subject: Re: [Tarantool-patches] [PATCH v4] test: fix flaky socket test List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Ilya Kosarev Cc: tarantool-patches@dev.tarantool.org, v.shpilevoy@tarantool.org Thanks for the patch! I really appreciate the work in direction toward stable tests. Unstable CI sometimes hides real problems, so actually it is direction to make tarantool itself more stable. All changes look correct. There are places however that we can polish a bit. Please, consider my comments below. How I tested it: $ (cd test && ./test-run.py -j 30 --force $(yes app/socket | head -n 1000)) Observed the following fails: Case 1. [008] Test failed! Result content mismatch: [008] --- app/socket.result Tue Dec 24 15:55:40 2019 [008] +++ app/socket.reject Tue Dec 24 15:56:24 2019 [008] @@ -1371,7 +1371,7 @@ [008] ... [008] cnt [008] --- [008] -- 2 [008] +- 3 [008] ... [008] client:write('hi') [008] --- Another instance is connected to our server. I found that it is socket.connect() from 'case: socket receive inconsistent behavior' (and this is the reason of case 3 below). The code the the following: | srv = socket.tcp_server('0.0.0.0', 0, fn) | s = socket.connect('localhost', srv:name().port) localhost may be resolved into ::1 (ipv6) and 127.0.0.1 (ipv4). Linux allows to bind the same port number on different addresses. getaddrinfo() on Linux give ipv6 addr/port first (usually, depends on settings) and if another process listens on ::1 on the same port as we bound on ipv4, we'll connect to this process. The change that eliminates the fail above for me: @@ -979,7 +983,7 @@ chan = fiber.channel() counter = 0 fn = function(s) counter = 0; while true do s:write((tostring(counter)):rep(chan:get())); counter = counter + 1 end end srv = socket.tcp_server('0.0.0.0', 0, fn) -s = socket.connect('localhost', srv:name().port) +s = socket.connect('127.0.0.1', srv:name().port) chan:put(5) chan:put(5) s:receive(5) (CCed Maria as author of the test case.) However it is possible that there are other similar cases. I suggest to verify the test against them. Please, beware of changing negative cases (when we expect that a timeout will occur): setting large timeout on them would significantly increase the test run time (we'll always wait 100 seconds at the point). BTW, why not to listen on 127.0.0.1 rather then on all interfaces (0.0.0.0)? BTW, unix sockets may help to don't bother with clashes of different kinds. ---- [018] --- app/socket.result Tue Dec 24 15:58:20 2019 [018] +++ app/socket.reject Tue Dec 24 15:59:29 2019 [018] @@ -2822,7 +2822,7 @@ [018] ... [018] client:read(1, 5) == '' [018] --- [018] -- true [018] +- false [018] ... [018] server:close() [018] --- I printed client_.errno. It is 110 (ETIMEDOUT). It seems it worth to replace 5 with WAIT_COND_TIME (or, as TIMEOUT as I proposed to name it below). Maybe there are other such places. ---- Case 3. Test hung! Result content mismatch: --- app/socket.result Wed Dec 25 01:04:26 2019 +++ var/003_app/socket.result Wed Dec 25 01:05:22 2019 @@ -2857,34 +2857,3 @@ --- ... chan:put(5) ---- -- true -... -chan:put(5) ---- -- true -... -s:receive(5) ---- -- '00000' -... -chan:put(5) ---- -- true -... -s:settimeout(1) ---- -- 1 -... -s:receive('*a') ---- -- '1111122222' -... -s:close() ---- -- 1 -... -srv:close() ---- -- true -... Explained above (see case 1). ---- [017] --- app/socket.result Wed Dec 25 01:18:29 2019 [017] +++ app/socket.reject Wed Dec 25 01:19:30 2019 [017] @@ -40,7 +40,7 @@ [017] ... [017] test_run:cmd("push filter '(error: .builtin/.*[.]lua):[0-9]+' to '\\1'") [017] --- [017] -- true [017] +- Hello, world It seems we connected to some of our socket server rather then test-run. Looks like the same ipv4 / ipv6 problem. After the following patch for test-run I don't see it anymore (don't sure I made enough runs, however): diff --git a/lib/test_suite.py b/lib/test_suite.py index 84b99f4..0a137c3 100644 --- a/lib/test_suite.py +++ b/lib/test_suite.py @@ -197,7 +197,7 @@ class TestSuite: def start_server(self, server): # create inspector daemon for cluster tests inspector = TarantoolInspector( - 'localhost', server.inspector_port + '127.0.0.1', server.inspector_port ) inspector.start() # fixme: remove this string if we fix all legacy tests diff --git a/test_run.lua b/test_run.lua index e37b559..fc1b833 100644 --- a/test_run.lua +++ b/test_run.lua @@ -458,7 +458,7 @@ local inspector_methods = { local function inspector_new(host, port) local inspector = {} - inspector.host = host or 'localhost' + inspector.host = host or '127.0.0.1' inspector.port = port or tonumber(os.getenv('INSPECTOR')) if inspector.port == nil then error('Inspector not started') ---- On Mon, Dec 23, 2019 at 11:48:33PM +0300, Ilya Kosarev wrote: > socket.test had a number of flaky problems: > - socket readiness expectation > - race conditions on socket shutdown in emulation test cases > - UDP datagrams losses on mac os > - excessive random port searches > Now they are solved. Socket test is not fragile anymore. > > Closes #4451, #4426, #4469 > --- > Branch: https://github.com/tarantool/tarantool/tree/i.kosarev/gh-4426-4451-fix-socket-test > Issues: https://github.com/tarantool/tarantool/issues/4426 > https://github.com/tarantool/tarantool/issues/4451 > https://github.com/tarantool/tarantool/issues/4469 > > Changes in v2: > - reconsidered socket readiness expectation > - reduced conditions waiting time > > Changes in v3: > - reconsidered expectations to unify them > - simplified randomization > > Changes in v4: > - left infinite timeouts alone > - wrapped UDP datagrams awaiting with wait_cond That's my bad. As I see, other parts of the test follows the following pattern for waiting UDP packets: | s:readable(TIMEOUT) | s:recv() test_run:wait_cond() is correct solution too, but :readable() looks a bit simpler. Anyway, let's use one that you like more. (It is the only change I propose, which I didn't verify.) > - replaced manual port randomization with zero bind Nice idea! > - removed extra wait_cond wrappers > > test/app/socket.result | 104 ++++++++++++++++++++++----------------- > test/app/socket.test.lua | 94 ++++++++++++++++++++--------------- > test/app/suite.ini | 1 - > 3 files changed, 114 insertions(+), 85 deletions(-) > diff --git a/test/app/socket.test.lua b/test/app/socket.test.lua > index c72d41763f4..4ed6c0f25f7 100644 > --- a/test/app/socket.test.lua > +++ b/test/app/socket.test.lua > @@ -13,7 +13,7 @@ env = require('test_run') > test_run = env.new() > test_run:cmd("push filter '(error: .builtin/.*[.]lua):[0-9]+' to '\\1'") > > -WAIT_COND_TIME = 10 > +WAIT_COND_TIME = 100 Nit: I would name it just TIMEOUT, since it is not only for test_run:wait_cond() now. > @@ -619,8 +608,14 @@ s:settimeout(100500) > rch, wch = fiber.channel(1), fiber.channel(1) > sc = socket.connect(host, port) > test_run:cmd("setopt delimiter ';'") > +socket_opened = true > cfiber = fiber.create(function(sc, rch, wch) > - while sc:send(wch:get()) and rch:put(sc:receive("*l")) do end > + while socket_opened do > + sc:send(wch:get()) > + local data = sc:receive("*l") > + if not socket_opened then sc:close() end > + rch:put(data) > + end I would use existing channel to signal the fiber to close the socket: it looks simpler then using a channel and a variable both. Proposed diff: diff --git a/test/app/socket.test.lua b/test/app/socket.test.lua index 4ed6c0f25..0fe22709a 100644 --- a/test/app/socket.test.lua +++ b/test/app/socket.test.lua @@ -608,13 +608,16 @@ s:settimeout(100500) rch, wch = fiber.channel(1), fiber.channel(1) sc = socket.connect(host, port) test_run:cmd("setopt delimiter ';'") -socket_opened = true cfiber = fiber.create(function(sc, rch, wch) - while socket_opened do - sc:send(wch:get()) - local data = sc:receive("*l") - if not socket_opened then sc:close() end - rch:put(data) + while true do + local outgoing_data = wch:get() + if outgoing_data == nil then + sc:close() + break + end + sc:send(outgoing_data) + local incoming_data = sc:receive("*l") + rch:put(incoming_data) end end, sc, rch, wch); test_run:cmd("setopt delimiter ''"); @@ -646,9 +649,10 @@ rch:get() wch:put("DATA\n") c:receive(4) c:receive("*l") -socket_opened = false +rch:get() wch:put("Fu") c:send("354 Please type your message\n") +wch:put(nil) -- EOF c:receive("*l", "Line: ") c:receive() c:receive(10) > end, sc, rch, wch); > test_run:cmd("setopt delimiter ''"); > > @@ -651,9 +646,9 @@ rch:get() > wch:put("DATA\n") > c:receive(4) > c:receive("*l") > +socket_opened = false > wch:put("Fu") > c:send("354 Please type your message\n") > -sc:close() > c:receive("*l", "Line: ") > c:receive() > c:receive(10) > @@ -778,7 +773,9 @@ e == errno.EAGAIN -- expected true > > -- case: recv, zero datagram > sendto_zero(sending_socket, '127.0.0.1', receiving_socket_port) > -received_message = receiving_socket:recv() > +received_message = test_run:wait_cond(function() \ > + return receiving_socket:recv() \ > +end, WAIT_COND_TIME) Proposed to use :readable() (see above). > --- case: sicket receive inconsistent behavior > +-- case: socket receive inconsistent behavior > chan = fiber.channel() > counter = 0 > fn = function(s) counter = 0; while true do s:write((tostring(counter)):rep(chan:get())); counter = counter + 1 end end > -srv = socket.tcp_server('0.0.0.0', 8888, fn) > -s = socket.connect('localhost', 8888) > +srv = socket.tcp_server('0.0.0.0', 0, fn) > +s = socket.connect('localhost', srv:name().port) Proposed to connect to 127.0.0.1 (see above). > chan:put(5) > chan:put(5) > s:receive(5)