From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtpng2.m.smailru.net (smtpng2.m.smailru.net [94.100.179.3]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id B1C4446971A for ; Fri, 6 Dec 2019 00:31:19 +0300 (MSK) References: <20191204134320.27760-1-i.kosarev@tarantool.org> From: Vladislav Shpilevoy Message-ID: Date: Thu, 5 Dec 2019 22:31:17 +0100 MIME-Version: 1.0 In-Reply-To: <20191204134320.27760-1-i.kosarev@tarantool.org> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Subject: Re: [Tarantool-patches] [PATCH] test: fix flaky socket test List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Ilya Kosarev , tarantool-patches@dev.tarantool.org Hi! Thanks for the patch! See 7 comments below. On 04/12/2019 14:43, Ilya Kosarev wrote: > socket.test had a number of flaky problems: > - socket readableness expectation > - race conditions on socket shutdown in emulation test cases > - tcp_server stability in socket receive inconsistent behavior case > Now they are solved. Port randomization is improved. > Socket test is not fragile anymore. > > Closes #4451, #4426, #4469 > --- > Branch: https://github.com/tarantool/tarantool/tree/i.kosarev/gh-4426-4451-fix-socket-test > Issues: https://github.com/tarantool/tarantool/issues/4426 > https://github.com/tarantool/tarantool/issues/4451 > https://github.com/tarantool/tarantool/issues/4469 > > test/app/socket.result | 84 +++++++++++++++++++++++++--------------- > test/app/socket.test.lua | 60 +++++++++++++++++++--------- > test/app/suite.ini | 1 - > 3 files changed, 94 insertions(+), 51 deletions(-) > > diff --git a/test/app/socket.result b/test/app/socket.result > index fd299424c9..42dde8f375 100644 > --- a/test/app/socket.result > +++ b/test/app/socket.result > @@ -45,6 +45,9 @@ test_run:cmd("push filter '(error: .builtin/.*[.]lua):[0-9]+' to '\\1'") > WAIT_COND_TIME = 10 1. This is now unused. > --- > ... > +WAIT_TCP_CONNECT_TIME = 240 > +--- > +... > socket('PF_INET', 'SOCK_STREAM', 'tcp121222'); > --- > - null > @@ -107,7 +110,7 @@ s:nonblock(true) > --- > - true > ... > -s:readable(.1) > +s:readable(WAIT_TCP_CONNECT_TIME) > --- > - true > ... > @@ -227,7 +230,7 @@ s:syswrite(ffi.cast('const char *', ping), #ping) > --- > - 6 > ... > -s:readable(1) > +s:readable(WAIT_TCP_CONNECT_TIME) 2. I understand, why it is 'connect time' in the previous hunk. But here the socket is already connected. So it is not connect time. On the other hand, there are places right after connect, which don't use this timeout. For example, line 457 in socket.result file: sc:sysconnect('127.0.0.1', s:name().port) or errno() == errno.EINPROGRESS --- - true ... sc:writable(10) --- - true ... Here it is 10 seconds, while in other places you started using 240. > --- > - true > ... > @@ -830,7 +833,7 @@ sc:sendto('127.0.0.1', s:name().port, 'Hello, world') > --- > - 12 > ... > -s:readable(10) > +s:readable(WAIT_TCP_CONNECT_TIME) > --- > - true > ... 3. There is no connect time. This is DGRAM socket with UDP. No connections. The same below. Moreover, there are now 2 waits for 'connect time' on the same socket. > @@ -842,7 +845,7 @@ sc:sendto('127.0.0.1', s:name().port, 'Hello, world, 2') > --- > - 15 > ... > -s:readable(10) > +s:readable(WAIT_TCP_CONNECT_TIME) > --- > - true > ... > @@ -898,7 +901,7 @@ sc:sendto('127.0.0.1', s:name().port, 'Hello, World!') > --- > - 13 > ... > -s:readable(1) > +s:readable(WAIT_TCP_CONNECT_TIME) > --- > - true > ... > @@ -1074,6 +1077,15 @@ master:setsockopt('SOL_SOCKET', 'SO_REUSEADDR', true) > --- > - true > ... > +seed = '' > +--- > +... > +for d in string.gmatch(box.info.uuid, '%d') do seed = seed .. d end 4. Well, this is probably the most intricate way to get a seed. Have you considered fiber.time()? And why do you need it? Isn't it initialized already? > +--- > +... > +math.randomseed(tonumber(seed)) > +--- > +... > port = 32768 + math.random(0, 32767) > --- > ... > @@ -1092,7 +1104,7 @@ test_run:wait_cond(function() > return false, master:error() > end > return true > -end, WAIT_COND_TIME); > +end, 100); 5. Why? 10 seconds was not enough to try a few random ports? > --- > - true > ... > @@ -1822,8 +1834,14 @@ test_run:cmd("setopt delimiter ';'") > --- > - true > ... > +socket_opened = true > cfiber = fiber.create(function(sc, rch, wch) > - while sc:send(wch:get()) and rch:put(sc:receive("*l")) do end > + while socket_opened do > + sc:send(wch:get()) > + local data = sc:receive("*l") > + if not socket_opened then sc:close() end > + rch:put(data) > + end > end, sc, rch, wch); 6. What was a problem here? In the commit message you said 'race conditions'. Where? You just moved socket close to the background fiber from the main fiber. > --- > ... > @@ -1936,6 +1954,9 @@ c:receive("*l") > --- > - > ... > +socket_opened = false > +--- > +... > wch:put("Fu") > --- > - true > @@ -1944,10 +1965,6 @@ c:send("354 Please type your message\n") > --- > - 29 > ... > -sc:close() > ---- > -- 1 > -... > c:receive("*l", "Line: ") > --- > - null > @@ -2816,7 +2833,7 @@ test_run:cmd("clear filter") > --- > - true > ... > --- case: sicket receive inconsistent behavior > +-- case: socket receive inconsistent behavior > chan = fiber.channel() > --- > ... > @@ -2826,41 +2843,46 @@ counter = 0 > fn = function(s) counter = 0; while true do s:write((tostring(counter)):rep(chan:get())); counter = counter + 1 end end > --- > ... > -srv = socket.tcp_server('0.0.0.0', 8888, fn) > +srv = nil > --- > ... > -s = socket.connect('localhost', 8888) > +test_run:cmd("setopt delimiter ';'") > --- > +- true > ... > -chan:put(5) > +test_run:wait_cond(function() > + port = 32768 + math.random(0, 32767) > + srv = socket.tcp_server('0.0.0.0', port, fn) > + return srv ~= nil > +end, 100); > --- > - true > ... > -chan:put(5) > +receive1 = nil; receive2 = nil; > --- > -- true > ... > -s:receive(5) > +if srv ~= nil then > + s = socket.connect('localhost', port) > + chan:put(5) > + chan:put(5) > + receive1 = s:receive(5) > + chan:put(5) > + s:settimeout(1) > + receive2 = s:receive('*a') > + s:close() > + srv:close() > +end; 7. How is it possible, that the server still is nil here? At least one of 32k ports should have been free. > --- > -- '00000' > ... > -chan:put(5) > +test_run:cmd("setopt delimiter ''"); > --- > - true > ... > -s:settimeout(1) > +receive1 > --- > -- 1 > +- '00000' > ... > -s:receive('*a') > +receive2 > --- > - '1111122222' > ... > -s:close() > ---- > -- 1 > -... > -srv:close() > ---- > -- true > -...