[Tarantool-patches] [PATCH] test: fix flaky socket test

Ilya Kosarev i.kosarev at tarantool.org
Fri Dec 6 19:01:27 MSK 2019


Hi!

Thanks for your review.

4. I run socket.test to simulate "high load" in parallel with itself
(while ./test-run.py -j20 `for r in {1..64} ; do echo socket ; done` 2>/dev/null ; do l=$(($l+1)) ; echo ======== $l ============= ; done).
Using fiber.time() as random seed leads to the same results produced
on parallel tasks. Using instance uuid as a seed seems to be a good
decision.
5. Well, considering the new seed generation, 10s seem to be enough.
6. If we close socket in the main fiber, we might attempt to use closed
socket in background fiber.
7. Of course it shouldn't be nil and test will fail in this case. We
just want to be sure that test won't hang even if server is still nil.
Otherwise we will hang on chan:put(), which is not nice. Sent v2 considering mentioned drawbacks.

>Пятница,  6 декабря 2019, 0:31 +03:00 от Vladislav Shpilevoy < v.shpilevoy at tarantool.org >:
>
>Hi! Thanks for the patch!
>
>See 7 comments below.
>
>On 04/12/2019 14:43, Ilya Kosarev wrote:
>> socket.test had a number of flaky problems:
>> - socket readableness expectation
>> - race conditions on socket shutdown in emulation test cases
>> - tcp_server stability in socket receive inconsistent behavior case
>> Now they are solved. Port randomization is improved.
>> Socket test is not fragile anymore.
>> 
>> Closes #4451, #4426, #4469
>> ---
>> Branch:  https://github.com/tarantool/tarantool/tree/i.kosarev/gh-4426-4451-fix-socket-test
>> Issues:  https://github.com/tarantool/tarantool/issues/4426
>>  https://github.com/tarantool/tarantool/issues/4451
>>  https://github.com/tarantool/tarantool/issues/4469
>> 
>>  test/app/socket.result   | 84 +++++++++++++++++++++++++---------------
>>  test/app/socket.test.lua | 60 +++++++++++++++++++---------
>>  test/app/suite.ini       |  1 -
>>  3 files changed, 94 insertions(+), 51 deletions(-)
>> 
>> diff --git a/test/app/socket.result b/test/app/socket.result
>> index fd299424c9..42dde8f375 100644
>> --- a/test/app/socket.result
>> +++ b/test/app/socket.result
>> @@ -45,6 +45,9 @@ test_run:cmd("push filter '(error: .builtin/.*[.]lua):[0-9]+' to '\\1'")
>>  WAIT_COND_TIME = 10
>1. This is now unused.
>
>>  ---
>>  ...
>> +WAIT_TCP_CONNECT_TIME = 240
>> +---
>> +...
>>  socket('PF_INET', 'SOCK_STREAM', 'tcp121222');
>>  ---
>>  - null
>> @@ -107,7 +110,7 @@ s:nonblock(true)
>>  ---
>>  - true
>>  ...
>> -s:readable(.1)
>> +s:readable(WAIT_TCP_CONNECT_TIME)
>>  ---
>>  - true
>>  ...
>> @@ -227,7 +230,7 @@ s:syswrite(ffi.cast('const char *', ping), #ping)
>>  ---
>>  - 6
>>  ...
>> -s:readable(1)
>> +s:readable(WAIT_TCP_CONNECT_TIME)
>
>2. I understand, why it is 'connect time' in the previous
>hunk. But here the socket is already connected. So it is
>not connect time.
>
>On the other hand, there are places right after connect,
>which don't use this timeout. For example, line 457
>in socket.result file:
>
>    sc:sysconnect('127.0.0.1', s:name().port) or errno() == errno.EINPROGRESS
>    ---
>    - true
>    ...
>    sc:writable(10)
>    ---
>    - true
>    ...
>
>Here it is 10 seconds, while in other places you started
>using 240.
>
>>  ---
>>  - true
>>  ...
>> @@ -830,7 +833,7 @@ sc:sendto('127.0.0.1', s:name().port, 'Hello, world')
>>  ---
>>  - 12
>>  ...
>> -s:readable(10)
>> +s:readable(WAIT_TCP_CONNECT_TIME)
>>  ---
>>  - true
>>  ...
>
>3. There is no connect time. This is DGRAM socket with UDP.
>No connections.
>
>The same below. Moreover, there are now 2 waits for
>'connect time' on the same socket.
>
>> @@ -842,7 +845,7 @@ sc:sendto('127.0.0.1', s:name().port, 'Hello, world, 2')
>>  ---
>>  - 15
>>  ...
>> -s:readable(10)
>> +s:readable(WAIT_TCP_CONNECT_TIME)
>>  ---
>>  - true
>>  ...
>> @@ -898,7 +901,7 @@ sc:sendto('127.0.0.1', s:name().port, 'Hello, World!')
>>  ---
>>  - 13
>>  ...
>> -s:readable(1)
>> +s:readable(WAIT_TCP_CONNECT_TIME)
>>  ---
>>  - true
>>  ...
>> @@ -1074,6 +1077,15 @@ master:setsockopt('SOL_SOCKET', 'SO_REUSEADDR', true)
>>  ---
>>  - true
>>  ...
>> +seed = ''
>> +---
>> +...
>> +for d in string.gmatch(box.info.uuid, '%d') do  seed = seed .. d end
>
>4. Well, this is probably the most intricate way to
>get a seed. Have you considered fiber.time()? And why do
>you need it? Isn't it initialized already?
>
>> +---
>> +...
>> +math.randomseed(tonumber(seed))
>> +---
>> +...
>>  port = 32768 + math.random(0, 32767)
>>  ---
>>  ...
>> @@ -1092,7 +1104,7 @@ test_run:wait_cond(function()
>>          return false, master:error()
>>      end
>>      return true
>> -end, WAIT_COND_TIME);
>> +end, 100);
>
>5. Why? 10 seconds was not enough to try a few random
>ports?
>
>>  ---
>>  - true
>>  ...
>> @@ -1822,8 +1834,14 @@ test_run:cmd("setopt delimiter ';'")
>>  ---
>>  - true
>>  ...
>> +socket_opened = true
>>  cfiber = fiber.create(function(sc, rch, wch)
>> -    while sc:send(wch:get()) and rch:put(sc:receive("*l")) do end
>> +    while socket_opened do
>> +        sc:send(wch:get())
>> +        local data = sc:receive("*l")
>> +        if not socket_opened then sc:close() end
>> +        rch:put(data)
>> +    end
>>  end, sc, rch, wch);
>
>6. What was a problem here? In the commit message you said
>'race conditions'. Where? You just moved socket close to
>the background fiber from the main fiber.
>
>>  ---
>>  ...
>> @@ -1936,6 +1954,9 @@ c:receive("*l")
>>  ---
>>  - 
>>  ...
>> +socket_opened = false
>> +---
>> +...
>>  wch:put("Fu")
>>  ---
>>  - true
>> @@ -1944,10 +1965,6 @@ c:send("354 Please type your message\n")
>>  ---
>>  - 29
>>  ...
>> -sc:close()
>> ----
>> -- 1
>> -...
>>  c:receive("*l", "Line: ")
>>  ---
>>  - null
>> @@ -2816,7 +2833,7 @@ test_run:cmd("clear filter")
>>  ---
>>  - true
>>  ...
>> --- case: sicket receive inconsistent behavior
>> +-- case: socket receive inconsistent behavior
>>  chan = fiber.channel()
>>  ---
>>  ...
>> @@ -2826,41 +2843,46 @@ counter = 0
>>  fn = function(s) counter = 0; while true do s:write((tostring(counter)):rep(chan:get())); counter = counter + 1 end end
>>  ---
>>  ...
>> -srv = socket.tcp_server('0.0.0.0', 8888, fn)
>> +srv = nil
>>  ---
>>  ...
>> -s = socket.connect('localhost', 8888)
>> +test_run:cmd("setopt delimiter ';'")
>>  ---
>> +- true
>>  ...
>> -chan:put(5)
>> +test_run:wait_cond(function()
>> +    port = 32768 + math.random(0, 32767)
>> +    srv = socket.tcp_server('0.0.0.0', port, fn)
>> +    return srv ~= nil
>> +end, 100);
>>  ---
>>  - true
>>  ...
>> -chan:put(5)
>> +receive1 = nil; receive2 = nil;
>>  ---
>> -- true
>>  ...
>> -s:receive(5)
>> +if srv ~= nil then
>> +    s = socket.connect('localhost', port)
>> +    chan:put(5)
>> +    chan:put(5)
>> +    receive1 = s:receive(5)
>> +    chan:put(5)
>> +    s:settimeout(1)
>> +    receive2 = s:receive('*a')
>> +    s:close()
>> +    srv:close()
>> +end;
>
>7. How is it possible, that the server still is nil here?
>At least one of 32k ports should have been free.
>
>>  ---
>> -- '00000'
>>  ...
>> -chan:put(5)
>> +test_run:cmd("setopt delimiter ''");
>>  ---
>>  - true
>>  ...
>> -s:settimeout(1)
>> +receive1
>>  ---
>> -- 1
>> +- '00000'
>>  ...
>> -s:receive('*a')
>> +receive2
>>  ---
>>  - '1111122222'
>>  ...
>> -s:close()
>> ----
>> -- 1
>> -...
>> -srv:close()
>> ----
>> -- true
>> -...


-- 
Ilya Kosarev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.tarantool.org/pipermail/tarantool-patches/attachments/20191206/2bbd0767/attachment.html>


More information about the Tarantool-patches mailing list