[Tarantool-patches] [PATCH v4] test: fix flaky socket test

Ilya Kosarev i.kosarev at tarantool.org
Thu Dec 26 00:06:51 MSK 2019


Hi!

Thanks for your review!

Sent v5 including all proposed enhancements.
>Среда, 25 декабря 2019, 4:29 +03:00 от Alexander Turenko <alexander.turenko at tarantool.org>:
>
>Thanks for the patch!
>
>I really appreciate the work in direction toward stable tests. Unstable
>CI sometimes hides real problems, so actually it is direction to make
>tarantool itself more stable.
>
>All changes look correct. There are places however that we can polish a
>bit. Please, consider my comments below.
>
>How I tested it:
>
>$ (cd test && ./test-run.py -j 30 --force $(yes app/socket | head -n 1000))
>
>Observed the following fails:
>
>Case 1.
>
>[008] Test failed! Result content mismatch:
>[008] --- app/socket.result	Tue Dec 24 15:55:40 2019
>[008] +++ app/socket.reject	Tue Dec 24 15:56:24 2019
>[008] @@ -1371,7 +1371,7 @@
>[008]  ...
>[008]  cnt
>[008]  ---
>[008] -- 2
>[008] +- 3
>[008]  ...
>[008]  client:write('hi')
>[008]  ---
>
>Another instance is connected to our server. I found that it is
>socket.connect() from 'case: socket receive inconsistent behavior' (and
>this is the reason of case 3 below). The code the the following:
>
> | srv = socket.tcp_server('0.0.0.0', 0, fn)
> | s = socket.connect('localhost', srv:name().port)
>
>localhost may be resolved into ::1 (ipv6) and 127.0.0.1 (ipv4). Linux
>allows to bind the same port number on different addresses.
>getaddrinfo() on Linux give ipv6 addr/port first (usually, depends
>on settings) and if another process listens on ::1 on the same port as
>we bound on ipv4, we'll connect to this process.
>
>The change that eliminates the fail above for me:
>
>@@ -979,7 +983,7 @@ chan = fiber.channel()
> counter = 0
> fn = function(s) counter = 0; while true do s:write((tostring(counter)):rep(chan:get())); counter = counter + 1 end end
> srv = socket.tcp_server('0.0.0.0', 0, fn)
>-s = socket.connect('localhost', srv:name().port)
>+s = socket.connect('127.0.0.1', srv:name().port)
> chan:put(5)
> chan:put(5)
> s:receive(5)
>
>(CCed Maria as author of the test case.)
>
>However it is possible that there are other similar cases. I suggest to
>verify the test against them. Please, beware of changing negative cases
>(when we expect that a timeout will occur): setting large timeout on
>them would significantly increase the test run time (we'll always wait
>100 seconds at the point).
>
>BTW, why not to listen on 127.0.0.1 rather then on all interfaces
>(0.0.0.0)?
>
>BTW, unix sockets may help to don't bother with clashes of different
>kinds.
>
>----
>
>[018] --- app/socket.result	Tue Dec 24 15:58:20 2019
>[018] +++ app/socket.reject	Tue Dec 24 15:59:29 2019
>[018] @@ -2822,7 +2822,7 @@
>[018]  ...
>[018]  client:read(1, 5) == ''
>[018]  ---
>[018] -- true
>[018] +- false
>[018]  ...
>[018]  server:close()
>[018]  ---
>
>I printed client_.errno. It is 110 (ETIMEDOUT). It seems it worth to
>replace 5 with WAIT_COND_TIME (or, as TIMEOUT as I proposed to name it
>below). Maybe there are other such places.
>
>----
>
>Case 3.
>
>Test hung! Result content mismatch:
>--- app/socket.result	Wed Dec 25 01:04:26 2019
>+++ var/003_app/socket.result	Wed Dec 25 01:05:22 2019
>@@ -2857,34 +2857,3 @@
> ---
> ...
> chan:put(5)
>----
>-- true
>-...
>-chan:put(5)
>----
>-- true
>-...
>-s:receive(5)
>----
>-- '00000'
>-...
>-chan:put(5)
>----
>-- true
>-...
>-s:settimeout(1)
>----
>-- 1
>-...
>-s:receive('*a')
>----
>-- '1111122222'
>-...
>-s:close()
>----
>-- 1
>-...
>-srv:close()
>----
>-- true
>-...
>
>Explained above (see case 1).
>
>----
>
>[017] --- app/socket.result	Wed Dec 25 01:18:29 2019
>[017] +++ app/socket.reject	Wed Dec 25 01:19:30 2019
>[017] @@ -40,7 +40,7 @@
>[017]  ...
>[017]  test_run:cmd("push filter '(error: .builtin/.*[.]lua):[0-9]+' to '\\1'")
>[017]  ---
>[017] -- true
>[017] +- Hello, world
>
>It seems we connected to some of our socket server rather then test-run.
>Looks like the same ipv4 / ipv6 problem.
>
>After the following patch for test-run I don't see it anymore (don't
>sure I made enough runs, however):
>
>diff --git a/lib/test_suite.py b/lib/test_suite.py
>index 84b99f4..0a137c3 100644
>--- a/lib/test_suite.py
>+++ b/lib/test_suite.py
>@@ -197,7 +197,7 @@ class TestSuite:
>     def start_server(self, server):
>         # create inspector daemon for cluster tests
>         inspector = TarantoolInspector(
>-            'localhost', server.inspector_port
>+            '127.0.0.1', server.inspector_port
>         )
>         inspector.start()
>         # fixme: remove this string if we fix all legacy tests
>diff --git a/test_run.lua b/test_run.lua
>index e37b559..fc1b833 100644
>--- a/test_run.lua
>+++ b/test_run.lua
>@@ -458,7 +458,7 @@ local inspector_methods = {
> local function inspector_new(host, port)
>     local inspector = {}
> 
>-    inspector.host = host or 'localhost'
>+    inspector.host = host or '127.0.0.1'
>     inspector.port = port or tonumber(os.getenv('INSPECTOR'))
>     if inspector.port == nil then
>         error('Inspector not started')
>
>----
>
>On Mon, Dec 23, 2019 at 11:48:33PM +0300, Ilya Kosarev wrote:
>> socket.test had a number of flaky problems:
>> - socket readiness expectation
>> - race conditions on socket shutdown in emulation test cases
>> - UDP datagrams losses on mac os
>> - excessive random port searches
>> Now they are solved. Socket test is not fragile anymore.
>> 
>> Closes #4451, #4426, #4469
>> ---
>> Branch:  https://github.com/tarantool/tarantool/tree/i.kosarev/gh-4426-4451-fix-socket-test
>> Issues:  https://github.com/tarantool/tarantool/issues/4426
>>  https://github.com/tarantool/tarantool/issues/4451
>>  https://github.com/tarantool/tarantool/issues/4469
>> 
>> Changes in v2:
>> - reconsidered socket readiness expectation
>> - reduced conditions waiting time
>> 
>> Changes in v3:
>> - reconsidered expectations to unify them
>> - simplified randomization
>> 
>> Changes in v4:
>> - left infinite timeouts alone
>> - wrapped UDP datagrams awaiting with wait_cond
>
>That's my bad. As I see, other parts of the test follows the following
>pattern for waiting UDP packets:
>
> | s:readable(TIMEOUT)
> | s:recv()
>
>test_run:wait_cond() is correct solution too, but :readable() looks a
>bit simpler. Anyway, let's use one that you like more.
>
>(It is the only change I propose, which I didn't verify.)
>
>> - replaced manual port randomization with zero bind
>
>Nice idea!
>
>> - removed extra wait_cond wrappers
>> 
>>  test/app/socket.result   | 104 ++++++++++++++++++++++-----------------
>>  test/app/socket.test.lua |  94 ++++++++++++++++++++---------------
>>  test/app/suite.ini       |   1 -
>>  3 files changed, 114 insertions(+), 85 deletions(-)
>
>> diff --git a/test/app/socket.test.lua b/test/app/socket.test.lua
>> index c72d41763f4..4ed6c0f25f7 100644
>> --- a/test/app/socket.test.lua
>> +++ b/test/app/socket.test.lua
>> @@ -13,7 +13,7 @@ env = require('test_run')
>>  test_run = env.new()
>>  test_run:cmd("push filter '(error: .builtin/.*[.]lua):[0-9]+' to '\\1'")
>> 
>> -WAIT_COND_TIME = 10
>> +WAIT_COND_TIME = 100
>
>Nit: I would name it just TIMEOUT, since it is not only for
>test_run:wait_cond() now.
>
>> @@ -619,8 +608,14 @@ s:settimeout(100500)
>>  rch, wch = fiber.channel(1), fiber.channel(1)
>>  sc = socket.connect(host, port)
>>  test_run:cmd("setopt delimiter ';'")
>> +socket_opened = true
>>  cfiber = fiber.create(function(sc, rch, wch)
>> -    while sc:send(wch:get()) and rch:put(sc:receive("*l")) do end
>> +    while socket_opened do
>> +        sc:send(wch:get())
>> +        local data = sc:receive("*l")
>> +        if not socket_opened then sc:close() end
>> +        rch:put(data)
>> +    end
>
>I would use existing channel to signal the fiber to close the socket: it
>looks simpler then using a channel and a variable both. Proposed diff:
>
>diff --git a/test/app/socket.test.lua b/test/app/socket.test.lua
>index 4ed6c0f25..0fe22709a 100644
>--- a/test/app/socket.test.lua
>+++ b/test/app/socket.test.lua
>@@ -608,13 +608,16 @@ s:settimeout(100500)
> rch, wch = fiber.channel(1), fiber.channel(1)
> sc = socket.connect(host, port)
> test_run:cmd("setopt delimiter ';'")
>-socket_opened = true
> cfiber = fiber.create(function(sc, rch, wch)
>-    while socket_opened do
>-        sc:send(wch:get())
>-        local data = sc:receive("*l")
>-        if not socket_opened then sc:close() end
>-        rch:put(data)
>+    while true do
>+        local outgoing_data = wch:get()
>+        if outgoing_data == nil then
>+            sc:close()
>+            break
>+        end
>+        sc:send(outgoing_data)
>+        local incoming_data = sc:receive("*l")
>+        rch:put(incoming_data)
>     end
> end, sc, rch, wch);
> test_run:cmd("setopt delimiter ''");
>@@ -646,9 +649,10 @@ rch:get()
> wch:put("DATA\n")
> c:receive(4)
> c:receive("*l")
>-socket_opened = false
>+rch:get()
> wch:put("Fu")
> c:send("354 Please type your message\n")
>+wch:put(nil) -- EOF
> c:receive("*l", "Line: ")
> c:receive()
> c:receive(10)
>
>>  end, sc, rch, wch);
>>  test_run:cmd("setopt delimiter ''");
>> 
>> @@ -651,9 +646,9 @@ rch:get()
>>  wch:put("DATA\n")
>>  c:receive(4)
>>  c:receive("*l")
>> +socket_opened = false
>>  wch:put("Fu")
>>  c:send("354 Please type your message\n")
>> -sc:close()
>>  c:receive("*l", "Line: ")
>>  c:receive()
>>  c:receive(10)
>
>> @@ -778,7 +773,9 @@ e == errno.EAGAIN -- expected true
>> 
>>  -- case: recv, zero datagram
>>  sendto_zero(sending_socket, '127.0.0.1', receiving_socket_port)
>> -received_message = receiving_socket:recv()
>> +received_message = test_run:wait_cond(function()        \
>> +    return receiving_socket:recv()                      \
>> +end, WAIT_COND_TIME)
>
>Proposed to use :readable() (see above).
>
>> --- case: sicket receive inconsistent behavior
>> +-- case: socket receive inconsistent behavior
>>  chan = fiber.channel()
>>  counter = 0
>>  fn = function(s) counter = 0; while true do s:write((tostring(counter)):rep(chan:get())); counter = counter + 1 end end
>> -srv = socket.tcp_server('0.0.0.0', 8888, fn)
>> -s = socket.connect('localhost', 8888)
>> +srv = socket.tcp_server('0.0.0.0', 0, fn)
>> +s = socket.connect('localhost', srv:name().port)
>
>Proposed to connect to 127.0.0.1 (see above).
>
>>  chan:put(5)
>>  chan:put(5)
>>  s:receive(5)


-- 
Ilya Kosarev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.tarantool.org/pipermail/tarantool-patches/attachments/20191226/07e66b1d/attachment.html>


More information about the Tarantool-patches mailing list