[Tarantool-patches] [PATCH v4] test: fix flaky socket test

Alexander Turenko alexander.turenko at tarantool.org
Wed Dec 25 04:29:35 MSK 2019


Thanks for the patch!

I really appreciate the work in direction toward stable tests. Unstable
CI sometimes hides real problems, so actually it is direction to make
tarantool itself more stable.

All changes look correct. There are places however that we can polish a
bit. Please, consider my comments below.

How I tested it:

$ (cd test && ./test-run.py -j 30 --force $(yes app/socket | head -n 1000))

Observed the following fails:

Case 1.

[008] Test failed! Result content mismatch:
[008] --- app/socket.result	Tue Dec 24 15:55:40 2019
[008] +++ app/socket.reject	Tue Dec 24 15:56:24 2019
[008] @@ -1371,7 +1371,7 @@
[008]  ...
[008]  cnt
[008]  ---
[008] -- 2
[008] +- 3
[008]  ...
[008]  client:write('hi')
[008]  ---

Another instance is connected to our server. I found that it is
socket.connect() from 'case: socket receive inconsistent behavior' (and
this is the reason of case 3 below). The code the the following:

 | srv = socket.tcp_server('0.0.0.0', 0, fn)
 | s = socket.connect('localhost', srv:name().port)

localhost may be resolved into ::1 (ipv6) and 127.0.0.1 (ipv4). Linux
allows to bind the same port number on different addresses.
getaddrinfo() on Linux give ipv6 addr/port first (usually, depends
on settings) and if another process listens on ::1 on the same port as
we bound on ipv4, we'll connect to this process.

The change that eliminates the fail above for me:

@@ -979,7 +983,7 @@ chan = fiber.channel()
 counter = 0
 fn = function(s) counter = 0; while true do s:write((tostring(counter)):rep(chan:get())); counter = counter + 1 end end
 srv = socket.tcp_server('0.0.0.0', 0, fn)
-s = socket.connect('localhost', srv:name().port)
+s = socket.connect('127.0.0.1', srv:name().port)
 chan:put(5)
 chan:put(5)
 s:receive(5)

(CCed Maria as author of the test case.)

However it is possible that there are other similar cases. I suggest to
verify the test against them. Please, beware of changing negative cases
(when we expect that a timeout will occur): setting large timeout on
them would significantly increase the test run time (we'll always wait
100 seconds at the point).

BTW, why not to listen on 127.0.0.1 rather then on all interfaces
(0.0.0.0)?

BTW, unix sockets may help to don't bother with clashes of different
kinds.

----

[018] --- app/socket.result	Tue Dec 24 15:58:20 2019
[018] +++ app/socket.reject	Tue Dec 24 15:59:29 2019
[018] @@ -2822,7 +2822,7 @@
[018]  ...
[018]  client:read(1, 5) == ''
[018]  ---
[018] -- true
[018] +- false
[018]  ...
[018]  server:close()
[018]  ---

I printed client_.errno. It is 110 (ETIMEDOUT). It seems it worth to
replace 5 with WAIT_COND_TIME (or, as TIMEOUT as I proposed to name it
below). Maybe there are other such places.

----

Case 3.

Test hung! Result content mismatch:
--- app/socket.result	Wed Dec 25 01:04:26 2019
+++ var/003_app/socket.result	Wed Dec 25 01:05:22 2019
@@ -2857,34 +2857,3 @@
 ---
 ...
 chan:put(5)
----
-- true
-...
-chan:put(5)
----
-- true
-...
-s:receive(5)
----
-- '00000'
-...
-chan:put(5)
----
-- true
-...
-s:settimeout(1)
----
-- 1
-...
-s:receive('*a')
----
-- '1111122222'
-...
-s:close()
----
-- 1
-...
-srv:close()
----
-- true
-...

Explained above (see case 1).

----

[017] --- app/socket.result	Wed Dec 25 01:18:29 2019
[017] +++ app/socket.reject	Wed Dec 25 01:19:30 2019
[017] @@ -40,7 +40,7 @@
[017]  ...
[017]  test_run:cmd("push filter '(error: .builtin/.*[.]lua):[0-9]+' to '\\1'")
[017]  ---
[017] -- true
[017] +- Hello, world

It seems we connected to some of our socket server rather then test-run.
Looks like the same ipv4 / ipv6 problem.

After the following patch for test-run I don't see it anymore (don't
sure I made enough runs, however):

diff --git a/lib/test_suite.py b/lib/test_suite.py
index 84b99f4..0a137c3 100644
--- a/lib/test_suite.py
+++ b/lib/test_suite.py
@@ -197,7 +197,7 @@ class TestSuite:
     def start_server(self, server):
         # create inspector daemon for cluster tests
         inspector = TarantoolInspector(
-            'localhost', server.inspector_port
+            '127.0.0.1', server.inspector_port
         )
         inspector.start()
         # fixme: remove this string if we fix all legacy tests
diff --git a/test_run.lua b/test_run.lua
index e37b559..fc1b833 100644
--- a/test_run.lua
+++ b/test_run.lua
@@ -458,7 +458,7 @@ local inspector_methods = {
 local function inspector_new(host, port)
     local inspector = {}
 
-    inspector.host = host or 'localhost'
+    inspector.host = host or '127.0.0.1'
     inspector.port = port or tonumber(os.getenv('INSPECTOR'))
     if inspector.port == nil then
         error('Inspector not started')

----

On Mon, Dec 23, 2019 at 11:48:33PM +0300, Ilya Kosarev wrote:
> socket.test had a number of flaky problems:
> - socket readiness expectation
> - race conditions on socket shutdown in emulation test cases
> - UDP datagrams losses on mac os
> - excessive random port searches
> Now they are solved. Socket test is not fragile anymore.
> 
> Closes #4451, #4426, #4469
> ---
> Branch: https://github.com/tarantool/tarantool/tree/i.kosarev/gh-4426-4451-fix-socket-test
> Issues: https://github.com/tarantool/tarantool/issues/4426
>         https://github.com/tarantool/tarantool/issues/4451
>         https://github.com/tarantool/tarantool/issues/4469
> 
> Changes in v2:
> - reconsidered socket readiness expectation
> - reduced conditions waiting time
> 
> Changes in v3:
> - reconsidered expectations to unify them
> - simplified randomization
> 
> Changes in v4:
> - left infinite timeouts alone

> - wrapped UDP datagrams awaiting with wait_cond

That's my bad. As I see, other parts of the test follows the following
pattern for waiting UDP packets:

 | s:readable(TIMEOUT)
 | s:recv()

test_run:wait_cond() is correct solution too, but :readable() looks a
bit simpler. Anyway, let's use one that you like more.

(It is the only change I propose, which I didn't verify.)

> - replaced manual port randomization with zero bind

Nice idea!

> - removed extra wait_cond wrappers
> 
>  test/app/socket.result   | 104 ++++++++++++++++++++++-----------------
>  test/app/socket.test.lua |  94 ++++++++++++++++++++---------------
>  test/app/suite.ini       |   1 -
>  3 files changed, 114 insertions(+), 85 deletions(-)

> diff --git a/test/app/socket.test.lua b/test/app/socket.test.lua
> index c72d41763f4..4ed6c0f25f7 100644
> --- a/test/app/socket.test.lua
> +++ b/test/app/socket.test.lua
> @@ -13,7 +13,7 @@ env = require('test_run')
>  test_run = env.new()
>  test_run:cmd("push filter '(error: .builtin/.*[.]lua):[0-9]+' to '\\1'")
>  
> -WAIT_COND_TIME = 10
> +WAIT_COND_TIME = 100

Nit: I would name it just TIMEOUT, since it is not only for
test_run:wait_cond() now.

> @@ -619,8 +608,14 @@ s:settimeout(100500)
>  rch, wch = fiber.channel(1), fiber.channel(1)
>  sc = socket.connect(host, port)
>  test_run:cmd("setopt delimiter ';'")
> +socket_opened = true
>  cfiber = fiber.create(function(sc, rch, wch)
> -    while sc:send(wch:get()) and rch:put(sc:receive("*l")) do end
> +    while socket_opened do
> +        sc:send(wch:get())
> +        local data = sc:receive("*l")
> +        if not socket_opened then sc:close() end
> +        rch:put(data)
> +    end

I would use existing channel to signal the fiber to close the socket: it
looks simpler then using a channel and a variable both. Proposed diff:

diff --git a/test/app/socket.test.lua b/test/app/socket.test.lua
index 4ed6c0f25..0fe22709a 100644
--- a/test/app/socket.test.lua
+++ b/test/app/socket.test.lua
@@ -608,13 +608,16 @@ s:settimeout(100500)
 rch, wch = fiber.channel(1), fiber.channel(1)
 sc = socket.connect(host, port)
 test_run:cmd("setopt delimiter ';'")
-socket_opened = true
 cfiber = fiber.create(function(sc, rch, wch)
-    while socket_opened do
-        sc:send(wch:get())
-        local data = sc:receive("*l")
-        if not socket_opened then sc:close() end
-        rch:put(data)
+    while true do
+        local outgoing_data = wch:get()
+        if outgoing_data == nil then
+            sc:close()
+            break
+        end
+        sc:send(outgoing_data)
+        local incoming_data = sc:receive("*l")
+        rch:put(incoming_data)
     end
 end, sc, rch, wch);
 test_run:cmd("setopt delimiter ''");
@@ -646,9 +649,10 @@ rch:get()
 wch:put("DATA\n")
 c:receive(4)
 c:receive("*l")
-socket_opened = false
+rch:get()
 wch:put("Fu")
 c:send("354 Please type your message\n")
+wch:put(nil) -- EOF
 c:receive("*l", "Line: ")
 c:receive()
 c:receive(10)

>  end, sc, rch, wch);
>  test_run:cmd("setopt delimiter ''");
>  
> @@ -651,9 +646,9 @@ rch:get()
>  wch:put("DATA\n")
>  c:receive(4)
>  c:receive("*l")
> +socket_opened = false
>  wch:put("Fu")
>  c:send("354 Please type your message\n")
> -sc:close()
>  c:receive("*l", "Line: ")
>  c:receive()
>  c:receive(10)

> @@ -778,7 +773,9 @@ e == errno.EAGAIN -- expected true
>  
>  -- case: recv, zero datagram
>  sendto_zero(sending_socket, '127.0.0.1', receiving_socket_port)
> -received_message = receiving_socket:recv()
> +received_message = test_run:wait_cond(function()        \
> +    return receiving_socket:recv()                      \
> +end, WAIT_COND_TIME)

Proposed to use :readable() (see above).

> --- case: sicket receive inconsistent behavior
> +-- case: socket receive inconsistent behavior
>  chan = fiber.channel()
>  counter = 0
>  fn = function(s) counter = 0; while true do s:write((tostring(counter)):rep(chan:get())); counter = counter + 1 end end
> -srv = socket.tcp_server('0.0.0.0', 8888, fn)
> -s = socket.connect('localhost', 8888)
> +srv = socket.tcp_server('0.0.0.0', 0, fn)
> +s = socket.connect('localhost', srv:name().port)

Proposed to connect to 127.0.0.1 (see above).

>  chan:put(5)
>  chan:put(5)
>  s:receive(5)


More information about the Tarantool-patches mailing list