[Tarantool-patches] [PATCH v2] Stabilize tcp_connect in test_run:cmd()

Thu Dec 5 04:35:14 MSK 2019

Thank you for the investigation and the patch.

Simplified it a bit:

diff --git a/test_run.lua b/test_run.lua
index 804d42c..e37b559 100644
--- a/test_run.lua
+++ b/test_run.lua
@@ -10,10 +10,8 @@ local errno = require('errno')
 local clock = require('clock')
 
 local function cmd(self, msg)
-    local sock = nil
-    self:wait_cond(function()
-        sock = socket.tcp_connect(self.host, self.port)
-        return sock ~= nil
+    local sock = self:wait_cond(function()
+        return socket.tcp_connect(self.host, self.port)
     end, 100)
     local data = msg .. '\n'
     sock:send(data)

Pushed to test-run's master.

Updated in tarantool in master, 2.2 and 1.10.

WBR, Alexander Turenko.

----

TL;DR: I tried another approach, but then refuses it.

BTW, I was wonder why we connect to test-run's inspector each time and made
attempt to store a connected socket in test_run instance. I met several
problems with this approach:

* test-run/lib/inspector.py leans on the fact that tarantool closes a socket
  after reading of a response (blocks on self.sem if it does not close it)
* test-run instance is recreated in many places: in an instance file, in a
  test, after each test_run:cmd('switch <...>')
* test-run's pretest_clean cleans a global test_run variable

I found that if we'll eliminate self.sem.acquire() / self.sem.release()
(I'm very tentative it is right thing to do), will protect _G.test_run
using 'protected_globals' and reuse it in inspector_new(), will save all
established connections to inspector in a global weak table (and protect
it with 'protected_globals') using host:port as keys and will tweak
test_run:cmd() to use those connections when available (w/o wait_cond),
then the test will work reliably. (We also need a __gc metamethod to
close a connection.)

However when I tried to always use the pre-created connection in
test_run:cmd(), I got nils in self.sock. It seems that those connections
are sometimes fails anyway even when they are created from in
inspector_new() (when a test is just started).

If we'll try to keep the connections always open, but create them if not
available in test_run:cmd(), this looks as just increase of complexity
w/o enough gains. Aside of this, I'm tentative about removing locks
from inspector.py.

So I returned to your variant of the fix and pushed it.

On Tue, Nov 26, 2019 at 08:58:59PM +0300, Ilya Kosarev wrote:
> For some tests, for example, replication/box_set_replication_stress,
> socket.tcp_connect() in test_run:cmd() might sometimes fail with
> "Too many open files" error when running under high load. In case of
> box_set_replication_stress test it happens because of a huge number of
> file descriptors being opened during many box_set_replication calls.
> Under high load it takes time for them to be closed. Now we are trying
> to perform socket.tcp_connect() under test_run:wait_cond() clause until
> we succeed or time is gone.
> 
> Closes #193
> ---
> https://github.com/tarantool/test-run/tree/i.kosarev/gh-193-stabilize-test-run-cmd
> https://github.com/tarantool/test-run/issues/193
> 
> Changes in v2:
> - replaced infinite loop with wait_cond
> 
>  test_run.lua | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/test_run.lua b/test_run.lua
> index 63dfdef..804d42c 100644
> --- a/test_run.lua
> +++ b/test_run.lua
> @@ -10,7 +10,11 @@ local errno = require('errno')
>  local clock = require('clock')
>  
>  local function cmd(self, msg)
> -    local sock = socket.tcp_connect(self.host, self.port)
> +    local sock = nil
> +    self:wait_cond(function()
> +        sock = socket.tcp_connect(self.host, self.port)
> +        return sock ~= nil
> +    end, 100)
>      local data = msg .. '\n'
>      sock:send(data)