[tarantool-patches] Re: [PATCH] test: fix unix socket conflict in socket.test.lua

Mon Oct 29 14:29:42 MSK 2018

On Mon, Oct 29, 2018 at 12:32:29PM +0300, Vladimir Davydov wrote:
> On Mon, Oct 29, 2018 at 11:25:29AM +0300, Alexander Turenko wrote:
> > On Thu, Oct 25, 2018 at 12:43:39PM +0300, Vladimir Davydov wrote:
> > > On Thu, Oct 25, 2018 at 05:21:45AM +0300, Alexander Turenko wrote:
> > > > It is needed to run the test in parallel on several test-run workers to
> > > > investigate flaky failures of the test.
> > > 
> > > I don't think I understand. Do you mean that this test doesn't fix the
> > > test flakiness and is only needed for further investigation?
> > > 
> > 
> > Yes (but now I have added some fixes that can eliminate some flaky
> > failures). It is often convenient to run something like...
> > 
> > ```
> > TEST_RUN_TESTS=$(for i in $(seq 1 100); do echo -n "app/ "; done) make test
> > ```
> > 
> > ...to prove the suite is ready to run in parallel mode.
> > 
> > > > diff --git a/test/app/socket.result b/test/app/socket.result
> > > > index 2f002a37e..1a570b9fa 100644
> > > > --- a/test/app/socket.result
> > > > +++ b/test/app/socket.result
> > > > @@ -42,6 +42,29 @@ test_run:cmd("push filter '(error: .builtin/.*[.]lua):[0-9]+' to '\\1'")
> > > >  ---
> > > >  - true
> > > >  ...
> > > > +test_run:cmd("push filter '(/tmp/tarantool-test-socket)-[0-9]+' to '\\1'")
> > > > +---
> > > > +- true
> > > > +...
> > > > +-- /tmp/tarantool-test-socket-${TEST_RUN_WORKER_ID}
> > > > +test_run:cmd("setopt delimiter ';'")
> > > > +---
> > > > +- true
> > > > +...
> > > > +function get_temp_socket_path()
> > > > +    local base_path = '/tmp/tarantool-test-socket'
> > > > +    local worker_id = os.getenv('TEST_RUN_WORKER_ID')
> > > > +    if not worker_id then
> > > > +        return base_path
> > > > +    end
> > > > +    return ('%s-%s'):format(base_path, worker_id)
> > > > +end;
> > > 
> > > Why don't you simply create the file in the worker directory?
> > 
> > Sure, I can. Rewritten the test. But socket.skipcond still uses
> > /tmp/tarantool-test-socket-${WORKER_ID}, because its working directory
> > is `test`.
> > 
> > The whole new patch is below.
> > 
> > commit 224fcdf2a5e8f84971253f1d585b94213d06a384
> > Author: Alexander Turenko <alexander.turenko at tarantool.org>
> > Date:   Thu Oct 25 04:06:46 2018 +0300
> > 
> >     test: fix unix socket conflict in socket.test.lua
> >     
> >     Increased socket.readable / socket.wait timeouts.
> >     
> >     Rewritten port choosing: repeat bind+listen until succeeds, exclude
> >     incorrect port 65536 from the range.
> >     
> >     All these changes are needed to run the test in parallel on several
> >     test-run workers to investigate flaky failures of the test / of the
> >     suite. Some of these changes can also eliminate possible flaky failures.
> 
> ./test-run -j -1 app/socket.test.lua doesn't work with this patch:
> 
> } app/socket.test.lua
> } Test.run() received the following error:
> } Traceback (most recent call last):
> }   File "/home/vlad/src/tarantool/test-run/lib/test.py", line 174, in run
> }     execfile(self.skip_cond, dict(locals(), **server.__dict__))
> }   File "app/socket.skipcond", line 8, in <module>
> }     worker_id = os.environ['TEST_RUN_WORKER_ID']
> }   File "/usr/lib/python2.7/UserDict.py", line 40, in __getitem__
> }     raise KeyError(key)
> } KeyError: 'TEST_RUN_WORKER_ID'
> 
> I think you shouldn't use TEST_RUN_WORKER_ID in skipcond after all.
> Instead you should make a temporary file.

Thanks for catching it up. Fixed. See the patch at end of the email.

> Also, I've seen the following failure once. Not sure if it's related:
> 
> } [001] app/socket.test.lua                                             [ fail ]
> } [001]
> } [001] Test failed! Result content mismatch:
> } [001] --- app/socket.result     Mon Oct 29 12:06:39 2018
> } [001] +++ socket.reject Mon Oct 29 12:13:34 2018
> } [001] @@ -2810,7 +2810,7 @@
> } [001]  ...
> } [001]  echo_fiber ~= nil
> } [001]  ---
> } [001] -- true
> } [001] +- false
> } [001]  ...
> } [001]  client:write('hello')
> } [001]  ---
> } [001]

I've observed some fails too, but when run the test, say, 500 times on
16 workers. I think it is now stable enough to unblock the testing
scenario like `./test-run.py app/ app/ app/ app/ app/` at least on
Linux. (I've a way more frequent fails on Mac OS.)

Changes:

diff --git a/test/app/socket.skipcond b/test/app/socket.skipcond
index c7d03a681..89a293f81 100644
--- a/test/app/socket.skipcond
+++ b/test/app/socket.skipcond
@@ -4,12 +4,10 @@ import re
 import os.path
 import socket
 import os
+import tempfile
 
-worker_id = os.environ['TEST_RUN_WORKER_ID']
-test_path = '/tmp/tarantool-test-socket-' + worker_id
-
-if os.path.exists(test_path):
-    os.remove(test_path)
+test_dir = tempfile.mkdtemp(prefix='tarantool-test-socket')
+test_path = os.path.join(test_dir, 'socket')
 
 s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
 try:
@@ -21,3 +19,4 @@ s.close()
 
 if os.path.exists(test_path):
     os.remove(test_path)
+    os.rmdir(test_dir)