From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <alexander.turenko@tarantool.org>
Received: from smtp29.i.mail.ru (smtp29.i.mail.ru [94.100.177.89])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by dev.tarantool.org (Postfix) with ESMTPS id 756B8469710
 for <tarantool-patches@dev.tarantool.org>;
 Tue, 24 Nov 2020 02:19:44 +0300 (MSK)
Date: Tue, 24 Nov 2020 02:19:53 +0300
From: Alexander Turenko <alexander.turenko@tarantool.org>
Message-ID: <20201123231953.uy5zkj65utmwmgu4@tkn_work_nb>
References: <5d0bad0df3023e15fd765962291ab4aa9a184558.1604585948.git.avtikhon@tarantool.org>
 <20201123221512.mocfziqbyu5ojjho@tkn_work_nb>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <20201123221512.mocfziqbyu5ojjho@tkn_work_nb>
Subject: Re: [Tarantool-patches] [PATCH v1] tarantoolctl: fix pid file
	removement
List-Id: Tarantool development patches <tarantool-patches.dev.tarantool.org>
List-Unsubscribe: <https://lists.tarantool.org/mailman/options/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=unsubscribe>
List-Archive: <https://lists.tarantool.org/pipermail/tarantool-patches/>
List-Post: <mailto:tarantool-patches@dev.tarantool.org>
List-Help: <mailto:tarantool-patches-request@dev.tarantool.org?subject=help>
List-Subscribe: <https://lists.tarantool.org/mailman/listinfo/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=subscribe>
To: "Alexander V. Tikhonov" <avtikhon@tarantool.org>
Cc: tarantool-patches@dev.tarantool.org

> I run the test in parallel many times and don't see any fail. However
> when I run two test-run's from two terminals and set the same --vardir
> value, I got various fails, including tarantoolctl's 'The daemon is
> already running'. The fail you linked is on Mac OS, where a shared
> --vardir is used. I would look into this direction.
> 
> I don't know how the testing is run on those machines, but I would ask
> the following questions to myself:
> 
> 1. If two jobs are run simultaneously using one var directory, how can I
>    verify that it occurs? Some logs? If there are no such logs, can we
>    add them?
> 2. How can we verify that no 'orphan' tarantools (ones from a previous
>    testing round) are present when starting a new testing round?
> 3. Are we need some test-run support for detection of situation of this
>    kind? Say, using flock() on the var directory.

See more:

The job [1] was started at Nov 5, 2020 1:35pm GMT+0300 on tarantool_mac4
runner and had 13 minutes 34 seconds duration. So it ended somewhere
around 1:48pm-1:49pm. The vardir is /tmp/tnt. The worker name, which
runs replication/ddl.test.lua, is 096_replication.

The job [2] (one you refer below) was started at Nov 5, 2020 1:49pm
GMT+0300 on tarantool_mac4 runner and had 13 minutes 45 seconds
duration. So it ended somewhere around 2:02pm-2:03pm. The vardir is
/tmp/tnt.  The worker name, which runs replication/ddl.test.lua, is
096_replication.

The logs gives more timestamps:

[087] +++ /tmp/tnt/rejects/small/obuf.reject	Thu Nov  5 13:54:13 2020
<replication/ddl.test.lua reruns>
[097] +++ /tmp/tnt/rejects/vinyl/iterator.reject	Thu Nov  5 14:02:51 2020

So the jobs were not run at the same time. But one follows another and
the worker names were the same. Looks suspectful? I don't know. I also
see that ddl.test.lua replicas use the same listen socket path as tests
that use autobootstrap.lua and autobootstrap_anon.lua. So maybe we
failed to stop a server, but the error was suppressed by one of test-run
patches you made?

Really, the only thing we can do about such problems, is to make guesses
and add logs, which can prove or decline those guesses. We should not
try to blindly fix them: it is highway to ever more obscure problems.

[1]: https://gitlab.com/tarantool/tarantool/-/jobs/831790167
[2]: https://gitlab.com/tarantool/tarantool/-/jobs/831873727

> 
> > 
> > Resolved the issue [1]:
> > 
> >   [096] replication/ddl.test.lua                        memtx
> >   [096]
> >   [096] [Instance "ddl2" returns with non-zero exit code: 1]
> >   [096]
> >   [096] Last 15 lines of Tarantool Log file [Instance "ddl2"][/tmp/tnt/096_replication/ddl2.log]:
> >   ...
> >   [096] 2020-11-05 13:56:59.838 [10538] main/103/ddl2 I> bootstrapping replica from f4f59bcd-54bb-4308-a43c-c8ede1c84701 at unix/:/private/tmp/tnt/096_replication/autobootstrap4.sock
> >   [096] 2020-11-05 13:56:59.838 [10538] main/115/applier/cluster@unix/:/private/tmp/tnt/096_replication/autobootstrap4.sock I> can't read row
> >   [096] 2020-11-05 13:56:59.838 [10538] main/115/applier/cluster@unix/:/private/tmp/tnt/096_replication/autobootstrap4.sock box.cc:183 E> ER_READONLY: Can't modify data because this instance is in read-only mode.
> >   [096] 2020-11-05 13:56:59.838 [10538] main/103/ddl2 box.cc:183 E> ER_READONLY: Can't modify data because this instance is in read-only mode.
> >   [096] 2020-11-05 13:56:59.838 [10538] main/103/ddl2 F> can't initialize storage: Can't modify data because this instance is in read-only mode.
> >   [096] 2020-11-05 13:56:59.838 [10538] main/103/ddl2 F> can't initialize storage: Can't modify data because this instance is in read-only mode.
> >   [096] [ fail ]
> >   [096] Test "replication/ddl.test.lua", conf: "memtx"
> >   [096] 	from "fragile" list failed with results file checksum: "a006d40205b9a67ddbbb8206b4e1764c", rerunning with server restart ...
> >   [096] replication/ddl.test.lua                        memtx           [ fail ]
> >   [096] Test "replication/ddl.test.lua", conf: "memtx"
> >   [096] 	from "fragile" list failed with results file checksum: "a3962e843889def7f61d6f1f71461bf1", rerunning with server restart ...
> >   [096] replication/ddl.test.lua                        memtx           [ fail ]
> >   ...
> >   [096] Worker "096_replication" got failed test; restarted the server
> >   [096] replication/ddl.test.lua                        vinyl
> >   [096]
> >   [096] [Instance "ddl1" returns with non-zero exit code: 1]
> >   [096]
> >   [096] Last 15 lines of Tarantool Log file [Instance "ddl1"][/tmp/tnt/096_replication/ddl1.log]:
> >   [096] Stopping instance ddl1...
> >   [096] Starting instance ddl1...
> >   [096] The daemon is already running: PID 10536
> >   [096] Stopping instance ddl1...
> >   [096] Starting instance ddl1...
> >   [096] The daemon is already running: PID 10536
> >   ...
> > 
> >   [1] - https://gitlab.com/tarantool/tarantool/-/jobs/831873727#L4683
> > ---
> > 
> > Github: https://github.com/tarantool/tarantool/tree/avtikhon/tarantoolctl-pid-file
> > 
> >  extra/dist/tarantoolctl.in | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> > 
> > diff --git a/extra/dist/tarantoolctl.in b/extra/dist/tarantoolctl.in
> > index 0726e7f46..acdb613fa 100755
> > --- a/extra/dist/tarantoolctl.in
> > +++ b/extra/dist/tarantoolctl.in
> > @@ -595,12 +595,11 @@ local function stop()
> >          return 1
> >      end
> >  
> > +    fio.unlink(pid_file)
> >      if ffi.C.kill(pid, 15) < 0 then
> >          log.error("Can't kill process %d: %s", pid, errno.strerror())
> > -        fio.unlink(pid_file)
> >          return 1
> >      end
> > -
> >      return 0
> >  end
> >  
> > -- 
> > 2.25.1
> >