Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH v1] tarantoolctl: fix pid file removement
@ 2020-11-05 14:20 Alexander V. Tikhonov
  2020-11-23 22:15 ` Alexander Turenko
  0 siblings, 1 reply; 3+ messages in thread
From: Alexander V. Tikhonov @ 2020-11-05 14:20 UTC (permalink / raw)
  To: Kirill Yukhin; +Cc: tarantool-patches, Alexander Turenko

Found that stop() routine didn't remove the pid file on instance
successfull kill. It produced the issue with restarting the instance
in the same temporary workdir of the test. I happened because the
instance had the same naming and start_check() routine finding the
old pid file broke instance start. To fix it the pid file set to be
removed in all sitations just before signal to instance process sent.

Resolved the issue [1]:

  [096] replication/ddl.test.lua                        memtx
  [096]
  [096] [Instance "ddl2" returns with non-zero exit code: 1]
  [096]
  [096] Last 15 lines of Tarantool Log file [Instance "ddl2"][/tmp/tnt/096_replication/ddl2.log]:
  ...
  [096] 2020-11-05 13:56:59.838 [10538] main/103/ddl2 I> bootstrapping replica from f4f59bcd-54bb-4308-a43c-c8ede1c84701 at unix/:/private/tmp/tnt/096_replication/autobootstrap4.sock
  [096] 2020-11-05 13:56:59.838 [10538] main/115/applier/cluster@unix/:/private/tmp/tnt/096_replication/autobootstrap4.sock I> can't read row
  [096] 2020-11-05 13:56:59.838 [10538] main/115/applier/cluster@unix/:/private/tmp/tnt/096_replication/autobootstrap4.sock box.cc:183 E> ER_READONLY: Can't modify data because this instance is in read-only mode.
  [096] 2020-11-05 13:56:59.838 [10538] main/103/ddl2 box.cc:183 E> ER_READONLY: Can't modify data because this instance is in read-only mode.
  [096] 2020-11-05 13:56:59.838 [10538] main/103/ddl2 F> can't initialize storage: Can't modify data because this instance is in read-only mode.
  [096] 2020-11-05 13:56:59.838 [10538] main/103/ddl2 F> can't initialize storage: Can't modify data because this instance is in read-only mode.
  [096] [ fail ]
  [096] Test "replication/ddl.test.lua", conf: "memtx"
  [096] 	from "fragile" list failed with results file checksum: "a006d40205b9a67ddbbb8206b4e1764c", rerunning with server restart ...
  [096] replication/ddl.test.lua                        memtx           [ fail ]
  [096] Test "replication/ddl.test.lua", conf: "memtx"
  [096] 	from "fragile" list failed with results file checksum: "a3962e843889def7f61d6f1f71461bf1", rerunning with server restart ...
  [096] replication/ddl.test.lua                        memtx           [ fail ]
  ...
  [096] Worker "096_replication" got failed test; restarted the server
  [096] replication/ddl.test.lua                        vinyl
  [096]
  [096] [Instance "ddl1" returns with non-zero exit code: 1]
  [096]
  [096] Last 15 lines of Tarantool Log file [Instance "ddl1"][/tmp/tnt/096_replication/ddl1.log]:
  [096] Stopping instance ddl1...
  [096] Starting instance ddl1...
  [096] The daemon is already running: PID 10536
  [096] Stopping instance ddl1...
  [096] Starting instance ddl1...
  [096] The daemon is already running: PID 10536
  ...

  [1] - https://gitlab.com/tarantool/tarantool/-/jobs/831873727#L4683
---

Github: https://github.com/tarantool/tarantool/tree/avtikhon/tarantoolctl-pid-file

 extra/dist/tarantoolctl.in | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/extra/dist/tarantoolctl.in b/extra/dist/tarantoolctl.in
index 0726e7f46..acdb613fa 100755
--- a/extra/dist/tarantoolctl.in
+++ b/extra/dist/tarantoolctl.in
@@ -595,12 +595,11 @@ local function stop()
         return 1
     end
 
+    fio.unlink(pid_file)
     if ffi.C.kill(pid, 15) < 0 then
         log.error("Can't kill process %d: %s", pid, errno.strerror())
-        fio.unlink(pid_file)
         return 1
     end
-
     return 0
 end
 
-- 
2.25.1

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Tarantool-patches] [PATCH v1] tarantoolctl: fix pid file removement
  2020-11-05 14:20 [Tarantool-patches] [PATCH v1] tarantoolctl: fix pid file removement Alexander V. Tikhonov
@ 2020-11-23 22:15 ` Alexander Turenko
  2020-11-23 23:19   ` Alexander Turenko
  0 siblings, 1 reply; 3+ messages in thread
From: Alexander Turenko @ 2020-11-23 22:15 UTC (permalink / raw)
  To: Alexander V. Tikhonov; +Cc: tarantool-patches

Please, stop producing patches, which attempt to fix a problem you
unable to reproduce locally. Those attempts constantly fails, you see.
As reviewer I obligated to provide a reason, why it the proposed changes
cannot land into master. So you spent a time to write a patch (instead
of reproducing a problem), I spent a time to prove it would not work. It
would be better if you would ask for ideas, where a problem may be, if
you unable to reproduce it locally.

On Thu, Nov 05, 2020 at 05:20:10PM +0300, Alexander V. Tikhonov wrote:
> Found that stop() routine didn't remove the pid file on instance
> successfull kill. It produced the issue with restarting the instance
> in the same temporary workdir of the test. I happened because the
> instance had the same naming and start_check() routine finding the
> old pid file broke instance start. To fix it the pid file set to be
> removed in all sitations just before signal to instance process sent.

I see the following key points:

1. Tarantool removes the PID file at exit (even after SIGINT / SIGTERM).
2. Tarantool complains about attempt to start against a PID file, which
   is locked by another instance (the PID file is opened with flopen()).
3. tarantoolctl compains at attempt to start an instance against a PID
   file, which contains a PID of an existing process (including a zombie
   process).
4. When Tarantool is killed by SIGKILL, crashed or terminated in another
   abnormal way, it unable to remove the PID file at termination.
   However at the next start it sees that the PID file is not locked by
   any process and so successfully writes the new PID and start up.

Let's look at the original tarantoolctl code:

 | if ffi.C.kill(pid, 15) < 0 then
 |     log.error("Can't kill process %d: %s", pid, errno.strerror())
 |     fio.unlink(pid_file)
 |     return 1
 | end

There are three possible situations, when kill() may fail:

* EINVAL -- incorrect signal number, not our case.
* EPERM -- no permissions to send a signal; we should not meet this
           errno value, because we either run tarantoolctl from root or
           from the same user as tarantool.
* ESRCH -- no such process. Our case.

When there is no such process, we can leave the old PID file or remove
it: it does not matter much. A next attempt to start the instance will
correctly handle the old PID file if the same PID will not be reused by
some other process (there may be a race, yep, but AFAIS PIDs are not
typically reused just after the old process termination; so unlikely you
met this case).

Back to your change:

> +    fio.unlink(pid_file)
>      if ffi.C.kill(pid, 15) < 0 then
>          log.error("Can't kill process %d: %s", pid, errno.strerror())
> -        fio.unlink(pid_file)
>          return 1
>      end

You want to remove the PID file before sending the signal (and, I don't
ever say, before full termination of an instance). AFAIU, this way we'll
open a room for the race condition: one process is not terminated yet,
but another will attempt to start (and likely will fail with another
error, maybe about attempt to use the same wal_dir).

Anyway, the problem with the patch is not ever that it breaks something.
I guess, if you would reproduce the problem locally and investigate it
thoroughly (log everything suspectful from test-run, test and tarantool
side, run, analyze, add more logs and again), you would not end up with
a change of this kind.

I run the test in parallel many times and don't see any fail. However
when I run two test-run's from two terminals and set the same --vardir
value, I got various fails, including tarantoolctl's 'The daemon is
already running'. The fail you linked is on Mac OS, where a shared
--vardir is used. I would look into this direction.

I don't know how the testing is run on those machines, but I would ask
the following questions to myself:

1. If two jobs are run simultaneously using one var directory, how can I
   verify that it occurs? Some logs? If there are no such logs, can we
   add them?
2. How can we verify that no 'orphan' tarantools (ones from a previous
   testing round) are present when starting a new testing round?
3. Are we need some test-run support for detection of situation of this
   kind? Say, using flock() on the var directory.

> 
> Resolved the issue [1]:
> 
>   [096] replication/ddl.test.lua                        memtx
>   [096]
>   [096] [Instance "ddl2" returns with non-zero exit code: 1]
>   [096]
>   [096] Last 15 lines of Tarantool Log file [Instance "ddl2"][/tmp/tnt/096_replication/ddl2.log]:
>   ...
>   [096] 2020-11-05 13:56:59.838 [10538] main/103/ddl2 I> bootstrapping replica from f4f59bcd-54bb-4308-a43c-c8ede1c84701 at unix/:/private/tmp/tnt/096_replication/autobootstrap4.sock
>   [096] 2020-11-05 13:56:59.838 [10538] main/115/applier/cluster@unix/:/private/tmp/tnt/096_replication/autobootstrap4.sock I> can't read row
>   [096] 2020-11-05 13:56:59.838 [10538] main/115/applier/cluster@unix/:/private/tmp/tnt/096_replication/autobootstrap4.sock box.cc:183 E> ER_READONLY: Can't modify data because this instance is in read-only mode.
>   [096] 2020-11-05 13:56:59.838 [10538] main/103/ddl2 box.cc:183 E> ER_READONLY: Can't modify data because this instance is in read-only mode.
>   [096] 2020-11-05 13:56:59.838 [10538] main/103/ddl2 F> can't initialize storage: Can't modify data because this instance is in read-only mode.
>   [096] 2020-11-05 13:56:59.838 [10538] main/103/ddl2 F> can't initialize storage: Can't modify data because this instance is in read-only mode.
>   [096] [ fail ]
>   [096] Test "replication/ddl.test.lua", conf: "memtx"
>   [096] 	from "fragile" list failed with results file checksum: "a006d40205b9a67ddbbb8206b4e1764c", rerunning with server restart ...
>   [096] replication/ddl.test.lua                        memtx           [ fail ]
>   [096] Test "replication/ddl.test.lua", conf: "memtx"
>   [096] 	from "fragile" list failed with results file checksum: "a3962e843889def7f61d6f1f71461bf1", rerunning with server restart ...
>   [096] replication/ddl.test.lua                        memtx           [ fail ]
>   ...
>   [096] Worker "096_replication" got failed test; restarted the server
>   [096] replication/ddl.test.lua                        vinyl
>   [096]
>   [096] [Instance "ddl1" returns with non-zero exit code: 1]
>   [096]
>   [096] Last 15 lines of Tarantool Log file [Instance "ddl1"][/tmp/tnt/096_replication/ddl1.log]:
>   [096] Stopping instance ddl1...
>   [096] Starting instance ddl1...
>   [096] The daemon is already running: PID 10536
>   [096] Stopping instance ddl1...
>   [096] Starting instance ddl1...
>   [096] The daemon is already running: PID 10536
>   ...
> 
>   [1] - https://gitlab.com/tarantool/tarantool/-/jobs/831873727#L4683
> ---
> 
> Github: https://github.com/tarantool/tarantool/tree/avtikhon/tarantoolctl-pid-file
> 
>  extra/dist/tarantoolctl.in | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/extra/dist/tarantoolctl.in b/extra/dist/tarantoolctl.in
> index 0726e7f46..acdb613fa 100755
> --- a/extra/dist/tarantoolctl.in
> +++ b/extra/dist/tarantoolctl.in
> @@ -595,12 +595,11 @@ local function stop()
>          return 1
>      end
>  
> +    fio.unlink(pid_file)
>      if ffi.C.kill(pid, 15) < 0 then
>          log.error("Can't kill process %d: %s", pid, errno.strerror())
> -        fio.unlink(pid_file)
>          return 1
>      end
> -
>      return 0
>  end
>  
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Tarantool-patches] [PATCH v1] tarantoolctl: fix pid file removement
  2020-11-23 22:15 ` Alexander Turenko
@ 2020-11-23 23:19   ` Alexander Turenko
  0 siblings, 0 replies; 3+ messages in thread
From: Alexander Turenko @ 2020-11-23 23:19 UTC (permalink / raw)
  To: Alexander V. Tikhonov; +Cc: tarantool-patches

> I run the test in parallel many times and don't see any fail. However
> when I run two test-run's from two terminals and set the same --vardir
> value, I got various fails, including tarantoolctl's 'The daemon is
> already running'. The fail you linked is on Mac OS, where a shared
> --vardir is used. I would look into this direction.
> 
> I don't know how the testing is run on those machines, but I would ask
> the following questions to myself:
> 
> 1. If two jobs are run simultaneously using one var directory, how can I
>    verify that it occurs? Some logs? If there are no such logs, can we
>    add them?
> 2. How can we verify that no 'orphan' tarantools (ones from a previous
>    testing round) are present when starting a new testing round?
> 3. Are we need some test-run support for detection of situation of this
>    kind? Say, using flock() on the var directory.

See more:

The job [1] was started at Nov 5, 2020 1:35pm GMT+0300 on tarantool_mac4
runner and had 13 minutes 34 seconds duration. So it ended somewhere
around 1:48pm-1:49pm. The vardir is /tmp/tnt. The worker name, which
runs replication/ddl.test.lua, is 096_replication.

The job [2] (one you refer below) was started at Nov 5, 2020 1:49pm
GMT+0300 on tarantool_mac4 runner and had 13 minutes 45 seconds
duration. So it ended somewhere around 2:02pm-2:03pm. The vardir is
/tmp/tnt.  The worker name, which runs replication/ddl.test.lua, is
096_replication.

The logs gives more timestamps:

[087] +++ /tmp/tnt/rejects/small/obuf.reject	Thu Nov  5 13:54:13 2020
<replication/ddl.test.lua reruns>
[097] +++ /tmp/tnt/rejects/vinyl/iterator.reject	Thu Nov  5 14:02:51 2020

So the jobs were not run at the same time. But one follows another and
the worker names were the same. Looks suspectful? I don't know. I also
see that ddl.test.lua replicas use the same listen socket path as tests
that use autobootstrap.lua and autobootstrap_anon.lua. So maybe we
failed to stop a server, but the error was suppressed by one of test-run
patches you made?

Really, the only thing we can do about such problems, is to make guesses
and add logs, which can prove or decline those guesses. We should not
try to blindly fix them: it is highway to ever more obscure problems.

[1]: https://gitlab.com/tarantool/tarantool/-/jobs/831790167
[2]: https://gitlab.com/tarantool/tarantool/-/jobs/831873727

> 
> > 
> > Resolved the issue [1]:
> > 
> >   [096] replication/ddl.test.lua                        memtx
> >   [096]
> >   [096] [Instance "ddl2" returns with non-zero exit code: 1]
> >   [096]
> >   [096] Last 15 lines of Tarantool Log file [Instance "ddl2"][/tmp/tnt/096_replication/ddl2.log]:
> >   ...
> >   [096] 2020-11-05 13:56:59.838 [10538] main/103/ddl2 I> bootstrapping replica from f4f59bcd-54bb-4308-a43c-c8ede1c84701 at unix/:/private/tmp/tnt/096_replication/autobootstrap4.sock
> >   [096] 2020-11-05 13:56:59.838 [10538] main/115/applier/cluster@unix/:/private/tmp/tnt/096_replication/autobootstrap4.sock I> can't read row
> >   [096] 2020-11-05 13:56:59.838 [10538] main/115/applier/cluster@unix/:/private/tmp/tnt/096_replication/autobootstrap4.sock box.cc:183 E> ER_READONLY: Can't modify data because this instance is in read-only mode.
> >   [096] 2020-11-05 13:56:59.838 [10538] main/103/ddl2 box.cc:183 E> ER_READONLY: Can't modify data because this instance is in read-only mode.
> >   [096] 2020-11-05 13:56:59.838 [10538] main/103/ddl2 F> can't initialize storage: Can't modify data because this instance is in read-only mode.
> >   [096] 2020-11-05 13:56:59.838 [10538] main/103/ddl2 F> can't initialize storage: Can't modify data because this instance is in read-only mode.
> >   [096] [ fail ]
> >   [096] Test "replication/ddl.test.lua", conf: "memtx"
> >   [096] 	from "fragile" list failed with results file checksum: "a006d40205b9a67ddbbb8206b4e1764c", rerunning with server restart ...
> >   [096] replication/ddl.test.lua                        memtx           [ fail ]
> >   [096] Test "replication/ddl.test.lua", conf: "memtx"
> >   [096] 	from "fragile" list failed with results file checksum: "a3962e843889def7f61d6f1f71461bf1", rerunning with server restart ...
> >   [096] replication/ddl.test.lua                        memtx           [ fail ]
> >   ...
> >   [096] Worker "096_replication" got failed test; restarted the server
> >   [096] replication/ddl.test.lua                        vinyl
> >   [096]
> >   [096] [Instance "ddl1" returns with non-zero exit code: 1]
> >   [096]
> >   [096] Last 15 lines of Tarantool Log file [Instance "ddl1"][/tmp/tnt/096_replication/ddl1.log]:
> >   [096] Stopping instance ddl1...
> >   [096] Starting instance ddl1...
> >   [096] The daemon is already running: PID 10536
> >   [096] Stopping instance ddl1...
> >   [096] Starting instance ddl1...
> >   [096] The daemon is already running: PID 10536
> >   ...
> > 
> >   [1] - https://gitlab.com/tarantool/tarantool/-/jobs/831873727#L4683
> > ---
> > 
> > Github: https://github.com/tarantool/tarantool/tree/avtikhon/tarantoolctl-pid-file
> > 
> >  extra/dist/tarantoolctl.in | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> > 
> > diff --git a/extra/dist/tarantoolctl.in b/extra/dist/tarantoolctl.in
> > index 0726e7f46..acdb613fa 100755
> > --- a/extra/dist/tarantoolctl.in
> > +++ b/extra/dist/tarantoolctl.in
> > @@ -595,12 +595,11 @@ local function stop()
> >          return 1
> >      end
> >  
> > +    fio.unlink(pid_file)
> >      if ffi.C.kill(pid, 15) < 0 then
> >          log.error("Can't kill process %d: %s", pid, errno.strerror())
> > -        fio.unlink(pid_file)
> >          return 1
> >      end
> > -
> >      return 0
> >  end
> >  
> > -- 
> > 2.25.1
> > 

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-11-23 23:19 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-05 14:20 [Tarantool-patches] [PATCH v1] tarantoolctl: fix pid file removement Alexander V. Tikhonov
2020-11-23 22:15 ` Alexander Turenko
2020-11-23 23:19   ` Alexander Turenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox