Tarantool development patches archive
 help / color / mirror / Atom feed
* [tarantool-patches] [PATCH v3] Use SIGKILL to stop server replica
@ 2019-05-01 18:15 avtikhon
  2019-05-02  1:39 ` Alexander Turenko
  0 siblings, 1 reply; 4+ messages in thread
From: avtikhon @ 2019-05-01 18:15 UTC (permalink / raw)
  To: Alexander Turenko; +Cc: avtikhon, tarantool-patches

Used the signal option set to SIGKILL to stop server replica
routine to be able to stop the replica immediately to imitate
the replica crash and, then, wake up.
The current case happened when we wanted to set ERRINJ_WAL_DELAY
for a tarantool instance and then stop it. By default the
SIGTERM was used and was not sufficient there, because the
main thread stil waited for the stuck WAL thread after the
signal. In that case the replica finished reading the *.xlog
file, but master server didn't know about it and saved the
previous *.xlog file for replica for its restart. When the
signal was changed from default to 9 (SIGKILL) replica didn't
have a chance to read all data from *.xlog file due to it was
killed immediately. So after replica restart it removed the previous
*.xlog file after its reading.
The logic of the replication was tried to change, but met
the new issues, so the suggested fix at commit:
b5b4809cf2e6d48230eb9e4301eac188b080e0f4
was reverted at commit:
766cd3e1015f6f76460a748c37212fb4c8791500

[029] --- replication/gc.result Mon Apr 15 14:58:09 2019
[029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
[029] @@ -290,7 +290,12 @@
[029] ...
[029] wait_xlog(1) or fio.listdir('./master')
[029] ---
[048] replication/gc.test.lua vinyl [ fail ]
[048]
[048] Test failed! Result content mismatch:
[029] -- true
[029] +- - 00000000000000000305.vylog
[029] + - 00000000000000000305.xlog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000310.vylog
[029] + - 00000000000000000310.snap
[029] ...
[029] -- Stop the replica.
[029] test_run:cmd("stop server replica")
[029] @@ -326,7 +331,13 @@
[029] ...
[029] wait_xlog(2) or fio.listdir('./master')
[029] ---
[029] -- true
[029] +- - 00000000000000000305.xlog
[029] + - 00000000000000000316.xlog
[029] + - 00000000000000000316.vylog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000317.vylog
[029] + - 00000000000000000317.snap
[029] ...
[029] -- The xlog should only be deleted after the replica
[029] -- is unregistered.
[029]

Close #4162
---

Github: https://github.com/tarantool/tarantool/tree/avtikhon/gh-4162-stop-kill
Issue: https://github.com/tarantool/tarantool/issues/4162

 test/replication/gc.result   | 16 +++++++---------
 test/replication/gc.test.lua | 12 ++++++------
 2 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/test/replication/gc.result b/test/replication/gc.result
index 65785f47b..396e7ff10 100644
--- a/test/replication/gc.result
+++ b/test/replication/gc.result
@@ -252,20 +252,18 @@ wait_xlog(2) or fio.listdir('./master')
 ---
 - true
 ...
-test_run:cmd("switch replica")
+-- Imitate the replica crash and, then, wake up.
+-- Just 'stop server replica' (SIGTERM) is not sufficient to stop
+-- a tarantool instance when ERRINJ_WAL_DELAY is set, because
+-- "tarantool" thread wait for paused "wal" thread infinitely.
+test_run:cmd("stop server replica with signal=9")
 ---
 - true
 ...
--- Unblock the replica and break replication.
-box.error.injection.set("ERRINJ_WAL_DELAY", false)
----
-- ok
-...
-box.cfg{replication = {}}
+test_run:cmd("start server replica")
 ---
+- true
 ...
--- Restart the replica to reestablish replication.
-test_run:cmd("restart server replica")
 -- Wait for the replica to catch up.
 test_run:cmd("switch replica")
 ---
diff --git a/test/replication/gc.test.lua b/test/replication/gc.test.lua
index 890fe29ae..1ebf32cc8 100644
--- a/test/replication/gc.test.lua
+++ b/test/replication/gc.test.lua
@@ -122,12 +122,12 @@ fiber.sleep(0.1) -- wait for master to relay data
 -- the old snapshot.
 wait_gc(1) or box.info.gc()
 wait_xlog(2) or fio.listdir('./master')
-test_run:cmd("switch replica")
--- Unblock the replica and break replication.
-box.error.injection.set("ERRINJ_WAL_DELAY", false)
-box.cfg{replication = {}}
--- Restart the replica to reestablish replication.
-test_run:cmd("restart server replica")
+-- Imitate the replica crash and, then, wake up.
+-- Just 'stop server replica' (SIGTERM) is not sufficient to stop
+-- a tarantool instance when ERRINJ_WAL_DELAY is set, because
+-- "tarantool" thread wait for paused "wal" thread infinitely.
+test_run:cmd("stop server replica with signal=9")
+test_run:cmd("start server replica")
 -- Wait for the replica to catch up.
 test_run:cmd("switch replica")
 test_run:wait_cond(function() return box.space.test:count() == 310 end, 10)
-- 
2.17.1

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] Use SIGKILL to stop server replica
  2019-05-01 18:15 [tarantool-patches] [PATCH v3] Use SIGKILL to stop server replica avtikhon
@ 2019-05-02  1:39 ` Alexander Turenko
  2019-05-02  9:51   ` Vladimir Davydov
  0 siblings, 1 reply; 4+ messages in thread
From: Alexander Turenko @ 2019-05-02  1:39 UTC (permalink / raw)
  To: avtikhon, Vladimir Davydov; +Cc: tarantool-patches

Alexander, look at the comments below. I force-pushed my fixes (most of
them are for the commit message). Also rebased the branch upon master to
catch the new test-run.

Vladimir, can you please verify my description of the fix? I have put it
at end of the email. I see that my terminology differs from your: I
describe things w/o terms 'advance gc', 'recovering xlog' and so on
because I don't sure I can use them appropriately. Instead I tried to
write it in terms of externally visible behaviour: 'remove xlog',
'reading xlog' and so on.

The branch is avtikhon/gh-4162-stop-kill, the needed test-run commit is
in tarantool already (master, 2.1 and 1.10).

WBR, Alexander Turenko.

Added 'test:' for the commit message header.

On Wed, May 01, 2019 at 09:15:59PM +0300, avtikhon wrote:
> Used the signal option set to SIGKILL to stop server replica
> routine to be able to stop the replica immediately to imitate
> the replica crash and, then, wake up.
> The current case happened when we wanted to set ERRINJ_WAL_DELAY
> for a tarantool instance and then stop it. By default the
> SIGTERM was used and was not sufficient there, because the
> main thread stil waited for the stuck WAL thread after the
> signal. In that case the replica finished reading the *.xlog
> file, but master server didn't know about it and saved the
> previous *.xlog file for replica for its restart. When the
> signal was changed from default to 9 (SIGKILL) replica didn't
> have a chance to read all data from *.xlog file due to it was
> killed immediately. So after replica restart it removed the previous
> *.xlog file after its reading.
> The logic of the replication was tried to change, but met
> the new issues, so the suggested fix at commit:
> b5b4809cf2e6d48230eb9e4301eac188b080e0f4
> was reverted at commit:
> 766cd3e1015f6f76460a748c37212fb4c8791500

Cite from the previous [review comments][1]:

> I would clarify that here you solves two problems: the incorrect test
> case and flaky failures. Also it worth to mention an issue about the
> flaky failures.

I took your effort, but it is still not clear with the new description.
Why not start the message with 'two problems are fixed here', describe
the first one and then describe the second one? Or you are not agree
with this comment? Or it is not clear what I asked to do?

One behaviour is observed in the 'good' (incorrect) case, another one in
the 'bad' (failed) one. It seems that you described only the 'bad' case.

[1]: https://www.freelists.org/post/tarantool-patches/PATCH-v2-Use-SIGKILL-to-stop-server-replica,1

> 
> [029] --- replication/gc.result Mon Apr 15 14:58:09 2019
> [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
> [029] @@ -290,7 +290,12 @@
> [029] ...
> [029] wait_xlog(1) or fio.listdir('./master')
> [029] ---
> [048] replication/gc.test.lua vinyl [ fail ]
> [048]
> [048] Test failed! Result content mismatch:

048 fail is unrelevant.

> [029] -- true
> [029] +- - 00000000000000000305.vylog
> [029] + - 00000000000000000305.xlog
> [029] + - '512'
> [029] + - 00000000000000000310.xlog
> [029] + - 00000000000000000310.vylog
> [029] + - 00000000000000000310.snap
> [029] ...
> [029] -- Stop the replica.
> [029] test_run:cmd("stop server replica")
> [029] @@ -326,7 +331,13 @@
> [029] ...
> [029] wait_xlog(2) or fio.listdir('./master')
> [029] ---
> [029] -- true
> [029] +- - 00000000000000000305.xlog
> [029] + - 00000000000000000316.xlog
> [029] + - 00000000000000000316.vylog
> [029] + - '512'
> [029] + - 00000000000000000310.xlog
> [029] + - 00000000000000000317.vylog
> [029] + - 00000000000000000317.snap
> [029] ...
> [029] -- The xlog should only be deleted after the replica
> [029] -- is unregistered.
> [029]

I have cropped the 2nd fail (to keep the listing as short as possible)
and stated that next cases can fail, because of an influence of this
one.

> 
> Close #4162
> ---
> 
> Github: https://github.com/tarantool/tarantool/tree/avtikhon/gh-4162-stop-kill
> Issue: https://github.com/tarantool/tarantool/issues/4162
> 
>  test/replication/gc.result   | 16 +++++++---------
>  test/replication/gc.test.lua | 12 ++++++------
>  2 files changed, 13 insertions(+), 15 deletions(-)
> 
> diff --git a/test/replication/gc.test.lua b/test/replication/gc.test.lua
> index 890fe29ae..1ebf32cc8 100644
> --- a/test/replication/gc.test.lua
> +++ b/test/replication/gc.test.lua
> @@ -122,12 +122,12 @@ fiber.sleep(0.1) -- wait for master to relay data
>  -- the old snapshot.
>  wait_gc(1) or box.info.gc()
>  wait_xlog(2) or fio.listdir('./master')
> -test_run:cmd("switch replica")
> --- Unblock the replica and break replication.
> -box.error.injection.set("ERRINJ_WAL_DELAY", false)
> -box.cfg{replication = {}}
> --- Restart the replica to reestablish replication.
> -test_run:cmd("restart server replica")
> +-- Imitate the replica crash and, then, wake up.
> +-- Just 'stop server replica' (SIGTERM) is not sufficient to stop
> +-- a tarantool instance when ERRINJ_WAL_DELAY is set, because
> +-- "tarantool" thread wait for paused "wal" thread infinitely.
> +test_run:cmd("stop server replica with signal=9")

Changed to signal=KILL.

> +test_run:cmd("start server replica")
>  -- Wait for the replica to catch up.
>  test_run:cmd("switch replica")
>  test_run:wait_cond(function() return box.space.test:count() == 310 end, 10)
> -- 
> 2.17.1

My variant of the description:

test: fix replication/gc flaky failures

Two problems are fixed here. The first one is about correctness of the
test case. The second is about flaky failures.

About correctness. The test case contains the following lines:

 | test_run:cmd("switch replica")
 | -- Unblock the replica and break replication.
 | box.error.injection.set("ERRINJ_WAL_DELAY", false)
 | box.cfg{replication = {}}

Usually rows are applied and the new vclock is sent to the master before
replication will be disabled. So the master removes old xlog before the
replica restart and the next case tests nothing.

This commit uses the new test-run's ability to stop a tarantool instance
with a custom signal and stops the replica with SIGKILL w/o dropping
ERRINJ_WAL_DELAY. This change fixes the race between applying rows and
disabling replication and so makes the test case correct.

About flaky failures. They were look like so:

 | [029] --- replication/gc.result Mon Apr 15 14:58:09 2019
 | [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
 | [029] @@ -290,7 +290,12 @@
 | [029] ...
 | [029] wait_xlog(1) or fio.listdir('./master')
 | [029] ---
 | [029] -- true
 | [029] +- - 00000000000000000305.vylog
 | [029] + - 00000000000000000305.xlog
 | [029] + - '512'
 | [029] + - 00000000000000000310.xlog
 | [029] + - 00000000000000000310.vylog
 | [029] + - 00000000000000000310.snap
 | [029] ...
 | [029] -- Stop the replica.
 | [029] test_run:cmd("stop server replica")
 | <...next cases could have induced mismathes too...>

The reason of the fail is that a replica applied all rows from the old
xlog, but didn't sent an ACK with a new vclock to a master, because the
replication was disabled before that. The master stops relay and keeps
the old xlog. When the replica starts again it subscribes with the
vclock value that instructs a relay to open the new xlog.

Tarantool can remove an old xlog just after a replica's ACK when
observes that the xlog was fully read by all replicas. But tarantool
does not remove xlogs when a replica is subscribed. This is not a big
problem, because such 'stuck' xlog file will be removed with a next xlog
removal.

There was the attempt to fix this behaviour and remove old xlogs at
subscribe, see the following commits:

* b5b4809cf2e6d48230eb9e4301eac188b080e0f4 ('replication: update replica
  gc state on subscribe');
* 766cd3e1015f6f76460a748c37212fb4c8791500 ('Revert "replication: update
  replica gc state on subscribe"').

Anyway, this commit fixes this flaky failures, because stops the replica
before applying rows from the old xlog. So when the replica starts it
continues reading from the old xlog and the xlog file will be removed
when will be fully read.

Closes #4162

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] Use SIGKILL to stop server replica
  2019-05-02  1:39 ` Alexander Turenko
@ 2019-05-02  9:51   ` Vladimir Davydov
  2019-05-02 22:41     ` Alexander Turenko
  0 siblings, 1 reply; 4+ messages in thread
From: Vladimir Davydov @ 2019-05-02  9:51 UTC (permalink / raw)
  To: Alexander Turenko; +Cc: avtikhon, tarantool-patches

On Thu, May 02, 2019 at 04:39:45AM +0300, Alexander Turenko wrote:
> Vladimir, can you please verify my description of the fix? I have put it
> at end of the email. I see that my terminology differs from your: I
> describe things w/o terms 'advance gc', 'recovering xlog' and so on
> because I don't sure I can use them appropriately. Instead I tried to
> write it in terms of externally visible behaviour: 'remove xlog',
> 'reading xlog' and so on.
> 
> My variant of the description:
> 
> test: fix replication/gc flaky failures
> 
> Two problems are fixed here. The first one is about correctness of the
> test case. The second is about flaky failures.
> 
> About correctness. The test case contains the following lines:
> 
>  | test_run:cmd("switch replica")
>  | -- Unblock the replica and break replication.
>  | box.error.injection.set("ERRINJ_WAL_DELAY", false)
>  | box.cfg{replication = {}}
> 
> Usually rows are applied and the new vclock is sent to the master before
> replication will be disabled. So the master removes old xlog before the
> replica restart and the next case tests nothing.
> 
> This commit uses the new test-run's ability to stop a tarantool instance
> with a custom signal and stops the replica with SIGKILL w/o dropping
> ERRINJ_WAL_DELAY. This change fixes the race between applying rows and
> disabling replication and so makes the test case correct.
> 
> About flaky failures. They were look like so:
> 
>  | [029] --- replication/gc.result Mon Apr 15 14:58:09 2019
>  | [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
>  | [029] @@ -290,7 +290,12 @@
>  | [029] ...
>  | [029] wait_xlog(1) or fio.listdir('./master')
>  | [029] ---
>  | [029] -- true
>  | [029] +- - 00000000000000000305.vylog
>  | [029] + - 00000000000000000305.xlog
>  | [029] + - '512'
>  | [029] + - 00000000000000000310.xlog
>  | [029] + - 00000000000000000310.vylog
>  | [029] + - 00000000000000000310.snap
>  | [029] ...
>  | [029] -- Stop the replica.
>  | [029] test_run:cmd("stop server replica")
>  | <...next cases could have induced mismathes too...>
> 
> The reason of the fail is that a replica applied all rows from the old
> xlog, but didn't sent an ACK with a new vclock to a master, because the
> replication was disabled before that. The master stops relay and keeps
> the old xlog. When the replica starts again it subscribes with the
> vclock value that instructs a relay to open the new xlog.
> 
> Tarantool can remove an old xlog just after a replica's ACK when
> observes that the xlog was fully read by all replicas. But tarantool
> does not remove xlogs when a replica is subscribed. This is not a big
> problem, because such 'stuck' xlog file will be removed with a next xlog
> removal.
> 
> There was the attempt to fix this behaviour and remove old xlogs at
> subscribe, see the following commits:
> 
> * b5b4809cf2e6d48230eb9e4301eac188b080e0f4 ('replication: update replica
>   gc state on subscribe');
> * 766cd3e1015f6f76460a748c37212fb4c8791500 ('Revert "replication: update
>   replica gc state on subscribe"').
> 
> Anyway, this commit fixes this flaky failures, because stops the replica
> before applying rows from the old xlog. So when the replica starts it
> continues reading from the old xlog and the xlog file will be removed
> when will be fully read.
> 
> Closes #4162

Looks good to me, thanks.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] Use SIGKILL to stop server replica
  2019-05-02  9:51   ` Vladimir Davydov
@ 2019-05-02 22:41     ` Alexander Turenko
  0 siblings, 0 replies; 4+ messages in thread
From: Alexander Turenko @ 2019-05-02 22:41 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: avtikhon, tarantool-patches

On Thu, May 02, 2019 at 12:51:29PM +0300, Vladimir Davydov wrote:
> On Thu, May 02, 2019 at 04:39:45AM +0300, Alexander Turenko wrote:
> > Vladimir, can you please verify my description of the fix? I have put it
> > at end of the email. I see that my terminology differs from your: I
> > describe things w/o terms 'advance gc', 'recovering xlog' and so on
> > because I don't sure I can use them appropriately. Instead I tried to
> > write it in terms of externally visible behaviour: 'remove xlog',
> > 'reading xlog' and so on.
> > 
> > My variant of the description:
> > 
> > test: fix replication/gc flaky failures
> > 
> > Two problems are fixed here. The first one is about correctness of the
> > test case. The second is about flaky failures.
> > 
> > About correctness. The test case contains the following lines:
> > 
> >  | test_run:cmd("switch replica")
> >  | -- Unblock the replica and break replication.
> >  | box.error.injection.set("ERRINJ_WAL_DELAY", false)
> >  | box.cfg{replication = {}}
> > 
> > Usually rows are applied and the new vclock is sent to the master before
> > replication will be disabled. So the master removes old xlog before the
> > replica restart and the next case tests nothing.
> > 
> > This commit uses the new test-run's ability to stop a tarantool instance
> > with a custom signal and stops the replica with SIGKILL w/o dropping
> > ERRINJ_WAL_DELAY. This change fixes the race between applying rows and
> > disabling replication and so makes the test case correct.
> > 
> > About flaky failures. They were look like so:
> > 
> >  | [029] --- replication/gc.result Mon Apr 15 14:58:09 2019
> >  | [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
> >  | [029] @@ -290,7 +290,12 @@
> >  | [029] ...
> >  | [029] wait_xlog(1) or fio.listdir('./master')
> >  | [029] ---
> >  | [029] -- true
> >  | [029] +- - 00000000000000000305.vylog
> >  | [029] + - 00000000000000000305.xlog
> >  | [029] + - '512'
> >  | [029] + - 00000000000000000310.xlog
> >  | [029] + - 00000000000000000310.vylog
> >  | [029] + - 00000000000000000310.snap
> >  | [029] ...
> >  | [029] -- Stop the replica.
> >  | [029] test_run:cmd("stop server replica")
> >  | <...next cases could have induced mismathes too...>
> > 
> > The reason of the fail is that a replica applied all rows from the old
> > xlog, but didn't sent an ACK with a new vclock to a master, because the
> > replication was disabled before that. The master stops relay and keeps
> > the old xlog. When the replica starts again it subscribes with the
> > vclock value that instructs a relay to open the new xlog.
> > 
> > Tarantool can remove an old xlog just after a replica's ACK when
> > observes that the xlog was fully read by all replicas. But tarantool
> > does not remove xlogs when a replica is subscribed. This is not a big
> > problem, because such 'stuck' xlog file will be removed with a next xlog
> > removal.
> > 
> > There was the attempt to fix this behaviour and remove old xlogs at
> > subscribe, see the following commits:
> > 
> > * b5b4809cf2e6d48230eb9e4301eac188b080e0f4 ('replication: update replica
> >   gc state on subscribe');
> > * 766cd3e1015f6f76460a748c37212fb4c8791500 ('Revert "replication: update
> >   replica gc state on subscribe"').
> > 
> > Anyway, this commit fixes this flaky failures, because stops the replica
> > before applying rows from the old xlog. So when the replica starts it
> > continues reading from the old xlog and the xlog file will be removed
> > when will be fully read.
> > 
> > Closes #4162
> 
> Looks good to me, thanks.

Pushed to master and 2.1.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-05-02 22:41 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-01 18:15 [tarantool-patches] [PATCH v3] Use SIGKILL to stop server replica avtikhon
2019-05-02  1:39 ` Alexander Turenko
2019-05-02  9:51   ` Vladimir Davydov
2019-05-02 22:41     ` Alexander Turenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox