[PATCH v3] Use SIGKILL to stop server replica

Alexander Turenko alexander.turenko at tarantool.org
Fri May 3 01:41:49 MSK 2019


On Thu, May 02, 2019 at 12:51:29PM +0300, Vladimir Davydov wrote:
> On Thu, May 02, 2019 at 04:39:45AM +0300, Alexander Turenko wrote:
> > Vladimir, can you please verify my description of the fix? I have put it
> > at end of the email. I see that my terminology differs from your: I
> > describe things w/o terms 'advance gc', 'recovering xlog' and so on
> > because I don't sure I can use them appropriately. Instead I tried to
> > write it in terms of externally visible behaviour: 'remove xlog',
> > 'reading xlog' and so on.
> > 
> > My variant of the description:
> > 
> > test: fix replication/gc flaky failures
> > 
> > Two problems are fixed here. The first one is about correctness of the
> > test case. The second is about flaky failures.
> > 
> > About correctness. The test case contains the following lines:
> > 
> >  | test_run:cmd("switch replica")
> >  | -- Unblock the replica and break replication.
> >  | box.error.injection.set("ERRINJ_WAL_DELAY", false)
> >  | box.cfg{replication = {}}
> > 
> > Usually rows are applied and the new vclock is sent to the master before
> > replication will be disabled. So the master removes old xlog before the
> > replica restart and the next case tests nothing.
> > 
> > This commit uses the new test-run's ability to stop a tarantool instance
> > with a custom signal and stops the replica with SIGKILL w/o dropping
> > ERRINJ_WAL_DELAY. This change fixes the race between applying rows and
> > disabling replication and so makes the test case correct.
> > 
> > About flaky failures. They were look like so:
> > 
> >  | [029] --- replication/gc.result Mon Apr 15 14:58:09 2019
> >  | [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
> >  | [029] @@ -290,7 +290,12 @@
> >  | [029] ...
> >  | [029] wait_xlog(1) or fio.listdir('./master')
> >  | [029] ---
> >  | [029] -- true
> >  | [029] +- - 00000000000000000305.vylog
> >  | [029] + - 00000000000000000305.xlog
> >  | [029] + - '512'
> >  | [029] + - 00000000000000000310.xlog
> >  | [029] + - 00000000000000000310.vylog
> >  | [029] + - 00000000000000000310.snap
> >  | [029] ...
> >  | [029] -- Stop the replica.
> >  | [029] test_run:cmd("stop server replica")
> >  | <...next cases could have induced mismathes too...>
> > 
> > The reason of the fail is that a replica applied all rows from the old
> > xlog, but didn't sent an ACK with a new vclock to a master, because the
> > replication was disabled before that. The master stops relay and keeps
> > the old xlog. When the replica starts again it subscribes with the
> > vclock value that instructs a relay to open the new xlog.
> > 
> > Tarantool can remove an old xlog just after a replica's ACK when
> > observes that the xlog was fully read by all replicas. But tarantool
> > does not remove xlogs when a replica is subscribed. This is not a big
> > problem, because such 'stuck' xlog file will be removed with a next xlog
> > removal.
> > 
> > There was the attempt to fix this behaviour and remove old xlogs at
> > subscribe, see the following commits:
> > 
> > * b5b4809cf2e6d48230eb9e4301eac188b080e0f4 ('replication: update replica
> >   gc state on subscribe');
> > * 766cd3e1015f6f76460a748c37212fb4c8791500 ('Revert "replication: update
> >   replica gc state on subscribe"').
> > 
> > Anyway, this commit fixes this flaky failures, because stops the replica
> > before applying rows from the old xlog. So when the replica starts it
> > continues reading from the old xlog and the xlog file will be removed
> > when will be fully read.
> > 
> > Closes #4162
> 
> Looks good to me, thanks.

Pushed to master and 2.1.



More information about the Tarantool-patches mailing list