[PATCH v3] Use SIGKILL to stop server replica

Vladimir Davydov vdavydov.dev at gmail.com
Thu May 2 12:51:29 MSK 2019


On Thu, May 02, 2019 at 04:39:45AM +0300, Alexander Turenko wrote:
> Vladimir, can you please verify my description of the fix? I have put it
> at end of the email. I see that my terminology differs from your: I
> describe things w/o terms 'advance gc', 'recovering xlog' and so on
> because I don't sure I can use them appropriately. Instead I tried to
> write it in terms of externally visible behaviour: 'remove xlog',
> 'reading xlog' and so on.
> 
> My variant of the description:
> 
> test: fix replication/gc flaky failures
> 
> Two problems are fixed here. The first one is about correctness of the
> test case. The second is about flaky failures.
> 
> About correctness. The test case contains the following lines:
> 
>  | test_run:cmd("switch replica")
>  | -- Unblock the replica and break replication.
>  | box.error.injection.set("ERRINJ_WAL_DELAY", false)
>  | box.cfg{replication = {}}
> 
> Usually rows are applied and the new vclock is sent to the master before
> replication will be disabled. So the master removes old xlog before the
> replica restart and the next case tests nothing.
> 
> This commit uses the new test-run's ability to stop a tarantool instance
> with a custom signal and stops the replica with SIGKILL w/o dropping
> ERRINJ_WAL_DELAY. This change fixes the race between applying rows and
> disabling replication and so makes the test case correct.
> 
> About flaky failures. They were look like so:
> 
>  | [029] --- replication/gc.result Mon Apr 15 14:58:09 2019
>  | [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
>  | [029] @@ -290,7 +290,12 @@
>  | [029] ...
>  | [029] wait_xlog(1) or fio.listdir('./master')
>  | [029] ---
>  | [029] -- true
>  | [029] +- - 00000000000000000305.vylog
>  | [029] + - 00000000000000000305.xlog
>  | [029] + - '512'
>  | [029] + - 00000000000000000310.xlog
>  | [029] + - 00000000000000000310.vylog
>  | [029] + - 00000000000000000310.snap
>  | [029] ...
>  | [029] -- Stop the replica.
>  | [029] test_run:cmd("stop server replica")
>  | <...next cases could have induced mismathes too...>
> 
> The reason of the fail is that a replica applied all rows from the old
> xlog, but didn't sent an ACK with a new vclock to a master, because the
> replication was disabled before that. The master stops relay and keeps
> the old xlog. When the replica starts again it subscribes with the
> vclock value that instructs a relay to open the new xlog.
> 
> Tarantool can remove an old xlog just after a replica's ACK when
> observes that the xlog was fully read by all replicas. But tarantool
> does not remove xlogs when a replica is subscribed. This is not a big
> problem, because such 'stuck' xlog file will be removed with a next xlog
> removal.
> 
> There was the attempt to fix this behaviour and remove old xlogs at
> subscribe, see the following commits:
> 
> * b5b4809cf2e6d48230eb9e4301eac188b080e0f4 ('replication: update replica
>   gc state on subscribe');
> * 766cd3e1015f6f76460a748c37212fb4c8791500 ('Revert "replication: update
>   replica gc state on subscribe"').
> 
> Anyway, this commit fixes this flaky failures, because stops the replica
> before applying rows from the old xlog. So when the replica starts it
> continues reading from the old xlog and the xlog file will be removed
> when will be fully read.
> 
> Closes #4162

Looks good to me, thanks.



More information about the Tarantool-patches mailing list