From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Fri, 3 May 2019 01:41:49 +0300 From: Alexander Turenko Subject: Re: [PATCH v3] Use SIGKILL to stop server replica Message-ID: <20190502224149.avz4yt72eyc2txge@tkn_work_nb> References: <5e615fda2058511403a5520840f9fe8621fd1db1.1556734535.git.avtikhon@tarantool.org> <20190502013945.rgo5qjwgzv7p3wlg@tkn_work_nb> <20190502095129.dhufqnds2f3duwxa@esperanza> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20190502095129.dhufqnds2f3duwxa@esperanza> To: Vladimir Davydov Cc: avtikhon , tarantool-patches@freelists.org List-ID: On Thu, May 02, 2019 at 12:51:29PM +0300, Vladimir Davydov wrote: > On Thu, May 02, 2019 at 04:39:45AM +0300, Alexander Turenko wrote: > > Vladimir, can you please verify my description of the fix? I have put it > > at end of the email. I see that my terminology differs from your: I > > describe things w/o terms 'advance gc', 'recovering xlog' and so on > > because I don't sure I can use them appropriately. Instead I tried to > > write it in terms of externally visible behaviour: 'remove xlog', > > 'reading xlog' and so on. > > > > My variant of the description: > > > > test: fix replication/gc flaky failures > > > > Two problems are fixed here. The first one is about correctness of the > > test case. The second is about flaky failures. > > > > About correctness. The test case contains the following lines: > > > > | test_run:cmd("switch replica") > > | -- Unblock the replica and break replication. > > | box.error.injection.set("ERRINJ_WAL_DELAY", false) > > | box.cfg{replication = {}} > > > > Usually rows are applied and the new vclock is sent to the master before > > replication will be disabled. So the master removes old xlog before the > > replica restart and the next case tests nothing. > > > > This commit uses the new test-run's ability to stop a tarantool instance > > with a custom signal and stops the replica with SIGKILL w/o dropping > > ERRINJ_WAL_DELAY. This change fixes the race between applying rows and > > disabling replication and so makes the test case correct. > > > > About flaky failures. They were look like so: > > > > | [029] --- replication/gc.result Mon Apr 15 14:58:09 2019 > > | [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019 > > | [029] @@ -290,7 +290,12 @@ > > | [029] ... > > | [029] wait_xlog(1) or fio.listdir('./master') > > | [029] --- > > | [029] -- true > > | [029] +- - 00000000000000000305.vylog > > | [029] + - 00000000000000000305.xlog > > | [029] + - '512' > > | [029] + - 00000000000000000310.xlog > > | [029] + - 00000000000000000310.vylog > > | [029] + - 00000000000000000310.snap > > | [029] ... > > | [029] -- Stop the replica. > > | [029] test_run:cmd("stop server replica") > > | <...next cases could have induced mismathes too...> > > > > The reason of the fail is that a replica applied all rows from the old > > xlog, but didn't sent an ACK with a new vclock to a master, because the > > replication was disabled before that. The master stops relay and keeps > > the old xlog. When the replica starts again it subscribes with the > > vclock value that instructs a relay to open the new xlog. > > > > Tarantool can remove an old xlog just after a replica's ACK when > > observes that the xlog was fully read by all replicas. But tarantool > > does not remove xlogs when a replica is subscribed. This is not a big > > problem, because such 'stuck' xlog file will be removed with a next xlog > > removal. > > > > There was the attempt to fix this behaviour and remove old xlogs at > > subscribe, see the following commits: > > > > * b5b4809cf2e6d48230eb9e4301eac188b080e0f4 ('replication: update replica > > gc state on subscribe'); > > * 766cd3e1015f6f76460a748c37212fb4c8791500 ('Revert "replication: update > > replica gc state on subscribe"'). > > > > Anyway, this commit fixes this flaky failures, because stops the replica > > before applying rows from the old xlog. So when the replica starts it > > continues reading from the old xlog and the xlog file will be removed > > when will be fully read. > > > > Closes #4162 > > Looks good to me, thanks. Pushed to master and 2.1.