From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Thu, 2 May 2019 12:51:29 +0300 From: Vladimir Davydov Subject: Re: [PATCH v3] Use SIGKILL to stop server replica Message-ID: <20190502095129.dhufqnds2f3duwxa@esperanza> References: <5e615fda2058511403a5520840f9fe8621fd1db1.1556734535.git.avtikhon@tarantool.org> <20190502013945.rgo5qjwgzv7p3wlg@tkn_work_nb> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190502013945.rgo5qjwgzv7p3wlg@tkn_work_nb> To: Alexander Turenko Cc: avtikhon , tarantool-patches@freelists.org List-ID: On Thu, May 02, 2019 at 04:39:45AM +0300, Alexander Turenko wrote: > Vladimir, can you please verify my description of the fix? I have put it > at end of the email. I see that my terminology differs from your: I > describe things w/o terms 'advance gc', 'recovering xlog' and so on > because I don't sure I can use them appropriately. Instead I tried to > write it in terms of externally visible behaviour: 'remove xlog', > 'reading xlog' and so on. > > My variant of the description: > > test: fix replication/gc flaky failures > > Two problems are fixed here. The first one is about correctness of the > test case. The second is about flaky failures. > > About correctness. The test case contains the following lines: > > | test_run:cmd("switch replica") > | -- Unblock the replica and break replication. > | box.error.injection.set("ERRINJ_WAL_DELAY", false) > | box.cfg{replication = {}} > > Usually rows are applied and the new vclock is sent to the master before > replication will be disabled. So the master removes old xlog before the > replica restart and the next case tests nothing. > > This commit uses the new test-run's ability to stop a tarantool instance > with a custom signal and stops the replica with SIGKILL w/o dropping > ERRINJ_WAL_DELAY. This change fixes the race between applying rows and > disabling replication and so makes the test case correct. > > About flaky failures. They were look like so: > > | [029] --- replication/gc.result Mon Apr 15 14:58:09 2019 > | [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019 > | [029] @@ -290,7 +290,12 @@ > | [029] ... > | [029] wait_xlog(1) or fio.listdir('./master') > | [029] --- > | [029] -- true > | [029] +- - 00000000000000000305.vylog > | [029] + - 00000000000000000305.xlog > | [029] + - '512' > | [029] + - 00000000000000000310.xlog > | [029] + - 00000000000000000310.vylog > | [029] + - 00000000000000000310.snap > | [029] ... > | [029] -- Stop the replica. > | [029] test_run:cmd("stop server replica") > | <...next cases could have induced mismathes too...> > > The reason of the fail is that a replica applied all rows from the old > xlog, but didn't sent an ACK with a new vclock to a master, because the > replication was disabled before that. The master stops relay and keeps > the old xlog. When the replica starts again it subscribes with the > vclock value that instructs a relay to open the new xlog. > > Tarantool can remove an old xlog just after a replica's ACK when > observes that the xlog was fully read by all replicas. But tarantool > does not remove xlogs when a replica is subscribed. This is not a big > problem, because such 'stuck' xlog file will be removed with a next xlog > removal. > > There was the attempt to fix this behaviour and remove old xlogs at > subscribe, see the following commits: > > * b5b4809cf2e6d48230eb9e4301eac188b080e0f4 ('replication: update replica > gc state on subscribe'); > * 766cd3e1015f6f76460a748c37212fb4c8791500 ('Revert "replication: update > replica gc state on subscribe"'). > > Anyway, this commit fixes this flaky failures, because stops the replica > before applying rows from the old xlog. So when the replica starts it > continues reading from the old xlog and the xlog file will be removed > when will be fully read. > > Closes #4162 Looks good to me, thanks.