Tarantool development patches archive
 help / color / mirror / Atom feed
From: Alexander Turenko <alexander.turenko@tarantool.org>
To: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: avtikhon <avtikhon@tarantool.org>, tarantool-patches@freelists.org
Subject: Re: [PATCH v3] Use SIGKILL to stop server replica
Date: Fri, 3 May 2019 01:41:49 +0300	[thread overview]
Message-ID: <20190502224149.avz4yt72eyc2txge@tkn_work_nb> (raw)
In-Reply-To: <20190502095129.dhufqnds2f3duwxa@esperanza>

On Thu, May 02, 2019 at 12:51:29PM +0300, Vladimir Davydov wrote:
> On Thu, May 02, 2019 at 04:39:45AM +0300, Alexander Turenko wrote:
> > Vladimir, can you please verify my description of the fix? I have put it
> > at end of the email. I see that my terminology differs from your: I
> > describe things w/o terms 'advance gc', 'recovering xlog' and so on
> > because I don't sure I can use them appropriately. Instead I tried to
> > write it in terms of externally visible behaviour: 'remove xlog',
> > 'reading xlog' and so on.
> > 
> > My variant of the description:
> > 
> > test: fix replication/gc flaky failures
> > 
> > Two problems are fixed here. The first one is about correctness of the
> > test case. The second is about flaky failures.
> > 
> > About correctness. The test case contains the following lines:
> > 
> >  | test_run:cmd("switch replica")
> >  | -- Unblock the replica and break replication.
> >  | box.error.injection.set("ERRINJ_WAL_DELAY", false)
> >  | box.cfg{replication = {}}
> > 
> > Usually rows are applied and the new vclock is sent to the master before
> > replication will be disabled. So the master removes old xlog before the
> > replica restart and the next case tests nothing.
> > 
> > This commit uses the new test-run's ability to stop a tarantool instance
> > with a custom signal and stops the replica with SIGKILL w/o dropping
> > ERRINJ_WAL_DELAY. This change fixes the race between applying rows and
> > disabling replication and so makes the test case correct.
> > 
> > About flaky failures. They were look like so:
> > 
> >  | [029] --- replication/gc.result Mon Apr 15 14:58:09 2019
> >  | [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
> >  | [029] @@ -290,7 +290,12 @@
> >  | [029] ...
> >  | [029] wait_xlog(1) or fio.listdir('./master')
> >  | [029] ---
> >  | [029] -- true
> >  | [029] +- - 00000000000000000305.vylog
> >  | [029] + - 00000000000000000305.xlog
> >  | [029] + - '512'
> >  | [029] + - 00000000000000000310.xlog
> >  | [029] + - 00000000000000000310.vylog
> >  | [029] + - 00000000000000000310.snap
> >  | [029] ...
> >  | [029] -- Stop the replica.
> >  | [029] test_run:cmd("stop server replica")
> >  | <...next cases could have induced mismathes too...>
> > 
> > The reason of the fail is that a replica applied all rows from the old
> > xlog, but didn't sent an ACK with a new vclock to a master, because the
> > replication was disabled before that. The master stops relay and keeps
> > the old xlog. When the replica starts again it subscribes with the
> > vclock value that instructs a relay to open the new xlog.
> > 
> > Tarantool can remove an old xlog just after a replica's ACK when
> > observes that the xlog was fully read by all replicas. But tarantool
> > does not remove xlogs when a replica is subscribed. This is not a big
> > problem, because such 'stuck' xlog file will be removed with a next xlog
> > removal.
> > 
> > There was the attempt to fix this behaviour and remove old xlogs at
> > subscribe, see the following commits:
> > 
> > * b5b4809cf2e6d48230eb9e4301eac188b080e0f4 ('replication: update replica
> >   gc state on subscribe');
> > * 766cd3e1015f6f76460a748c37212fb4c8791500 ('Revert "replication: update
> >   replica gc state on subscribe"').
> > 
> > Anyway, this commit fixes this flaky failures, because stops the replica
> > before applying rows from the old xlog. So when the replica starts it
> > continues reading from the old xlog and the xlog file will be removed
> > when will be fully read.
> > 
> > Closes #4162
> 
> Looks good to me, thanks.

Pushed to master and 2.1.

      reply	other threads:[~2019-05-02 22:41 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-01 18:15 [tarantool-patches] " avtikhon
2019-05-02  1:39 ` Alexander Turenko
2019-05-02  9:51   ` Vladimir Davydov
2019-05-02 22:41     ` Alexander Turenko [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190502224149.avz4yt72eyc2txge@tkn_work_nb \
    --to=alexander.turenko@tarantool.org \
    --cc=avtikhon@tarantool.org \
    --cc=tarantool-patches@freelists.org \
    --cc=vdavydov.dev@gmail.com \
    --subject='Re: [PATCH v3] Use SIGKILL to stop server replica' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox