From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Sat, 29 Dec 2018 12:53:26 +0300 From: Vladimir Davydov Subject: Re: [tarantool-patches] Re: [PATCH 2/5] relay: do not try to scan xlog if exiting Message-ID: <20181229095325.oojkito2hmr25wkl@esperanza> References: <20181229091450.GE17043@chai> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20181229091450.GE17043@chai> To: Konstantin Osipov Cc: tarantool-patches@freelists.org List-ID: On Sat, Dec 29, 2018 at 12:14:50PM +0300, Konstantin Osipov wrote: > * Vladimir Davydov [18/12/29 10:00]: > > relay_process_wal_event() may be called if the relay fiber is already > > exiting, e.g. by wal_clear_watcher(). We must not try to scan xlogs in > > this case, because we could have written an incomplete packet fragment > > to the replication socket, as described in the previous commit message, > > so that writing another row would lead to corrupted replication stream > > and, as a result, permanent replication breakdown. > > > > Actually, there was a check for this case in relay_process_wal_event(), > > but it was broken by commit adc28591f77f ("replication: do not delete > > relay on applier disconnect"), which replaced it with a relay->status > > check, which is completely wrong, because relay->status is reset only > > after the relay thread exits. > > > > Part of #3910 > > --- > > src/box/relay.cc | 11 ++++++++--- > > 1 file changed, 8 insertions(+), 3 deletions(-) > > > > diff --git a/src/box/relay.cc b/src/box/relay.cc > > index a01c2a2e..3d9703ea 100644 > > --- a/src/box/relay.cc > > +++ b/src/box/relay.cc > > @@ -409,10 +409,15 @@ static void > > relay_process_wal_event(struct wal_watcher *watcher, unsigned events) > > { > > struct relay *relay = container_of(watcher, struct relay, wal_watcher); > > - if (relay->state != RELAY_FOLLOW) { > > + if (fiber_is_cancelled()) { > > When a relay is exiting, it's state is changed. Why would you > need to look at fiber_is_cancelled() *instead of* a more explicit > RELAY_FOLLOW state change? Why not fix the invariant that > whenever relay is exiting it's state is not RELAY_FOLLOW? For the record. Discussed f2f. relay->state isn't used by the relay thread, only by the tx thread for reporting box.info. Relay thread uses fiber_is_cancelled() instead. This looks ugly, but this particular fix doesn't make things worse so it's OK to push it as is for now. In future we should rework relay machinery to make it more straightforward and use fewer callbacks. > > > /* > > - * Do not try to send anything to the replica > > - * if it already closed its socket. > > + * The relay is exiting. Rescanning the WAL at this > > + * point would be pointless and even dangerous, > > + * because the relay could have written a packet > > + * fragment to the socket before being cancelled > > + * so that writing another row to the socket would > > + * lead to corrupted replication stream and, as > > + * a result, permanent replication breakdown. > > */ > > return; > > }