From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id 19C14225A7 for ; Sat, 29 Dec 2018 04:14:53 -0500 (EST) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id biBsJxm_-zge for ; Sat, 29 Dec 2018 04:14:53 -0500 (EST) Received: from smtp14.mail.ru (smtp14.mail.ru [94.100.181.95]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id CA59022581 for ; Sat, 29 Dec 2018 04:14:52 -0500 (EST) Received: by smtp14.mail.ru with esmtpa (envelope-from ) id 1gdAhm-0003d8-Qu for tarantool-patches@freelists.org; Sat, 29 Dec 2018 12:14:51 +0300 Date: Sat, 29 Dec 2018 12:14:50 +0300 From: Konstantin Osipov Subject: [tarantool-patches] Re: [PATCH 2/5] relay: do not try to scan xlog if exiting Message-ID: <20181229091450.GE17043@chai> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-subscribe: List-owner: List-post: List-archive: To: tarantool-patches@freelists.org * Vladimir Davydov [18/12/29 10:00]: > relay_process_wal_event() may be called if the relay fiber is already > exiting, e.g. by wal_clear_watcher(). We must not try to scan xlogs in > this case, because we could have written an incomplete packet fragment > to the replication socket, as described in the previous commit message, > so that writing another row would lead to corrupted replication stream > and, as a result, permanent replication breakdown. > > Actually, there was a check for this case in relay_process_wal_event(), > but it was broken by commit adc28591f77f ("replication: do not delete > relay on applier disconnect"), which replaced it with a relay->status > check, which is completely wrong, because relay->status is reset only > after the relay thread exits. > > Part of #3910 > --- > src/box/relay.cc | 11 ++++++++--- > 1 file changed, 8 insertions(+), 3 deletions(-) > > diff --git a/src/box/relay.cc b/src/box/relay.cc > index a01c2a2e..3d9703ea 100644 > --- a/src/box/relay.cc > +++ b/src/box/relay.cc > @@ -409,10 +409,15 @@ static void > relay_process_wal_event(struct wal_watcher *watcher, unsigned events) > { > struct relay *relay = container_of(watcher, struct relay, wal_watcher); > - if (relay->state != RELAY_FOLLOW) { > + if (fiber_is_cancelled()) { When a relay is exiting, it's state is changed. Why would you need to look at fiber_is_cancelled() *instead of* a more explicit RELAY_FOLLOW state change? Why not fix the invariant that whenever relay is exiting it's state is not RELAY_FOLLOW? > /* > - * Do not try to send anything to the replica > - * if it already closed its socket. > + * The relay is exiting. Rescanning the WAL at this > + * point would be pointless and even dangerous, > + * because the relay could have written a packet > + * fragment to the socket before being cancelled > + * so that writing another row to the socket would > + * lead to corrupted replication stream and, as > + * a result, permanent replication breakdown. > */ > return; > } > -- > 2.11.0 > -- Konstantin Osipov, Moscow, Russia, +7 903 626 22 32 http://tarantool.io - www.twitter.com/kostja_osipov