[tarantool-patches] Re: [PATCH 2/5] relay: do not try to scan xlog if exiting

Konstantin Osipov kostja at tarantool.org
Sat Dec 29 12:14:50 MSK 2018


* Vladimir Davydov <vdavydov.dev at gmail.com> [18/12/29 10:00]:
> relay_process_wal_event() may be called if the relay fiber is already
> exiting, e.g. by wal_clear_watcher(). We must not try to scan xlogs in
> this case, because we could have written an incomplete packet fragment
> to the replication socket, as described in the previous commit message,
> so that writing another row would lead to corrupted replication stream
> and, as a result, permanent replication breakdown.
> 
> Actually, there was a check for this case in relay_process_wal_event(),
> but it was broken by commit adc28591f77f ("replication: do not delete
> relay on applier disconnect"), which replaced it with a relay->status
> check, which is completely wrong, because relay->status is reset only
> after the relay thread exits.
> 
> Part of #3910
> ---
>  src/box/relay.cc | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/src/box/relay.cc b/src/box/relay.cc
> index a01c2a2e..3d9703ea 100644
> --- a/src/box/relay.cc
> +++ b/src/box/relay.cc
> @@ -409,10 +409,15 @@ static void
>  relay_process_wal_event(struct wal_watcher *watcher, unsigned events)
>  {
>  	struct relay *relay = container_of(watcher, struct relay, wal_watcher);
> -	if (relay->state != RELAY_FOLLOW) {
> +	if (fiber_is_cancelled()) {

 When a relay is exiting, it's state is changed. Why would you
 need to look at fiber_is_cancelled() *instead of* a more explicit
 RELAY_FOLLOW state change? Why not fix the invariant that
 whenever relay is exiting it's state is not RELAY_FOLLOW?

>  		/*
> -		 * Do not try to send anything to the replica
> -		 * if it already closed its socket.
> +		 * The relay is exiting. Rescanning the WAL at this
> +		 * point would be pointless and even dangerous,
> +		 * because the relay could have written a packet
> +		 * fragment to the socket before being cancelled
> +		 * so that writing another row to the socket would
> +		 * lead to corrupted replication stream and, as
> +		 * a result, permanent replication breakdown.
>  		 */
>  		return;
>  	}
> -- 
> 2.11.0
> 

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov




More information about the Tarantool-patches mailing list