[tarantool-patches] Re: [PATCH 2/5] relay: do not try to scan xlog if exiting
Konstantin Osipov
kostja at tarantool.org
Sat Dec 29 12:14:50 MSK 2018
* Vladimir Davydov <vdavydov.dev at gmail.com> [18/12/29 10:00]:
> relay_process_wal_event() may be called if the relay fiber is already
> exiting, e.g. by wal_clear_watcher(). We must not try to scan xlogs in
> this case, because we could have written an incomplete packet fragment
> to the replication socket, as described in the previous commit message,
> so that writing another row would lead to corrupted replication stream
> and, as a result, permanent replication breakdown.
>
> Actually, there was a check for this case in relay_process_wal_event(),
> but it was broken by commit adc28591f77f ("replication: do not delete
> relay on applier disconnect"), which replaced it with a relay->status
> check, which is completely wrong, because relay->status is reset only
> after the relay thread exits.
>
> Part of #3910
> ---
> src/box/relay.cc | 11 ++++++++---
> 1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/src/box/relay.cc b/src/box/relay.cc
> index a01c2a2e..3d9703ea 100644
> --- a/src/box/relay.cc
> +++ b/src/box/relay.cc
> @@ -409,10 +409,15 @@ static void
> relay_process_wal_event(struct wal_watcher *watcher, unsigned events)
> {
> struct relay *relay = container_of(watcher, struct relay, wal_watcher);
> - if (relay->state != RELAY_FOLLOW) {
> + if (fiber_is_cancelled()) {
When a relay is exiting, it's state is changed. Why would you
need to look at fiber_is_cancelled() *instead of* a more explicit
RELAY_FOLLOW state change? Why not fix the invariant that
whenever relay is exiting it's state is not RELAY_FOLLOW?
> /*
> - * Do not try to send anything to the replica
> - * if it already closed its socket.
> + * The relay is exiting. Rescanning the WAL at this
> + * point would be pointless and even dangerous,
> + * because the relay could have written a packet
> + * fragment to the socket before being cancelled
> + * so that writing another row to the socket would
> + * lead to corrupted replication stream and, as
> + * a result, permanent replication breakdown.
> */
> return;
> }
> --
> 2.11.0
>
--
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov
More information about the Tarantool-patches
mailing list