[PATCH 2/5] relay: do not try to scan xlog if exiting

Vladimir Davydov vdavydov.dev at gmail.com
Sat Dec 29 00:21:48 MSK 2018


relay_process_wal_event() may be called if the relay fiber is already
exiting, e.g. by wal_clear_watcher(). We must not try to scan xlogs in
this case, because we could have written an incomplete packet fragment
to the replication socket, as described in the previous commit message,
so that writing another row would lead to corrupted replication stream
and, as a result, permanent replication breakdown.

Actually, there was a check for this case in relay_process_wal_event(),
but it was broken by commit adc28591f77f ("replication: do not delete
relay on applier disconnect"), which replaced it with a relay->status
check, which is completely wrong, because relay->status is reset only
after the relay thread exits.

Part of #3910
---
 src/box/relay.cc | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/src/box/relay.cc b/src/box/relay.cc
index a01c2a2e..3d9703ea 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -409,10 +409,15 @@ static void
 relay_process_wal_event(struct wal_watcher *watcher, unsigned events)
 {
 	struct relay *relay = container_of(watcher, struct relay, wal_watcher);
-	if (relay->state != RELAY_FOLLOW) {
+	if (fiber_is_cancelled()) {
 		/*
-		 * Do not try to send anything to the replica
-		 * if it already closed its socket.
+		 * The relay is exiting. Rescanning the WAL at this
+		 * point would be pointless and even dangerous,
+		 * because the relay could have written a packet
+		 * fragment to the socket before being cancelled
+		 * so that writing another row to the socket would
+		 * lead to corrupted replication stream and, as
+		 * a result, permanent replication breakdown.
 		 */
 		return;
 	}
-- 
2.11.0




More information about the Tarantool-patches mailing list