From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <vdavydov.dev@gmail.com>
From: Vladimir Davydov <vdavydov.dev@gmail.com>
Subject: [PATCH 1/5] recovery: stop writing to xstream on system error
Date: Sat, 29 Dec 2018 00:21:47 +0300
Message-Id: <dc22e4cc81ce87fc0d852dafa96464f135bebe0a.1546030880.git.vdavydov.dev@gmail.com>
In-Reply-To: <cover.1546030880.git.vdavydov.dev@gmail.com>
References: <cover.1546030880.git.vdavydov.dev@gmail.com>
In-Reply-To: <cover.1546030880.git.vdavydov.dev@gmail.com>
References: <cover.1546030880.git.vdavydov.dev@gmail.com>
To: tarantool-patches@freelists.org
List-ID: <tarantool-patches.dev.tarantool.org>

In case force_recovery flag is set, recover_xlog() ignores any errors
returned by xstream_write(), even SocketError or FiberIsCancelled. This
may result in permanent replication breakdown as described in the next
paragraph.

Suppose there's a master and a replica and the master has force_recovery
flag set. The replica gets stalled on WAL while applying a row fetched
from the master. As a result, it stops sending ACKs. In the meantime,
the master writes a lot of new rows to its WAL so that the relay thread
sending changes to the replica fills up all the space available in the
network buffer and blocks on the replication socket. Note, at this
moment it may occur that a packet fragment has been written to the
socket. The WAL delay on the replica takes long enough for replication
to break on timeout: the relay reader fiber on the master doesn't
receive an ACK from the replica in time and cancels the relay writer
fiber. The relay writer fiber wakes up and returns to recover_xlog(),
which happily continues to scan the xlog attempting to send more rows
(force_recovery is set), failing, and complaining to the log. While the
relay thread is still scanning the log, the replica finishes the long
WAL write and reads more data from the socket, freeing up some space in
the network buffer for the relay to write more rows. The relay thread,
which happens to be still in recover_xlog(), writes a new row to the
socket after the packet fragment it had written when it was cancelled,
effectively corrupting the stream and breaking a replication with an
unrecoverable error, e.g.

  xrow.c:99 E> ER_INVALID_MSGPACK: Invalid MsgPack - packet header

Actually, it's pointless to continue scanning an xlog if xstream_write()
returned any error different from ClientError - this means that the xlog
is scanned by a relay thread (not local recovery) and the connection is
broken, in which case there isn't much we can do but stop the relay and
wait for the replica to reconnect. So let's fix this issue by ignoring
force_recovery option for any error that doesn't have type ClientError.

It's difficult to write a test for this case, since too many conditions
have to be satisfied simultaneously for the issue to occur. Injecting
errors doesn't really help here and would look artificial, because it'd
rely too much on the implementation. So I'm committing this one without
a test case.

Part of #3910
---
 src/box/recovery.cc | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/src/box/recovery.cc b/src/box/recovery.cc
index 64d50989..c3cc7454 100644
--- a/src/box/recovery.cc
+++ b/src/box/recovery.cc
@@ -279,7 +279,17 @@ recover_xlog(struct recovery *r, struct xstream *stream,
 		} else {
 			say_error("can't apply row: ");
 			diag_log();
-			if (!r->wal_dir.force_recovery)
+			/*
+			 * Stop recovery if a system error occurred,
+			 * no matter if force_recovery is set or not,
+			 * because in this case we could have written
+			 * a packet fragment to the stream so that
+			 * the next write would corrupt data at the
+			 * receiving end.
+			 */
+			struct error *e = diag_last_error(diag_get());
+			if (!r->wal_dir.force_recovery ||
+			    !type_assignable(&type_ClientError, e->type))
 				diag_raise();
 		}
 	}
-- 
2.11.0