Tarantool development patches archive
 help / color / mirror / Atom feed
From: Cyrill Gorcunov <gorcunov@gmail.com>
To: tml <tarantool-patches@dev.tarantool.org>
Subject: [Tarantool-patches] [PATCH v7 3/5] box/applier: fix nil dereference in applier rollback
Date: Tue, 28 Jan 2020 22:22:47 +0300	[thread overview]
Message-ID: <20200128192249.10023-4-gorcunov@gmail.com> (raw)
In-Reply-To: <20200128192249.10023-1-gorcunov@gmail.com>

Currently when transaction rollback happens we just drop an existing
error setting ClientError to the replicaset.applier.diag. This action
leaves current fiber with diag=nil, which in turn leads to sigsegv once
diag_raise() called right after applier_apply_tx():

 | applier_f
 |   try {
 |   applier_subscribe
 |     applier_apply_tx
 |       // error happens
 |       txn_rollback
 |         diag_set(ClientError, ER_WAL_IO)
 |         diag_move(&fiber()->diag, &replicaset.applier.diag)
 |         // fiber->diag = nil
 |       applier_on_rollback
 |         diag_add_error(&applier->diag, diag_last_error(&replicaset.applier.diag)
 |         fiber_cancel(applier->reader);
 |     diag_raise() -> NULL dereference
 |   } catch { ... }

The applier_f works in try/catch cycle and handles errors depending on
what exactly happened during transaction application. It might reconnect
appliers in some cases, the applier is simply cancelled and reaped out in
others.

The problem is that the shared replicaset.applier.diag is handled on
FiberIsCancelled exception only (while it is set inside transaction
rollback action) and we never trigger this specific exception. But
even if we would the former error which has been causing the applier
abort is vanished by ClientError which is too general.

Thus:

 - on transaction rollback save the origin error which caused
   the transaction abort to the replicaset.applier.diag;

 - there are cases (such as xlog error injection) where diag
   is explicitly clear on error path, for this sake we setup
   ClientError instead;

 - trigger FiberIsCancelled exception which will log the
   problem and zap the applier;

 - put fixme mark into the code: we need to figure out
   if underlierd error is really critical one maybe we
   could retry the applier iteration instead.

Part-of #4730

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/applier.cc | 43 +++++++++++++++++++++++++++++++++++++++----
 1 file changed, 39 insertions(+), 4 deletions(-)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index 2ed5125d0..967dc91de 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -692,9 +692,31 @@ static int
 applier_txn_rollback_cb(struct trigger *trigger, void *event)
 {
 	(void) trigger;
-	/* Setup shared applier diagnostic area. */
-	diag_set(ClientError, ER_WAL_IO);
-	diag_move(&fiber()->diag, &replicaset.applier.diag);
+
+	/*
+	 * We must not loose the origin error, instead
+	 * lets keep it in replicaset diag instance.
+	 *
+	 * FIXME: We need to revisit this code and
+	 * figure out if we can reconnect and retry
+	 * the prelication process instead of cancelling
+	 * applier with FiberIsCancelled.
+	 */
+	struct error *e = diag_last_error(diag_get());
+	if (!e) {
+		/*
+		 * If information is already lost
+		 * (say xlog cleared diag instance)
+		 * setup general ClientError, seriously
+		 * we need to unweave this mess, if error
+		 * happened it must never been cleared
+		 * until error handling in rollback.
+		 */
+		diag_set(ClientError, ER_WAL_IO);
+		e = diag_last_error(diag_get());
+	}
+	diag_add_error(&replicaset.applier.diag, e);
+
 	/* Broadcast the rollback event across all appliers. */
 	trigger_run(&replicaset.applier.on_rollback, event);
 	/* Rollback applier vclock to the committed one. */
@@ -849,8 +871,20 @@ applier_on_rollback(struct trigger *trigger, void *event)
 		diag_add_error(&applier->diag,
 			       diag_last_error(&replicaset.applier.diag));
 	}
-	/* Stop the applier fiber. */
+
+	/*
+	 * Something really bad happened, we can't proceed
+	 * thus stop the applier and throw FiberIsCancelled
+	 * exception which will be catched by the caller
+	 * and the fiber gracefully finish.
+	 *
+	 * FIXME: Need to make sure that this is a really
+	 * final error where we can't longer proceed and should
+	 * zap the applier, probably we could reconnect and
+	 * retry instead?
+	 */
 	fiber_cancel(applier->reader);
+	diag_set(FiberIsCancelled);
 	return 0;
 }
 
@@ -1098,6 +1132,7 @@ applier_f(va_list ap)
 		} catch (FiberIsCancelled *e) {
 			if (!diag_is_empty(&applier->diag)) {
 				diag_move(&applier->diag, &fiber()->diag);
+				diag_log();
 				applier_disconnect(applier, APPLIER_STOPPED);
 				break;
 			}
-- 
2.20.1

  parent reply	other threads:[~2020-01-28 19:23 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-28 19:22 [Tarantool-patches] [PATCH v7 0/5] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov
2020-01-28 19:22 ` [Tarantool-patches] [PATCH v7 1/5] box/request: add missing OutOfMemory diag_set Cyrill Gorcunov
2020-02-03 14:37   ` Sergey Ostanevich
2020-01-28 19:22 ` [Tarantool-patches] [PATCH v7 2/5] box/applier: add missing diag_set on region_alloc failure Cyrill Gorcunov
2020-02-03 14:39   ` Sergey Ostanevich
2020-02-04 22:15     ` Konstantin Osipov
2020-02-05  7:46       ` Cyrill Gorcunov
2020-02-05  9:49         ` Konstantin Osipov
2020-02-05 10:06           ` Cyrill Gorcunov
2020-01-28 19:22 ` Cyrill Gorcunov [this message]
2020-02-04 22:19   ` [Tarantool-patches] [PATCH v7 3/5] box/applier: fix nil dereference in applier rollback Konstantin Osipov
2020-02-05  7:33     ` Cyrill Gorcunov
2020-01-28 19:22 ` [Tarantool-patches] [PATCH v7 4/5] errinj: add ERRINJ_REPLICA_TXN_WRITE Cyrill Gorcunov
2020-02-04 22:45   ` Konstantin Osipov
2020-01-28 19:22 ` [Tarantool-patches] [PATCH v7 5/5] test: add replication/applier-rollback Cyrill Gorcunov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200128192249.10023-4-gorcunov@gmail.com \
    --to=gorcunov@gmail.com \
    --cc=tarantool-patches@dev.tarantool.org \
    --subject='Re: [Tarantool-patches] [PATCH v7 3/5] box/applier: fix nil dereference in applier rollback' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox