From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lj1-f195.google.com (mail-lj1-f195.google.com [209.85.208.195]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 2D688469719 for ; Fri, 14 Feb 2020 17:04:27 +0300 (MSK) Received: by mail-lj1-f195.google.com with SMTP id n18so10867764ljo.7 for ; Fri, 14 Feb 2020 06:04:27 -0800 (PST) From: Cyrill Gorcunov Date: Fri, 14 Feb 2020 17:03:38 +0300 Message-Id: <20200214140339.4085-4-gorcunov@gmail.com> In-Reply-To: <20200214140339.4085-1-gorcunov@gmail.com> References: <20200214140339.4085-1-gorcunov@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: [Tarantool-patches] [PATCH v10 3/4] box/applier: prevent nil dereference on applier rollback List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: tml Currently when transaction rollback happens we just drop an existing error setting ClientError to the replicaset.applier.diag. This action leaves current fiber with diag=nil, which in turn leads to sigsegv once diag_raise() called right after applier_apply_tx(): | applier_f | try { | applier_subscribe | applier_apply_tx | // error happens | txn_rollback | diag_set(ClientError, ER_WAL_IO) | diag_move(&fiber()->diag, &replicaset.applier.diag) | // fiber->diag = nil | applier_on_rollback | diag_add_error(&applier->diag, diag_last_error(&replicaset.applier.diag) | fiber_cancel(applier->reader); | diag_raise() -> NULL dereference | } catch { ... } The applier_f works in try/catch cycle and handles errors depending on what exactly happened during transaction application. It might reconnect appliers in some cases, the applier is simply cancelled and reaped out in others. The problem is that the shared replicaset.applier.diag is handled on FiberIsCancelled exception only (while it is set inside transaction rollback action) and we never trigger this specific exception. But even if we would the former error which has been causing the applier abort is vanished by ClientError which is too general. Thus: - log the former error, but leave ClientError as a new one to be preserved in replicaset diag instance (I don't want to make an intrusive patch which would change a logic since I'm far from being expert in this code); - put fixme mark into the code: we need to rework it in a more sense way. Part-of #4730 Signed-off-by: Cyrill Gorcunov --- src/box/applier.cc | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/src/box/applier.cc b/src/box/applier.cc index 2ed5125d0..a4a73d383 100644 --- a/src/box/applier.cc +++ b/src/box/applier.cc @@ -692,9 +692,33 @@ static int applier_txn_rollback_cb(struct trigger *trigger, void *event) { (void) trigger; + /* + * FIXME: Do not clear fiber()->diag since it + * cause nil dereference + * + * applier_subscribe + * applier_apply_tx + * diag_raise + * + * In turn we need to redesign this code: + * - preserve original error or log it somewhere + * - make the error path more clear + * + * We must never reach this point with clean diag + * area, if we do it means we're simply screwed + * somewhere and there is a bug. + */ + + if (!diag_is_empty(diag_get())) + diag_log(); + else + say_warn_ratelimited("applier_txn_rollback_cb: empty diag"); + /* Setup shared applier diagnostic area. */ diag_set(ClientError, ER_WAL_IO); - diag_move(&fiber()->diag, &replicaset.applier.diag); + diag_add_error(&replicaset.applier.diag, + diag_last_error(&fiber()->diag)); + /* Broadcast the rollback event across all appliers. */ trigger_run(&replicaset.applier.on_rollback, event); /* Rollback applier vclock to the committed one. */ -- 2.20.1