From: Cyrill Gorcunov <gorcunov@gmail.com> To: tml <tarantool-patches@dev.tarantool.org> Subject: [Tarantool-patches] [PATCH 3/3] box/applier: fix nil dereference in applier rollback Date: Mon, 27 Jan 2020 01:30:23 +0300 [thread overview] Message-ID: <20200126223023.10197-4-gorcunov@gmail.com> (raw) In-Reply-To: <20200126223023.10197-1-gorcunov@gmail.com> Currently when transaction rollback happens we just drop an existing error setting ClientError to the replicaset.applier.diag. This action leaves current fiber with diag=nil, which in turn leads to sigsegv once diag_raise() called right after applier_apply_tx(): | applier_f | try { | applier_subscribe | applier_apply_tx | // error happens | txn_rollback | diag_set(ClientError, ER_WAL_IO) | diag_move(&fiber()->diag, &replicaset.applier.diag) | // fiber->diag = nil | applier_on_rollback | diag_add_error(&applier->diag, diag_last_error(&replicaset.applier.diag) | fiber_cancel(applier->reader); | diag_raise() -> NULL dereference | } catch { ... } The applier_f works in try/catch cycle and handles errors depending on what exactly happened during transaction application. It might reconnect appliers in some cases, the applier is simply cancelled and reaped out in others. The problem is that the shared replicaset.applier.diag is handled on FiberIsCancelled exception only (while it is set inside transaction rollback action) and we never trigger this specific exception. But even if we would the former error which has been causing the applier abort is vanished by ClientError which is too general. Thus: - on transaction rollback save the origin error which caused the transaction abort to the replicaset.applier.diag; - trigger FiberIsCancelled exception which will log the problem and zap the applier; - put fixme mark into the code: we need to figure out if underlierd error is really critical one maybe we could retry the applier iteration instead. Part-of #4730 Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> --- src/box/applier.cc | 26 ++++++++++++++++++++++---- 1 file changed, 22 insertions(+), 4 deletions(-) diff --git a/src/box/applier.cc b/src/box/applier.cc index 2ed5125d0..8c6b00f76 100644 --- a/src/box/applier.cc +++ b/src/box/applier.cc @@ -692,9 +692,19 @@ static int applier_txn_rollback_cb(struct trigger *trigger, void *event) { (void) trigger; - /* Setup shared applier diagnostic area. */ - diag_set(ClientError, ER_WAL_IO); - diag_move(&fiber()->diag, &replicaset.applier.diag); + + /* + * We must not loose the origin error, instead + * lets keep it in replicaset diag instance. + * + * FIXME: We need to revisit this code and + * figure out if we can reconnect and retry + * the prelication process instead of cancelling + * applier with FiberIsCancelled. + */ + struct error *e = diag_last_error(diag_get()); + diag_add_error(&replicaset.applier.diag, e); + /* Broadcast the rollback event across all appliers. */ trigger_run(&replicaset.applier.on_rollback, event); /* Rollback applier vclock to the committed one. */ @@ -849,8 +859,15 @@ applier_on_rollback(struct trigger *trigger, void *event) diag_add_error(&applier->diag, diag_last_error(&replicaset.applier.diag)); } - /* Stop the applier fiber. */ + + /* + * Something really bad happened, we can't proceed + * thus stop the applier and throw FiberIsCancelled + * exception which will be catched by the caller + * and the fiber gracefully finish. + */ fiber_cancel(applier->reader); + diag_set(FiberIsCancelled); return 0; } @@ -1098,6 +1115,7 @@ applier_f(va_list ap) } catch (FiberIsCancelled *e) { if (!diag_is_empty(&applier->diag)) { diag_move(&applier->diag, &fiber()->diag); + diag_log(); applier_disconnect(applier, APPLIER_STOPPED); break; } -- 2.20.1
next prev parent reply other threads:[~2020-01-26 22:31 UTC|newest] Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-01-26 22:30 [Tarantool-patches] [PATCH 0/3] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov 2020-01-26 22:30 ` [Tarantool-patches] [PATCH 1/3] box/request: add missing OutOfMemory diag_set Cyrill Gorcunov 2020-01-26 22:30 ` [Tarantool-patches] [PATCH 2/3] box/applier: add missing diag_set on region_alloc failure Cyrill Gorcunov 2020-01-26 22:30 ` Cyrill Gorcunov [this message] 2020-02-04 22:04 ` [Tarantool-patches] [PATCH 3/3] box/applier: fix nil dereference in applier rollback Konstantin Osipov 2020-02-05 8:18 ` Cyrill Gorcunov 2020-02-05 9:50 ` Konstantin Osipov 2020-02-05 10:12 ` Cyrill Gorcunov 2020-02-05 10:45 ` Konstantin Osipov 2020-01-27 16:19 ` [Tarantool-patches] [PATCH 0/3] box/replication: add missing diag set and fix sigsegv Cyrill Gorcunov
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20200126223023.10197-4-gorcunov@gmail.com \ --to=gorcunov@gmail.com \ --cc=tarantool-patches@dev.tarantool.org \ --subject='Re: [Tarantool-patches] [PATCH 3/3] box/applier: fix nil dereference in applier rollback' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox