From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lj1-f176.google.com (mail-lj1-f176.google.com [209.85.208.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 96AED469719 for ; Mon, 17 Feb 2020 19:00:42 +0300 (MSK) Received: by mail-lj1-f176.google.com with SMTP id r19so19441967ljg.3 for ; Mon, 17 Feb 2020 08:00:42 -0800 (PST) From: Cyrill Gorcunov Date: Mon, 17 Feb 2020 18:59:53 +0300 Message-Id: <20200217155953.25803-5-gorcunov@gmail.com> In-Reply-To: <20200217155953.25803-1-gorcunov@gmail.com> References: <20200217155953.25803-1-gorcunov@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: [Tarantool-patches] [PATCH 4/4] box/txn: fix nil dereference in txn_rollback List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: tml In case if we get failed in allocation of new journal entry the txn_rollback will try to derefernce nil pointer | txn_write | fiber_set_txn(fiber(), NULL); // zap fiber's storage.txn | txn_write_to_wal(txn); | journal_entry_new(..., txn_entry_done_cb, ...) | if (req == NULL) | txn_rollback(txn); | assert(txn == in_txn()); // in_txn()=nil, triggers This is because there are two call site: - when transaction is complete the wal engine will call txn_entry_complete_cb completion handler and since it is async in terms of threads (wal is a separate thread) it setup the txn it processes into a fiber's storage thus it expects the current storage is nil - in turn error may happen and we need to run a rollback procedure, there fiber's storage keeps current txn; Taking this into account we clean txn inside txn_write_to_wal right before journal_write is called and restore it back on error. This allows the caller code to run txn_rollback in a more consistent way. https://github.com/tarantool/tarantool/issues/4776 Signed-off-by: Cyrill Gorcunov --- src/box/txn.c | 27 +++++++++++++++++++++------ 1 file changed, 21 insertions(+), 6 deletions(-) diff --git a/src/box/txn.c b/src/box/txn.c index a4ca48224..68365ea0b 100644 --- a/src/box/txn.c +++ b/src/box/txn.c @@ -495,10 +495,8 @@ txn_write_to_wal(struct txn *txn) &txn->region, txn_entry_complete_cb, txn); - if (req == NULL) { - txn_rollback(txn); + if (req == NULL) return -1; - } struct txn_stmt *stmt; struct xrow_header **remote_row = req->rows; @@ -519,8 +517,20 @@ txn_write_to_wal(struct txn *txn) assert(remote_row == req->rows + txn->n_applier_rows); assert(local_row == remote_row + txn->n_new_rows); - /* Send the entry to the journal. */ + /* + * Queue the entry for processing in journal + * engine. The semantics of complete_cb implies + * that fiber's txn (kept in storage) is nil + * becase WAL is a separet thread, for this + * sake we zap it here. + * + * Still this is messy since the caller runs + * txn_rollback if something bad happened. Thus + * restore the former txn on error path. + */ + fiber_set_txn(fiber(), NULL); if (journal_write(req) < 0) { + fiber_set_txn(fiber(), txn); diag_set(ClientError, ER_WAL_IO); diag_log(); return -1; @@ -583,8 +593,13 @@ txn_write(struct txn *txn) fiber_set_txn(fiber(), NULL); return 0; } - fiber_set_txn(fiber(), NULL); - return txn_write_to_wal(txn); + + if (txn_write_to_wal(txn) != 0) { + txn_rollback(txn); + return -1; + } + + return 0; } int -- 2.20.1