From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf1-f67.google.com (mail-lf1-f67.google.com [209.85.167.67]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id AE0C246970E for ; Wed, 5 Feb 2020 11:18:16 +0300 (MSK) Received: by mail-lf1-f67.google.com with SMTP id z26so800067lfg.13 for ; Wed, 05 Feb 2020 00:18:16 -0800 (PST) Date: Wed, 5 Feb 2020 11:18:14 +0300 From: Cyrill Gorcunov Message-ID: <20200205081814.GI12445@uranus> References: <20200126223023.10197-1-gorcunov@gmail.com> <20200126223023.10197-4-gorcunov@gmail.com> <20200204220430.GB20146@atlas> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200204220430.GB20146@atlas> Subject: Re: [Tarantool-patches] [PATCH 3/3] box/applier: fix nil dereference in applier rollback List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Konstantin Osipov Cc: tml On Wed, Feb 05, 2020 at 01:04:30AM +0300, Konstantin Osipov wrote: > * Cyrill Gorcunov [20/01/27 10:14]: > > Currently when transaction rollback happens we just drop an existing > > error setting ClientError to the replicaset.applier.diag. This action > > leaves current fiber with diag=nil, which in turn leads to sigsegv once > > diag_raise() called right after applier_apply_tx(): > > > > | applier_f > > | try { > > | applier_subscribe > > | applier_apply_tx > > | // error happens > > | txn_rollback > > | diag_set(ClientError, ER_WAL_IO) > > | diag_move(&fiber()->diag, &replicaset.applier.diag) > > | // fiber->diag = nil > > > | applier_on_rollback > > | diag_add_error(&applier->diag, diag_last_error(&replicaset.applier.diag) > > | fiber_cancel(applier->reader); > > | diag_raise() -> NULL dereference > > | } catch { ... } > > Where exactly does the error happen in applier_apply_tx? The reporter pointed somwhere into a deep dive into vynil, the problem is that its been runnin release build first time it triggered. Actually it doesn't matter where exactly it failed, the only important thing is that it failed the way we need to run a rollback procedure. > > Looks like this: > > > | diag_set(ClientError, ER_WAL_IO) > > | diag_move(&fiber()->diag, &replicaset.applier.diag) > > > overwrites the original error. True > > Instead, the original error should be preserved and copied to the > shared diagnostics area (replicaset.applier.error). Sounds reasonable