From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lj1-f194.google.com (mail-lj1-f194.google.com [209.85.208.194]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id F3E1446970E for ; Wed, 5 Feb 2020 01:11:12 +0300 (MSK) Received: by mail-lj1-f194.google.com with SMTP id y6so309467lji.0 for ; Tue, 04 Feb 2020 14:11:12 -0800 (PST) Date: Wed, 5 Feb 2020 01:11:10 +0300 From: Konstantin Osipov Message-ID: <20200204221110.GC20146@atlas> References: <20200127215306.31681-1-gorcunov@gmail.com> <20200127215306.31681-4-gorcunov@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200127215306.31681-4-gorcunov@gmail.com> Subject: Re: [Tarantool-patches] [PATCH v5 3/5] box/applier: fix nil dereference in applier rollback List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Cyrill Gorcunov Cc: tml * Cyrill Gorcunov [20/01/28 10:16]: > + > + /* > + * We must not loose the origin error, instead > + * lets keep it in replicaset diag instance. > + * > + * FIXME: We need to revisit this code and > + * figure out if we can reconnect and retry > + * the prelication process instead of cancelling > + * applier with FiberIsCancelled. First of all, we're dealing with a regression introduced by the parallel applier patch. Could you please describe what is triggering the error? > + /* > + * If information is already lost > + * (say xlog cleared diag instance) I don't understand this comment. How can it be lost exactly? > + * setup general ClientError, seriously > + * we need to unweave this mess, if error > + * happened it must never been cleared > + * until error handling in rollback. :-/ > + */ > + diag_set(ClientError, ER_WAL_IO); > + e = diag_last_error(diag_get()); > + } > + diag_add_error(&replicaset.applier.diag, e); > + > /* Broadcast the rollback event across all appliers. */ > trigger_run(&replicaset.applier.on_rollback, event); > /* Rollback applier vclock to the committed one. */ > @@ -849,8 +871,20 @@ applier_on_rollback(struct trigger *trigger, void *event) > diag_add_error(&applier->diag, > diag_last_error(&replicaset.applier.diag)); > } > - /* Stop the applier fiber. */ > + > + /* > + * Something really bad happened, we can't proceed > + * thus stop the applier and throw FiberIsCancelled > + * exception which will be catched by the caller > + * and the fiber gracefully finish. > + * > + * FIXME: Need to make sure that this is a really > + * final error where we can't longer proceed and should > + * zap the applier, probably we could reconnect and > + * retry instead? > + */ > fiber_cancel(applier->reader); Let's begin by explaining why we need to cancel the reader fiber here. > + diag_set(FiberIsCancelled); This is clearly a clutch: first you make an effort to set replicaset.applier.diag and then it is not used by diag_raise(). -- Konstantin Osipov, Moscow, Russia