From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf1-f66.google.com (mail-lf1-f66.google.com [209.85.167.66]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id B3AB746970E for ; Wed, 5 Feb 2020 13:48:19 +0300 (MSK) Received: by mail-lf1-f66.google.com with SMTP id n25so1155304lfl.0 for ; Wed, 05 Feb 2020 02:48:19 -0800 (PST) Date: Wed, 5 Feb 2020 13:48:17 +0300 From: Cyrill Gorcunov Message-ID: <20200205104817.GM12445@uranus> References: <20200127215306.31681-1-gorcunov@gmail.com> <20200127215306.31681-4-gorcunov@gmail.com> <20200204221110.GC20146@atlas> <20200205082721.GJ12445@uranus> <20200205095524.GD4624@atlas> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200205095524.GD4624@atlas> Subject: Re: [Tarantool-patches] [PATCH v5 3/5] box/applier: fix nil dereference in applier rollback List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Konstantin Osipov Cc: tml On Wed, Feb 05, 2020 at 12:55:24PM +0300, Konstantin Osipov wrote: > * Cyrill Gorcunov [20/02/05 11:31]: > > > I don't understand this comment. How can it be lost exactly? > > > > Hmmm, I think you're right. Actually unweaving the all possible > > call traces by hands (which I had to do) is quite exhausting task > > so I might be wrong here. > > It was added by parallel applier patch, so the most likely cause > of the bug is that this way of cancelling parallel appliers on > conflict was not tested well enough. Previously we only had a > single applier per peer so did not need to coordinate. I see, thanks! > > > Let's begin by explaining why we need to cancel the reader fiber here. > > > > This fiber_cancel has been already here, I only added diag_set(FiberIsCancelled) > > to throw an exception thus the caller would zap this applier fiber. > > Actually I think we could retry instead of reaping off the fiber > > completely but it requies more deep understanding of how applier > > works. So I left it in comment. > > No, we can't and shouldn't retry. Retry handling is done elsewhere already - > see replication_skip_conflict. Sure, will do, thanks for pointing. > If replication is stopped by an apply error, it's most likely a > transaction conflict, indicating that active-active setup is > broken, so it has to be resolved by a DBA (which can set > replication_skip_conflict). This is why it's critical to preserve > the original error. > > > > Not exactly, if I understand the initial logic of this applier > > try/cath branch we need to setup replicaset.applier.diag and > > then on FiberIsCancelled we should move it from replicaset.applier.diag > > back to current fiber->diag. > > Please dig into what is "current" here. Which fiber is current if > there are many fibers handling a single peer? Will do, tahnks!