From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf1-f52.google.com (mail-lf1-f52.google.com [209.85.167.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id BC4BB46970E for ; Wed, 5 Feb 2020 12:55:26 +0300 (MSK) Received: by mail-lf1-f52.google.com with SMTP id r14so1032806lfm.5 for ; Wed, 05 Feb 2020 01:55:26 -0800 (PST) Date: Wed, 5 Feb 2020 12:55:24 +0300 From: Konstantin Osipov Message-ID: <20200205095524.GD4624@atlas> References: <20200127215306.31681-1-gorcunov@gmail.com> <20200127215306.31681-4-gorcunov@gmail.com> <20200204221110.GC20146@atlas> <20200205082721.GJ12445@uranus> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200205082721.GJ12445@uranus> Subject: Re: [Tarantool-patches] [PATCH v5 3/5] box/applier: fix nil dereference in applier rollback List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Cyrill Gorcunov Cc: tml * Cyrill Gorcunov [20/02/05 11:31]: > > I don't understand this comment. How can it be lost exactly? > > Hmmm, I think you're right. Actually unweaving the all possible > call traces by hands (which I had to do) is quite exhausting task > so I might be wrong here. It was added by parallel applier patch, so the most likely cause of the bug is that this way of cancelling parallel appliers on conflict was not tested well enough. Previously we only had a single applier per peer so did not need to coordinate. > > Let's begin by explaining why we need to cancel the reader fiber here. > > This fiber_cancel has been already here, I only added diag_set(FiberIsCancelled) > to throw an exception thus the caller would zap this applier fiber. > Actually I think we could retry instead of reaping off the fiber > completely but it requies more deep understanding of how applier > works. So I left it in comment. No, we can't and shouldn't retry. Retry handling is done elsewhere already - see replication_skip_conflict. If replication is stopped by an apply error, it's most likely a transaction conflict, indicating that active-active setup is broken, so it has to be resolved by a DBA (which can set replication_skip_conflict). This is why it's critical to preserve the original error. > > Not exactly, if I understand the initial logic of this applier > try/cath branch we need to setup replicaset.applier.diag and > then on FiberIsCancelled we should move it from replicaset.applier.diag > back to current fiber->diag. Please dig into what is "current" here. Which fiber is current if there are many fibers handling a single peer? -- Konstantin Osipov, Moscow, Russia https://scylladb.com