From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <kostja.osipov@gmail.com>
Received: from mail-lf1-f65.google.com (mail-lf1-f65.google.com
 [209.85.167.65])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by dev.tarantool.org (Postfix) with ESMTPS id DEDBA4696C3
 for <tarantool-patches@dev.tarantool.org>;
 Fri, 24 Apr 2020 01:28:39 +0300 (MSK)
Received: by mail-lf1-f65.google.com with SMTP id w145so6077886lff.3
 for <tarantool-patches@dev.tarantool.org>;
 Thu, 23 Apr 2020 15:28:39 -0700 (PDT)
Date: Fri, 24 Apr 2020 01:28:37 +0300
From: Konstantin Osipov <kostja.osipov@gmail.com>
Message-ID: <20200423222837.GC22011@atlas>
References: <20200403210836.GB18283@tarantool.org>
 <c86ef610-f54e-524e-103a-324e7e572d2d@tarantool.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <c86ef610-f54e-524e-103a-324e7e572d2d@tarantool.org>
Subject: Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
List-Id: Tarantool development patches <tarantool-patches.dev.tarantool.org>
List-Unsubscribe: <https://lists.tarantool.org/mailman/options/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=unsubscribe>
List-Archive: <https://lists.tarantool.org/pipermail/tarantool-patches/>
List-Post: <mailto:tarantool-patches@dev.tarantool.org>
List-Help: <mailto:tarantool-patches-request@dev.tarantool.org?subject=help>
List-Subscribe: <https://lists.tarantool.org/mailman/listinfo/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=subscribe>
To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Cc: tarantool-patches@dev.tarantool.org

* Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/04/24 00:42]:

> It says, that once the quorum is collected, and 'confirm' is written
> to local leader's WAL, it is considered committed and is reported
> to the client as successful.
> 
> On the other hand it is said, that in case of leader change the
> new leader will rollback all not confirmed transactions. That leads
> to the following bug:
> 
> Assume we have 4 instances: i1, i2, i3, i4. Leader is i1. It
> writes a transaction with LSN1. The LSN1 is sent to other nodes,
> they apply it ok, and send acks to the leader. The leader sees
> i2-i4 all applied the transaction (propagated their LSNs to LSN1).
> It writes 'confirm' to its local WAL, reports it to the client as
> success, the client's request is over, it is returned back to
> some remote node, etc. The transaction is officially synchronously
> committed.
> 
> Then the leader's machine dies - disk is dead. The confirm was
> not sent to any of the other nodes. For example, it started having
> problems with network connection to the replicas recently before
> the death. Or it just didn't manage to hand the confirm out.
> 
> >From now on if any of the other nodes i2-i4 becomes a leader, it
> will rollback the officially confirmed transaction, even if it
> has it, and all the other nodes too.
> 
> That basically means, this sync replication gives exactly the same
> guarantees as the async replication - 'confirm' on the leader tells
> nothing about replicas except that they *are able to apply the
> transaction*, but still may not apply it.
> 
> Am I missing something?

This video explains what leader has to do after it's been elected:

https://www.youtube.com/watch?v=YbZ3zDzDnrw

In short, the transactions in leader's wal has to be committed,
not rolled back.

Raft paper has https://raft.github.io/raft.pdf has answers in a
concise single page summary.

Why have this discussion at all, any ambiguity or discrepancy
between this document and raft paper should be treated as a
mistake in this document. Or do you actually think it's possible
to invent a new consensus algorithm this way?

> Note for those who is concerned: this has nothing to do with
> in-memory relay. It has the same problems, which are in the protocol,
> not in the implementation.

No, the issues are distinct:
1) there may be cases where this paper doesn't follow RAFT. It
   should be obvious to everyone, that with the exception to
   external leader election and failure detection it has to if
   correctness is of any concern, so it's simply a matter of
   fixing this doc to match raft.

   As to the leader election, there are two alternatives: either
   spec out in this paper how the external election is interacting
   with the cluster, including finishing up old transactions and
   neutralizing old leaders, or allow multi-master, so forget
   about consistency for now.
2) an implementation based on triggers will be complicated and
   will have performance/stability implications. This is what I
   hope I was able to convey and in this case we can put the
   matter to rest. 
   
-- 
Konstantin Osipov, Moscow, Russia