From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <kostja.osipov@gmail.com>
Received: from mail-lf1-f67.google.com (mail-lf1-f67.google.com
 [209.85.167.67])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by dev.tarantool.org (Postfix) with ESMTPS id 2BBB2469710
 for <tarantool-patches@dev.tarantool.org>;
 Wed,  6 May 2020 11:52:52 +0300 (MSK)
Received: by mail-lf1-f67.google.com with SMTP id b26so657746lfa.5
 for <tarantool-patches@dev.tarantool.org>;
 Wed, 06 May 2020 01:52:52 -0700 (PDT)
Date: Wed, 6 May 2020 11:52:49 +0300
From: Konstantin Osipov <kostja.osipov@gmail.com>
Message-ID: <20200506085249.GA2842@atlas>
References: <20200403210836.GB18283@tarantool.org>
 <c86ef610-f54e-524e-103a-324e7e572d2d@tarantool.org>
 <20200430145033.GF112@tarantool.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20200430145033.GF112@tarantool.org>
Subject: Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
List-Id: Tarantool development patches <tarantool-patches.dev.tarantool.org>
List-Unsubscribe: <https://lists.tarantool.org/mailman/options/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=unsubscribe>
List-Archive: <https://lists.tarantool.org/pipermail/tarantool-patches/>
List-Post: <mailto:tarantool-patches@dev.tarantool.org>
List-Help: <mailto:tarantool-patches-request@dev.tarantool.org?subject=help>
List-Subscribe: <https://lists.tarantool.org/mailman/listinfo/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=subscribe>
To: Sergey Ostanevich <sergos@tarantool.org>
Cc: tarantool-patches@dev.tarantool.org, Vladislav Shpilevoy <v.shpilevoy@tarantool.org>

* Sergey Ostanevich <sergos@tarantool.org> [20/04/30 17:51]:
> Hi!
> 
> Thanks for the review!
> 
> After a long discussion we agreed to rework the RFC. 
> 
> On 23 апр 23:38, Vladislav Shpilevoy wrote:
> > It says, that once the quorum is collected, and 'confirm' is written
> > to local leader's WAL, it is considered committed and is reported
> > to the client as successful.
> > 
> > On the other hand it is said, that in case of leader change the
> > new leader will rollback all not confirmed transactions. That leads
> 
> This is no longer right, we decided to follow the RAFT's approach that
> leader rules the world, hence committing all changes in it's WAL.
> 
> > 
> > Another issue is with failure detection. Lets assume, that we wait
> > for 'confirm' to be propagated on quorum of replicas too. Assume
> > some replicas responded with an error. So they first said they can
> > apply the transaction, and saved it into their WALs, and then they
> > couldn't apply confirm. That could happen because of 2 reasons:
> > replica has problems with WAL, or the replica becomes unreachable
> > from the master.
> > 
> > WAL-problematic replicas can be disconnected forcefully, since they
> > are clearly not able to work properly anymore. But what to do with
> > disconnected replicas? 'Confirm' can't wait for them forever - we
> > will run out of fibers, if we have even just hundreds of RPS of
> > sync transactions, and wait for, lets say, a few minutes. On the
> > other hand we can't roll them back, because 'confirm' has been
> > written to the local WAL already.
> 
> Here we agreed that replica will be kicked out of cluster and wait for
> human intervention to fix the problems - probably with rejoin. In case
> available replics are not enough to achieve the quorum leader also
> reports the problem and stop the cluster operation until cluster
> reconfigured or number of replicas will become sufficient.
> 
> Below is the new RFC, available at
> https://github.com/tarantool/tarantool/blob/sergos/quorum-based-synchro/doc/rfc/quorum-based-synchro.md
> 
> ---
> * **Status**: In progress
> * **Start date**: 31-03-2020
> * **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\>
> * **Issues**: https://github.com/tarantool/tarantool/issues/4842
> 
> ## Summary
> 
> The aim of this RFC is to address the following list of problems
> formulated at MRG planning meeting:
>   - protocol backward compatibility to enable cluster upgrade w/o
>     downtime
>   - consistency of data on replica and leader
>   - switch from leader to replica without data loss
>   - up to date replicas to run read-only requests
>   - ability to switch async replicas into sync ones and vice versa
>   - guarantee of rollback on leader and sync replicas
>   - simplicity of cluster orchestration
> 
> What this RFC is not:
> 
>   - high availability (HA) solution with automated failover, roles
>     assignments an so on
>   - master-master configuration support
> 
> ## Background and motivation
> 
> There are number of known implementation of consistent data presence in
> a Tarantool cluster. They can be commonly named as "wait for LSN"
> technique. The biggest issue with this technique is the absence of
> rollback guarantees at replica in case of transaction failure on one
> master or some of the replicas in the cluster.
> 
> To provide such capabilities a new functionality should be introduced in
> Tarantool core, with requirements mentioned before - backward
> compatibility and ease of cluster orchestration.
> 
> ## Detailed design
> 
> ### Quorum commit
> 
> The main idea behind the proposal is to reuse existent machinery as much
> as possible. It will ensure the well-tested and proven functionality
> across many instances in MRG and beyond is used. The transaction rollback
> mechanism is in place and works for WAL write failure. If we substitute
> the WAL success with a new situation which is named 'quorum' later in
> this document then no changes to the machinery is needed. The same is
> true for snapshot machinery that allows to create a copy of the database
> in memory for the whole period of snapshot file write. Adding quorum here
> also minimizes changes.
> 
> Currently replication represented by the following scheme:
> ```
> Customer        Leader          WAL(L)        Replica        WAL(R)
>    |------TXN----->|              |             |              |
>    |               |              |             |              |
>    |         [TXN undo log        |             |              |
>    |            created]          |             |              |
>    |               |              |             |              |
>    |               |-----TXN----->|             |              |
>    |               |              |             |              |
>    |               |<---WAL Ok----|             |              |
>    |               |              |             |              |
>    |         [TXN undo log        |             |              |
>    |           destroyed]         |             |              |
>    |               |              |             |              |
>    |<----TXN Ok----|              |             |              |
>    |               |-------Replicate TXN------->|              |
>    |               |              |             |              |
>    |               |              |       [TXN undo log        |
>    |               |              |          created]          |
>    |               |              |             |              |
>    |               |              |             |-----TXN----->|
>    |               |              |             |              |
>    |               |              |             |<---WAL Ok----|
>    |               |              |             |              |
>    |               |              |       [TXN undo log        |
>    |               |              |         destroyed]         |
>    |               |              |             |              |
> ```
> 
> To introduce the 'quorum' we have to receive confirmation from replicas
> to make a decision on whether the quorum is actually present. Leader
> collects necessary amount of replicas confirmation plus its own WAL
> success. This state is named 'quorum' and gives leader the right to
> complete the customers' request. So the picture will change to:
> ```
> Customer        Leader          WAL(L)        Replica        WAL(R)
>    |------TXN----->|              |             |              |
>    |               |              |             |              |
>    |         [TXN undo log        |             |              |
>    |            created]          |             |              |
>    |               |              |             |              |
>    |               |-----TXN----->|             |              |
>    |               |              |             |              |
>    |               |-------Replicate TXN------->|              |
>    |               |              |             |              |
>    |               |              |       [TXN undo log        |
>    |               |<---WAL Ok----|          created]          |
>    |               |              |             |              |
>    |           [Waiting           |             |-----TXN----->|
>    |         of a quorum]         |             |              |
>    |               |              |             |<---WAL Ok----|
>    |               |              |             |              |
>    |               |<------Replication Ok-------|              |
>    |               |              |             |              |
>    |            [Quorum           |             |              |
>    |           achieved]          |             |              |
>    |               |              |             |              |
>    |         [TXN undo log        |             |              |
>    |           destroyed]         |             |              |
>    |               |              |             |              |
>    |               |---Confirm--->|             |              |
>    |               |              |             |              |

What happens if writing Confirm to WAL fails? TXN und log record
is destroyed already. Will the server panic now on WAL failure,
even if it is intermittent?

>    |               |----------Confirm---------->|              |

What happens if peers receive and maybe even write Confirm to their WALs
but local WAL write is lost after a restart? WAL is not synced, 
so we can easily lose the tail of the WAL. Tarantool will sync up
with all replicas on restart, but there will be no "Replication
OK" messages from them, so it wouldn't know that the transaction
is committed on them. How is this handled? We may end up with some
replicas confirming the transaction while the leader will roll it
back on restart. Do you suggest there is a human intervention on
restart as well?


>    |               |              |             |              |
>    |<---TXN Ok-----|              |       [TXN undo log        |
>    |               |              |         destroyed]         |
>    |               |              |             |              |
>    |               |              |             |---Confirm--->|
>    |               |              |             |              |
> ```
> 
> The quorum should be collected as a table for a list of transactions
> waiting for quorum. The latest transaction that collects the quorum is
> considered as complete, as well as all transactions prior to it, since
> all transactions should be applied in order. Leader writes a 'confirm'
> message to the WAL that refers to the transaction's [LEADER_ID, LSN] and
> the confirm has its own LSN. This confirm message is delivered to all
> replicas through the existing replication mechanism.
> 
> Replica should report a TXN application success to the leader via the
> IPROTO explicitly to allow leader to collect the quorum for the TXN.
> In case of application failure the replica has to disconnect from the
> replication the same way as it is done now. The replica also has to
> report its disconnection to the orchestrator. Further actions require
> human intervention, since failure means either technical problem (such
> as not enough space for WAL) that has to be resovled or an inconsistent
> state that requires rejoin.

> As soon as leader appears in a situation it has not enough replicas
> to achieve quorum, the cluster should stop accepting any requests - both
> write and read.

How does *the cluster* know the state of the leader and if it
doesn't, how it can possibly implement this? Did you mean
the leader should stop accepting transactions here? But how can
the leader know if it has not enough replicas during a read
transaction, if it doesn't contact any replica to serve a read?

> The reason for this is that replication of transactions
> can achieve quorum on replicas not visible to the leader. On the other
> hand, leader can't achieve quorum with available minority. Leader has to
> report the state and wait for human intervention. There's an option to
> ask leader to rollback to the latest transaction that has quorum: leader
> issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN
> is of the first transaction in the leader's undo log. The rollback
> message replicated to the available cluster will put it in a consistent
> state. After that configuration of the cluster can be updated to
> available quorum and leader can be switched back to write mode.

As you should be able to conclude from restart scenario, it is
possible a replica has the record in *confirmed* state but the
leader has it in pending state. The replica will not be able to
roll back then. Do you suggest the replica should abort if it
can't rollback? This may lead to an avalanche of rejoins on leader
restart, bringing performance to a halt.


-- 
Konstantin Osipov, Moscow, Russia