From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <sergos@tarantool.org>
Received: from smtp41.i.mail.ru (smtp41.i.mail.ru [94.100.177.101])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by dev.tarantool.org (Postfix) with ESMTPS id 9E831469710
 for <tarantool-patches@dev.tarantool.org>;
 Wed,  6 May 2020 19:39:02 +0300 (MSK)
Date: Wed, 6 May 2020 19:39:01 +0300
From: Sergey Ostanevich <sergos@tarantool.org>
Message-ID: <20200506163901.GH112@tarantool.org>
References: <20200403210836.GB18283@tarantool.org>
 <c86ef610-f54e-524e-103a-324e7e572d2d@tarantool.org>
 <20200430145033.GF112@tarantool.org> <20200506085249.GA2842@atlas>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <20200506085249.GA2842@atlas>
Subject: Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
List-Id: Tarantool development patches <tarantool-patches.dev.tarantool.org>
List-Unsubscribe: <https://lists.tarantool.org/mailman/options/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=unsubscribe>
List-Archive: <https://lists.tarantool.org/pipermail/tarantool-patches/>
List-Post: <mailto:tarantool-patches@dev.tarantool.org>
List-Help: <mailto:tarantool-patches-request@dev.tarantool.org?subject=help>
List-Subscribe: <https://lists.tarantool.org/mailman/listinfo/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=subscribe>
To: Konstantin Osipov <kostja.osipov@gmail.com>, Vladislav Shpilevoy <v.shpilevoy@tarantool.org>, tarantool-patches@dev.tarantool.org

Hi!

Thanks for review!

> >    |               |              |             |              |
> >    |            [Quorum           |             |              |
> >    |           achieved]          |             |              |
> >    |               |              |             |              |
> >    |         [TXN undo log        |             |              |
> >    |           destroyed]         |             |              |
> >    |               |              |             |              |
> >    |               |---Confirm--->|             |              |
> >    |               |              |             |              |
> 
> What happens if writing Confirm to WAL fails? TXN und log record
> is destroyed already. Will the server panic now on WAL failure,
> even if it is intermittent?

I would like to have an example of intermittent WAL failure. Can it be
other than problem with disc - be it space/availability/malfunction?

For all of those it should be resolved outside the DBMS anyways. So,
leader should stop and report its problems to orchestrator/admins.

I would agree that undo log can be destroyed *after* the Confirm is
landed to WAL - same is for replica.

> 
> >    |               |----------Confirm---------->|              |
> 
> What happens if peers receive and maybe even write Confirm to their WALs
> but local WAL write is lost after a restart?

Did you mean WAL write on leader as a local? Then we have a replica with
a bigger LSN for the leader ID. 

> WAL is not synced, 
> so we can easily lose the tail of the WAL. Tarantool will sync up
> with all replicas on restart,

But at this point a new leader will be appointed - the old one is
restarted. Then the Confirm message will arrive to the restarted leader 
through a regular replication.

> but there will be no "Replication
> OK" messages from them, so it wouldn't know that the transaction
> is committed on them. How is this handled? We may end up with some
> replicas confirming the transaction while the leader will roll it
> back on restart. Do you suggest there is a human intervention on
> restart as well?
> 
> 
> >    |               |              |             |              |
> >    |<---TXN Ok-----|              |       [TXN undo log        |
> >    |               |              |         destroyed]         |
> >    |               |              |             |              |
> >    |               |              |             |---Confirm--->|
> >    |               |              |             |              |
> > ```
> > 
> > The quorum should be collected as a table for a list of transactions
> > waiting for quorum. The latest transaction that collects the quorum is
> > considered as complete, as well as all transactions prior to it, since
> > all transactions should be applied in order. Leader writes a 'confirm'
> > message to the WAL that refers to the transaction's [LEADER_ID, LSN] and
> > the confirm has its own LSN. This confirm message is delivered to all
> > replicas through the existing replication mechanism.
> > 
> > Replica should report a TXN application success to the leader via the
> > IPROTO explicitly to allow leader to collect the quorum for the TXN.
> > In case of application failure the replica has to disconnect from the
> > replication the same way as it is done now. The replica also has to
> > report its disconnection to the orchestrator. Further actions require
> > human intervention, since failure means either technical problem (such
> > as not enough space for WAL) that has to be resovled or an inconsistent
> > state that requires rejoin.
> 
> > As soon as leader appears in a situation it has not enough replicas
> > to achieve quorum, the cluster should stop accepting any requests - both
> > write and read.
> 
> How does *the cluster* know the state of the leader and if it
> doesn't, how it can possibly implement this? Did you mean
> the leader should stop accepting transactions here? But how can
> the leader know if it has not enough replicas during a read
> transaction, if it doesn't contact any replica to serve a read?

I expect to have a disconnection trigger assigned to all relays so that
disconnection will cause the number of replicas decrease. The quorum
size is static, so we can stop at the very moment the number dives below.

> 
> > The reason for this is that replication of transactions
> > can achieve quorum on replicas not visible to the leader. On the other
> > hand, leader can't achieve quorum with available minority. Leader has to
> > report the state and wait for human intervention. There's an option to
> > ask leader to rollback to the latest transaction that has quorum: leader
> > issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN
> > is of the first transaction in the leader's undo log. The rollback
> > message replicated to the available cluster will put it in a consistent
> > state. After that configuration of the cluster can be updated to
> > available quorum and leader can be switched back to write mode.
> 
> As you should be able to conclude from restart scenario, it is
> possible a replica has the record in *confirmed* state but the
> leader has it in pending state. The replica will not be able to
> roll back then. Do you suggest the replica should abort if it
> can't rollback? This may lead to an avalanche of rejoins on leader
> restart, bringing performance to a halt.

No, I declare replica with biggest LSN as a new shining leader. More
than that, new leader can (so far it will be by default) finalize the
former leader life's work by replicating txns and appropriate confirms.

Sergos.