From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <kostja.osipov@gmail.com>
Received: from mail-lf1-f68.google.com (mail-lf1-f68.google.com
 [209.85.167.68])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by dev.tarantool.org (Postfix) with ESMTPS id 27E86469710
 for <tarantool-patches@dev.tarantool.org>;
 Wed,  6 May 2020 21:44:48 +0300 (MSK)
Received: by mail-lf1-f68.google.com with SMTP id u4so2225186lfm.7
 for <tarantool-patches@dev.tarantool.org>;
 Wed, 06 May 2020 11:44:48 -0700 (PDT)
Date: Wed, 6 May 2020 21:44:45 +0300
From: Konstantin Osipov <kostja.osipov@gmail.com>
Message-ID: <20200506184445.GB24913@atlas>
References: <20200403210836.GB18283@tarantool.org>
 <c86ef610-f54e-524e-103a-324e7e572d2d@tarantool.org>
 <20200430145033.GF112@tarantool.org> <20200506085249.GA2842@atlas>
 <20200506163901.GH112@tarantool.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20200506163901.GH112@tarantool.org>
Subject: Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
List-Id: Tarantool development patches <tarantool-patches.dev.tarantool.org>
List-Unsubscribe: <https://lists.tarantool.org/mailman/options/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=unsubscribe>
List-Archive: <https://lists.tarantool.org/pipermail/tarantool-patches/>
List-Post: <mailto:tarantool-patches@dev.tarantool.org>
List-Help: <mailto:tarantool-patches-request@dev.tarantool.org?subject=help>
List-Subscribe: <https://lists.tarantool.org/mailman/listinfo/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=subscribe>
To: Sergey Ostanevich <sergos@tarantool.org>
Cc: tarantool-patches@dev.tarantool.org, Vladislav Shpilevoy <v.shpilevoy@tarantool.org>

* Sergey Ostanevich <sergos@tarantool.org> [20/05/06 19:41]:
> > >    |               |              |             |              |
> > >    |            [Quorum           |             |              |
> > >    |           achieved]          |             |              |
> > >    |               |              |             |              |
> > >    |         [TXN undo log        |             |              |
> > >    |           destroyed]         |             |              |
> > >    |               |              |             |              |
> > >    |               |---Confirm--->|             |              |
> > >    |               |              |             |              |
> > 
> > What happens if writing Confirm to WAL fails? TXN und log record
> > is destroyed already. Will the server panic now on WAL failure,
> > even if it is intermittent?
> 
> I would like to have an example of intermittent WAL failure. Can it be
> other than problem with disc - be it space/availability/malfunction?

For SAN disks it can simply be a networking issue. The same is
true for any virtual filesystem in the cloud. For local disks it
is most often out of space, but this is not an impossible event.

> For all of those it should be resolved outside the DBMS anyways. So,
> leader should stop and report its problems to orchestrator/admins.

Sergey, I understand that RAFT spec is big and with this spec you
try to split it into manageable parts. The question is how useful
is this particular piece. I'm trying to point out that "the leader
should stop" is not a silver bullet - especially since each such
stop may mean a rejoin of some other node. The purpose of sync
replication is to provide consistency without reducing
availability (i.e. make progress as long as the quorum
of nodes make progress). 

The current spec, suggesting there should be a leader stop in case
of most errors, reduces availability significantly, and doesn't
make external coordinator job any easier - it still has to follow to
the letter the prescriptions of RAFT. 

> landed to WAL - same is for replica.

> 
> > 
> > >    |               |----------Confirm---------->|              |
> > 
> > What happens if peers receive and maybe even write Confirm to their WALs
> > but local WAL write is lost after a restart?
> 
> Did you mean WAL write on leader as a local? Then we have a replica with
> a bigger LSN for the leader ID. 

> > WAL is not synced, 
> > so we can easily lose the tail of the WAL. Tarantool will sync up
> > with all replicas on restart,
> 
> But at this point a new leader will be appointed - the old one is
> restarted. Then the Confirm message will arrive to the restarted leader 
> through a regular replication.

This assumes that restart is guaranteed to be noticed by the
external coordinator and there is an election on every restart.

> > but there will be no "Replication
> > OK" messages from them, so it wouldn't know that the transaction
> > is committed on them. How is this handled? We may end up with some
> > replicas confirming the transaction while the leader will roll it
> > back on restart. Do you suggest there is a human intervention on
> > restart as well?
> > 
> > 
> > >    |               |              |             |              |
> > >    |<---TXN Ok-----|              |       [TXN undo log        |
> > >    |               |              |         destroyed]         |
> > >    |               |              |             |              |
> > >    |               |              |             |---Confirm--->|
> > >    |               |              |             |              |
> > > ```
> > > 
> > > The quorum should be collected as a table for a list of transactions
> > > waiting for quorum. The latest transaction that collects the quorum is
> > > considered as complete, as well as all transactions prior to it, since
> > > all transactions should be applied in order. Leader writes a 'confirm'
> > > message to the WAL that refers to the transaction's [LEADER_ID, LSN] and
> > > the confirm has its own LSN. This confirm message is delivered to all
> > > replicas through the existing replication mechanism.
> > > 
> > > Replica should report a TXN application success to the leader via the
> > > IPROTO explicitly to allow leader to collect the quorum for the TXN.
> > > In case of application failure the replica has to disconnect from the
> > > replication the same way as it is done now. The replica also has to
> > > report its disconnection to the orchestrator. Further actions require
> > > human intervention, since failure means either technical problem (such
> > > as not enough space for WAL) that has to be resovled or an inconsistent
> > > state that requires rejoin.
> > 
> > > As soon as leader appears in a situation it has not enough replicas
> > > to achieve quorum, the cluster should stop accepting any requests - both
> > > write and read.
> > 
> > How does *the cluster* know the state of the leader and if it
> > doesn't, how it can possibly implement this? Did you mean
> > the leader should stop accepting transactions here? But how can
> > the leader know if it has not enough replicas during a read
> > transaction, if it doesn't contact any replica to serve a read?
> 
> I expect to have a disconnection trigger assigned to all relays so that
> disconnection will cause the number of replicas decrease. The quorum
> size is static, so we can stop at the very moment the number dives below.

What happens between the event the leader is partitioned away and
a new leader is elected?

The leader may be unaware of the events and serve a read just
fine.

So at least you can't say the leader shouldn't be serving reads
without quorum - because the only way to achieve it is to collect
a quorum of responses to reads as well.

> > > The reason for this is that replication of transactions
> > > can achieve quorum on replicas not visible to the leader. On the other
> > > hand, leader can't achieve quorum with available minority. Leader has to
> > > report the state and wait for human intervention. There's an option to
> > > ask leader to rollback to the latest transaction that has quorum: leader
> > > issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN
> > > is of the first transaction in the leader's undo log. The rollback
> > > message replicated to the available cluster will put it in a consistent
> > > state. After that configuration of the cluster can be updated to
> > > available quorum and leader can be switched back to write mode.
> > 
> > As you should be able to conclude from restart scenario, it is
> > possible a replica has the record in *confirmed* state but the
> > leader has it in pending state. The replica will not be able to
> > roll back then. Do you suggest the replica should abort if it
> > can't rollback? This may lead to an avalanche of rejoins on leader
> > restart, bringing performance to a halt.
> 
> No, I declare replica with biggest LSN as a new shining leader. More
> than that, new leader can (so far it will be by default) finalize the
> former leader life's work by replicating txns and appropriate confirms.

Right, this also assumes the restart is noticed, so it follows the
same logic.

-- 
Konstantin Osipov, Moscow, Russia
https://scylladb.com