From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <kostja.osipov@gmail.com>
Received: from mail-lj1-f180.google.com (mail-lj1-f180.google.com
 [209.85.208.180])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by dev.tarantool.org (Postfix) with ESMTPS id 0DE5C469710
 for <tarantool-patches@dev.tarantool.org>;
 Tue, 12 May 2020 19:42:05 +0300 (MSK)
Received: by mail-lj1-f180.google.com with SMTP id f18so14417760lja.13
 for <tarantool-patches@dev.tarantool.org>;
 Tue, 12 May 2020 09:42:05 -0700 (PDT)
Date: Tue, 12 May 2020 19:42:02 +0300
From: Konstantin Osipov <kostja.osipov@gmail.com>
Message-ID: <20200512164202.GA2049@atlas>
References: <20200403210836.GB18283@tarantool.org>
 <c86ef610-f54e-524e-103a-324e7e572d2d@tarantool.org>
 <20200430145033.GF112@tarantool.org> <20200506085249.GA2842@atlas>
 <20200506163901.GH112@tarantool.org> <20200506184445.GB24913@atlas>
 <20200512155508.GJ112@tarantool.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20200512155508.GJ112@tarantool.org>
Subject: Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
List-Id: Tarantool development patches <tarantool-patches.dev.tarantool.org>
List-Unsubscribe: <https://lists.tarantool.org/mailman/options/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=unsubscribe>
List-Archive: <https://lists.tarantool.org/pipermail/tarantool-patches/>
List-Post: <mailto:tarantool-patches@dev.tarantool.org>
List-Help: <mailto:tarantool-patches-request@dev.tarantool.org?subject=help>
List-Subscribe: <https://lists.tarantool.org/mailman/listinfo/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=subscribe>
To: Sergey Ostanevich <sergos@tarantool.org>
Cc: tarantool-patches@dev.tarantool.org, Vladislav Shpilevoy <v.shpilevoy@tarantool.org>

* Sergey Ostanevich <sergos@tarantool.org> [20/05/12 18:56]:
> On 06 мая 21:44, Konstantin Osipov wrote:
> > * Sergey Ostanevich <sergos@tarantool.org> [20/05/06 19:41]:
> > > > >    |               |              |             |              |
> > > > >    |            [Quorum           |             |              |
> > > > >    |           achieved]          |             |              |
> > > > >    |               |              |             |              |
> > > > >    |         [TXN undo log        |             |              |
> > > > >    |           destroyed]         |             |              |
> > > > >    |               |              |             |              |
> > > > >    |               |---Confirm--->|             |              |
> > > > >    |               |              |             |              |
> > > > 
> > > > What happens if writing Confirm to WAL fails? TXN und log record
> > > > is destroyed already. Will the server panic now on WAL failure,
> > > > even if it is intermittent?
> > > 
> > > I would like to have an example of intermittent WAL failure. Can it be
> > > other than problem with disc - be it space/availability/malfunction?
> > 
> > For SAN disks it can simply be a networking issue. The same is
> > true for any virtual filesystem in the cloud. For local disks it
> > is most often out of space, but this is not an impossible event.
> 
> The SANDisk is an SSD vendor. I bet you mean NAS - network array
> storage, isn't it? Then I see no difference in WAL write into NAS in
> current schema - you will catch a timeout, WAL will report failure,
> replica stops.

SAN stands for storage area network.

There is no timeout in wal tx bus and no timeout in WAL I/O. 

A replica doesn't stop on an intermittent failure. Stopping a
replica on an intermittent failure reduces availability of
non-sync writes.

It seems you have some assumptions in mind which are not in the
document - e.g. that some timeouts are added. They are not in the
POC either.

I suppose the document is expected to explain quite accurately
what has to be done, e.g. how these new timeouts work?

> > > For all of those it should be resolved outside the DBMS anyways. So,
> > > leader should stop and report its problems to orchestrator/admins.
> > 
> > Sergey, I understand that RAFT spec is big and with this spec you
> > try to split it into manageable parts. The question is how useful
> > is this particular piece. I'm trying to point out that "the leader
> > should stop" is not a silver bullet - especially since each such
> > stop may mean a rejoin of some other node. The purpose of sync
> > replication is to provide consistency without reducing
> > availability (i.e. make progress as long as the quorum
> > of nodes make progress). 
> 
> I'm not sure if we're talking about the same RAFT - mine is "In Search
> of an Understandable Consensus Algorithm (Extended Version)" from
> Stanford as of May 2014. And it is 15 pages - including references,
> conclusions and intro. Seems not that big.
> 
> Although, most of it is dedicated to the leader election itself, which
> we intentionally put aside from this RFC. It is written in the very
> beginning and I empasized this by explicit mentioning of it.

I conclude that it is big from the state of this document. It
provides some coverage of the normal operation.
Leader election, failure detection, recovery/restart,
replication configuration changes are either barely
mentioned or not covered at all.
I find no other reason to not cover them except to be able to come
up with a MVP quicker. Do you?


> > The current spec, suggesting there should be a leader stop in case
> > of most errors, reduces availability significantly, and doesn't
> > make external coordinator job any easier - it still has to follow to
> > the letter the prescriptions of RAFT. 
> 
> So, the postponing of a commit until quorum collection is the most
> useful part of this RFC, also to some point I'm trying to address the
> WAL insconsistency.

> Although, it can be covered only partly: if a
> leader's log diverge in unconfirmed transactions only, then they can be
> rolled back easiy. Technically, it should be enough if leader changed
> for a replica from the cluster majority at the moment of failure.
> Otherwise it will require pre-parsing of the WAL and it can well happens
> that WAL is not long enough, hence ex-leader still need a complete
> bootstrap. 

I don't understand what's pre-parsing and how what you write is
relevant to the fact that reduced availability of non-raft writes
is bad.

> > > But at this point a new leader will be appointed - the old one is
> > > restarted. Then the Confirm message will arrive to the restarted leader 
> > > through a regular replication.
> > 
> > This assumes that restart is guaranteed to be noticed by the
> > external coordinator and there is an election on every restart.
> 
> Sure yes, if it restarted - then connection lost can't be unnoticed by
> anyone, be it coordinator or cluster.

Well, the spec doesn't say anywhere that the external coordinator
has to establish a TCP connection to every participant. Could you
please add a chapter where this is clarified? It seems you have a
specific coordinator in mind ?

> > > > > The quorum should be collected as a table for a list of transactions
> > > > > waiting for quorum. The latest transaction that collects the quorum is
> > > > > considered as complete, as well as all transactions prior to it, since
> > > > > all transactions should be applied in order. Leader writes a 'confirm'
> > > > > message to the WAL that refers to the transaction's [LEADER_ID, LSN] and
> > > > > the confirm has its own LSN. This confirm message is delivered to all
> > > > > replicas through the existing replication mechanism.
> > > > > 
> > > > > Replica should report a TXN application success to the leader via the
> > > > > IPROTO explicitly to allow leader to collect the quorum for the TXN.
> > > > > In case of application failure the replica has to disconnect from the
> > > > > replication the same way as it is done now. The replica also has to
> > > > > report its disconnection to the orchestrator. Further actions require
> > > > > human intervention, since failure means either technical problem (such
> > > > > as not enough space for WAL) that has to be resovled or an inconsistent
> > > > > state that requires rejoin.
> > > > 
> > > > > As soon as leader appears in a situation it has not enough replicas
> > > > > to achieve quorum, the cluster should stop accepting any requests - both
> > > > > write and read.
> > > > 
> > > > How does *the cluster* know the state of the leader and if it
> > > > doesn't, how it can possibly implement this? Did you mean
> > > > the leader should stop accepting transactions here? But how can
> > > > the leader know if it has not enough replicas during a read
> > > > transaction, if it doesn't contact any replica to serve a read?
> > > 
> > > I expect to have a disconnection trigger assigned to all relays so that
> > > disconnection will cause the number of replicas decrease. The quorum
> > > size is static, so we can stop at the very moment the number dives below.
> > 
> > What happens between the event the leader is partitioned away and
> > a new leader is elected?
> > 
> > The leader may be unaware of the events and serve a read just
> > fine.
> 
> As it is stated 20 lines above: 
> > > > > As soon as leader appears in a situation it has not enough
> > > > > replicas
> > > > > to achieve quorum, the cluster should stop accepting any
> > > > > requests - both
> > > > > write and read.
> 
> So it will not serve. 

Sergey, this is recursion. I'm asking you to clarify exactly this
point.
Do you assume that replicas perform some kind of
failure detection? What kind? Is it *in addition* to the failure
detection performed by the external coordinator? 
Any failure detector imaginable would be asynchronous. 
What happens between the failure and the time it's detected?

> > So at least you can't say the leader shouldn't be serving reads
> > without quorum - because the only way to achieve it is to collect
> > a quorum of responses to reads as well.
> 
> The leader lost connection to the (N-Q)+1 repllicas out of the N in
> cluster with a quorum of Q == it stops serving anything. So the quorum
> criteria is there: no quorum - no reads. 

OK, so you assume that TCP connection *is* the failure detector? 

Failure detection in TCP is optional, asynchronous, and worst of
all, unreliable. Why do think it can be used? 

> > > > > The reason for this is that replication of transactions
> > > > > can achieve quorum on replicas not visible to the leader. On the other
> > > > > hand, leader can't achieve quorum with available minority. Leader has to
> > > > > report the state and wait for human intervention. There's an option to
> > > > > ask leader to rollback to the latest transaction that has quorum: leader
> > > > > issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN
> > > > > is of the first transaction in the leader's undo log. The rollback
> > > > > message replicated to the available cluster will put it in a consistent
> > > > > state. After that configuration of the cluster can be updated to
> > > > > available quorum and leader can be switched back to write mode.
> > > > 
> > > > As you should be able to conclude from restart scenario, it is
> > > > possible a replica has the record in *confirmed* state but the
> > > > leader has it in pending state. The replica will not be able to
> > > > roll back then. Do you suggest the replica should abort if it
> > > > can't rollback? This may lead to an avalanche of rejoins on leader
> > > > restart, bringing performance to a halt.
> > > 
> > > No, I declare replica with biggest LSN as a new shining leader. More
> > > than that, new leader can (so far it will be by default) finalize the
> > > former leader life's work by replicating txns and appropriate confirms.
> > 
> > Right, this also assumes the restart is noticed, so it follows the
> > same logic.
> 
> How a restart can be unnoticed, if it causes disconnection?

Honestly, I'm baffled. It's like we speak different languages. 
I can't imagine you are unaware of the fallacies of distributed
computing, but I see no other explanation to you question.

-- 
Konstantin Osipov, Moscow, Russia