[Tarantool-patches] [RFC] Quorum-based synchronous replication

Tue May 12 18:55:08 MSK 2020

On 06 мая 21:44, Konstantin Osipov wrote:
> * Sergey Ostanevich <sergos at tarantool.org> [20/05/06 19:41]:
> > > >    |               |              |             |              |
> > > >    |            [Quorum           |             |              |
> > > >    |           achieved]          |             |              |
> > > >    |               |              |             |              |
> > > >    |         [TXN undo log        |             |              |
> > > >    |           destroyed]         |             |              |
> > > >    |               |              |             |              |
> > > >    |               |---Confirm--->|             |              |
> > > >    |               |              |             |              |
> > > 
> > > What happens if writing Confirm to WAL fails? TXN und log record
> > > is destroyed already. Will the server panic now on WAL failure,
> > > even if it is intermittent?
> > 
> > I would like to have an example of intermittent WAL failure. Can it be
> > other than problem with disc - be it space/availability/malfunction?
> 
> For SAN disks it can simply be a networking issue. The same is
> true for any virtual filesystem in the cloud. For local disks it
> is most often out of space, but this is not an impossible event.

The SANDisk is an SSD vendor. I bet you mean NAS - network array
storage, isn't it? Then I see no difference in WAL write into NAS in
current schema - you will catch a timeout, WAL will report failure,
replica stops.

> 
> > For all of those it should be resolved outside the DBMS anyways. So,
> > leader should stop and report its problems to orchestrator/admins.
> 
> Sergey, I understand that RAFT spec is big and with this spec you
> try to split it into manageable parts. The question is how useful
> is this particular piece. I'm trying to point out that "the leader
> should stop" is not a silver bullet - especially since each such
> stop may mean a rejoin of some other node. The purpose of sync
> replication is to provide consistency without reducing
> availability (i.e. make progress as long as the quorum
> of nodes make progress). 

I'm not sure if we're talking about the same RAFT - mine is "In Search
of an Understandable Consensus Algorithm (Extended Version)" from
Stanford as of May 2014. And it is 15 pages - including references,
conclusions and intro. Seems not that big.

Although, most of it is dedicated to the leader election itself, which
we intentionally put aside from this RFC. It is written in the very
beginning and I empasized this by explicit mentioning of it.

> 
> The current spec, suggesting there should be a leader stop in case
> of most errors, reduces availability significantly, and doesn't
> make external coordinator job any easier - it still has to follow to
> the letter the prescriptions of RAFT. 

So, the postponing of a commit until quorum collection is the most
useful part of this RFC, also to some point I'm trying to address the
WAL insconsistency. Although, it can be covered only partly: if a
leader's log diverge in unconfirmed transactions only, then they can be
rolled back easiy. Technically, it should be enough if leader changed
for a replica from the cluster majority at the moment of failure.
Otherwise it will require pre-parsing of the WAL and it can well happens
that WAL is not long enough, hence ex-leader still need a complete
bootstrap. 

> 
> > landed to WAL - same is for replica.
> 
> > 
> > > 
> > > >    |               |----------Confirm---------->|              |
> > > 
> > > What happens if peers receive and maybe even write Confirm to their WALs
> > > but local WAL write is lost after a restart?
> > 
> > Did you mean WAL write on leader as a local? Then we have a replica with
> > a bigger LSN for the leader ID. 
> 
> > > WAL is not synced, 
> > > so we can easily lose the tail of the WAL. Tarantool will sync up
> > > with all replicas on restart,
> > 
> > But at this point a new leader will be appointed - the old one is
> > restarted. Then the Confirm message will arrive to the restarted leader 
> > through a regular replication.
> 
> This assumes that restart is guaranteed to be noticed by the
> external coordinator and there is an election on every restart.

Sure yes, if it restarted - then connection lost can't be unnoticed by
anyone, be it coordinator or cluster.

> 
> > > but there will be no "Replication
> > > OK" messages from them, so it wouldn't know that the transaction
> > > is committed on them. How is this handled? We may end up with some
> > > replicas confirming the transaction while the leader will roll it
> > > back on restart. Do you suggest there is a human intervention on
> > > restart as well?
> > > 
> > > 
> > > >    |               |              |             |              |
> > > >    |<---TXN Ok-----|              |       [TXN undo log        |
> > > >    |               |              |         destroyed]         |
> > > >    |               |              |             |              |
> > > >    |               |              |             |---Confirm--->|
> > > >    |               |              |             |              |
> > > > ```
> > > > 
> > > > The quorum should be collected as a table for a list of transactions
> > > > waiting for quorum. The latest transaction that collects the quorum is
> > > > considered as complete, as well as all transactions prior to it, since
> > > > all transactions should be applied in order. Leader writes a 'confirm'
> > > > message to the WAL that refers to the transaction's [LEADER_ID, LSN] and
> > > > the confirm has its own LSN. This confirm message is delivered to all
> > > > replicas through the existing replication mechanism.
> > > > 
> > > > Replica should report a TXN application success to the leader via the
> > > > IPROTO explicitly to allow leader to collect the quorum for the TXN.
> > > > In case of application failure the replica has to disconnect from the
> > > > replication the same way as it is done now. The replica also has to
> > > > report its disconnection to the orchestrator. Further actions require
> > > > human intervention, since failure means either technical problem (such
> > > > as not enough space for WAL) that has to be resovled or an inconsistent
> > > > state that requires rejoin.
> > > 
> > > > As soon as leader appears in a situation it has not enough replicas
> > > > to achieve quorum, the cluster should stop accepting any requests - both
> > > > write and read.
> > > 
> > > How does *the cluster* know the state of the leader and if it
> > > doesn't, how it can possibly implement this? Did you mean
> > > the leader should stop accepting transactions here? But how can
> > > the leader know if it has not enough replicas during a read
> > > transaction, if it doesn't contact any replica to serve a read?
> > 
> > I expect to have a disconnection trigger assigned to all relays so that
> > disconnection will cause the number of replicas decrease. The quorum
> > size is static, so we can stop at the very moment the number dives below.
> 
> What happens between the event the leader is partitioned away and
> a new leader is elected?
> 
> The leader may be unaware of the events and serve a read just
> fine.

As it is stated 20 lines above: 
> > > > As soon as leader appears in a situation it has not enough
> > > > replicas
> > > > to achieve quorum, the cluster should stop accepting any
> > > > requests - both
> > > > write and read.

So it will not serve. 

> 
> So at least you can't say the leader shouldn't be serving reads
> without quorum - because the only way to achieve it is to collect
> a quorum of responses to reads as well.

The leader lost connection to the (N-Q)+1 repllicas out of the N in
cluster with a quorum of Q == it stops serving anything. So the quorum
criteria is there: no quorum - no reads. 

> 
> > > > The reason for this is that replication of transactions
> > > > can achieve quorum on replicas not visible to the leader. On the other
> > > > hand, leader can't achieve quorum with available minority. Leader has to
> > > > report the state and wait for human intervention. There's an option to
> > > > ask leader to rollback to the latest transaction that has quorum: leader
> > > > issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN
> > > > is of the first transaction in the leader's undo log. The rollback
> > > > message replicated to the available cluster will put it in a consistent
> > > > state. After that configuration of the cluster can be updated to
> > > > available quorum and leader can be switched back to write mode.
> > > 
> > > As you should be able to conclude from restart scenario, it is
> > > possible a replica has the record in *confirmed* state but the
> > > leader has it in pending state. The replica will not be able to
> > > roll back then. Do you suggest the replica should abort if it
> > > can't rollback? This may lead to an avalanche of rejoins on leader
> > > restart, bringing performance to a halt.
> > 
> > No, I declare replica with biggest LSN as a new shining leader. More
> > than that, new leader can (so far it will be by default) finalize the
> > former leader life's work by replicating txns and appropriate confirms.
> 
> Right, this also assumes the restart is noticed, so it follows the
> same logic.

How a restart can be unnoticed, if it causes disconnection?

> 
> -- 
> Konstantin Osipov, Moscow, Russia
> https://scylladb.com