[Tarantool-patches] [RFC] Quorum-based synchronous replication

Thu May 14 00:39:01 MSK 2020

Hi! Thanks for the discussion!

On 12/05/2020 17:55, Sergey Ostanevich wrote:
> On 06 мая 21:44, Konstantin Osipov wrote:
>> * Sergey Ostanevich <sergos at tarantool.org> [20/05/06 19:41]:
>>>>>    |               |              |             |              |
>>>>>    |            [Quorum           |             |              |
>>>>>    |           achieved]          |             |              |
>>>>>    |               |              |             |              |
>>>>>    |         [TXN undo log        |             |              |
>>>>>    |           destroyed]         |             |              |
>>>>>    |               |              |             |              |
>>>>>    |               |---Confirm--->|             |              |
>>>>>    |               |              |             |              |
>>>>
>>>> What happens if writing Confirm to WAL fails? TXN und log record
>>>> is destroyed already. Will the server panic now on WAL failure,
>>>> even if it is intermittent?
>>>
>>> I would like to have an example of intermittent WAL failure. Can it be
>>> other than problem with disc - be it space/availability/malfunction?
>>
>> For SAN disks it can simply be a networking issue. The same is
>> true for any virtual filesystem in the cloud. For local disks it
>> is most often out of space, but this is not an impossible event.
> 
> The SANDisk is an SSD vendor. I bet you mean NAS - network array
> storage, isn't it? Then I see no difference in WAL write into NAS in
> current schema - you will catch a timeout, WAL will report failure,
> replica stops.
> 
>>
>>> For all of those it should be resolved outside the DBMS anyways. So,
>>> leader should stop and report its problems to orchestrator/admins.
>>
>> Sergey, I understand that RAFT spec is big and with this spec you
>> try to split it into manageable parts. The question is how useful
>> is this particular piece. I'm trying to point out that "the leader
>> should stop" is not a silver bullet - especially since each such
>> stop may mean a rejoin of some other node. The purpose of sync
>> replication is to provide consistency without reducing
>> availability (i.e. make progress as long as the quorum
>> of nodes make progress). 
> 
> I'm not sure if we're talking about the same RAFT - mine is "In Search
> of an Understandable Consensus Algorithm (Extended Version)" from
> Stanford as of May 2014. And it is 15 pages - including references,
> conclusions and intro. Seems not that big.

15 pages of tightly packed theory is a big piece of data. And especially
big, when it comes to application to a real project, with existing
infrastructure, and all. Just my IMHO. I remember implementing SWIM - it
is smaller than RAFT. Much smaller and simpler, and yet it took year to
implement it, and cover all things described in the paper.

This is not as simple as it looks, when it comes to edge cases. This is
why the whole sync replication frustrates me more than anything else
before, and why I am so reluctant to doing anything with it.

The RFC mostly covers the normal operation, here I agree with Kostja. But
the normal operation is not that interesting. Failures are much more
important.

> Although, most of it is dedicated to the leader election itself, which
> we intentionally put aside from this RFC. It is written in the very
> beginning and I empasized this by explicit mentioning of it.

And still there will be leader election. Even though not ours for now.
And Tarantool should provide API and instructions so as external
applications could follow them and do the election.

Usually in RFCs we describe API. With arguments, behaviour, and all.

>> The current spec, suggesting there should be a leader stop in case
>> of most errors, reduces availability significantly, and doesn't
>> make external coordinator job any easier - it still has to follow to
>> the letter the prescriptions of RAFT. 
>>
>>>
>>>>
>>>>>    |               |----------Confirm---------->|              |
>>>>
>>>> What happens if peers receive and maybe even write Confirm to their WALs
>>>> but local WAL write is lost after a restart?
>>>
>>> Did you mean WAL write on leader as a local? Then we have a replica with
>>> a bigger LSN for the leader ID. 
>>
>>>> WAL is not synced, 
>>>> so we can easily lose the tail of the WAL. Tarantool will sync up
>>>> with all replicas on restart,
>>>
>>> But at this point a new leader will be appointed - the old one is
>>> restarted. Then the Confirm message will arrive to the restarted leader 
>>> through a regular replication.
>>
>> This assumes that restart is guaranteed to be noticed by the
>> external coordinator and there is an election on every restart.
> 
> Sure yes, if it restarted - then connection lost can't be unnoticed by
> anyone, be it coordinator or cluster.

Here comes another problem. Disconnect and restart have nothing to do with
each other. The coordinator can loose connection without the peer leader
restart. Just because it is network. Anything can happen. Moreover, while
the coordinator does not have a connection, the leader can restart multiple
times.

We can't tell the coordinator rely on connectivity as a restart signal.

>>>> but there will be no "Replication
>>>> OK" messages from them, so it wouldn't know that the transaction
>>>> is committed on them. How is this handled? We may end up with some
>>>> replicas confirming the transaction while the leader will roll it
>>>> back on restart. Do you suggest there is a human intervention on
>>>> restart as well?
>>>>
>>>>
>>>>>    |               |              |             |              |
>>>>>    |<---TXN Ok-----|              |       [TXN undo log        |
>>>>>    |               |              |         destroyed]         |
>>>>>    |               |              |             |              |
>>>>>    |               |              |             |---Confirm--->|
>>>>>    |               |              |             |              |
>>>>> ```
>>>>>
>>>>> The quorum should be collected as a table for a list of transactions
>>>>> waiting for quorum. The latest transaction that collects the quorum is
>>>>> considered as complete, as well as all transactions prior to it, since
>>>>> all transactions should be applied in order. Leader writes a 'confirm'
>>>>> message to the WAL that refers to the transaction's [LEADER_ID, LSN] and
>>>>> the confirm has its own LSN. This confirm message is delivered to all
>>>>> replicas through the existing replication mechanism.
>>>>>
>>>>> Replica should report a TXN application success to the leader via the
>>>>> IPROTO explicitly to allow leader to collect the quorum for the TXN.
>>>>> In case of application failure the replica has to disconnect from the
>>>>> replication the same way as it is done now. The replica also has to
>>>>> report its disconnection to the orchestrator. Further actions require
>>>>> human intervention, since failure means either technical problem (such
>>>>> as not enough space for WAL) that has to be resovled or an inconsistent
>>>>> state that requires rejoin.
>>>>
>>>>> As soon as leader appears in a situation it has not enough replicas
>>>>> to achieve quorum, the cluster should stop accepting any requests - both
>>>>> write and read.
>>>>
>>>> How does *the cluster* know the state of the leader and if it
>>>> doesn't, how it can possibly implement this? Did you mean
>>>> the leader should stop accepting transactions here? But how can
>>>> the leader know if it has not enough replicas during a read
>>>> transaction, if it doesn't contact any replica to serve a read?
>>>
>>> I expect to have a disconnection trigger assigned to all relays so that
>>> disconnection will cause the number of replicas decrease. The quorum
>>> size is static, so we can stop at the very moment the number dives below.
>>
>> What happens between the event the leader is partitioned away and
>> a new leader is elected?
>>
>> The leader may be unaware of the events and serve a read just
>> fine.
> 
> As it is stated 20 lines above: 
>>>>> As soon as leader appears in a situation it has not enough
>>>>> replicas
>>>>> to achieve quorum, the cluster should stop accepting any
>>>>> requests - both
>>>>> write and read.
> 
> So it will not serve. 

This breaks compatibility, since now an orphan node is perfectly able
to serve reads. The cluster can't just stop doing everything, if the
quorum is lost. Stop writes - yes, since the quorum is lost anyway. But
reads do not need a quorum.

If you say reads need a quorum, then they would need to go through WAL,
collect confirmations, and all.

>> So at least you can't say the leader shouldn't be serving reads
>> without quorum - because the only way to achieve it is to collect
>> a quorum of responses to reads as well.
> 
> The leader lost connection to the (N-Q)+1 repllicas out of the N in
> cluster with a quorum of Q == it stops serving anything. So the quorum
> criteria is there: no quorum - no reads. 

Connection count tells nothing. Network connectivity is not a reliable
source of information. Only messages and persistent data are reliable
(to certain extent).

>>>>> The reason for this is that replication of transactions
>>>>> can achieve quorum on replicas not visible to the leader. On the other
>>>>> hand, leader can't achieve quorum with available minority. Leader has to
>>>>> report the state and wait for human intervention. There's an option to
>>>>> ask leader to rollback to the latest transaction that has quorum: leader
>>>>> issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN
>>>>> is of the first transaction in the leader's undo log. The rollback
>>>>> message replicated to the available cluster will put it in a consistent
>>>>> state. After that configuration of the cluster can be updated to
>>>>> available quorum and leader can be switched back to write mode.
>>>>
>>>> As you should be able to conclude from restart scenario, it is
>>>> possible a replica has the record in *confirmed* state but the
>>>> leader has it in pending state. The replica will not be able to
>>>> roll back then. Do you suggest the replica should abort if it
>>>> can't rollback? This may lead to an avalanche of rejoins on leader
>>>> restart, bringing performance to a halt.
>>>
>>> No, I declare replica with biggest LSN as a new shining leader. More
>>> than that, new leader can (so far it will be by default) finalize the
>>> former leader life's work by replicating txns and appropriate confirms.
>>
>> Right, this also assumes the restart is noticed, so it follows the
>> same logic.
> 
> How a restart can be unnoticed, if it causes disconnection?

Disconnection has nothing to do with restart. The coordinator itself may
restart. Or it may loose connection to the leader temporarily. Or the
leader may loose it without any restarts.