[Tarantool-patches] [RFC] Quorum-based synchronous replication

Thu May 14 00:36:34 MSK 2020

Thanks for the discussion!

On 06/05/2020 18:39, Sergey Ostanevich wrote:
> Hi!
> 
> Thanks for review!
> 
>>>    |               |              |             |              |
>>>    |            [Quorum           |             |              |
>>>    |           achieved]          |             |              |
>>>    |               |              |             |              |
>>>    |         [TXN undo log        |             |              |
>>>    |           destroyed]         |             |              |
>>>    |               |              |             |              |
>>>    |               |---Confirm--->|             |              |
>>>    |               |              |             |              |
>>
>> What happens if writing Confirm to WAL fails? TXN und log record
>> is destroyed already. Will the server panic now on WAL failure,
>> even if it is intermittent?
> 
> I would like to have an example of intermittent WAL failure. Can it be
> other than problem with disc - be it space/availability/malfunction?
> 
> For all of those it should be resolved outside the DBMS anyways. So,
> leader should stop and report its problems to orchestrator/admins.
> 
> I would agree that undo log can be destroyed *after* the Confirm is
> landed to WAL - same is for replica.

Well, in fact you can't (or can you?). Because it won't help. Once you
tried to write 'Confirm', it means you got the quorum. So now in case
you will fail, a new leader will write 'Confirm' for you, when will see
a quorum too. So the current leader has no right to write 'Rollback'
from this moment, from what I understand. Because it still can be
confirmed by a new leader later, if you fail before 'Rollback' is
replicated to all.

However the same problem appears, if you write 'Confirm' *successfully*.
Still the leader can fail, and a newer leader will write 'Rollback' if
won't collect the quorum again. Don't know what to do with that really.
Probably nothing.

>>
>>>    |               |----------Confirm---------->|              |
>>
>> What happens if peers receive and maybe even write Confirm to their WALs
>> but local WAL write is lost after a restart?
> 
> Did you mean WAL write on leader as a local? Then we have a replica with
> a bigger LSN for the leader ID. 
> 
>> WAL is not synced, 
>> so we can easily lose the tail of the WAL. Tarantool will sync up
>> with all replicas on restart,
> 
> But at this point a new leader will be appointed - the old one is
> restarted. Then the Confirm message will arrive to the restarted leader 
> through a regular replication.
> 
>> but there will be no "Replication
>> OK" messages from them, so it wouldn't know that the transaction
>> is committed on them. How is this handled? We may end up with some
>> replicas confirming the transaction while the leader will roll it
>> back on restart. Do you suggest there is a human intervention on
>> restart as well?
>>
>>
>>>    |               |              |             |              |
>>>    |<---TXN Ok-----|              |       [TXN undo log        |
>>>    |               |              |         destroyed]         |
>>>    |               |              |             |              |
>>>    |               |              |             |---Confirm--->|
>>>    |               |              |             |              |
>>> ```
>>>
>>> The quorum should be collected as a table for a list of transactions
>>> waiting for quorum. The latest transaction that collects the quorum is
>>> considered as complete, as well as all transactions prior to it, since
>>> all transactions should be applied in order. Leader writes a 'confirm'
>>> message to the WAL that refers to the transaction's [LEADER_ID, LSN] and
>>> the confirm has its own LSN. This confirm message is delivered to all
>>> replicas through the existing replication mechanism.
>>>
>>> Replica should report a TXN application success to the leader via the
>>> IPROTO explicitly to allow leader to collect the quorum for the TXN.
>>> In case of application failure the replica has to disconnect from the
>>> replication the same way as it is done now. The replica also has to
>>> report its disconnection to the orchestrator. Further actions require
>>> human intervention, since failure means either technical problem (such
>>> as not enough space for WAL) that has to be resovled or an inconsistent
>>> state that requires rejoin.
>>
>>> As soon as leader appears in a situation it has not enough replicas
>>> to achieve quorum, the cluster should stop accepting any requests - both
>>> write and read.
>>
>> How does *the cluster* know the state of the leader and if it
>> doesn't, how it can possibly implement this? Did you mean
>> the leader should stop accepting transactions here? But how can
>> the leader know if it has not enough replicas during a read
>> transaction, if it doesn't contact any replica to serve a read?
> 
> I expect to have a disconnection trigger assigned to all relays so that
> disconnection will cause the number of replicas decrease. The quorum
> size is static, so we can stop at the very moment the number dives below.

This is a very dubious statement. In TCP disconnect may be detected much
later, than it happened. So to collect a quorum on something you need to
literally collect this quorum, with special WAL records, via network, and
all. A disconnect trigger does not help at all here.

Talking of the whole 'read-quorum' idea, I don't like it. Because this
really makes things unbearably harder to implement, the nodes become
much slower and less available in terms of any problems.

I think reads should be allowed always, and from any node (except during
bootstrap, of course). After all, you have transactions for consistency. So
as far as replication respects transaction boundaries, every node is in a
consistent state. Maybe not all of them are in the same state, but every one
is consistent.

Honestly, I can't even imagine, how is it possible to implement a completely
synchronous simultaneous cluster progression. It is impossible even in theory.
There always will be a time period, when some nodes are further than the
others. At least because of network delays.

So either we allows reads from master only, or we allow reads from everywhere,
and in that case nothing will save from a possibility of seeing different
data on different nodes.