[Tarantool-patches] [RFC] Quorum-based synchronous replication
Sergey Ostanevich
sergos at tarantool.org
Thu May 14 23:38:12 MSK 2020
Hi!
> >> Sergey, I understand that RAFT spec is big and with this spec you
> >> try to split it into manageable parts. The question is how useful
> >> is this particular piece. I'm trying to point out that "the leader
> >> should stop" is not a silver bullet - especially since each such
> >> stop may mean a rejoin of some other node. The purpose of sync
> >> replication is to provide consistency without reducing
> >> availability (i.e. make progress as long as the quorum
> >> of nodes make progress).
> >
> > I'm not sure if we're talking about the same RAFT - mine is "In Search
> > of an Understandable Consensus Algorithm (Extended Version)" from
> > Stanford as of May 2014. And it is 15 pages - including references,
> > conclusions and intro. Seems not that big.
>
> 15 pages of tightly packed theory is a big piece of data. And especially
> big, when it comes to application to a real project, with existing
> infrastructure, and all. Just my IMHO. I remember implementing SWIM - it
> is smaller than RAFT. Much smaller and simpler, and yet it took year to
> implement it, and cover all things described in the paper.
That I won't object and this was the reason not to take the RAFT as is
and implement it in full for next 2-3 years. That's why we had the very
first part of RFC describing what it tries to address and what's not.
>
> This is not as simple as it looks, when it comes to edge cases. This is
> why the whole sync replication frustrates me more than anything else
> before, and why I am so reluctant to doing anything with it.
>
> The RFC mostly covers the normal operation, here I agree with Kostja. But
> the normal operation is not that interesting. Failures are much more
> important.
Definitely and I expect to follow with more functionality on top of it.
I believe it will be easier to do if the start will be as small as
possible change to the existent code base, which I also try to follow.
>
> > Although, most of it is dedicated to the leader election itself, which
> > we intentionally put aside from this RFC. It is written in the very
> > beginning and I empasized this by explicit mentioning of it.
>
> And still there will be leader election. Even though not ours for now.
> And Tarantool should provide API and instructions so as external
> applications could follow them and do the election.
>
> Usually in RFCs we describe API. With arguments, behaviour, and all.
That is something I believe should be done after we agree on the whole
idea, such as confirm entry in WAL for sync transactions that appeared
there earlier. Otherwise we can get very deep into the details, spending
time for API definition while the idea itself can appear wrong.
I believe that was a common ground to start, but we immediately went to
discussion of so many details I tried to keep away before we agree on the
key parts, such as WAL consistency or quorum collection.
>
> >> The current spec, suggesting there should be a leader stop in case
> >> of most errors, reduces availability significantly, and doesn't
> >> make external coordinator job any easier - it still has to follow to
> >> the letter the prescriptions of RAFT.
> >>
> >>>
> >>>>
> >>>>> | |----------Confirm---------->| |
> >>>>
> >>>> What happens if peers receive and maybe even write Confirm to their WALs
> >>>> but local WAL write is lost after a restart?
> >>>
> >>> Did you mean WAL write on leader as a local? Then we have a replica with
> >>> a bigger LSN for the leader ID.
> >>
> >>>> WAL is not synced,
> >>>> so we can easily lose the tail of the WAL. Tarantool will sync up
> >>>> with all replicas on restart,
> >>>
> >>> But at this point a new leader will be appointed - the old one is
> >>> restarted. Then the Confirm message will arrive to the restarted leader
> >>> through a regular replication.
> >>
> >> This assumes that restart is guaranteed to be noticed by the
> >> external coordinator and there is an election on every restart.
> >
> > Sure yes, if it restarted - then connection lost can't be unnoticed by
> > anyone, be it coordinator or cluster.
>
> Here comes another problem. Disconnect and restart have nothing to do with
> each other. The coordinator can loose connection without the peer leader
> restart. Just because it is network. Anything can happen. Moreover, while
> the coordinator does not have a connection, the leader can restart multiple
> times.
Definitely there should be a higher level functionality to support some
sort of membership protocol, such as SWIM or RAFT itself. But
introduction of it should not affect the basic priciples we have to
agree upon.
>
> We can't tell the coordinator rely on connectivity as a restart signal.
>
> >>>> but there will be no "Replication
> >>>> OK" messages from them, so it wouldn't know that the transaction
> >>>> is committed on them. How is this handled? We may end up with some
> >>>> replicas confirming the transaction while the leader will roll it
> >>>> back on restart. Do you suggest there is a human intervention on
> >>>> restart as well?
> >>>>
> >>>>
> >>>>> | | | | |
> >>>>> |<---TXN Ok-----| | [TXN undo log |
> >>>>> | | | destroyed] |
> >>>>> | | | | |
> >>>>> | | | |---Confirm--->|
> >>>>> | | | | |
> >>>>> ```
> >>>>>
> >>>>> The quorum should be collected as a table for a list of transactions
> >>>>> waiting for quorum. The latest transaction that collects the quorum is
> >>>>> considered as complete, as well as all transactions prior to it, since
> >>>>> all transactions should be applied in order. Leader writes a 'confirm'
> >>>>> message to the WAL that refers to the transaction's [LEADER_ID, LSN] and
> >>>>> the confirm has its own LSN. This confirm message is delivered to all
> >>>>> replicas through the existing replication mechanism.
> >>>>>
> >>>>> Replica should report a TXN application success to the leader via the
> >>>>> IPROTO explicitly to allow leader to collect the quorum for the TXN.
> >>>>> In case of application failure the replica has to disconnect from the
> >>>>> replication the same way as it is done now. The replica also has to
> >>>>> report its disconnection to the orchestrator. Further actions require
> >>>>> human intervention, since failure means either technical problem (such
> >>>>> as not enough space for WAL) that has to be resovled or an inconsistent
> >>>>> state that requires rejoin.
> >>>>
> >>>>> As soon as leader appears in a situation it has not enough replicas
> >>>>> to achieve quorum, the cluster should stop accepting any requests - both
> >>>>> write and read.
> >>>>
> >>>> How does *the cluster* know the state of the leader and if it
> >>>> doesn't, how it can possibly implement this? Did you mean
> >>>> the leader should stop accepting transactions here? But how can
> >>>> the leader know if it has not enough replicas during a read
> >>>> transaction, if it doesn't contact any replica to serve a read?
> >>>
> >>> I expect to have a disconnection trigger assigned to all relays so that
> >>> disconnection will cause the number of replicas decrease. The quorum
> >>> size is static, so we can stop at the very moment the number dives below.
> >>
> >> What happens between the event the leader is partitioned away and
> >> a new leader is elected?
> >>
> >> The leader may be unaware of the events and serve a read just
> >> fine.
> >
> > As it is stated 20 lines above:
> >>>>> As soon as leader appears in a situation it has not enough
> >>>>> replicas
> >>>>> to achieve quorum, the cluster should stop accepting any
> >>>>> requests - both
> >>>>> write and read.
> >
> > So it will not serve.
>
> This breaks compatibility, since now an orphan node is perfectly able
> to serve reads. The cluster can't just stop doing everything, if the
> quorum is lost. Stop writes - yes, since the quorum is lost anyway. But
> reads do not need a quorum.
>
> If you say reads need a quorum, then they would need to go through WAL,
> collect confirmations, and all.
The reads should not be inconsistent - so that cluster will keep
answering A or B for the same request. And in case we lost quorum we
can't say for sure that all instances will answer the same.
As we discussed it before, if leader appears in minor part of the
cluster it can't issue rollback for all unconfirmed txns, since the
majority will re-elect leader who will collect quorum for them. Means,
we will appear is a state that cluster split in two. So the minor part
should stop. Am I wrong here?
>
> >> So at least you can't say the leader shouldn't be serving reads
> >> without quorum - because the only way to achieve it is to collect
> >> a quorum of responses to reads as well.
> >
> > The leader lost connection to the (N-Q)+1 repllicas out of the N in
> > cluster with a quorum of Q == it stops serving anything. So the quorum
> > criteria is there: no quorum - no reads.
>
> Connection count tells nothing. Network connectivity is not a reliable
> source of information. Only messages and persistent data are reliable
> (to certain extent).
Well, persistent data can't help obtain quorum if there's no connection
to the replicas who should contribute to quorum.
Correct me, if I'm wrong: in case no quorum available we can't garantee
that the data is stored on at least <quorum> number of servers. Means -
cluster is not operable.
>
> >>>>> The reason for this is that replication of transactions
> >>>>> can achieve quorum on replicas not visible to the leader. On the other
> >>>>> hand, leader can't achieve quorum with available minority. Leader has to
> >>>>> report the state and wait for human intervention. There's an option to
> >>>>> ask leader to rollback to the latest transaction that has quorum: leader
> >>>>> issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN
> >>>>> is of the first transaction in the leader's undo log. The rollback
> >>>>> message replicated to the available cluster will put it in a consistent
> >>>>> state. After that configuration of the cluster can be updated to
> >>>>> available quorum and leader can be switched back to write mode.
> >>>>
> >>>> As you should be able to conclude from restart scenario, it is
> >>>> possible a replica has the record in *confirmed* state but the
> >>>> leader has it in pending state. The replica will not be able to
> >>>> roll back then. Do you suggest the replica should abort if it
> >>>> can't rollback? This may lead to an avalanche of rejoins on leader
> >>>> restart, bringing performance to a halt.
> >>>
> >>> No, I declare replica with biggest LSN as a new shining leader. More
> >>> than that, new leader can (so far it will be by default) finalize the
> >>> former leader life's work by replicating txns and appropriate confirms.
> >>
> >> Right, this also assumes the restart is noticed, so it follows the
> >> same logic.
> >
> > How a restart can be unnoticed, if it causes disconnection?
>
> Disconnection has nothing to do with restart. The coordinator itself may
> restart. Or it may loose connection to the leader temporarily. Or the
> leader may loose it without any restarts.
But how we detect it right now in Tarantool? Is there any machinery?
I suppose we can simply rely on the same at least to test the minimal -
and 'normally operating' - first approach to the problem.
So, thank you for all comments and please, find my updated RFC below.
Sergos.
---
* **Status**: In progress
* **Start date**: 31-03-2020
* **Authors**: Sergey Ostanevich @sergos \<sergos at tarantool.org\>
* **Issues**: https://github.com/tarantool/tarantool/issues/4842
## Summary
The aim of this RFC is to address the following list of problems
formulated at MRG planning meeting:
- protocol backward compatibility to enable cluster upgrade w/o
downtime
- consistency of data on replica and leader
- switch from leader to replica without data loss
- up to date replicas to run read-only requests
- ability to switch async replicas into sync ones and vice versa
- guarantee of rollback on leader and sync replicas
- simplicity of cluster orchestration
What this RFC is not:
- high availability (HA) solution with automated failover, roles
assignments an so on
- master-master configuration support
## Background and motivation
There are number of known implementation of consistent data presence in
a Tarantool cluster. They can be commonly named as "wait for LSN"
technique. The biggest issue with this technique is the absence of
rollback guarantees at replica in case of transaction failure on one
master or some of the replicas in the cluster.
To provide such capabilities a new functionality should be introduced in
Tarantool core, with requirements mentioned before - backward
compatibility and ease of cluster orchestration.
## Detailed design
### Quorum commit
The main idea behind the proposal is to reuse existent machinery as much
as possible. It will ensure the well-tested and proven functionality
across many instances in MRG and beyond is used. The transaction rollback
mechanism is in place and works for WAL write failure. If we substitute
the WAL success with a new situation which is named 'quorum' later in
this document then no changes to the machinery is needed. The same is
true for snapshot machinery that allows to create a copy of the database
in memory for the whole period of snapshot file write. Adding quorum here
also minimizes changes.
Currently replication represented by the following scheme:
```
Customer Leader WAL(L) Replica WAL(R)
|------TXN----->| | | |
| | | | |
| [TXN undo log | | |
| created] | | |
| | | | |
| |-----TXN----->| | |
| | | | |
| |<---WAL Ok----| | |
| | | | |
| [TXN undo log | | |
| destroyed] | | |
| | | | |
|<----TXN Ok----| | | |
| |-------Replicate TXN------->| |
| | | | |
| | | [TXN undo log |
| | | created] |
| | | | |
| | | |-----TXN----->|
| | | | |
| | | |<---WAL Ok----|
| | | | |
| | | [TXN undo log |
| | | destroyed] |
| | | | |
```
To introduce the 'quorum' we have to receive confirmation from replicas
to make a decision on whether the quorum is actually present. Leader
collects necessary amount of replicas confirmation plus its own WAL
success. This state is named 'quorum' and gives leader the right to
complete the customers' request. So the picture will change to:
```
Customer Leader WAL(L) Replica WAL(R)
|------TXN----->| | | |
| | | | |
| [TXN undo log | | |
| created] | | |
| | | | |
| |-----TXN----->| | |
| | | | |
| |-------Replicate TXN------->| |
| | | | |
| | | [TXN undo log |
| |<---WAL Ok----| created] |
| | | | |
| [Waiting | |-----TXN----->|
| of a quorum] | | |
| | | |<---WAL Ok----|
| | | | |
| |<------Replication Ok-------| |
| | | | |
| [Quorum | | |
| achieved] | | |
| | | | |
| |---Confirm--->| | |
| | | | |
| |----------Confirm---------->| |
| | | | |
|<---TXN Ok-----| | |---Confirm--->|
| | | | |
| [TXN undo log | [TXN undo log |
| destroyed] | destroyed] |
| | | | |
```
The quorum should be collected as a table for a list of transactions
waiting for quorum. The latest transaction that collects the quorum is
considered as complete, as well as all transactions prior to it, since
all transactions should be applied in order. Leader writes a 'confirm'
message to the WAL that refers to the transaction's [LEADER_ID, LSN] and
the confirm has its own LSN. This confirm message is delivered to all
replicas through the existing replication mechanism.
Replica should report a TXN application success to the leader via the
IPROTO explicitly to allow leader to collect the quorum for the TXN.
In case of application failure the replica has to disconnect from the
replication the same way as it is done now. The replica also has to
report its disconnection to the orchestrator. Further actions require
human intervention, since failure means either technical problem (such
as not enough space for WAL) that has to be resolved or an inconsistent
state that requires rejoin.
As soon as leader appears in a situation it has not enough replicas
to achieve quorum, the cluster should stop accepting any requests - both
write and read. The reason for this is that replication of transactions
can achieve quorum on replicas not visible to the leader. On the other
hand, leader can't achieve quorum with available minority. Leader has to
report the state and wait for human intervention. There's an option to
ask leader to rollback to the latest transaction that has quorum: leader
issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN
is of the first transaction in the leader's undo log. The rollback
message replicated to the available cluster will put it in a consistent
state. After that configuration of the cluster can be updated to
available quorum and leader can be switched back to write mode.
### Leader role assignment.
To assign a leader role to an instance the following should be performed:
1. among all available instances pick the one that has the biggest
vclock element of the former leader ID; an arbitrary istance can be
selected in case it is first time the leader is assigned
2. the leader should assure that number of available instances in the
cluster is enough to achieve the quorum and proceed to step 3,
otherwise the leader should report the situation of incomplete quorum,
as in the last paragraph of previous section
3. the selected instance has to take the responsibility to replicate
former leader entries from its WAL, obtainig quorum and commit
confirm messages referring to [FORMER_LEADER_ID, LSN] in its WAL,
replicating to the cluster, after that it can start adding its own
entries into the WAL
### Recovery and failover.
Tarantool instance during reading WAL should postpone the undo log
deletion until the 'confirm' is read. In case the WAL eof is achieved,
the instance should keep undo log for all transactions that are waiting
for a confirm entry until the role of the instance is set.
If this instance will be assigned a leader role then all transactions
that have no corresponding confirm message should be confirmed (see the
leader role assignment).
In case there's not enough replicas to set up a quorum the cluster can
be switched into a read-only mode. Note, this can't be done by default
since some of transactions can have confirmed state. It is up to human
intervention to force rollback of all transactions that have no confirm
and to put the cluster into a consistent state.
In case the instance will be assigned a replica role, it may appear in
a state that it has conflicting WAL entries, in case it recovered from a
leader role and some of transactions didn't replicated to the current
leader. This situation should be resolved through rejoin of the instance.
Consider an example below. Originally instance with ID1 was assigned a
Leader role and the cluster had 2 replicas with quorum set to 2.
```
+---------------------+---------------------+---------------------+
| ID1 | ID2 | ID3 |
| Leader | Replica 1 | Replica 2 |
+---------------------+---------------------+---------------------+
| ID1 Tx1 | ID1 Tx1 | ID1 Tx1 |
+---------------------+---------------------+---------------------+
| ID1 Tx2 | ID1 Tx2 | ID1 Tx2 |
+---------------------+---------------------+---------------------+
| ID1 Tx3 | ID1 Tx3 | ID1 Tx3 |
+---------------------+---------------------+---------------------+
| ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | |
+---------------------+---------------------+---------------------+
| ID1 Tx4 | ID1 Tx4 | |
+---------------------+---------------------+---------------------+
| ID1 Tx5 | ID1 Tx5 | |
+---------------------+---------------------+---------------------+
| ID1 Conf [ID1, Tx2] | | |
+---------------------+---------------------+---------------------+
| Tx6 | | |
+---------------------+---------------------+---------------------+
| Tx7 | | |
+---------------------+---------------------+---------------------+
```
Suppose at this moment the ID1 instance crashes. Then the ID2 instance
should be assigned a leader role since its ID1 LSN is the biggest.
Then this new leader will deliver its WAL to all replicas.
As soon as quorum for Tx4 and Tx5 will be obtained, it should write the
corresponding Confirms to its WAL. Note that Tx are still uses ID1.
```
+---------------------+---------------------+---------------------+
| ID1 | ID2 | ID3 |
| (dead) | Leader | Replica 2 |
+---------------------+---------------------+---------------------+
| ID1 Tx1 | ID1 Tx1 | ID1 Tx1 |
+---------------------+---------------------+---------------------+
| ID1 Tx2 | ID1 Tx2 | ID1 Tx2 |
+---------------------+---------------------+---------------------+
| ID1 Tx3 | ID1 Tx3 | ID1 Tx3 |
+---------------------+---------------------+---------------------+
| ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] |
+---------------------+---------------------+---------------------+
| ID1 Tx4 | ID1 Tx4 | ID1 Tx4 |
+---------------------+---------------------+---------------------+
| ID1 Tx5 | ID1 Tx5 | ID1 Tx5 |
+---------------------+---------------------+---------------------+
| ID1 Conf [ID1, Tx2] | ID2 Conf [Id1, Tx5] | ID2 Conf [Id1, Tx5] |
+---------------------+---------------------+---------------------+
| ID1 Tx6 | | |
+---------------------+---------------------+---------------------+
| ID1 Tx7 | | |
+---------------------+---------------------+---------------------+
```
After rejoining ID1 will figure out the inconsistency of its WAL: the
last WAL entry it has is corresponding to Tx7, while in Leader's log the
last entry with ID1 is Tx5.
In case the ID1's WAL contains corresponding entry then Replica 1 can
stop reading WAL as soon as it hits the vclock[ID1] obtained from the
current Leader. It will put the ID1 into a consistent state and it can
obtain latest data via replication. The WAL should be rotated after a
snapshot creation. The old WAL should be renamed so it will not be
reused in the future and kept for postmortem.
```
+---------------------+---------------------+---------------------+
| ID1 | ID2 | ID3 |
| Replica 1 | Leader | Replica 2 |
+---------------------+---------------------+---------------------+
| ID1 Tx1 | ID1 Tx1 | ID1 Tx1 |
+---------------------+---------------------+---------------------+
| ID1 Tx2 | ID1 Tx2 | ID1 Tx2 |
+---------------------+---------------------+---------------------+
| ID1 Tx3 | ID1 Tx3 | ID1 Tx3 |
+---------------------+---------------------+---------------------+
| ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] |
+---------------------+---------------------+---------------------+
| ID1 Tx4 | ID1 Tx4 | ID1 Tx4 |
+---------------------+---------------------+---------------------+
| ID1 Tx5 | ID1 Tx5 | ID1 Tx5 |
+---------------------+---------------------+---------------------+
| | ID2 Conf [Id1, Tx5] | ID2 Conf [Id1, Tx5] |
+---------------------+---------------------+---------------------+
| | ID2 Tx1 | ID2 Tx1 |
+---------------------+---------------------+---------------------+
| | ID2 Tx2 | ID2 Tx2 |
+---------------------+---------------------+---------------------+
```
Although, there could be a situation that ID1's WAL begins with an LSN
after the biggest available in the Leader's WAL. Either, for vinyl
part of WAL can be referenced in .run files, hence can't be evicted by
a simple WAL ignore. In such a case the ID1 needs a complete rejoin.
### Snapshot generation.
We also can reuse current machinery of snapshot generation. Upon
receiving a request to create a snapshot an instance should request a
readview for the current commit operation. Although start of the
snapshot generation should be postponed until this commit operation
receives its confirmation. In case operation is rolled back, the snapshot
generation should be aborted and restarted using current transaction
after rollback is complete.
After snapshot is created the WAL should start from the first operation
that follows the commit operation snapshot is generated for. That means
WAL will contain 'confirm' messages that refer to transactions that are
not present in the WAL. Apparently, we have to allow this for the case
'confirm' refers to a transaction with LSN less than the first entry in
the WAL.
In case master appears unavailable a replica still have to be able to
create a snapshot. Replica can perform rollback for all transactions that
are not confirmed and claim its LSN as the latest confirmed txn. Then it
can create a snapshot in a regular way and start with blank xlog file.
All rolled back transactions will appear through the regular replication
in case master reappears later on.
### Asynchronous replication.
Along with synchronous replicas the cluster can contain asynchronous
replicas. That means async replica doesn't reply to the leader with
errors since they're not contributing into quorum. Still, async
replicas have to follow the new WAL operation, such as keep rollback
info until 'quorum' message is received. This is essential for the case
of 'rollback' message appearance in the WAL. This message assumes
replica is able to perform all necessary rollback by itself. Cluster
information should contain explicit notification of each replica
operation mode.
### Synchronous replication enabling.
Synchronous operation can be required for a set of spaces in the data
scheme. That means only transactions that contain data modification for
these spaces should require quorum. Such transactions named synchronous.
As soon as last operation of synchronous transaction appeared in leader's
WAL, it will cause all following transactions - no matter if they are
synchronous or not - wait for the quorum. In case quorum is not achieved
the 'rollback' operation will cause rollback of all transactions after
the synchronous one. It will ensure the consistent state of the data both
on leader and replicas. In case user doesn't require synchronous operation
for any space then no changes to the WAL generation and replication will
appear.
Cluster description should contain explicit attribute for each replica
to denote it participates in synchronous activities. Also the description
should contain criterion on how many replicas responses are needed to
achieve the quorum.
## Rationale and alternatives
There is an implementation of synchronous replication as part of gh-980
activities, still it is not in a state to get into the product. More
than that it intentionally breaks backward compatibility which is a
prerequisite for this proposal.
More information about the Tarantool-patches
mailing list