[Tarantool-patches] [RFC] Quorum-based synchronous replication

Tue Apr 21 13:49:18 MSK 2020

Hi!

Thanks for review!

> >   - high availability (HA) solution with automated failover, roles
> >     assignments an so on
> 
> 1. So no leader election? That essentially makes single failure point
> for RW requests, is it correct?
> 
> On the other hand I see section 'Recovery and failover.' below. And
> it seems to be automated, with selecting a replica with the biggest
> LSN. Where is the truth?
> 

The failover can be manual or implemented independnetnly. By no means
this means we should not explain how this should be done according to
the replication schema discussed.

And yes, the SPOF is the leader of the cluster. This is expected and is
Ok according to all MRG planning meeting participants.

> >   - master-master configuration support
> > 
> > ## Background and motivation
> > 
> > There are number of known implementation of consistent data presence in
> > a cluster. They can be commonly named as "wait for LSN" technique. The
> > biggest issue with this technique is the absence of rollback guarantees
> > at replica in case of transaction failure on one master or some of the
> > replicas in the cluster.
> > 
> > To provide such capabilities a new functionality should be introduced in
> > Tarantool core, with requirements mentioned before - backward
> > compatibility and ease of cluster orchestration.
> > 
> > ## Detailed design
> > 
> > ### Quorum commit
> > 
> > The main idea behind the proposal is to reuse existent machinery as much
> > as possible. It will ensure the well-tested and proven functionality
> > across many instances in MRG and beyond is used. The transaction rollback
> > mechanism is in place and works for WAL write failure. If we substitute
> > the WAL success with a new situation which is named 'quorum' later in
> > this document then no changes to the machinery is needed.
> 
> 2. The problem here is that you create dependency on WAL. According to
> your words, replication is inside WAL, and if WAL gave ok, then all is

'Replication is inside WAL' - what do you mean by that? The replication in
its current state works from WAL, although it's an exaggregation to say
it is 'inside WAL'. Why it means a new dependency after that?

> replicated and applied. But that makes current code structure even worse
> than it is. Now WAL, GC, and replication code is spaghetti, basically.
> All depends on all. I was rather thinking, that we should fix that first.
> Not aggravate.
> 
> WAL should provide API for writing to disk. Replication should not bother
> about WAL. GC should not bother about replication. All should be independent,
> and linked in one place by some kind of a manager, which would just use their

So you want to introduce a single point that will translate all messages
between all participants? I believe current state was introduced exactly
to avoid this situation. Each participant can be subscribed for a
particular trigger inside another participant and take it into account
in its activities - at the right time for itself. 

> APIs. I believe Cyrill G. would agree with me here, I remember him
> complaining about replication-wal-gc code inter-dependencies too. Please,
> request his review on this, if you didn't yet.
> 
I personally have the same problem trying to implement a trivial test,
by just figuring out the layers and dependencies of participants. This
is about poor documentation im my understanding, not poor design. 

> > The same is
> > true for snapshot machinery that allows to create a copy of the database
> > in memory for the whole period of snapshot file write. Adding quorum here
> > also minimizes changes.
> > Currently replication represented by the following scheme:
> > ```
> > Customer        Leader          WAL(L)        Replica        WAL(R)
> 
> 3. Are you saying 'leader' === 'master'?

According to international polite naming.
Mark Twain nowadays goes obscene with his 'Mars Tom'.

> 
> >    |------TXN----->|              |             |              |
> >    |               |              |             |              |
> >    |         [TXN Rollback        |             |              |
> >    |            created]          |             |              |
> 
> 4. What is 'txn rollback', and why is it created before even a transaction
> is started? At least, rollback is a verb. Maybe you meant 'undo log'?

No, rollback has both verb and noun meaning. Nevertheless, if you 
tripped over this - it should be 

fixed.

> 
> >    |               |              |             |              |
> >    |               |-----TXN----->|             |              |
> >    |               |              |             |              |
> >    |               |<---WAL Ok----|             |              |
> >    |               |              |             |              |
> >    |         [TXN Rollback        |             |              |
> >    |           destroyed]         |             |              |
> >    |               |              |             |              |
> >    |<----TXN Ok----|              |             |              |
> >    |               |-------Replicate TXN------->|              |
> >    |               |              |             |              |
> >    |               |              |       [TXN Rollback        |
> >    |               |              |          created]          |
> >    |               |              |             |              |
> >    |               |              |             |-----TXN----->|
> >    |               |              |             |              |
> >    |               |              |             |<---WAL Ok----|
> >    |               |              |             |              |
> >    |               |              |       [TXN Rollback        |
> >    |               |              |         destroyed]         |
> >    |               |              |             |              |
> > ```
> > 
> > To introduce the 'quorum' we have to receive confirmation from replicas
> > to make a decision on whether the quorum is actually present. Leader
> > collects necessary amount of replicas confirmation plus its own WAL
> 
> 5. Please, define 'necessary amount'?

Apparently, resolved with comment #11

> 
> > success. This state is named 'quorum' and gives leader the right to
> > complete the customers' request. So the picture will change to:
> > ```
> > Customer        Leader          WAL(L)        Replica        WAL(R)
> >    |------TXN----->|              |             |              |
> >    |               |              |             |              |
> >    |         [TXN Rollback        |             |              |
> >    |            created]          |             |              |
> >    |               |              |             |              |
> >    |               |-----TXN----->|             |              |
> 
> 6. Are we going to replicate transaction after user writes commit()?

Does your 'user' means customer from the picture? In such a case do you
expect to have an interactive transaction? Definitely, we do not
consider it here in any form, since replication happens only for the 
comlpete transaction. 

> Or will we replicate it while it is in progress? So called 'presumed
> commit'. I remember I read some papers explaining how it significantly
> speeds up synchronous transactions. Probably that was a paper about
> 2-phase commit, can't remember already. But the idea is still applicable
> for the replication too.

This can be considered only after MVCC is introduced - currently running
as a separate activity. Then we can replicate transaction 'in the fly'
into a separate blob/readview/better_name. By now this means we will be
too much interwoven to correctly rollback afterwards, at the time of
quorum is failed.

> > In case of a leader failure a replica with the biggest LSN with former
> > leader's ID is elected as a new leader. The replica should record
> > 'rollback' in its WAL which effectively means that all transactions
> > without quorum should be rolled back. This rollback will be delivered to
> > all replicas and they will perform rollbacks of all transactions waiting
> > for quorum.
> 
> 7. Please, elaborate leader election. It is not as trivial as just 'elect'.
> What if the replica with the biggest LSN is temporary not available, but
> it knows that it has the biggest LSN? Will it become a leader without
> asking other nodes? What will do the other nodes? Will they wait for the
> new leader node to become available? Do they have a timeout on that?
> 
By now I do not plan any activities on HA - including automated failover
and leader re-election. In case leader sees insufficient number of
replicas to achieve quorum - it stops, reporting the problem to the
external orchestrator. 

> Basically, it would be nice to see the split-brain problem description here,
> and its solution for us.
> 
I believe the split-brain is under orchestrator control either - we
should provide API to switch leader in the cluster, so that when a
former leader came back it will not get quorum for any txn it has,
replying to customers with failure as a result.

> How leader failure is detected? Do you rely on our heartbeat messages?
> Are you going to adapt SWIM for this?
> 
> Raft has a dedicated subsystem for election, it is not that simple. It
> involves voting, randomized algorithms. Am I missing something obvious in
> this RFC, which makes the leader election much simpler specifically for
> Tarantool?
> 
All of these I assume as HA features, when Tarantool can automate the
failover and leader re-election. Out of the scope by now.

> > An interface to force apply pending transactions by issuing a confirm
> > entry for them have to be introduced for manual recovery.
> > 
> > ### Snapshot generation.
> > 
> > We also can reuse current machinery of snapshot generation. Upon
> > receiving a request to create a snapshot an instance should request a
> > readview for the current commit operation. Although start of the
> > snapshot generation should be postponed until this commit operation
> > receives its confirmation. In case operation is rolled back, the snapshot
> > generation should be aborted and restarted using current transaction
> > after rollback is complete.
> 
> 8. This section highly depends on transaction manager for memtx. If you
> have a transaction manager, you always have a ready-to-use read-view
> of the latest committed data. At least this is my understanding.
> 
> After all, the manager should provide transaction isolation. And it means,
> that all non-committed transactions are not visible. And for that we need
> a read-view. Therefore, it could be used to make a snapshot.
> 
Currently there's no such manager for memtx. So I proposed this
workaround with minimal impact on our current machinery. 
Alexander Lyapunov is working on the manager in parallel, he reviewed
and blessed this RFC, so apparently there's no contradiction with his
plans.

> > After snapshot is created the WAL should start from the first operation
> > that follows the commit operation snapshot is generated for. That means
> > WAL will contain 'confirm' messages that refer to transactions that are
> > not present in the WAL. Apparently, we have to allow this for the case
> > 'confirm' refers to a transaction with LSN less than the first entry in
> > the WAL.
> 
> 9. I couldn't understand that. Why confirm is in WAL for data stored in
> the snap? I thought you said above, that snapshot should be done for all
> confirmed data. Besides, having confirm out of snap means the snap is
> not self-sufficient anymore.
>
Snap waits for confirm message to start. During this wait the WAL keep
growing. At the moment confirm arrived the snap will be created - say,
for txn #10. The WAL will be started with lsn #11 and commit can be
somewhere lsn #30. 
So, starting with this snap data appears consistent for lsn #10 - it is
guaranteed by the wait of commit message. Then replay of WAL will come 
to a confirm message lsn #30 - referring to lsn #10 - that actually
ignored, since it looks beyond the WAL start. There could be confirm
messages for even earlier txns if wait takes sufficient time - all of
them will refer to lsn beyond the WAL. And it is Ok.

> > In case master appears unavailable a replica still have to be able to
> > create a snapshot. Replica can perform rollback for all transactions that
> > are not confirmed and claim its LSN as the latest confirmed txn. Then it
> > can create a snapshot in a regular way and start with blank xlog file.
> > All rolled back transactions will appear through the regular replication
> > in case master reappears later on.
> 
> 10. You should be able to make a snapshot without rollback. Read-views are
> available anyway. At least it is so in Vinyl, from what I remember. And this
> is going to be similar in memtx.
> 
You have to make a snapshot for a consistent data state. Unless we have
transaction manager in memtx - this is the way to do so. And as I
mentioned, this is a different activity.

> > Cluster description should contain explicit attribute for each replica
> > to denote it participates in synchronous activities. Also the description
> > should contain criterion on how many replicas responses are needed to
> > achieve the quorum.
> 
> 11. Aha, I see 'necessary amount' from above is a manually set value. Ok.
> 
> > 
> > ## Rationale and alternatives
> > 
> > There is an implementation of synchronous replication as part of gh-980
> > activities, still it is not in a state to get into the product. More
> > than that it intentionally breaks backward compatibility which is a
> > prerequisite for this proposal.
> 
> 12. How are we going to deal with fsync()? Will it be forcefully enabled
> on sync replicas and the leader?

To my understanding - it's up to user. I was considering a cluster that
has no WAL at all - relying on sychro replication and sufficient number
of replicas. Everyone who I asked about it told me I'm nuts. To my great
surprise Alexander Lyapunov brought exactly the same idea to discuss. 

All of these is for one resolution: I would keep it for user to decide.
Obviously, to speed up the processing leader can disable wal completely,
but to do so we have to re-work the relay to work from memory. Replicas
can use WAL in a way user wants: 2 replicas with slow HDD should'n wait
for fsync(), while super-fast Intel DCPMM one can enable it. Balancing
is up to user.