[Tarantool-patches] [RFC] Quorum-based synchronous replication

Wed Apr 22 19:50:29 MSK 2020

Hi!

On 22 апр 00:17, Vladislav Shpilevoy wrote:
> >>>   - high availability (HA) solution with automated failover, roles
> >>>     assignments an so on
> >>
> >> 1. So no leader election? That essentially makes single failure point
> >> for RW requests, is it correct?
> >>
> >> On the other hand I see section 'Recovery and failover.' below. And
> >> it seems to be automated, with selecting a replica with the biggest
> >> LSN. Where is the truth?
> >>
> > 
> > The failover can be manual or implemented independnetnly. By no means
> > this means we should not explain how this should be done according to
> > the replication schema discussed.
> 
> But it is not explained. You just said 'the biggest LSN owner is chosen'.
> This looks very good in theory and on the paper, but it is not that simple.
> If you are going to explain how it works. You said 'by no means we should not
> explain'.
> 
I expect this to be a reference to what is currently implemented by
Tarantool users in many ways. I think I have to rephrase that one 'can
keep current election apporach using biggest LSN' since proposed
solution does not change current semantics of WAL generation, just
'confirm' and 'rollback' operations that are regular entries in the WAL.

> Talking of the election in scope of our replication schema, I don't see
> where it is discussed. Is there a separate RFC I am missing? I am asking exactly
> about that - where are pings, healthchecks, timeouts, voting on top of our
> replication schema? If you don't want to make the election a part of this RFC
> at all, then why is there a section, which literally says that the election
> is present and it is 'the biggest LSN owner is chosen'?
> 
> In case the election is out of this for now, did you think about how a possible
> future leader election algorithm could be implemented on top of this sync
> replication? Just to be sure we are not ruining some things which will be
> necessary for auto election later on.
> 
My answer will be the same - no changes to WAL from this point of view,
just replicas has their respective undo logs and can rollback to the
consistent state.

> > And yes, the SPOF is the leader of the cluster. This is expected and is
> > Ok according to all MRG planning meeting participants.
> 
> That is not about MRG only, Tarantool is not a closed-source MRG-only DB.
> I am not against making it non-automated for now, but I want to be sure it
> will be possible to implement this as an enhancement.
> 
Sure we want to - using the existing SWIM module for membership and to
elaborate something close to RAFT, still it is not in our immediate plans.

> >>>   - master-master configuration support
> >>>
> >>> ## Background and motivation
> >>>
> >>> There are number of known implementation of consistent data presence in
> >>> a cluster. They can be commonly named as "wait for LSN" technique. The
> >>> biggest issue with this technique is the absence of rollback guarantees
> >>> at replica in case of transaction failure on one master or some of the
> >>> replicas in the cluster.
> >>>
> >>> To provide such capabilities a new functionality should be introduced in
> >>> Tarantool core, with requirements mentioned before - backward
> >>> compatibility and ease of cluster orchestration.
> >>>
> >>> ## Detailed design
> >>>
> >>> ### Quorum commit
> >>>
> >>> The main idea behind the proposal is to reuse existent machinery as much
> >>> as possible. It will ensure the well-tested and proven functionality
> >>> across many instances in MRG and beyond is used. The transaction rollback
> >>> mechanism is in place and works for WAL write failure. If we substitute
> >>> the WAL success with a new situation which is named 'quorum' later in
> >>> this document then no changes to the machinery is needed.
> >>
> >> 2. The problem here is that you create dependency on WAL. According to
> >> your words, replication is inside WAL, and if WAL gave ok, then all is
> > 
> > 'Replication is inside WAL' - what do you mean by that? The replication in
> > its current state works from WAL, although it's an exaggregation to say
> > it is 'inside WAL'. Why it means a new dependency after that?
> 
> You said 'WAL success' is substituted with a new situation 'quorum'. It
> means, strictly interpreting your words, that 'wal_write()' function
> won't return 0 until quorum is collected.
> 
> This is what I mean by moving replication into WAL subsystem.
> 
The wal_write() should report the result of WAL operation. It should not
return quorum - the WAL result should be used along with quorum messages
from replicas to claim txn is complete. This shouldn't be part of WAL.

> >> replicated and applied. But that makes current code structure even worse
> >> than it is. Now WAL, GC, and replication code is spaghetti, basically.
> >> All depends on all. I was rather thinking, that we should fix that first.
> >> Not aggravate.
> >>
> >> WAL should provide API for writing to disk. Replication should not bother
> >> about WAL. GC should not bother about replication. All should be independent,
> >> and linked in one place by some kind of a manager, which would just use their
> > 
> > So you want to introduce a single point that will translate all messages
> > between all participants?
> 
> Well, this is called 'cbus', we already have it. This is a separate headache,
> which no one understands except Georgy. However I was not talking about it.
> I am wrong, it should not be a single manager. But all the subsystems should
> be as independent as possible still.
> 
> > I believe current state was introduced exactly
> > to avoid this situation. Each participant can be subscribed for a
> > particular trigger inside another participant and take it into account
> > in its activities - at the right time for itself. 
> 
> Whatever. I don't know the code good enough, so I am probably wrong
> somewhere here. But every time looking at these numerous triggers depending
> on each other and called at arbitrary moments was enough. I tried to fix
> these trigger-dependencies in scope of different tasks, but cleaning this
> code appears to be a huge task by itself.
> 
> This can be done without obscure triggers called from arbitrary threads at
> atribtrary moments of time. That in fact was the main blocker for the in-memory
> WAL, when I tried to finish it after Georgy in January. We have fibers exactly
> to avoid triggers. To be able to write linear and simple code. The triggers
> can be replaced by dedicated fibers, and fiber condition variables can be used
> to wait for exact moments of time of needed events where functionality is
> event based.
> 
I totaly agree to you that it is a big task itself. I believe we won't
introduce too much extra dependencies between the parties - just tweak
some of them.
So far I want to start an ativity - Cyrill Gorcunov supports me - to
draw a mutual dependency map of all participants: their threads, fibers,
triggers and how they are connected. I believe it will help us to
prepare a first step to redesign the system - or make a thoughtful
decision to keep it as is. 

> >> APIs. I believe Cyrill G. would agree with me here, I remember him
> >> complaining about replication-wal-gc code inter-dependencies too. Please,
> >> request his review on this, if you didn't yet.
> >>
> > I personally have the same problem trying to implement a trivial test,
> > by just figuring out the layers and dependencies of participants. This
> > is about poor documentation im my understanding, not poor design. 
> 
> There is almost more documentation than the code. Just look at the number
> of comments and level of their details. And still it does not help. So
> looks like just a bad API and dependency design. Does not matter how hard
> it is documented, it just becomes more interdepending and harder to wrap a
> mind around it.
> 
You can document every line of code with explanation of what it does,
but there' no such line that will require to draw the 'big picture'. I
believe it is the problem. Design is not the code, rather guideline for
the code. To decypher it back from (perhaps not-so good sometimes) code
is a big task itself. This should help to understand - and only then
improve - the implementation.

> >>> In case of a leader failure a replica with the biggest LSN with former
> >>> leader's ID is elected as a new leader. The replica should record
> >>> 'rollback' in its WAL which effectively means that all transactions
> >>> without quorum should be rolled back. This rollback will be delivered to
> >>> all replicas and they will perform rollbacks of all transactions waiting
> >>> for quorum.
> >>
> >> 7. Please, elaborate leader election. It is not as trivial as just 'elect'.
> >> What if the replica with the biggest LSN is temporary not available, but
> >> it knows that it has the biggest LSN? Will it become a leader without
> >> asking other nodes? What will do the other nodes? Will they wait for the
> >> new leader node to become available? Do they have a timeout on that?
> >>
> > By now I do not plan any activities on HA - including automated failover
> > and leader re-election. In case leader sees insufficient number of
> > replicas to achieve quorum - it stops, reporting the problem to the
> > external orchestrator. 
> 
> From the text in this section it does not look like a not planned activity,
> but like an already made decision. It is not even in the 'Plans' section.
> You just said 'is elected'. By whom? How?
> 
Again, I want to address this to current users - I would rephrase.

> If the election is not a part of the RFC, I would suggest moving this out,
> or into a separate section 'Plans' or something. Or reformulate this by
> saying like 'the cluster stops serving write requests until external
> intervention sets a new leader'. And 'it is *advised* to use the one with
> the biggest LSN in the old leader's vclock component'. Something like that.
> 
I don't think the automated election should be even a plan for SR. It is
a feature on top of it, shouldn't be a prerequisite in any form.

> >> Basically, it would be nice to see the split-brain problem description here,
> >> and its solution for us.
> >>
> > I believe the split-brain is under orchestrator control either - we
> > should provide API to switch leader in the cluster, so that when a
> > former leader came back it will not get quorum for any txn it has,
> > replying to customers with failure as a result.
> 
> Exactly. We should provide something for this from inside. But are there
> any details? How should that work? Should all the healthy replicas reject
> everything from the false-leader? Should the false-leader somehow realize,
> that it is not considered a leader anymore, and should stop itself? If we
> choose the former way, how a replica defines who is the true leader? For
> example, some replicas still may consider the old leader as a true master.
> If we choose the latter way, what is the algorithm of determining that we
> are not a leader anymore?
> 
It is all about external orchestration - if replica can't get ping from
leader it stops, reporting its status to orchestrator. 
If leader lost number of replicas that makes quorum impossible - it
stops replication, reporting to the orchestrator. 
Will it be sufficient to cover the question?

> >>> After snapshot is created the WAL should start from the first operation
> >>> that follows the commit operation snapshot is generated for. That means
> >>> WAL will contain 'confirm' messages that refer to transactions that are
> >>> not present in the WAL. Apparently, we have to allow this for the case
> >>> 'confirm' refers to a transaction with LSN less than the first entry in
> >>> the WAL.
> >>
> >> 9. I couldn't understand that. Why confirm is in WAL for data stored in
> >> the snap? I thought you said above, that snapshot should be done for all
> >> confirmed data. Besides, having confirm out of snap means the snap is
> >> not self-sufficient anymore.
> >>
> > Snap waits for confirm message to start. During this wait the WAL keep
> > growing. At the moment confirm arrived the snap will be created - say,
> > for txn #10. The WAL will be started with lsn #11 and commit can be
> > somewhere lsn #30. 
> > So, starting with this snap data appears consistent for lsn #10 - it is
> > guaranteed by the wait of commit message. Then replay of WAL will come 
> > to a confirm message lsn #30 - referring to lsn #10 - that actually
> > ignored, since it looks beyond the WAL start. There could be confirm
> > messages for even earlier txns if wait takes sufficient time - all of
> > them will refer to lsn beyond the WAL. And it is Ok.
> 
> What is 'commit message'? I don't see it on the schema above. I see only
> confirms.
> 
Sorry, it is a misprint - I meant 'confirm'.

> So the problem is that some data may be written to WAL after we started
> committing our transactions going to the snap, but before we received a
> quorum. And we can't truncate the WAL by the quorum, because there is
> already newer data, which was not included into the snap. Because WAL is
> not stopped, it still accepts new transactions. Now I understand.
> 
> Would be good to have this example in the RFC.
>
Ok, I will try to elaborate on this.

> >>> Cluster description should contain explicit attribute for each replica
> >>> to denote it participates in synchronous activities. Also the description
> >>> should contain criterion on how many replicas responses are needed to
> >>> achieve the quorum.
> >>
> >> 11. Aha, I see 'necessary amount' from above is a manually set value. Ok.
> >>
> >>>
> >>> ## Rationale and alternatives
> >>>
> >>> There is an implementation of synchronous replication as part of gh-980
> >>> activities, still it is not in a state to get into the product. More
> >>> than that it intentionally breaks backward compatibility which is a
> >>> prerequisite for this proposal.
> >>
> >> 12. How are we going to deal with fsync()? Will it be forcefully enabled
> >> on sync replicas and the leader?
> > 
> > To my understanding - it's up to user. I was considering a cluster that
> > has no WAL at all - relying on sychro replication and sufficient number
> > of replicas. Everyone who I asked about it told me I'm nuts. To my great
> > surprise Alexander Lyapunov brought exactly the same idea to discuss. 
> 
> I didn't see an RFC on that, and this can become easily possible, when
> in-memory relay is implemented. If it is implemented in a clean way. We
> just can turn off the disk backoff, and it will work from memory-only.
> 
It is not in RFC and we had no support from the customers in question.

> > All of these is for one resolution: I would keep it for user to decide.
> > Obviously, to speed up the processing leader can disable wal completely,
> > but to do so we have to re-work the relay to work from memory. Replicas
> > can use WAL in a way user wants: 2 replicas with slow HDD should'n wait
> > for fsync(), while super-fast Intel DCPMM one can enable it. Balancing
> > is up to user.
> 
> Possibility of omitting fsync means that it is possible, that all nodes
> write confirm, which is reported to the client, then the nodes restart,
> and the data is lost. I would say it somewhere.

The data will not be lost, unless _all_ nodes fail at the same time -
including leader. Otherwise the data will be propagated from the
survivor through the regular replication. No changes here to what we
have currently.