[Tarantool-patches] [RFC] Quorum-based synchronous replication

Wed Apr 22 01:17:44 MSK 2020

>>>   - high availability (HA) solution with automated failover, roles
>>>     assignments an so on
>>
>> 1. So no leader election? That essentially makes single failure point
>> for RW requests, is it correct?
>>
>> On the other hand I see section 'Recovery and failover.' below. And
>> it seems to be automated, with selecting a replica with the biggest
>> LSN. Where is the truth?
>>
> 
> The failover can be manual or implemented independnetnly. By no means
> this means we should not explain how this should be done according to
> the replication schema discussed.

But it is not explained. You just said 'the biggest LSN owner is chosen'.
This looks very good in theory and on the paper, but it is not that simple.
If you are going to explain how it works. You said 'by no means we should not
explain'.

Talking of the election in scope of our replication schema, I don't see
where it is discussed. Is there a separate RFC I am missing? I am asking exactly
about that - where are pings, healthchecks, timeouts, voting on top of our
replication schema? If you don't want to make the election a part of this RFC
at all, then why is there a section, which literally says that the election
is present and it is 'the biggest LSN owner is chosen'?

In case the election is out of this for now, did you think about how a possible
future leader election algorithm could be implemented on top of this sync
replication? Just to be sure we are not ruining some things which will be
necessary for auto election later on.

> And yes, the SPOF is the leader of the cluster. This is expected and is
> Ok according to all MRG planning meeting participants.

That is not about MRG only, Tarantool is not a closed-source MRG-only DB.
I am not against making it non-automated for now, but I want to be sure it
will be possible to implement this as an enhancement.

>>>   - master-master configuration support
>>>
>>> ## Background and motivation
>>>
>>> There are number of known implementation of consistent data presence in
>>> a cluster. They can be commonly named as "wait for LSN" technique. The
>>> biggest issue with this technique is the absence of rollback guarantees
>>> at replica in case of transaction failure on one master or some of the
>>> replicas in the cluster.
>>>
>>> To provide such capabilities a new functionality should be introduced in
>>> Tarantool core, with requirements mentioned before - backward
>>> compatibility and ease of cluster orchestration.
>>>
>>> ## Detailed design
>>>
>>> ### Quorum commit
>>>
>>> The main idea behind the proposal is to reuse existent machinery as much
>>> as possible. It will ensure the well-tested and proven functionality
>>> across many instances in MRG and beyond is used. The transaction rollback
>>> mechanism is in place and works for WAL write failure. If we substitute
>>> the WAL success with a new situation which is named 'quorum' later in
>>> this document then no changes to the machinery is needed.
>>
>> 2. The problem here is that you create dependency on WAL. According to
>> your words, replication is inside WAL, and if WAL gave ok, then all is
> 
> 'Replication is inside WAL' - what do you mean by that? The replication in
> its current state works from WAL, although it's an exaggregation to say
> it is 'inside WAL'. Why it means a new dependency after that?

You said 'WAL success' is substituted with a new situation 'quorum'. It
means, strictly interpreting your words, that 'wal_write()' function
won't return 0 until quorum is collected.

This is what I mean by moving replication into WAL subsystem.

>> replicated and applied. But that makes current code structure even worse
>> than it is. Now WAL, GC, and replication code is spaghetti, basically.
>> All depends on all. I was rather thinking, that we should fix that first.
>> Not aggravate.
>>
>> WAL should provide API for writing to disk. Replication should not bother
>> about WAL. GC should not bother about replication. All should be independent,
>> and linked in one place by some kind of a manager, which would just use their
> 
> So you want to introduce a single point that will translate all messages
> between all participants?

Well, this is called 'cbus', we already have it. This is a separate headache,
which no one understands except Georgy. However I was not talking about it.
I am wrong, it should not be a single manager. But all the subsystems should
be as independent as possible still.

> I believe current state was introduced exactly
> to avoid this situation. Each participant can be subscribed for a
> particular trigger inside another participant and take it into account
> in its activities - at the right time for itself. 

Whatever. I don't know the code good enough, so I am probably wrong
somewhere here. But every time looking at these numerous triggers depending
on each other and called at arbitrary moments was enough. I tried to fix
these trigger-dependencies in scope of different tasks, but cleaning this
code appears to be a huge task by itself.

This can be done without obscure triggers called from arbitrary threads at
atribtrary moments of time. That in fact was the main blocker for the in-memory
WAL, when I tried to finish it after Georgy in January. We have fibers exactly
to avoid triggers. To be able to write linear and simple code. The triggers
can be replaced by dedicated fibers, and fiber condition variables can be used
to wait for exact moments of time of needed events where functionality is
event based.

>> APIs. I believe Cyrill G. would agree with me here, I remember him
>> complaining about replication-wal-gc code inter-dependencies too. Please,
>> request his review on this, if you didn't yet.
>>
> I personally have the same problem trying to implement a trivial test,
> by just figuring out the layers and dependencies of participants. This
> is about poor documentation im my understanding, not poor design. 

There is almost more documentation than the code. Just look at the number
of comments and level of their details. And still it does not help. So
looks like just a bad API and dependency design. Does not matter how hard
it is documented, it just becomes more interdepending and harder to wrap a
mind around it.

>>> In case of a leader failure a replica with the biggest LSN with former
>>> leader's ID is elected as a new leader. The replica should record
>>> 'rollback' in its WAL which effectively means that all transactions
>>> without quorum should be rolled back. This rollback will be delivered to
>>> all replicas and they will perform rollbacks of all transactions waiting
>>> for quorum.
>>
>> 7. Please, elaborate leader election. It is not as trivial as just 'elect'.
>> What if the replica with the biggest LSN is temporary not available, but
>> it knows that it has the biggest LSN? Will it become a leader without
>> asking other nodes? What will do the other nodes? Will they wait for the
>> new leader node to become available? Do they have a timeout on that?
>>
> By now I do not plan any activities on HA - including automated failover
> and leader re-election. In case leader sees insufficient number of
> replicas to achieve quorum - it stops, reporting the problem to the
> external orchestrator. 

>From the text in this section it does not look like a not planned activity,
but like an already made decision. It is not even in the 'Plans' section.
You just said 'is elected'. By whom? How?

If the election is not a part of the RFC, I would suggest moving this out,
or into a separate section 'Plans' or something. Or reformulate this by
saying like 'the cluster stops serving write requests until external
intervention sets a new leader'. And 'it is *advised* to use the one with
the biggest LSN in the old leader's vclock component'. Something like that.

>> Basically, it would be nice to see the split-brain problem description here,
>> and its solution for us.
>>
> I believe the split-brain is under orchestrator control either - we
> should provide API to switch leader in the cluster, so that when a
> former leader came back it will not get quorum for any txn it has,
> replying to customers with failure as a result.

Exactly. We should provide something for this from inside. But are there
any details? How should that work? Should all the healthy replicas reject
everything from the false-leader? Should the false-leader somehow realize,
that it is not considered a leader anymore, and should stop itself? If we
choose the former way, how a replica defines who is the true leader? For
example, some replicas still may consider the old leader as a true master.
If we choose the latter way, what is the algorithm of determining that we
are not a leader anymore?

>>> After snapshot is created the WAL should start from the first operation
>>> that follows the commit operation snapshot is generated for. That means
>>> WAL will contain 'confirm' messages that refer to transactions that are
>>> not present in the WAL. Apparently, we have to allow this for the case
>>> 'confirm' refers to a transaction with LSN less than the first entry in
>>> the WAL.
>>
>> 9. I couldn't understand that. Why confirm is in WAL for data stored in
>> the snap? I thought you said above, that snapshot should be done for all
>> confirmed data. Besides, having confirm out of snap means the snap is
>> not self-sufficient anymore.
>>
> Snap waits for confirm message to start. During this wait the WAL keep
> growing. At the moment confirm arrived the snap will be created - say,
> for txn #10. The WAL will be started with lsn #11 and commit can be
> somewhere lsn #30. 
> So, starting with this snap data appears consistent for lsn #10 - it is
> guaranteed by the wait of commit message. Then replay of WAL will come 
> to a confirm message lsn #30 - referring to lsn #10 - that actually
> ignored, since it looks beyond the WAL start. There could be confirm
> messages for even earlier txns if wait takes sufficient time - all of
> them will refer to lsn beyond the WAL. And it is Ok.

What is 'commit message'? I don't see it on the schema above. I see only
confirms.

So the problem is that some data may be written to WAL after we started
committing our transactions going to the snap, but before we received a
quorum. And we can't truncate the WAL by the quorum, because there is
already newer data, which was not included into the snap. Because WAL is
not stopped, it still accepts new transactions. Now I understand.

Would be good to have this example in the RFC.

>>> Cluster description should contain explicit attribute for each replica
>>> to denote it participates in synchronous activities. Also the description
>>> should contain criterion on how many replicas responses are needed to
>>> achieve the quorum.
>>
>> 11. Aha, I see 'necessary amount' from above is a manually set value. Ok.
>>
>>>
>>> ## Rationale and alternatives
>>>
>>> There is an implementation of synchronous replication as part of gh-980
>>> activities, still it is not in a state to get into the product. More
>>> than that it intentionally breaks backward compatibility which is a
>>> prerequisite for this proposal.
>>
>> 12. How are we going to deal with fsync()? Will it be forcefully enabled
>> on sync replicas and the leader?
> 
> To my understanding - it's up to user. I was considering a cluster that
> has no WAL at all - relying on sychro replication and sufficient number
> of replicas. Everyone who I asked about it told me I'm nuts. To my great
> surprise Alexander Lyapunov brought exactly the same idea to discuss. 

I didn't see an RFC on that, and this can become easily possible, when
in-memory relay is implemented. If it is implemented in a clean way. We
just can turn off the disk backoff, and it will work from memory-only.

> All of these is for one resolution: I would keep it for user to decide.
> Obviously, to speed up the processing leader can disable wal completely,
> but to do so we have to re-work the relay to work from memory. Replicas
> can use WAL in a way user wants: 2 replicas with slow HDD should'n wait
> for fsync(), while super-fast Intel DCPMM one can enable it. Balancing
> is up to user.

Possibility of omitting fsync means that it is possible, that all nodes
write confirm, which is reported to the client, then the nodes restart,
and the data is lost. I would say it somewhere.