[Tarantool-patches] [RFC] Quorum-based synchronous replication

Thu Apr 23 09:58:09 MSK 2020

* Vladislav Shpilevoy <v.shpilevoy at tarantool.org> [20/04/22 01:21]:
> > To my understanding - it's up to user. I was considering a cluster that
> > has no WAL at all - relying on sychro replication and sufficient number
> > of replicas. Everyone who I asked about it told me I'm nuts. To my great
> > surprise Alexander Lyapunov brought exactly the same idea to discuss. 
> 
> I didn't see an RFC on that, and this can become easily possible, when
> in-memory relay is implemented. If it is implemented in a clean way. We
> just can turn off the disk backoff, and it will work from memory-only.

Sync replication must work from in-memory relay only. It works as
a natural failure detector: a replica which is slow or unavailable
is first removed from the subscribers of in-memory relay, and only 
then (possibly much much later) is marked as down.

By looking at the in-memory relay you have a clear idea what peers
are available and can abort a transaction if a cluster is in the
downgraded state right away. You never wait for impossible events. 

If you do have to wait, and say your wait timeout is 1 second, you
quickly run out of any fibers in the fiber pool for any work,
because all of them will be waiting on the sync transactions they
picked up from iproto to finish. The system will loose its
throttling capability. 

There are other reasons, too: the protocol will eventually be
quite tricky and the logic has to reside in a single place and not
require inter-thread communication. 
Committing a transaction purely anywhere outside WAL will require 
inter-thread communication, which is costly and should be avoided.

I am surprised I have to explain this again and again - I never
assumed this spec is a precursor for a half-backed implementation,
only as a high-level description of the next steps after in-memory
relay is in.

> > All of these is for one resolution: I would keep it for user to decide.
> > Obviously, to speed up the processing leader can disable wal completely,
> > but to do so we have to re-work the relay to work from memory. Replicas
> > can use WAL in a way user wants: 2 replicas with slow HDD should'n wait
> > for fsync(), while super-fast Intel DCPMM one can enable it. Balancing
> > is up to user.
> 
> Possibility of omitting fsync means that it is possible, that all nodes
> write confirm, which is reported to the client, then the nodes restart,
> and the data is lost. I would say it somewhere.

Worse yet you can elect a leader "based on WAL length" and then it
is no longer the leader, because it lost it long WAL after
restart. fcync() is mandatory during election, in other cases it
shouldn't impact durability in our case.

-- 
Konstantin Osipov, Moscow, Russia