[Tarantool-patches] [RFC] Quorum-based synchronous replication

Thu Apr 23 23:39:32 MSK 2020

* Sergey Ostanevich <sergos at tarantool.org> [20/04/23 18:11]:
> > The spec should demonstrate the consistency is guaranteed: right
> > now it can easily be violated during a leader change, and this is
> > left out of scope of the spec.
> > 
> > My take is that any implementation which is not close enough to a
> > TLA+ proven spec is not trustworthy, so I would not claim myself
> > or trust any one elses claims that it is consistent. At best this
> > RFC could achieve durability, by ensuring that no transaction is
> > committed unless it is delivered to a majority of replicas.
> 
> What is exactly mentioned in RFC goals.

This is durability, though, not consistency. My point is: if
consistency can not be guaranteed anyway, why assume single leader. Let's
consider what happens if all replicas are allowed to collect acks, 
define for it the same semantics as we do today in case of async
multi-master. Then add the remaining bits of RAFT.
> 
> > Consistency requires implementing RAFT spec in full and showing
> > that leader changes preserve the write ahead log linearizability.
> > 
> So the leader should stop accepting transactions, wait for all txn in
> queue resolved into confirmed either issue a rollback - after a 
> timeout as a last resort.
> Since no automation in leader election the cluster will appear in a
> consistent state after this. Now a new leader can be appointed with
> all circumstances taken into account - nodes availability, ping from
> the proxy, lsn, etc.
> Again, this RFC is not about any HA features, such as auto-failover.
> 
> > > > The other issue is that if your replicas are alive but
> > > > slow/lagging behind, you can't let too many undo records to
> > > > pile up unacknowledged in tx thread.
> > > > The in-memory relay solves this nicely too, because it kicks out
> > > > replicas from memory to file mode if they are unable to keep up
> > > > with the speed of change.
> > > > 
> > > That is the same problem - resources of leader, so natural limit for
> > > throughput. I bet Tarantool faces similar limitations even now,
> > > although different ones. 
> > > 
> > > The in-memory relay supposed to keep the same interface, so we expect to
> > > hop easily to this new shiny express as soon as it appears. This will be
> > > an optimization and we're trying to implement something and then speed
> > > it up.
> > 
> > It is pretty clear that the implementation will be different. 
> > 
> Which contradicts to the interface preservance, right?

I don't believe internals and API can be so disconnected. I think
in-memory relay is such a significant change that the
implementation has to build upon it. 
The trigger-based implementation was contributed back in 2015 and
went nowhere, in fact it was an inspiration to create a backlog of
such items as parallel applier, applier in iproto, in-memory
relay, and so on - all of these are "review items" for the
trigger-based syncrep:

https://github.com/Alexey-Ivanensky/tarantool/tree/bsync

-- 
Konstantin Osipov, Moscow, Russia