From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lj1-f193.google.com (mail-lj1-f193.google.com [209.85.208.193]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 6E0734696C3 for ; Thu, 23 Apr 2020 23:39:35 +0300 (MSK) Received: by mail-lj1-f193.google.com with SMTP id y4so7676622ljn.7 for ; Thu, 23 Apr 2020 13:39:35 -0700 (PDT) Date: Thu, 23 Apr 2020 23:39:32 +0300 From: Konstantin Osipov Message-ID: <20200423203932.GA22011@atlas> References: <20200403210836.GB18283@tarantool.org> <20200421104918.GA112@tarantool.org> <20200423065809.GA4528@atlas> <20200423091436.GA14576@atlas> <20200423112702.GC112@tarantool.org> <20200423114325.GA19129@atlas> <20200423151134.GD112@tarantool.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200423151134.GD112@tarantool.org> Subject: Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Sergey Ostanevich Cc: tarantool-patches@dev.tarantool.org, Vladislav Shpilevoy * Sergey Ostanevich [20/04/23 18:11]: > > The spec should demonstrate the consistency is guaranteed: right > > now it can easily be violated during a leader change, and this is > > left out of scope of the spec. > > > > My take is that any implementation which is not close enough to a > > TLA+ proven spec is not trustworthy, so I would not claim myself > > or trust any one elses claims that it is consistent. At best this > > RFC could achieve durability, by ensuring that no transaction is > > committed unless it is delivered to a majority of replicas. > > What is exactly mentioned in RFC goals. This is durability, though, not consistency. My point is: if consistency can not be guaranteed anyway, why assume single leader. Let's consider what happens if all replicas are allowed to collect acks, define for it the same semantics as we do today in case of async multi-master. Then add the remaining bits of RAFT. > > > Consistency requires implementing RAFT spec in full and showing > > that leader changes preserve the write ahead log linearizability. > > > So the leader should stop accepting transactions, wait for all txn in > queue resolved into confirmed either issue a rollback - after a > timeout as a last resort. > Since no automation in leader election the cluster will appear in a > consistent state after this. Now a new leader can be appointed with > all circumstances taken into account - nodes availability, ping from > the proxy, lsn, etc. > Again, this RFC is not about any HA features, such as auto-failover. > > > > > The other issue is that if your replicas are alive but > > > > slow/lagging behind, you can't let too many undo records to > > > > pile up unacknowledged in tx thread. > > > > The in-memory relay solves this nicely too, because it kicks out > > > > replicas from memory to file mode if they are unable to keep up > > > > with the speed of change. > > > > > > > That is the same problem - resources of leader, so natural limit for > > > throughput. I bet Tarantool faces similar limitations even now, > > > although different ones. > > > > > > The in-memory relay supposed to keep the same interface, so we expect to > > > hop easily to this new shiny express as soon as it appears. This will be > > > an optimization and we're trying to implement something and then speed > > > it up. > > > > It is pretty clear that the implementation will be different. > > > Which contradicts to the interface preservance, right? I don't believe internals and API can be so disconnected. I think in-memory relay is such a significant change that the implementation has to build upon it. The trigger-based implementation was contributed back in 2015 and went nowhere, in fact it was an inspiration to create a backlog of such items as parallel applier, applier in iproto, in-memory relay, and so on - all of these are "review items" for the trigger-based syncrep: https://github.com/Alexey-Ivanensky/tarantool/tree/bsync -- Konstantin Osipov, Moscow, Russia