From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lj1-f196.google.com (mail-lj1-f196.google.com [209.85.208.196]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 7BDB84696C3 for ; Thu, 23 Apr 2020 09:58:11 +0300 (MSK) Received: by mail-lj1-f196.google.com with SMTP id b2so5040387ljp.4 for ; Wed, 22 Apr 2020 23:58:11 -0700 (PDT) Date: Thu, 23 Apr 2020 09:58:09 +0300 From: Konstantin Osipov Message-ID: <20200423065809.GA4528@atlas> References: <20200403210836.GB18283@tarantool.org> <20200421104918.GA112@tarantool.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Subject: Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Vladislav Shpilevoy Cc: tarantool-patches@dev.tarantool.org * Vladislav Shpilevoy [20/04/22 01:21]: > > To my understanding - it's up to user. I was considering a cluster that > > has no WAL at all - relying on sychro replication and sufficient number > > of replicas. Everyone who I asked about it told me I'm nuts. To my great > > surprise Alexander Lyapunov brought exactly the same idea to discuss. > > I didn't see an RFC on that, and this can become easily possible, when > in-memory relay is implemented. If it is implemented in a clean way. We > just can turn off the disk backoff, and it will work from memory-only. Sync replication must work from in-memory relay only. It works as a natural failure detector: a replica which is slow or unavailable is first removed from the subscribers of in-memory relay, and only then (possibly much much later) is marked as down. By looking at the in-memory relay you have a clear idea what peers are available and can abort a transaction if a cluster is in the downgraded state right away. You never wait for impossible events. If you do have to wait, and say your wait timeout is 1 second, you quickly run out of any fibers in the fiber pool for any work, because all of them will be waiting on the sync transactions they picked up from iproto to finish. The system will loose its throttling capability. There are other reasons, too: the protocol will eventually be quite tricky and the logic has to reside in a single place and not require inter-thread communication. Committing a transaction purely anywhere outside WAL will require inter-thread communication, which is costly and should be avoided. I am surprised I have to explain this again and again - I never assumed this spec is a precursor for a half-backed implementation, only as a high-level description of the next steps after in-memory relay is in. > > All of these is for one resolution: I would keep it for user to decide. > > Obviously, to speed up the processing leader can disable wal completely, > > but to do so we have to re-work the relay to work from memory. Replicas > > can use WAL in a way user wants: 2 replicas with slow HDD should'n wait > > for fsync(), while super-fast Intel DCPMM one can enable it. Balancing > > is up to user. > > Possibility of omitting fsync means that it is possible, that all nodes > write confirm, which is reported to the client, then the nodes restart, > and the data is lost. I would say it somewhere. Worse yet you can elect a leader "based on WAL length" and then it is no longer the leader, because it lost it long WAL after restart. fcync() is mandatory during election, in other cases it shouldn't impact durability in our case. -- Konstantin Osipov, Moscow, Russia