From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <kostja.osipov@gmail.com>
Received: from mail-lj1-f196.google.com (mail-lj1-f196.google.com
 [209.85.208.196])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by dev.tarantool.org (Postfix) with ESMTPS id 7BDB84696C3
 for <tarantool-patches@dev.tarantool.org>;
 Thu, 23 Apr 2020 09:58:11 +0300 (MSK)
Received: by mail-lj1-f196.google.com with SMTP id b2so5040387ljp.4
 for <tarantool-patches@dev.tarantool.org>;
 Wed, 22 Apr 2020 23:58:11 -0700 (PDT)
Date: Thu, 23 Apr 2020 09:58:09 +0300
From: Konstantin Osipov <kostja.osipov@gmail.com>
Message-ID: <20200423065809.GA4528@atlas>
References: <20200403210836.GB18283@tarantool.org>
 <ab849382-feb5-b906-84a8-402124e1c0a8@tarantool.org>
 <20200421104918.GA112@tarantool.org>
 <dd4d703e-6918-ccd9-ac5e-76fb54fff0f9@tarantool.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <dd4d703e-6918-ccd9-ac5e-76fb54fff0f9@tarantool.org>
Subject: Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
List-Id: Tarantool development patches <tarantool-patches.dev.tarantool.org>
List-Unsubscribe: <https://lists.tarantool.org/mailman/options/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=unsubscribe>
List-Archive: <https://lists.tarantool.org/pipermail/tarantool-patches/>
List-Post: <mailto:tarantool-patches@dev.tarantool.org>
List-Help: <mailto:tarantool-patches-request@dev.tarantool.org?subject=help>
List-Subscribe: <https://lists.tarantool.org/mailman/listinfo/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=subscribe>
To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Cc: tarantool-patches@dev.tarantool.org

* Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/04/22 01:21]:
> > To my understanding - it's up to user. I was considering a cluster that
> > has no WAL at all - relying on sychro replication and sufficient number
> > of replicas. Everyone who I asked about it told me I'm nuts. To my great
> > surprise Alexander Lyapunov brought exactly the same idea to discuss. 
> 
> I didn't see an RFC on that, and this can become easily possible, when
> in-memory relay is implemented. If it is implemented in a clean way. We
> just can turn off the disk backoff, and it will work from memory-only.

Sync replication must work from in-memory relay only. It works as
a natural failure detector: a replica which is slow or unavailable
is first removed from the subscribers of in-memory relay, and only 
then (possibly much much later) is marked as down.

By looking at the in-memory relay you have a clear idea what peers
are available and can abort a transaction if a cluster is in the
downgraded state right away. You never wait for impossible events. 

If you do have to wait, and say your wait timeout is 1 second, you
quickly run out of any fibers in the fiber pool for any work,
because all of them will be waiting on the sync transactions they
picked up from iproto to finish. The system will loose its
throttling capability. 

There are other reasons, too: the protocol will eventually be
quite tricky and the logic has to reside in a single place and not
require inter-thread communication. 
Committing a transaction purely anywhere outside WAL will require 
inter-thread communication, which is costly and should be avoided.

I am surprised I have to explain this again and again - I never
assumed this spec is a precursor for a half-backed implementation,
only as a high-level description of the next steps after in-memory
relay is in.

> > All of these is for one resolution: I would keep it for user to decide.
> > Obviously, to speed up the processing leader can disable wal completely,
> > but to do so we have to re-work the relay to work from memory. Replicas
> > can use WAL in a way user wants: 2 replicas with slow HDD should'n wait
> > for fsync(), while super-fast Intel DCPMM one can enable it. Balancing
> > is up to user.
> 
> Possibility of omitting fsync means that it is possible, that all nodes
> write confirm, which is reported to the client, then the nodes restart,
> and the data is lost. I would say it somewhere.

Worse yet you can elect a leader "based on WAL length" and then it
is no longer the leader, because it lost it long WAL after
restart. fcync() is mandatory during election, in other cases it
shouldn't impact durability in our case.

-- 
Konstantin Osipov, Moscow, Russia