From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <kostja.osipov@gmail.com>
Received: from mail-lj1-f169.google.com (mail-lj1-f169.google.com
 [209.85.208.169])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by dev.tarantool.org (Postfix) with ESMTPS id EFD844696C3
 for <tarantool-patches@dev.tarantool.org>;
 Thu, 23 Apr 2020 14:43:27 +0300 (MSK)
Received: by mail-lj1-f169.google.com with SMTP id a21so2973409ljj.11
 for <tarantool-patches@dev.tarantool.org>;
 Thu, 23 Apr 2020 04:43:27 -0700 (PDT)
Date: Thu, 23 Apr 2020 14:43:25 +0300
From: Konstantin Osipov <kostja.osipov@gmail.com>
Message-ID: <20200423114325.GA19129@atlas>
References: <20200403210836.GB18283@tarantool.org>
 <ab849382-feb5-b906-84a8-402124e1c0a8@tarantool.org>
 <20200421104918.GA112@tarantool.org>
 <dd4d703e-6918-ccd9-ac5e-76fb54fff0f9@tarantool.org>
 <20200423065809.GA4528@atlas> <20200423091436.GA14576@atlas>
 <20200423112702.GC112@tarantool.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20200423112702.GC112@tarantool.org>
Subject: Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
List-Id: Tarantool development patches <tarantool-patches.dev.tarantool.org>
List-Unsubscribe: <https://lists.tarantool.org/mailman/options/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=unsubscribe>
List-Archive: <https://lists.tarantool.org/pipermail/tarantool-patches/>
List-Post: <mailto:tarantool-patches@dev.tarantool.org>
List-Help: <mailto:tarantool-patches-request@dev.tarantool.org?subject=help>
List-Subscribe: <https://lists.tarantool.org/mailman/listinfo/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=subscribe>
To: Sergey Ostanevich <sergos@tarantool.org>
Cc: tarantool-patches@dev.tarantool.org, Vladislav Shpilevoy <v.shpilevoy@tarantool.org>

* Sergey Ostanevich <sergos@tarantool.org> [20/04/23 14:29]:
> Hi!
> 
> Thanks for review!
> 
> On 23 апр 12:14, Konstantin Osipov wrote:
> > * Konstantin Osipov <kostja.osipov@gmail.com> [20/04/23 09:58]:
> > > > > To my understanding - it's up to user. I was considering a cluster that
> > > > > has no WAL at all - relying on sychro replication and sufficient number
> > > > > of replicas. Everyone who I asked about it told me I'm nuts. To my great
> > > > > surprise Alexander Lyapunov brought exactly the same idea to discuss. 
> > > > 
> > > > I didn't see an RFC on that, and this can become easily possible, when
> > > > in-memory relay is implemented. If it is implemented in a clean way. We
> > > > just can turn off the disk backoff, and it will work from memory-only.
> > > 
> > > Sync replication must work from in-memory relay only. It works as
> > > a natural failure detector: a replica which is slow or unavailable
> > > is first removed from the subscribers of in-memory relay, and only 
> > > then (possibly much much later) is marked as down.
> > > 
> > > By looking at the in-memory relay you have a clear idea what peers
> > > are available and can abort a transaction if a cluster is in the
> > > downgraded state right away. You never wait for impossible events. 
> > > 
> > > If you do have to wait, and say your wait timeout is 1 second, you
> > > quickly run out of any fibers in the fiber pool for any work,
> > > because all of them will be waiting on the sync transactions they
> > > picked up from iproto to finish. The system will loose its
> > > throttling capability. 
> > 
> There's no need to explain it to customer: sync replication is not
> expected to be as fast as pure in-memory. By no means. We have network
> communication, disk operation, multiple entities quorum - all of these
> can't be as fast. No need to try cramp more than network can push
> through, obvoiusly.

This expected performance overhead is not a grant to run out of
memory or available fibers on a node failure or network partitioning.

> The quality one buys for this price: consistency of data in multiple
> instances distributed across different locations. 

The spec should demonstrate the consistency is guaranteed: right
now it can easily be violated during a leader change, and this is
left out of scope of the spec.

My take is that any implementation which is not close enough to a
TLA+ proven spec is not trustworthy, so I would not claim myself
or trust any one elses claims that it is consistent. At best this
RFC could achieve durability, by ensuring that no transaction is
committed unless it is delivered to a majority of replicas.
Consistency requires implementing RAFT spec in full and showing
that leader changes preserve the write ahead log linearizability.

> > The other issue is that if your replicas are alive but
> > slow/lagging behind, you can't let too many undo records to
> > pile up unacknowledged in tx thread.
> > The in-memory relay solves this nicely too, because it kicks out
> > replicas from memory to file mode if they are unable to keep up
> > with the speed of change.
> > 
> That is the same problem - resources of leader, so natural limit for
> throughput. I bet Tarantool faces similar limitations even now,
> although different ones. 
> 
> The in-memory relay supposed to keep the same interface, so we expect to
> hop easily to this new shiny express as soon as it appears. This will be
> an optimization and we're trying to implement something and then speed
> it up.

It is pretty clear that the implementation will be different. 

-- 
Konstantin Osipov, Moscow, Russia