From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <sergos@tarantool.org>
Received: from smtp3.mail.ru (smtp3.mail.ru [94.100.179.58])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by dev.tarantool.org (Postfix) with ESMTPS id 85A2C4696C3
 for <tarantool-patches@dev.tarantool.org>;
 Thu, 23 Apr 2020 14:27:03 +0300 (MSK)
Date: Thu, 23 Apr 2020 14:27:02 +0300
From: Sergey Ostanevich <sergos@tarantool.org>
Message-ID: <20200423112702.GC112@tarantool.org>
References: <20200403210836.GB18283@tarantool.org>
 <ab849382-feb5-b906-84a8-402124e1c0a8@tarantool.org>
 <20200421104918.GA112@tarantool.org>
 <dd4d703e-6918-ccd9-ac5e-76fb54fff0f9@tarantool.org>
 <20200423065809.GA4528@atlas> <20200423091436.GA14576@atlas>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20200423091436.GA14576@atlas>
Subject: Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
List-Id: Tarantool development patches <tarantool-patches.dev.tarantool.org>
List-Unsubscribe: <https://lists.tarantool.org/mailman/options/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=unsubscribe>
List-Archive: <https://lists.tarantool.org/pipermail/tarantool-patches/>
List-Post: <mailto:tarantool-patches@dev.tarantool.org>
List-Help: <mailto:tarantool-patches-request@dev.tarantool.org?subject=help>
List-Subscribe: <https://lists.tarantool.org/mailman/listinfo/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=subscribe>
To: Konstantin Osipov <kostja.osipov@gmail.com>, Vladislav Shpilevoy <v.shpilevoy@tarantool.org>, tarantool-patches@dev.tarantool.org

Hi!

Thanks for review!

On 23 апр 12:14, Konstantin Osipov wrote:
> * Konstantin Osipov <kostja.osipov@gmail.com> [20/04/23 09:58]:
> > > > To my understanding - it's up to user. I was considering a cluster that
> > > > has no WAL at all - relying on sychro replication and sufficient number
> > > > of replicas. Everyone who I asked about it told me I'm nuts. To my great
> > > > surprise Alexander Lyapunov brought exactly the same idea to discuss. 
> > > 
> > > I didn't see an RFC on that, and this can become easily possible, when
> > > in-memory relay is implemented. If it is implemented in a clean way. We
> > > just can turn off the disk backoff, and it will work from memory-only.
> > 
> > Sync replication must work from in-memory relay only. It works as
> > a natural failure detector: a replica which is slow or unavailable
> > is first removed from the subscribers of in-memory relay, and only 
> > then (possibly much much later) is marked as down.
> > 
> > By looking at the in-memory relay you have a clear idea what peers
> > are available and can abort a transaction if a cluster is in the
> > downgraded state right away. You never wait for impossible events. 
> > 
> > If you do have to wait, and say your wait timeout is 1 second, you
> > quickly run out of any fibers in the fiber pool for any work,
> > because all of them will be waiting on the sync transactions they
> > picked up from iproto to finish. The system will loose its
> > throttling capability. 
> 
There's no need to explain it to customer: sync replication is not
expected to be as fast as pure in-memory. By no means. We have network
communication, disk operation, multiple entities quorum - all of these
can't be as fast. No need to try cramp more than network can push
through, obvoiusly.

The quality one buys for this price: consistency of data in multiple
instances distributed across different locations. 

> The other issue is that if your replicas are alive but
> slow/lagging behind, you can't let too many undo records to
> pile up unacknowledged in tx thread.
> The in-memory relay solves this nicely too, because it kicks out
> replicas from memory to file mode if they are unable to keep up
> with the speed of change.
> 
That is the same problem - resources of leader, so natural limit for
throughput. I bet Tarantool faces similar limitations even now,
although different ones. 

The in-memory relay supposed to keep the same interface, so we expect to
hop easily to this new shiny express as soon as it appears. This will be
an optimization and we're trying to implement something and then speed
it up.