From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <kostja.osipov@gmail.com>
Received: from mail-lj1-f193.google.com (mail-lj1-f193.google.com
 [209.85.208.193])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by dev.tarantool.org (Postfix) with ESMTPS id 3247C469710
 for <tarantool-patches@dev.tarantool.org>;
 Thu, 14 May 2020 02:45:56 +0300 (MSK)
Received: by mail-lj1-f193.google.com with SMTP id l19so1468912lje.10
 for <tarantool-patches@dev.tarantool.org>;
 Wed, 13 May 2020 16:45:56 -0700 (PDT)
Date: Thu, 14 May 2020 02:45:53 +0300
From: Konstantin Osipov <kostja.osipov@gmail.com>
Message-ID: <20200513234553.GB5698@atlas>
References: <20200403210836.GB18283@tarantool.org>
 <c86ef610-f54e-524e-103a-324e7e572d2d@tarantool.org>
 <20200430145033.GF112@tarantool.org> <20200506085249.GA2842@atlas>
 <20200506163901.GH112@tarantool.org>
 <10920620-b911-360f-f075-7718a40bf4af@tarantool.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <10920620-b911-360f-f075-7718a40bf4af@tarantool.org>
Subject: Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
List-Id: Tarantool development patches <tarantool-patches.dev.tarantool.org>
List-Unsubscribe: <https://lists.tarantool.org/mailman/options/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=unsubscribe>
List-Archive: <https://lists.tarantool.org/pipermail/tarantool-patches/>
List-Post: <mailto:tarantool-patches@dev.tarantool.org>
List-Help: <mailto:tarantool-patches-request@dev.tarantool.org?subject=help>
List-Subscribe: <https://lists.tarantool.org/mailman/listinfo/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=subscribe>
To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Cc: tarantool-patches@dev.tarantool.org

* Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/05/14 00:37]:
> > Thanks for review!
> > 
> >>>    |               |              |             |              |
> >>>    |            [Quorum           |             |              |
> >>>    |           achieved]          |             |              |
> >>>    |               |              |             |              |
> >>>    |         [TXN undo log        |             |              |
> >>>    |           destroyed]         |             |              |
> >>>    |               |              |             |              |
> >>>    |               |---Confirm--->|             |              |
> >>>    |               |              |             |              |
> >>
> >> What happens if writing Confirm to WAL fails? TXN und log record
> >> is destroyed already. Will the server panic now on WAL failure,
> >> even if it is intermittent?
> > 
> > I would like to have an example of intermittent WAL failure. Can it be
> > other than problem with disc - be it space/availability/malfunction?
> > 
> > For all of those it should be resolved outside the DBMS anyways. So,
> > leader should stop and report its problems to orchestrator/admins.
> > 
> > I would agree that undo log can be destroyed *after* the Confirm is
> > landed to WAL - same is for replica.
> 
> Well, in fact you can't (or can you?). Because it won't help. Once you
> tried to write 'Confirm', it means you got the quorum. So now in case
> you will fail, a new leader will write 'Confirm' for you, when will see
> a quorum too. So the current leader has no right to write 'Rollback'
> from this moment, from what I understand. Because it still can be
> confirmed by a new leader later, if you fail before 'Rollback' is
> replicated to all.
> 
> However the same problem appears, if you write 'Confirm' *successfully*.
> Still the leader can fail, and a newer leader will write 'Rollback' if
> won't collect the quorum again. Don't know what to do with that really.
> Probably nothing.

Maybe consult with the raft spec? 

The new leader is guaranteed to see the transaction since it has
reached the majority of replicas. So it will definitely write
"confirm" for it. The reason I asked the question is I want the
case of intermittent failures be described in the spec.
For example, is "confirm" a cbus message, then if there is a
cascading rollback of the batch it is part of, it can be rolled
back. I would like to see all these scenarios covered in the spec.
If one of them ends with panic, I would like to understand how the
external coordinator is going to resolve the new election. 
Raft has answers for all of it.


> >>> As soon as leader appears in a situation it has not enough replicas
> >>> to achieve quorum, the cluster should stop accepting any requests - both
> >>> write and read.
> >>
> >> How does *the cluster* know the state of the leader and if it
> >> doesn't, how it can possibly implement this? Did you mean
> >> the leader should stop accepting transactions here? But how can
> >> the leader know if it has not enough replicas during a read
> >> transaction, if it doesn't contact any replica to serve a read?
> > 
> > I expect to have a disconnection trigger assigned to all relays so that
> > disconnection will cause the number of replicas decrease. The quorum
> > size is static, so we can stop at the very moment the number dives below.
> 
> This is a very dubious statement. In TCP disconnect may be detected much
> later, than it happened. So to collect a quorum on something you need to
> literally collect this quorum, with special WAL records, via network, and
> all. A disconnect trigger does not help at all here.

Erhm, thanks. 

> Talking of the whole 'read-quorum' idea, I don't like it. Because this
> really makes things unbearably harder to implement, the nodes become
> much slower and less available in terms of any problems.
> 
> I think reads should be allowed always, and from any node (except during
> bootstrap, of course). After all, you have transactions for consistency. So
> as far as replication respects transaction boundaries, every node is in a
> consistent state. Maybe not all of them are in the same state, but every one
> is consistent.

In memtx, you read by default dirty, uncommitted data. It was OK for single-node
transactions, since the only chance for it to be rolled back were
out of space/disk failure, which were extremely rare, now you
really read dirty stuff, because you can easily have it rolled
back because of lack of quorum or re-election.
So it's a much bigger deal.


> Honestly, I can't even imagine, how is it possible to implement a completely
> synchronous simultaneous cluster progression. It is impossible even in theory.
> There always will be a time period, when some nodes are further than the
> others. At least because of network delays.
> 
> So either we allows reads from master only, or we allow reads from everywhere,
> and in that case nothing will save from a possibility of seeing different
> data on different nodes.

This is why there are many consistency models out there (just
google consistency models in distributed systems), and the minor
details are important. It's indeed hard to implement the strictest model
(serial), but it is also often unnecessary, and there is consensus
in the relational databases what issues are acceptable and what are not. 

More specifically, I think for tarantool sync replication we
should aim at read committed. The spec should say it in no
uncertain terms and explain how it is achieved.

-- 
Konstantin Osipov, Moscow, Russia