From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <v.shpilevoy@tarantool.org>
Received: from smtpng1.m.smailru.net (smtpng1.m.smailru.net [94.100.181.251])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
 bits)) (No client certificate requested)
 by dev.tarantool.org (Postfix) with ESMTPS id 0D90D4696C3
 for <tarantool-patches@dev.tarantool.org>;
 Fri, 24 Apr 2020 00:38:36 +0300 (MSK)
References: <20200403210836.GB18283@tarantool.org>
From: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Message-ID: <c86ef610-f54e-524e-103a-324e7e572d2d@tarantool.org>
Date: Thu, 23 Apr 2020 23:38:34 +0200
MIME-Version: 1.0
In-Reply-To: <20200403210836.GB18283@tarantool.org>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Subject: Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
List-Id: Tarantool development patches <tarantool-patches.dev.tarantool.org>
List-Unsubscribe: <https://lists.tarantool.org/mailman/options/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=unsubscribe>
List-Archive: <https://lists.tarantool.org/pipermail/tarantool-patches/>
List-Post: <mailto:tarantool-patches@dev.tarantool.org>
List-Help: <mailto:tarantool-patches-request@dev.tarantool.org?subject=help>
List-Subscribe: <https://lists.tarantool.org/mailman/listinfo/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=subscribe>
To: Sergey Ostanevich <sergos@tarantool.org>, tarantool-patches@dev.tarantool.org, Timur Safin <tsafin@tarantool.org>, Mons Anderson <mons@cpan.org>

Hi!

Here is a short summary of our late night discussion and the
questions it brought up, while I was trying to design a draft
plan of an implementation. Since the RFC is too far from the
code, and I needed a more 'pedestrian' and detailed plan.

The question is about 'confirm' message and quorum collection.
Here is the schema presented in the RFC:

> Customer        Leader          WAL(L)        Replica        WAL(R)
>    |------TXN----->|              |             |              |
>    |               |              |             |              |
>    |         [TXN undo log        |             |              |
>    |            created]          |             |              |
>    |               |              |             |              |
>    |               |-----TXN----->|             |              |
>    |               |              |             |              |
>    |               |-------Replicate TXN------->|              |
>    |               |              |             |              |
>    |               |              |       [TXN undo log        |
>    |               |<---WAL Ok----|          created]          |
>    |               |              |             |              |
>    |           [Waiting           |             |-----TXN----->|
>    |         of a quorum]         |             |              |
>    |               |              |             |<---WAL Ok----|
>    |               |              |             |              |
>    |               |<------Replication Ok-------|              |
>    |               |              |             |              |
>    |            [Quorum           |             |              |
>    |           achieved]          |             |              |
>    |               |              |             |              |
>    |         [TXN undo log        |             |              |
>    |           destroyed]         |             |              |
>    |               |              |             |              |
>    |               |---Confirm--->|             |              |
>    |               |              |             |              |
>    |               |----------Confirm---------->|              |
>    |               |              |             |              |
>    |<---TXN Ok-----|              |       [TXN undo log        |
>    |               |              |         destroyed]         |
>    |               |              |             |              |
>    |               |              |             |---Confirm--->|
>    |               |              |             |              |

It says, that once the quorum is collected, and 'confirm' is written
to local leader's WAL, it is considered committed and is reported
to the client as successful.

On the other hand it is said, that in case of leader change the
new leader will rollback all not confirmed transactions. That leads
to the following bug:

Assume we have 4 instances: i1, i2, i3, i4. Leader is i1. It
writes a transaction with LSN1. The LSN1 is sent to other nodes,
they apply it ok, and send acks to the leader. The leader sees
i2-i4 all applied the transaction (propagated their LSNs to LSN1).
It writes 'confirm' to its local WAL, reports it to the client as
success, the client's request is over, it is returned back to
some remote node, etc. The transaction is officially synchronously
committed.

Then the leader's machine dies - disk is dead. The confirm was
not sent to any of the other nodes. For example, it started having
problems with network connection to the replicas recently before
the death. Or it just didn't manage to hand the confirm out.

>From now on if any of the other nodes i2-i4 becomes a leader, it
will rollback the officially confirmed transaction, even if it
has it, and all the other nodes too.

That basically means, this sync replication gives exactly the same
guarantees as the async replication - 'confirm' on the leader tells
nothing about replicas except that they *are able to apply the
transaction*, but still may not apply it.

Am I missing something?

Another issue is with failure detection. Lets assume, that we wait
for 'confirm' to be propagated on quorum of replicas too. Assume
some replicas responded with an error. So they first said they can
apply the transaction, and saved it into their WALs, and then they
couldn't apply confirm. That could happen because of 2 reasons:
replica has problems with WAL, or the replica becomes unreachable
from the master.

WAL-problematic replicas can be disconnected forcefully, since they
are clearly not able to work properly anymore. But what to do with
disconnected replicas? 'Confirm' can't wait for them forever - we
will run out of fibers, if we have even just hundreds of RPS of
sync transactions, and wait for, lets say, a few minutes. On the
other hand we can't roll them back, because 'confirm' has been
written to the local WAL already.

Note for those who is concerned: this has nothing to do with
in-memory relay. It has the same problems, which are in the protocol,
not in the implementation.