From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <v.shpilevoy@tarantool.org>
Received: from smtp38.i.mail.ru (smtp38.i.mail.ru [94.100.177.98])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by dev.tarantool.org (Postfix) with ESMTPS id C447C4696C3
 for <tarantool-patches@dev.tarantool.org>;
 Tue, 21 Apr 2020 02:32:26 +0300 (MSK)
References: <20200403210836.GB18283@tarantool.org>
From: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Message-ID: <ab849382-feb5-b906-84a8-402124e1c0a8@tarantool.org>
Date: Tue, 21 Apr 2020 01:32:24 +0200
MIME-Version: 1.0
In-Reply-To: <20200403210836.GB18283@tarantool.org>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Subject: Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
List-Id: Tarantool development patches <tarantool-patches.dev.tarantool.org>
List-Unsubscribe: <https://lists.tarantool.org/mailman/options/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=unsubscribe>
List-Archive: <https://lists.tarantool.org/pipermail/tarantool-patches/>
List-Post: <mailto:tarantool-patches@dev.tarantool.org>
List-Help: <mailto:tarantool-patches-request@dev.tarantool.org?subject=help>
List-Subscribe: <https://lists.tarantool.org/mailman/listinfo/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=subscribe>
To: Sergey Ostanevich <sergos@tarantool.org>, tarantool-patches@dev.tarantool.org

Hi!

This is the latest version I found on the branch. I give my
comments for it.

Keep in mind I didn't read other reviews before writing my own,
assuming that all questions were fixed, and by idea I should
have understood everything after reading this now.

Nonetheless see 12 comments below.

> * **Status**: In progress
> * **Start date**: 31-03-2020
> * **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\>
> * **Issues**: https://github.com/tarantool/tarantool/issues/4842
> 
> ## Summary
> 
> The aim of this RFC is to address the following list of problems
> formulated at MRG planning meeting:
>   - protocol backward compatibility to enable cluster upgrade w/o
>     downtime
>   - consistency of data on replica and leader
>   - switch from leader to replica without data loss
>   - up to date replicas to run read-only requests
>   - ability to switch async replicas into sync ones and vice versa
>   - guarantee of rollback on leader and sync replicas
>   - simplicity of cluster orchestration
> 
> What this RFC is not:
> 
>   - high availability (HA) solution with automated failover, roles
>     assignments an so on

1. So no leader election? That essentially makes single failure point
for RW requests, is it correct?

On the other hand I see section 'Recovery and failover.' below. And
it seems to be automated, with selecting a replica with the biggest
LSN. Where is the truth?

>   - master-master configuration support
> 
> ## Background and motivation
> 
> There are number of known implementation of consistent data presence in
> a cluster. They can be commonly named as "wait for LSN" technique. The
> biggest issue with this technique is the absence of rollback guarantees
> at replica in case of transaction failure on one master or some of the
> replicas in the cluster.
> 
> To provide such capabilities a new functionality should be introduced in
> Tarantool core, with requirements mentioned before - backward
> compatibility and ease of cluster orchestration.
> 
> ## Detailed design
> 
> ### Quorum commit
> 
> The main idea behind the proposal is to reuse existent machinery as much
> as possible. It will ensure the well-tested and proven functionality
> across many instances in MRG and beyond is used. The transaction rollback
> mechanism is in place and works for WAL write failure. If we substitute
> the WAL success with a new situation which is named 'quorum' later in
> this document then no changes to the machinery is needed.

2. The problem here is that you create dependency on WAL. According to
your words, replication is inside WAL, and if WAL gave ok, then all is
replicated and applied. But that makes current code structure even worse
than it is. Now WAL, GC, and replication code is spaghetti, basically.
All depends on all. I was rather thinking, that we should fix that first.
Not aggravate.

WAL should provide API for writing to disk. Replication should not bother
about WAL. GC should not bother about replication. All should be independent,
and linked in one place by some kind of a manager, which would just use their
APIs. I believe Cyrill G. would agree with me here, I remember him
complaining about replication-wal-gc code inter-dependencies too. Please,
request his review on this, if you didn't yet.

> The same is
> true for snapshot machinery that allows to create a copy of the database
> in memory for the whole period of snapshot file write. Adding quorum here
> also minimizes changes.
> Currently replication represented by the following scheme:
> ```
> Customer        Leader          WAL(L)        Replica        WAL(R)

3. Are you saying 'leader' === 'master'?

>    |------TXN----->|              |             |              |
>    |               |              |             |              |
>    |         [TXN Rollback        |             |              |
>    |            created]          |             |              |

4. What is 'txn rollback', and why is it created before even a transaction
is started? At least, rollback is a verb. Maybe you meant 'undo log'?

>    |               |              |             |              |
>    |               |-----TXN----->|             |              |
>    |               |              |             |              |
>    |               |<---WAL Ok----|             |              |
>    |               |              |             |              |
>    |         [TXN Rollback        |             |              |
>    |           destroyed]         |             |              |
>    |               |              |             |              |
>    |<----TXN Ok----|              |             |              |
>    |               |-------Replicate TXN------->|              |
>    |               |              |             |              |
>    |               |              |       [TXN Rollback        |
>    |               |              |          created]          |
>    |               |              |             |              |
>    |               |              |             |-----TXN----->|
>    |               |              |             |              |
>    |               |              |             |<---WAL Ok----|
>    |               |              |             |              |
>    |               |              |       [TXN Rollback        |
>    |               |              |         destroyed]         |
>    |               |              |             |              |
> ```
> 
> To introduce the 'quorum' we have to receive confirmation from replicas
> to make a decision on whether the quorum is actually present. Leader
> collects necessary amount of replicas confirmation plus its own WAL

5. Please, define 'necessary amount'?

> success. This state is named 'quorum' and gives leader the right to
> complete the customers' request. So the picture will change to:
> ```
> Customer        Leader          WAL(L)        Replica        WAL(R)
>    |------TXN----->|              |             |              |
>    |               |              |             |              |
>    |         [TXN Rollback        |             |              |
>    |            created]          |             |              |
>    |               |              |             |              |
>    |               |-----TXN----->|             |              |

6. Are we going to replicate transaction after user writes commit()?
Or will we replicate it while it is in progress? So called 'presumed
commit'. I remember I read some papers explaining how it significantly
speeds up synchronous transactions. Probably that was a paper about
2-phase commit, can't remember already. But the idea is still applicable
for the replication too.

>    |               |              |             |              |
>    |               |-------Replicate TXN------->|              |
>    |               |              |             |              |
>    |               |              |       [TXN Rollback        |
>    |               |<---WAL Ok----|          created]          |
>    |               |              |             |              |
>    |           [Waiting           |             |-----TXN----->|
>    |         of a quorum]         |             |              |
>    |               |              |             |<---WAL Ok----|
>    |               |              |             |              |
>    |               |<------Replication Ok-------|              |
>    |               |              |             |              |
>    |            [Quorum           |             |              |
>    |           achieved]          |             |              |
>    |               |              |             |              |
>    |         [TXN Rollback        |             |              |
>    |           destroyed]         |             |              |
>    |               |              |             |              |
>    |               |---Confirm--->|             |              |
>    |               |              |             |              |
>    |               |----------Confirm---------->|              |
>    |               |              |             |              |
>    |<---TXN Ok-----|              |       [TXN Rollback        |
>    |               |              |         destroyed]         |
>    |               |              |             |              |
>    |               |              |             |---Confirm--->|
>    |               |              |             |              |
> ```
> 
> The quorum should be collected as a table for a list of transactions
> waiting for quorum. The latest transaction that collects the quorum is
> considered as complete, as well as all transactions prior to it, since
> all transactions should be applied in order. Leader writes a 'confirm'
> message to the WAL that refers to the transaction's LSN and it has its
> own LSN. This confirm message is delivered to all replicas through the
> existing replication mechanism.
> 
> Replica should report a positive or a negative result of the TXN to the
> leader via the IPROTO explicitly to allow leader to collect the quorum
> or anti-quorum for the TXN. In case a negative result for the TXN is
> received from minor number of replicas, then leader has to send an error
> message to the replicas, which in turn have to disconnect from the
> replication the same way as it is done now in case of conflict.
> In case leader receives enough error messages to do not achieve the
> quorum it should write the 'rollback' message in the WAL. After that
> leader and replicas will perform the rollback for all TXN that didn't
> receive quorum.
> 
> ### Recovery and failover.
> 
> Tarantool instance during reading WAL should postpone the commit until
> the 'confirm' is read. In case the WAL eof is achieved, the instance
> should keep rollback for all transactions that are waiting for a confirm
> entry until the role of the instance is set. In case this instance
> become a replica there are no additional actions needed, since all info
> about quorum/rollback will arrive via replication. In case this instance
> is assigned a leader role, it should write 'rollback' in its WAL and
> perform rollback for all transactions waiting for a quorum.
> 
> In case of a leader failure a replica with the biggest LSN with former
> leader's ID is elected as a new leader. The replica should record
> 'rollback' in its WAL which effectively means that all transactions
> without quorum should be rolled back. This rollback will be delivered to
> all replicas and they will perform rollbacks of all transactions waiting
> for quorum.

7. Please, elaborate leader election. It is not as trivial as just 'elect'.
What if the replica with the biggest LSN is temporary not available, but
it knows that it has the biggest LSN? Will it become a leader without
asking other nodes? What will do the other nodes? Will they wait for the
new leader node to become available? Do they have a timeout on that?

Basically, it would be nice to see the split-brain problem description here,
and its solution for us.

How leader failure is detected? Do you rely on our heartbeat messages?
Are you going to adapt SWIM for this?

Raft has a dedicated subsystem for election, it is not that simple. It
involves voting, randomized algorithms. Am I missing something obvious in
this RFC, which makes the leader election much simpler specifically for
Tarantool?

> An interface to force apply pending transactions by issuing a confirm
> entry for them have to be introduced for manual recovery.
> 
> ### Snapshot generation.
> 
> We also can reuse current machinery of snapshot generation. Upon
> receiving a request to create a snapshot an instance should request a
> readview for the current commit operation. Although start of the
> snapshot generation should be postponed until this commit operation
> receives its confirmation. In case operation is rolled back, the snapshot
> generation should be aborted and restarted using current transaction
> after rollback is complete.

8. This section highly depends on transaction manager for memtx. If you
have a transaction manager, you always have a ready-to-use read-view
of the latest committed data. At least this is my understanding.

After all, the manager should provide transaction isolation. And it means,
that all non-committed transactions are not visible. And for that we need
a read-view. Therefore, it could be used to make a snapshot.

> After snapshot is created the WAL should start from the first operation
> that follows the commit operation snapshot is generated for. That means
> WAL will contain 'confirm' messages that refer to transactions that are
> not present in the WAL. Apparently, we have to allow this for the case
> 'confirm' refers to a transaction with LSN less than the first entry in
> the WAL.

9. I couldn't understand that. Why confirm is in WAL for data stored in
the snap? I thought you said above, that snapshot should be done for all
confirmed data. Besides, having confirm out of snap means the snap is
not self-sufficient anymore.

> In case master appears unavailable a replica still have to be able to
> create a snapshot. Replica can perform rollback for all transactions that
> are not confirmed and claim its LSN as the latest confirmed txn. Then it
> can create a snapshot in a regular way and start with blank xlog file.
> All rolled back transactions will appear through the regular replication
> in case master reappears later on.

10. You should be able to make a snapshot without rollback. Read-views are
available anyway. At least it is so in Vinyl, from what I remember. And this
is going to be similar in memtx.

> 
> ### Asynchronous replication.
> 
> Along with synchronous replicas the cluster can contain asynchronous
> replicas. That means async replica doesn't reply to the leader with
> errors since they're not contributing into quorum. Still, async
> replicas have to follow the new WAL operation, such as keep rollback
> info until 'quorum' message is received. This is essential for the case
> of 'rollback' message appearance in the WAL. This message assumes
> replica is able to perform all necessary rollback by itself. Cluster
> information should contain explicit notification of each replica
> operation mode.
> 
> ### Synchronous replication enabling.
> 
> Synchronous operation can be required for a set of spaces in the data
> scheme. That means only transactions that contain data modification for
> these spaces should require quorum. Such transactions named synchronous.
> As soon as last operation of synchronous transaction appeared in leader's
> WAL, it will cause all following transactions - matter if they are
> synchronous or not - wait for the quorum. In case quorum is not achieved
> the 'rollback' operation will cause rollback of all transactions after
> the synchronous one. It will ensure the consistent state of the data both
> on leader and replicas. In case user doesn't require synchronous operation
> for any space then no changes to the WAL generation and replication will
> appear.
> 
> Cluster description should contain explicit attribute for each replica
> to denote it participates in synchronous activities. Also the description
> should contain criterion on how many replicas responses are needed to
> achieve the quorum.

11. Aha, I see 'necessary amount' from above is a manually set value. Ok.

> 
> ## Rationale and alternatives
> 
> There is an implementation of synchronous replication as part of gh-980
> activities, still it is not in a state to get into the product. More
> than that it intentionally breaks backward compatibility which is a
> prerequisite for this proposal.

12. How are we going to deal with fsync()? Will it be forcefully enabled
on sync replicas and the leader?