Tarantool development patches archive
 help / color / mirror / Atom feed
From: Sergey Bronnikov <sergeyb@tarantool.org>
To: Sergey Ostanevich <sergos@tarantool.org>
Cc: tarantool-patches@dev.tarantool.org
Subject: Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
Date: Tue, 14 Apr 2020 15:58:48 +0300	[thread overview]
Message-ID: <20200414125848.GA1249@pony.bronevichok.ru> (raw)
In-Reply-To: <20200403210836.GB18283@tarantool.org>

Hi,

see 5 comments inline

On 00:08 Sat 04 Apr , Sergey Ostanevich wrote:
> 
> * **Status**: In progress
> * **Start date**: 31-03-2020
> * **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\>
> * **Issues**:

1. Just for convenience, please add https://github.com/tarantool/tarantool/issues/4842

> ## Summary
> 
> The aim of this RFC is to address the following list of problems
> formulated at MRG planning meeting:
>   - protocol backward compatibility to enable cluster upgrade w/o
>     downtime
>   - consistency of data on replica and leader
>   - switch from leader to replica without data loss
>   - up to date replicas to run read-only requests
>   - ability to switch async replicas into sync ones

2. Ability to switch async replicas into sync ones and vice-versa? Or not?

>   - guarantee of rollback on leader and sync replicas
>   - simplicity of cluster orchestration
>  
> What this RFC is not:
>  
>   - high availability (HA) solution with automated failover, roles
>     assignments an so on
>   - master-master configuration support
> 
> 
> ## Background and motivation
> 
> There are number of known implemenatation of consistent data presence in
> a cluster. They can be commonly named as "wait for LSN" technique. The
> biggest issue with this technique is the abscence of rollback gauarantees

3. typo: gauarantees -> guarantees

> at replica in case of transaction failure on one master or some of the 
> replics in the cluster. 

4. typo: replics -> replicas
> 
> To provide such capabilities a new functionality should be introduced in
> Tarantool core, with limitation mentioned before - backward compatilibity
> and ease of cluster orchestration.

5. but there is nothing mentioned before about these limitations.

> ## Detailed design
> 
> ### Quorum commit
> The main idea behind the proposal is to reuse existent machinery as much
> as possible. It will ensure the well-tested and proven functionality
> across many instances in MRG and beyond is used. The transaction rollback
> mechanism is in place and works for WAL write failure. If we substitute 
> the WAL success with a new situation which is named 'quorum' later in
> this document then no changes to the machinery is needed. The same is
> true for snapshot machinery that allows to create a copy of the database
> in memory for the whole period of snapshot file write. Adding quorum here
> also minimizes changes.
> 
> Currently replication represented by the following scheme:
> ```
> Customer        Leader          WAL(L)        Replica        WAL(R)
>    |------TXN----->|              |             |              |
>    |               |              |             |              | 
>    |         [TXN Rollback        |             |              |
>    |            created]          |             |              |
>    |               |              |             |              | 
>    |               |-----TXN----->|             |              |
>    |               |              |             |              | 
>    |               |<---WAL Ok----|             |              |
>    |               |              |             |              | 
>    |         [TXN Rollback        |             |              |
>    |           destroyed]         |             |              |
>    |               |              |             |              | 
>    |<----TXN Ok----|              |             |              |
>    |               |-------Replicate TXN------->|              |
>    |               |              |             |              | 
>    |               |              |       [TXN Rollback        |
>    |               |              |          created]          |
>    |               |              |             |              | 
>    |               |              |             |-----TXN----->|
>    |               |              |             |              | 
>    |               |              |             |<---WAL Ok----|
>    |               |              |             |              |
>    |               |              |       [TXN Rollback        |
>    |               |              |         destroyed]         |
>    |               |              |             |              | 
> ```
> 
> 
> To introduce the 'quorum' we have to receive confirmation from replicas
> to make a decision on whether the quorum is actually present. Leader
> collects necessary amount of replicas confirmation plus its own WAL
> success. This state is named 'quorum' and gives leader the right to
> complete the customers' request. So the picture will change to:
> ```
> Customer        Leader          WAL(L)        Replica        WAL(R)
>    |------TXN----->|              |             |              |
>    |               |              |             |              | 
>    |         [TXN Rollback        |             |              |
>    |            created]          |             |              |
>    |               |              |             |              | 
>    |               |-----TXN----->|             |              |
>    |               |              |             |              | 
>    |               |-------Replicate TXN------->|              |
>    |               |              |             |              | 
>    |               |              |       [TXN Rollback        |
>    |               |<---WAL Ok----|          created]          |
>    |               |              |             |              | 
>    |           [Waiting           |             |-----TXN----->|
>    |         of a quorum]         |             |              | 
>    |               |              |             |<---WAL Ok----|
>    |               |              |             |              | 
>    |               |<------Replication Ok-------|              | 
>    |               |              |             |              | 
>    |            [Quorum           |             |              |
>    |           achieved]          |             |              |
>    |               |              |             |              | 
>    |         [TXN Rollback        |             |              |
>    |           destroyed]         |             |              |
>    |               |              |             |              | 
>    |               |----Quorum--->|             |              | 
>    |               |              |             |              | 
>    |               |-----------Quorum---------->|              | 
>    |               |              |             |              | 
>    |<---TXN Ok-----|              |       [TXN Rollback        |
>    |               |              |         destroyed]         |
>    |               |              |             |              | 
>    |               |              |             |----Quorum--->| 
>    |               |              |             |              | 
> ```
> 
> The quorum should be collected as a table for a list of transactions 
> waiting for quorum. The latest transaction that collects the quorum is
> considered as complete, as well as all transactions prior to it, since
> all transactions should be applied in order. Leader writes a 'quorum'
> message to the WAL and it is delivered to Replicas.
>  
> Replica should report a positive or a negative result of the TXN to the 
> Leader via the IPROTO explicitly to allow Leader to collect the quorum
> or anti-quorum for the TXN. In case negative result for the TXN received
> from minor number of Replicas, then Leader has to send an error message
> to each Replica, which in turn has to disconnect from the replication
> the same way as it is done now in case of conflict.
>  
> In case Leader receives enough error messages to do not achieve the
> quorum it should write the 'rollback' message in the WAL. After that
> Leader and Replicas will perform the rollback for all TXN that didn't
> receive quorum.
>  
> ### Recovery and failover.
>  
> Tarantool instance during reading WAL should postpone the commit until
> the quorum is read. In case the WAL eof is achieved, the instance should
> keep rollback for all transactions that are waiting for a quorum entry
> until the role of the instance is set. In case this instance become a 
> Replica there are no additional actions needed, sine all info about 
> quorum/rollback will arrive via replication. In case this instance is 
> assigned a Leader role, it should write 'rollback' in its WAL and
> perform rollback for all transactions waiting for a quorum.
>  
> In case of a Leader failure a Replica with the biggest LSN with former
> leader's ID is elected as a new leader. The replica should record
> 'rollback' in its WAL which effectively means that all transactions
> without quorum should be rolled back. This rollback will be delivered to
> all replicas and they will perform rollbacks of all transactions waiting
> for quorum.
>  
> ### Snapshot generation.
>  
> We also can reuse current machinery of snapshot generation. Upon
> receiving a request to create a snapshot an instance should request a
> readview for the current commit operation. Although start of the
> snapshot generation should be postponed until this commit operation
> receives its quorum. In case operation is rolled back, the snapshot
> generation should be aborted and restarted using current transaction
> after rollback is complete.
>  
> After snapshot is created the WAL should start from the first operation
> that follows the commit operation snapshot is generated for. That means
> WAL will contain a quorum message that refers to a transaction that is
> not present in the WAL. Apparently, we have to allow this for the case
> quorum refers to a transaction with LSN less than the first entry in the
> WAL and only once.
>  
> ### Asynchronous replication.
>  
> Along with synchronous Replicas the cluster can contain asynchronous
> Replicas. That means async Replica doesn't reply to the Leader with
> errors since they're not contributing into quorum. Still, async
> Replicas have to follow the new WAL operation, such as keep rollback
> info until 'quorum' message is received. This is essential for the case
> of 'rollback' message appearance in the WAL. This message assumes
> Replica is able to perform all necessary rollback by itself. Cluster
> information should contain explicit notification of each Replica
> operation mode. 
>  
> ### Synchronous replication enabling.
>  
> Synchronous operation can be required for a set of spaces in the data
> scheme. That means only transactions that contain data modification for
> these spaces should require quorum. Such transactions named synchronous.
> As soon as last operation of synchronous transaction appeared in Leader's
> WAL, it will cause all following transactions - matter if they are
> synchronous or not - wait for the quorum. In case quorum is not achieved
> the 'rollback' operation will cause rollback of all transactions after
> the synchronous one. It will ensure the consistent state of the data both
> on Leader and Replicas. In case user doesn't require synchronous operation
> for any space then no changes to the WAL generation and replication will
> appear.
>  
> Cluster description should contain explicit attribute for each Replica
> to denote it participates in synchronous activities. Also the description
> should contain criterion on how many Replicas responses are needed to
> achieve the quorum.  
>  
> 
> ## Rationale and alternatives
> 
> There is an implementation of synchronous replication as part of gh-980 
> activities, still it is not in a state to get into the product. More 
> than that it intentionally breaks backward compatibility which is a
> prerequisite for this proposal.

-- 
sergeyb@

  parent reply	other threads:[~2020-04-14 12:58 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-03 21:08 Sergey Ostanevich
2020-04-07 13:02 ` Aleksandr Lyapunov
2020-04-08  9:18   ` Sergey Ostanevich
2020-04-08 14:05     ` Konstantin Osipov
2020-04-08 15:06       ` Sergey Ostanevich
2020-04-14 12:58 ` Sergey Bronnikov [this message]
2020-04-14 14:43   ` Sergey Ostanevich
2020-04-15 11:09     ` sergos
2020-04-15 14:50       ` sergos
2020-04-16  7:13         ` Aleksandr Lyapunov
2020-04-17 10:10         ` Konstantin Osipov
2020-04-17 13:45           ` Sergey Ostanevich
2020-04-20 11:20         ` Serge Petrenko
2020-04-20 23:32 ` Vladislav Shpilevoy
2020-04-21 10:49   ` Sergey Ostanevich
2020-04-21 22:17     ` Vladislav Shpilevoy
2020-04-22 16:50       ` Sergey Ostanevich
2020-04-22 20:28         ` Vladislav Shpilevoy
2020-04-23  6:58       ` Konstantin Osipov
2020-04-23  9:14         ` Konstantin Osipov
2020-04-23 11:27           ` Sergey Ostanevich
2020-04-23 11:43             ` Konstantin Osipov
2020-04-23 15:11               ` Sergey Ostanevich
2020-04-23 20:39                 ` Konstantin Osipov
2020-04-23 21:38 ` Vladislav Shpilevoy
2020-04-23 22:28   ` Konstantin Osipov
2020-04-30 14:50   ` Sergey Ostanevich
2020-05-06  8:52     ` Konstantin Osipov
2020-05-06 16:39       ` Sergey Ostanevich
2020-05-06 18:44         ` Konstantin Osipov
2020-05-12 15:55           ` Sergey Ostanevich
2020-05-12 16:42             ` Konstantin Osipov
2020-05-13 21:39             ` Vladislav Shpilevoy
2020-05-13 23:54               ` Konstantin Osipov
2020-05-14 20:38               ` Sergey Ostanevich
2020-05-20 20:59                 ` Sergey Ostanevich
2020-05-25 23:41                   ` Vladislav Shpilevoy
2020-05-27 21:17                     ` Sergey Ostanevich
2020-06-09 16:19                       ` Sergey Ostanevich
2020-06-11 15:17                         ` Vladislav Shpilevoy
2020-06-12 20:31                           ` Sergey Ostanevich
2020-05-13 21:36         ` Vladislav Shpilevoy
2020-05-13 23:45           ` Konstantin Osipov
2020-05-06 18:55     ` Konstantin Osipov
2020-05-06 19:10       ` Konstantin Osipov
2020-05-12 16:03         ` Sergey Ostanevich
2020-05-13 21:42       ` Vladislav Shpilevoy
2020-05-14  0:05         ` Konstantin Osipov
2020-05-07 23:01     ` Konstantin Osipov
2020-05-12 16:40       ` Sergey Ostanevich
2020-05-12 17:47         ` Konstantin Osipov
2020-05-13 21:34           ` Vladislav Shpilevoy
2020-05-13 23:31             ` Konstantin Osipov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200414125848.GA1249@pony.bronevichok.ru \
    --to=sergeyb@tarantool.org \
    --cc=sergos@tarantool.org \
    --cc=tarantool-patches@dev.tarantool.org \
    --subject='Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox