Tarantool development patches archive
 help / color / mirror / Atom feed
From: Sergey Ostanevich <sergos@tarantool.org>
To: tarantool-patches@dev.tarantool.org
Subject: [Tarantool-patches]  [RFC] Quorum-based synchronous replication
Date: Sat, 4 Apr 2020 00:08:36 +0300	[thread overview]
Message-ID: <20200403210836.GB18283@tarantool.org> (raw)


* **Status**: In progress
* **Start date**: 31-03-2020
* **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\>
* **Issues**:

## Summary

The aim of this RFC is to address the following list of problems
formulated at MRG planning meeting:
  - protocol backward compatibility to enable cluster upgrade w/o
    downtime
  - consistency of data on replica and leader
  - switch from leader to replica without data loss
  - up to date replicas to run read-only requests
  - ability to switch async replicas into sync ones
  - guarantee of rollback on leader and sync replicas
  - simplicity of cluster orchestration
 
What this RFC is not:
 
  - high availability (HA) solution with automated failover, roles
    assignments an so on
  - master-master configuration support


## Background and motivation

There are number of known implemenatation of consistent data presence in
a cluster. They can be commonly named as "wait for LSN" technique. The
biggest issue with this technique is the abscence of rollback gauarantees
at replica in case of transaction failure on one master or some of the 
replics in the cluster. 

To provide such capabilities a new functionality should be introduced in
Tarantool core, with limitation mentioned before - backward compatilibity
and ease of cluster orchestration.

## Detailed design

### Quorum commit
The main idea behind the proposal is to reuse existent machinery as much
as possible. It will ensure the well-tested and proven functionality
across many instances in MRG and beyond is used. The transaction rollback
mechanism is in place and works for WAL write failure. If we substitute 
the WAL success with a new situation which is named 'quorum' later in
this document then no changes to the machinery is needed. The same is
true for snapshot machinery that allows to create a copy of the database
in memory for the whole period of snapshot file write. Adding quorum here
also minimizes changes.

Currently replication represented by the following scheme:
```
Customer        Leader          WAL(L)        Replica        WAL(R)
   |------TXN----->|              |             |              |
   |               |              |             |              | 
   |         [TXN Rollback        |             |              |
   |            created]          |             |              |
   |               |              |             |              | 
   |               |-----TXN----->|             |              |
   |               |              |             |              | 
   |               |<---WAL Ok----|             |              |
   |               |              |             |              | 
   |         [TXN Rollback        |             |              |
   |           destroyed]         |             |              |
   |               |              |             |              | 
   |<----TXN Ok----|              |             |              |
   |               |-------Replicate TXN------->|              |
   |               |              |             |              | 
   |               |              |       [TXN Rollback        |
   |               |              |          created]          |
   |               |              |             |              | 
   |               |              |             |-----TXN----->|
   |               |              |             |              | 
   |               |              |             |<---WAL Ok----|
   |               |              |             |              |
   |               |              |       [TXN Rollback        |
   |               |              |         destroyed]         |
   |               |              |             |              | 
```


To introduce the 'quorum' we have to receive confirmation from replicas
to make a decision on whether the quorum is actually present. Leader
collects necessary amount of replicas confirmation plus its own WAL
success. This state is named 'quorum' and gives leader the right to
complete the customers' request. So the picture will change to:
```
Customer        Leader          WAL(L)        Replica        WAL(R)
   |------TXN----->|              |             |              |
   |               |              |             |              | 
   |         [TXN Rollback        |             |              |
   |            created]          |             |              |
   |               |              |             |              | 
   |               |-----TXN----->|             |              |
   |               |              |             |              | 
   |               |-------Replicate TXN------->|              |
   |               |              |             |              | 
   |               |              |       [TXN Rollback        |
   |               |<---WAL Ok----|          created]          |
   |               |              |             |              | 
   |           [Waiting           |             |-----TXN----->|
   |         of a quorum]         |             |              | 
   |               |              |             |<---WAL Ok----|
   |               |              |             |              | 
   |               |<------Replication Ok-------|              | 
   |               |              |             |              | 
   |            [Quorum           |             |              |
   |           achieved]          |             |              |
   |               |              |             |              | 
   |         [TXN Rollback        |             |              |
   |           destroyed]         |             |              |
   |               |              |             |              | 
   |               |----Quorum--->|             |              | 
   |               |              |             |              | 
   |               |-----------Quorum---------->|              | 
   |               |              |             |              | 
   |<---TXN Ok-----|              |       [TXN Rollback        |
   |               |              |         destroyed]         |
   |               |              |             |              | 
   |               |              |             |----Quorum--->| 
   |               |              |             |              | 
```

The quorum should be collected as a table for a list of transactions 
waiting for quorum. The latest transaction that collects the quorum is
considered as complete, as well as all transactions prior to it, since
all transactions should be applied in order. Leader writes a 'quorum'
message to the WAL and it is delivered to Replicas.
 
Replica should report a positive or a negative result of the TXN to the 
Leader via the IPROTO explicitly to allow Leader to collect the quorum
or anti-quorum for the TXN. In case negative result for the TXN received
from minor number of Replicas, then Leader has to send an error message
to each Replica, which in turn has to disconnect from the replication
the same way as it is done now in case of conflict.
 
In case Leader receives enough error messages to do not achieve the
quorum it should write the 'rollback' message in the WAL. After that
Leader and Replicas will perform the rollback for all TXN that didn't
receive quorum.
 
### Recovery and failover.
 
Tarantool instance during reading WAL should postpone the commit until
the quorum is read. In case the WAL eof is achieved, the instance should
keep rollback for all transactions that are waiting for a quorum entry
until the role of the instance is set. In case this instance become a 
Replica there are no additional actions needed, sine all info about 
quorum/rollback will arrive via replication. In case this instance is 
assigned a Leader role, it should write 'rollback' in its WAL and
perform rollback for all transactions waiting for a quorum.
 
In case of a Leader failure a Replica with the biggest LSN with former
leader's ID is elected as a new leader. The replica should record
'rollback' in its WAL which effectively means that all transactions
without quorum should be rolled back. This rollback will be delivered to
all replicas and they will perform rollbacks of all transactions waiting
for quorum.
 
### Snapshot generation.
 
We also can reuse current machinery of snapshot generation. Upon
receiving a request to create a snapshot an instance should request a
readview for the current commit operation. Although start of the
snapshot generation should be postponed until this commit operation
receives its quorum. In case operation is rolled back, the snapshot
generation should be aborted and restarted using current transaction
after rollback is complete.
 
After snapshot is created the WAL should start from the first operation
that follows the commit operation snapshot is generated for. That means
WAL will contain a quorum message that refers to a transaction that is
not present in the WAL. Apparently, we have to allow this for the case
quorum refers to a transaction with LSN less than the first entry in the
WAL and only once.
 
### Asynchronous replication.
 
Along with synchronous Replicas the cluster can contain asynchronous
Replicas. That means async Replica doesn't reply to the Leader with
errors since they're not contributing into quorum. Still, async
Replicas have to follow the new WAL operation, such as keep rollback
info until 'quorum' message is received. This is essential for the case
of 'rollback' message appearance in the WAL. This message assumes
Replica is able to perform all necessary rollback by itself. Cluster
information should contain explicit notification of each Replica
operation mode. 
 
### Synchronous replication enabling.
 
Synchronous operation can be required for a set of spaces in the data
scheme. That means only transactions that contain data modification for
these spaces should require quorum. Such transactions named synchronous.
As soon as last operation of synchronous transaction appeared in Leader's
WAL, it will cause all following transactions - matter if they are
synchronous or not - wait for the quorum. In case quorum is not achieved
the 'rollback' operation will cause rollback of all transactions after
the synchronous one. It will ensure the consistent state of the data both
on Leader and Replicas. In case user doesn't require synchronous operation
for any space then no changes to the WAL generation and replication will
appear.
 
Cluster description should contain explicit attribute for each Replica
to denote it participates in synchronous activities. Also the description
should contain criterion on how many Replicas responses are needed to
achieve the quorum.  
 

## Rationale and alternatives

There is an implementation of synchronous replication as part of gh-980 
activities, still it is not in a state to get into the product. More 
than that it intentionally breaks backward compatibility which is a
prerequisite for this proposal.

             reply	other threads:[~2020-04-03 21:08 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-03 21:08 Sergey Ostanevich [this message]
2020-04-07 13:02 ` Aleksandr Lyapunov
2020-04-08  9:18   ` Sergey Ostanevich
2020-04-08 14:05     ` Konstantin Osipov
2020-04-08 15:06       ` Sergey Ostanevich
2020-04-14 12:58 ` Sergey Bronnikov
2020-04-14 14:43   ` Sergey Ostanevich
2020-04-15 11:09     ` sergos
2020-04-15 14:50       ` sergos
2020-04-16  7:13         ` Aleksandr Lyapunov
2020-04-17 10:10         ` Konstantin Osipov
2020-04-17 13:45           ` Sergey Ostanevich
2020-04-20 11:20         ` Serge Petrenko
2020-04-20 23:32 ` Vladislav Shpilevoy
2020-04-21 10:49   ` Sergey Ostanevich
2020-04-21 22:17     ` Vladislav Shpilevoy
2020-04-22 16:50       ` Sergey Ostanevich
2020-04-22 20:28         ` Vladislav Shpilevoy
2020-04-23  6:58       ` Konstantin Osipov
2020-04-23  9:14         ` Konstantin Osipov
2020-04-23 11:27           ` Sergey Ostanevich
2020-04-23 11:43             ` Konstantin Osipov
2020-04-23 15:11               ` Sergey Ostanevich
2020-04-23 20:39                 ` Konstantin Osipov
2020-04-23 21:38 ` Vladislav Shpilevoy
2020-04-23 22:28   ` Konstantin Osipov
2020-04-30 14:50   ` Sergey Ostanevich
2020-05-06  8:52     ` Konstantin Osipov
2020-05-06 16:39       ` Sergey Ostanevich
2020-05-06 18:44         ` Konstantin Osipov
2020-05-12 15:55           ` Sergey Ostanevich
2020-05-12 16:42             ` Konstantin Osipov
2020-05-13 21:39             ` Vladislav Shpilevoy
2020-05-13 23:54               ` Konstantin Osipov
2020-05-14 20:38               ` Sergey Ostanevich
2020-05-20 20:59                 ` Sergey Ostanevich
2020-05-25 23:41                   ` Vladislav Shpilevoy
2020-05-27 21:17                     ` Sergey Ostanevich
2020-06-09 16:19                       ` Sergey Ostanevich
2020-06-11 15:17                         ` Vladislav Shpilevoy
2020-06-12 20:31                           ` Sergey Ostanevich
2020-05-13 21:36         ` Vladislav Shpilevoy
2020-05-13 23:45           ` Konstantin Osipov
2020-05-06 18:55     ` Konstantin Osipov
2020-05-06 19:10       ` Konstantin Osipov
2020-05-12 16:03         ` Sergey Ostanevich
2020-05-13 21:42       ` Vladislav Shpilevoy
2020-05-14  0:05         ` Konstantin Osipov
2020-05-07 23:01     ` Konstantin Osipov
2020-05-12 16:40       ` Sergey Ostanevich
2020-05-12 17:47         ` Konstantin Osipov
2020-05-13 21:34           ` Vladislav Shpilevoy
2020-05-13 23:31             ` Konstantin Osipov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200403210836.GB18283@tarantool.org \
    --to=sergos@tarantool.org \
    --cc=tarantool-patches@dev.tarantool.org \
    --subject='Re: [Tarantool-patches]  [RFC] Quorum-based synchronous replication' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox