[Tarantool-patches] [RFC] Quorum-based synchronous replication

Tarantool development patches archive
 help / color / mirror / Atom feed

* [Tarantool-patches]  [RFC] Quorum-based synchronous replication
@ 2020-04-03 21:08 Sergey Ostanevich
  2020-04-07 13:02 ` Aleksandr Lyapunov
                   ` (3 more replies)
  0 siblings, 4 replies; 53+ messages in thread
From: Sergey Ostanevich @ 2020-04-03 21:08 UTC (permalink / raw)
  To: tarantool-patches

* **Status**: In progress
* **Start date**: 31-03-2020
* **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\>
* **Issues**:

## Summary

The aim of this RFC is to address the following list of problems
formulated at MRG planning meeting:
  - protocol backward compatibility to enable cluster upgrade w/o
    downtime
  - consistency of data on replica and leader
  - switch from leader to replica without data loss
  - up to date replicas to run read-only requests
  - ability to switch async replicas into sync ones
  - guarantee of rollback on leader and sync replicas
  - simplicity of cluster orchestration

What this RFC is not:

  - high availability (HA) solution with automated failover, roles
    assignments an so on
  - master-master configuration support

## Background and motivation

There are number of known implemenatation of consistent data presence in
a cluster. They can be commonly named as "wait for LSN" technique. The
biggest issue with this technique is the abscence of rollback gauarantees
at replica in case of transaction failure on one master or some of the 
replics in the cluster. 

To provide such capabilities a new functionality should be introduced in
Tarantool core, with limitation mentioned before - backward compatilibity
and ease of cluster orchestration.

## Detailed design

### Quorum commit
The main idea behind the proposal is to reuse existent machinery as much
as possible. It will ensure the well-tested and proven functionality
across many instances in MRG and beyond is used. The transaction rollback
mechanism is in place and works for WAL write failure. If we substitute 
the WAL success with a new situation which is named 'quorum' later in
this document then no changes to the machinery is needed. The same is
true for snapshot machinery that allows to create a copy of the database
in memory for the whole period of snapshot file write. Adding quorum here
also minimizes changes.

Currently replication represented by the following scheme:
```
Customer        Leader          WAL(L)        Replica        WAL(R)
   |------TXN----->|              |             |              |
   |               |              |             |              | 
   |         [TXN Rollback        |             |              |
   |            created]          |             |              |
   |               |              |             |              | 
   |               |-----TXN----->|             |              |
   |               |              |             |              | 
   |               |<---WAL Ok----|             |              |
   |               |              |             |              | 
   |         [TXN Rollback        |             |              |
   |           destroyed]         |             |              |
   |               |              |             |              | 
   |<----TXN Ok----|              |             |              |
   |               |-------Replicate TXN------->|              |
   |               |              |             |              | 
   |               |              |       [TXN Rollback        |
   |               |              |          created]          |
   |               |              |             |              | 
   |               |              |             |-----TXN----->|
   |               |              |             |              | 
   |               |              |             |<---WAL Ok----|
   |               |              |             |              |
   |               |              |       [TXN Rollback        |
   |               |              |         destroyed]         |
   |               |              |             |              | 
```

To introduce the 'quorum' we have to receive confirmation from replicas
to make a decision on whether the quorum is actually present. Leader
collects necessary amount of replicas confirmation plus its own WAL
success. This state is named 'quorum' and gives leader the right to
complete the customers' request. So the picture will change to:
```
Customer        Leader          WAL(L)        Replica        WAL(R)
   |------TXN----->|              |             |              |
   |               |              |             |              | 
   |         [TXN Rollback        |             |              |
   |            created]          |             |              |
   |               |              |             |              | 
   |               |-----TXN----->|             |              |
   |               |              |             |              | 
   |               |-------Replicate TXN------->|              |
   |               |              |             |              | 
   |               |              |       [TXN Rollback        |
   |               |<---WAL Ok----|          created]          |
   |               |              |             |              | 
   |           [Waiting           |             |-----TXN----->|
   |         of a quorum]         |             |              | 
   |               |              |             |<---WAL Ok----|
   |               |              |             |              | 
   |               |<------Replication Ok-------|              | 
   |               |              |             |              | 
   |            [Quorum           |             |              |
   |           achieved]          |             |              |
   |               |              |             |              | 
   |         [TXN Rollback        |             |              |
   |           destroyed]         |             |              |
   |               |              |             |              | 
   |               |----Quorum--->|             |              | 
   |               |              |             |              | 
   |               |-----------Quorum---------->|              | 
   |               |              |             |              | 
   |<---TXN Ok-----|              |       [TXN Rollback        |
   |               |              |         destroyed]         |
   |               |              |             |              | 
   |               |              |             |----Quorum--->| 
   |               |              |             |              | 
```

The quorum should be collected as a table for a list of transactions 
waiting for quorum. The latest transaction that collects the quorum is
considered as complete, as well as all transactions prior to it, since
all transactions should be applied in order. Leader writes a 'quorum'
message to the WAL and it is delivered to Replicas.

Replica should report a positive or a negative result of the TXN to the 
Leader via the IPROTO explicitly to allow Leader to collect the quorum
or anti-quorum for the TXN. In case negative result for the TXN received
from minor number of Replicas, then Leader has to send an error message
to each Replica, which in turn has to disconnect from the replication
the same way as it is done now in case of conflict.

In case Leader receives enough error messages to do not achieve the
quorum it should write the 'rollback' message in the WAL. After that
Leader and Replicas will perform the rollback for all TXN that didn't
receive quorum.

### Recovery and failover.

Tarantool instance during reading WAL should postpone the commit until
the quorum is read. In case the WAL eof is achieved, the instance should
keep rollback for all transactions that are waiting for a quorum entry
until the role of the instance is set. In case this instance become a 
Replica there are no additional actions needed, sine all info about 
quorum/rollback will arrive via replication. In case this instance is 
assigned a Leader role, it should write 'rollback' in its WAL and
perform rollback for all transactions waiting for a quorum.

In case of a Leader failure a Replica with the biggest LSN with former
leader's ID is elected as a new leader. The replica should record
'rollback' in its WAL which effectively means that all transactions
without quorum should be rolled back. This rollback will be delivered to
all replicas and they will perform rollbacks of all transactions waiting
for quorum.

### Snapshot generation.

We also can reuse current machinery of snapshot generation. Upon
receiving a request to create a snapshot an instance should request a
readview for the current commit operation. Although start of the
snapshot generation should be postponed until this commit operation
receives its quorum. In case operation is rolled back, the snapshot
generation should be aborted and restarted using current transaction
after rollback is complete.

After snapshot is created the WAL should start from the first operation
that follows the commit operation snapshot is generated for. That means
WAL will contain a quorum message that refers to a transaction that is
not present in the WAL. Apparently, we have to allow this for the case
quorum refers to a transaction with LSN less than the first entry in the
WAL and only once.

### Asynchronous replication.

Along with synchronous Replicas the cluster can contain asynchronous
Replicas. That means async Replica doesn't reply to the Leader with
errors since they're not contributing into quorum. Still, async
Replicas have to follow the new WAL operation, such as keep rollback
info until 'quorum' message is received. This is essential for the case
of 'rollback' message appearance in the WAL. This message assumes
Replica is able to perform all necessary rollback by itself. Cluster
information should contain explicit notification of each Replica
operation mode. 

### Synchronous replication enabling.

Synchronous operation can be required for a set of spaces in the data
scheme. That means only transactions that contain data modification for
these spaces should require quorum. Such transactions named synchronous.
As soon as last operation of synchronous transaction appeared in Leader's
WAL, it will cause all following transactions - matter if they are
synchronous or not - wait for the quorum. In case quorum is not achieved
the 'rollback' operation will cause rollback of all transactions after
the synchronous one. It will ensure the consistent state of the data both
on Leader and Replicas. In case user doesn't require synchronous operation
for any space then no changes to the WAL generation and replication will
appear.

Cluster description should contain explicit attribute for each Replica
to denote it participates in synchronous activities. Also the description
should contain criterion on how many Replicas responses are needed to
achieve the quorum.  

## Rationale and alternatives

There is an implementation of synchronous replication as part of gh-980 
activities, still it is not in a state to get into the product. More 
than that it intentionally breaks backward compatibility which is a
prerequisite for this proposal.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-04-03 21:08 [Tarantool-patches] [RFC] Quorum-based synchronous replication Sergey Ostanevich
@ 2020-04-07 13:02 ` Aleksandr Lyapunov
  2020-04-08  9:18   ` Sergey Ostanevich
  2020-04-14 12:58 ` Sergey Bronnikov
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 53+ messages in thread
From: Aleksandr Lyapunov @ 2020-04-07 13:02 UTC (permalink / raw)
  To: Sergey Ostanevich, tarantool-patches


On 4/4/20 12:08 AM, Sergey Ostanevich wrote:
> * **Status**: In progress
> * **Start date**: 31-03-2020
> * **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\>
> * **Issues**:
>
> ## Summary
>
> The aim of this RFC is to address the following list of problems
> formulated at MRG planning meeting:
>    - protocol backward compatibility to enable cluster upgrade w/o
>      downtime
>    - consistency of data on replica and leader
>    - switch from leader to replica without data loss
>    - up to date replicas to run read-only requests
>    - ability to switch async replicas into sync ones
>    - guarantee of rollback on leader and sync replicas
>    - simplicity of cluster orchestration
>   
> What this RFC is not:
>   
>    - high availability (HA) solution with automated failover, roles
>      assignments an so on
>    - master-master configuration support
>
>
> ## Background and motivation
>
> There are number of known implemenatation of consistent data presence in
> a cluster. They can be commonly named as "wait for LSN" technique. The
> biggest issue with this technique is the abscence of rollback gauarantees
> at replica in case of transaction failure on one master or some of the
> replics in the cluster.
>
> To provide such capabilities a new functionality should be introduced in
> Tarantool core, with limitation mentioned before - backward compatilibity
> and ease of cluster orchestration.
>
> ## Detailed design
>
> ### Quorum commit
> The main idea behind the proposal is to reuse existent machinery as much
> as possible. It will ensure the well-tested and proven functionality
> across many instances in MRG and beyond is used. The transaction rollback
> mechanism is in place and works for WAL write failure. If we substitute
> the WAL success with a new situation which is named 'quorum' later in
> this document then no changes to the machinery is needed. The same is
> true for snapshot machinery that allows to create a copy of the database
> in memory for the whole period of snapshot file write. Adding quorum here
> also minimizes changes.
>
> Currently replication represented by the following scheme:
> ```
> Customer        Leader          WAL(L)        Replica        WAL(R)
>     |------TXN----->|              |             |              |
>     |               |              |             |              |
>     |         [TXN Rollback        |             |              |
>     |            created]          |             |              |
>     |               |              |             |              |
>     |               |-----TXN----->|             |              |
>     |               |              |             |              |
>     |               |<---WAL Ok----|             |              |
>     |               |              |             |              |
>     |         [TXN Rollback        |             |              |
>     |           destroyed]         |             |              |
>     |               |              |             |              |
>     |<----TXN Ok----|              |             |              |
>     |               |-------Replicate TXN------->|              |
>     |               |              |             |              |
>     |               |              |       [TXN Rollback        |
>     |               |              |          created]          |
>     |               |              |             |              |
>     |               |              |             |-----TXN----->|
>     |               |              |             |              |
>     |               |              |             |<---WAL Ok----|
>     |               |              |             |              |
>     |               |              |       [TXN Rollback        |
>     |               |              |         destroyed]         |
>     |               |              |             |              |
> ```
>
>
> To introduce the 'quorum' we have to receive confirmation from replicas
> to make a decision on whether the quorum is actually present. Leader
> collects necessary amount of replicas confirmation plus its own WAL
> success. This state is named 'quorum' and gives leader the right to
> complete the customers' request. So the picture will change to:
> ```
> Customer        Leader          WAL(L)        Replica        WAL(R)
>     |------TXN----->|              |             |              |
>     |               |              |             |              |
>     |         [TXN Rollback        |             |              |
>     |            created]          |             |              |
>     |               |              |             |              |
>     |               |-----TXN----->|             |              |
>     |               |              |             |              |
>     |               |-------Replicate TXN------->|              |
>     |               |              |             |              |
>     |               |              |       [TXN Rollback        |
>     |               |<---WAL Ok----|          created]          |
>     |               |              |             |              |
>     |           [Waiting           |             |-----TXN----->|
>     |         of a quorum]         |             |              |
>     |               |              |             |<---WAL Ok----|
>     |               |              |             |              |
>     |               |<------Replication Ok-------|              |
>     |               |              |             |              |
>     |            [Quorum           |             |              |
>     |           achieved]          |             |              |
>     |               |              |             |              |
>     |         [TXN Rollback        |             |              |
>     |           destroyed]         |             |              |
>     |               |              |             |              |
>     |               |----Quorum--->|             |              |
>     |               |              |             |              |
>     |               |-----------Quorum---------->|              |
>     |               |              |             |              |
>     |<---TXN Ok-----|              |       [TXN Rollback        |
>     |               |              |         destroyed]         |
>     |               |              |             |              |
>     |               |              |             |----Quorum--->|
>     |               |              |             |              |
> ```
>
> The quorum should be collected as a table for a list of transactions
> waiting for quorum. The latest transaction that collects the quorum is
> considered as complete, as well as all transactions prior to it, since
> all transactions should be applied in order. Leader writes a 'quorum'
> message to the WAL and it is delivered to Replicas.
I think we should cal the the message something like 'confirm'
(not 'quorum'), and mention here that it has its own LSN.
Besides, it's very similar to phase two of two-phase-commit,
we'll need it later.
>   
> Replica should report a positive or a negative result of the TXN to the
> Leader via the IPROTO explicitly to allow Leader to collect the quorum
> or anti-quorum for the TXN. In case negative result for the TXN received
> from minor number of Replicas, then Leader has to send an error message
> to each Replica, which in turn has to disconnect from the replication
> the same way as it is done now in case of conflict.
I'm sure that unconfirmed transactions must not be visible both
on master and on replica since the could be aborted.
We need read-committed.
>   
> In case Leader receives enough error messages to do not achieve the
> quorum it should write the 'rollback' message in the WAL. After that
> Leader and Replicas will perform the rollback for all TXN that didn't
> receive quorum.
>   
> ### Recovery and failover.
>   
> Tarantool instance during reading WAL should postpone the commit until
> the quorum is read. In case the WAL eof is achieved, the instance should
> keep rollback for all transactions that are waiting for a quorum entry
> until the role of the instance is set. In case this instance become a
> Replica there are no additional actions needed, sine all info about
> quorum/rollback will arrive via replication. In case this instance is
> assigned a Leader role, it should write 'rollback' in its WAL and
> perform rollback for all transactions waiting for a quorum.
>   
> In case of a Leader failure a Replica with the biggest LSN with former
> leader's ID is elected as a new leader. The replica should record
> 'rollback' in its WAL which effectively means that all transactions
> without quorum should be rolled back. This rollback will be delivered to
> all replicas and they will perform rollbacks of all transactions waiting
> for quorum.
>   
> ### Snapshot generation.
>   
> We also can reuse current machinery of snapshot generation. Upon
> receiving a request to create a snapshot an instance should request a
> readview for the current commit operation. Although start of the
> snapshot generation should be postponed until this commit operation
> receives its quorum. In case operation is rolled back, the snapshot
> generation should be aborted and restarted using current transaction
> after rollback is complete.
There is no guarantee that the replica will ever receive 'confirm'
('quorum') message, for example when the master is dead forever.
That means that in some cases we are unable to make a snapshot..
But if we make unconfirmed transactions invisible, the current
read view will give us exactly what we need, but I have no idea
how to handle WAL rotation ('restart') in this case.
>   
> After snapshot is created the WAL should start from the first operation
> that follows the commit operation snapshot is generated for. That means
> WAL will contain a quorum message that refers to a transaction that is
> not present in the WAL. Apparently, we have to allow this for the case
> quorum refers to a transaction with LSN less than the first entry in the
> WAL and only once.
Not 'only once', there could be several unconfirmed transactions
and thus several 'confirm' messages.
>   
> ### Asynchronous replication.
>   
> Along with synchronous Replicas the cluster can contain asynchronous
> Replicas. That means async Replica doesn't reply to the Leader with
> errors since they're not contributing into quorum. Still, async
> Replicas have to follow the new WAL operation, such as keep rollback
> info until 'quorum' message is received. This is essential for the case
> of 'rollback' message appearance in the WAL. This message assumes
> Replica is able to perform all necessary rollback by itself. Cluster
> information should contain explicit notification of each Replica
> operation mode.
>   
> ### Synchronous replication enabling.
>   
> Synchronous operation can be required for a set of spaces in the data
> scheme. That means only transactions that contain data modification for
> these spaces should require quorum. Such transactions named synchronous.
> As soon as last operation of synchronous transaction appeared in Leader's
> WAL, it will cause all following transactions - matter if they are
> synchronous or not - wait for the quorum. In case quorum is not achieved
> the 'rollback' operation will cause rollback of all transactions after
> the synchronous one. It will ensure the consistent state of the data both
> on Leader and Replicas. In case user doesn't require synchronous operation
> for any space then no changes to the WAL generation and replication will
> appear.
>   
> Cluster description should contain explicit attribute for each Replica
> to denote it participates in synchronous activities. Also the description
> should contain criterion on how many Replicas responses are needed to
> achieve the quorum.
>   
>
> ## Rationale and alternatives
>
> There is an implementation of synchronous replication as part of gh-980
> activities, still it is not in a state to get into the product. More
> than that it intentionally breaks backward compatibility which is a
> prerequisite for this proposal.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-04-07 13:02 ` Aleksandr Lyapunov
@ 2020-04-08  9:18   ` Sergey Ostanevich
  2020-04-08 14:05     ` Konstantin Osipov
  0 siblings, 1 reply; 53+ messages in thread
From: Sergey Ostanevich @ 2020-04-08  9:18 UTC (permalink / raw)
  To: Aleksandr Lyapunov; +Cc: tarantool-patches

Hi!
Thanks for review!

Latest version is availabe at
https://github.com/tarantool/tarantool/blob/sergos/quorum-based-synchro/doc/rfc/quorum-based-synchro.md

> > The quorum should be collected as a table for a list of transactions
> > waiting for quorum. The latest transaction that collects the quorum is
> > considered as complete, as well as all transactions prior to it, since
> > all transactions should be applied in order. Leader writes a 'quorum'
> > message to the WAL and it is delivered to Replicas.
> I think we should cal the the message something like 'confirm'
> (not 'quorum'), and mention here that it has its own LSN.

I believe it was clear from the mention that it goes to WAL. Updated.

The quorum should be collected as a table for a list of transactions
waiting for quorum. The latest transaction that collects the quorum is
considered as complete, as well as all transactions prior to it, since
all transactions should be applied in order. Leader writes a 'confirm'
message to the WAL that refers to the transaction's LSN and it has its
own LSN. This confirm message is delivered to all replicas through the 
existing replication mechanism.

> Besides, it's very similar to phase two of two-phase-commit,
> we'll need it later.

We already discussed this, similarity is ended as soon as one quorum
means confirmation of the whole bunch of transactions before it, not the
one. 

> > Replica should report a positive or a negative result of the TXN to the
> > Leader via the IPROTO explicitly to allow Leader to collect the quorum
> > or anti-quorum for the TXN. In case negative result for the TXN received
> > from minor number of Replicas, then Leader has to send an error message
> > to each Replica, which in turn has to disconnect from the replication
> > the same way as it is done now in case of conflict.
> I'm sure that unconfirmed transactions must not be visible both
> on master and on replica since the could be aborted.
> We need read-committed.

So far I don't envision any problems with read-committed after we enable
transaction manager similar to vinyl. From the standpoint of replication
the rollback message will cancel all transactions that are later than
confirmed one. No matter if they are visible or not.

> > ### Snapshot generation.
> > We also can reuse current machinery of snapshot generation. Upon
> > receiving a request to create a snapshot an instance should request a
> > readview for the current commit operation. Although start of the
> > snapshot generation should be postponed until this commit operation
> > receives its quorum. In case operation is rolled back, the snapshot
> > generation should be aborted and restarted using current transaction
> > after rollback is complete.
> There is no guarantee that the replica will ever receive 'confirm'
> ('quorum') message, for example when the master is dead forever.
> That means that in some cases we are unable to make a snapshot..
> But if we make unconfirmed transactions invisible, the current
> read view will give us exactly what we need, but I have no idea
> how to handle WAL rotation ('restart') in this case.

Updated.

In case master appears unavailable a replica still have to be able to
create a snapshot. Replica can perform rollback for all transactions that
are not confirmed and claim its LSN as the latest confirmed txn. Then it
can create a snapshot in a regular way and start with blank xlog file.
All rolled back transactions will appear through the regular replication
in case master reappears later on.

> > After snapshot is created the WAL should start from the first operation
> > that follows the commit operation snapshot is generated for. That means
> > WAL will contain a quorum message that refers to a transaction that is
> > not present in the WAL. Apparently, we have to allow this for the case
> > quorum refers to a transaction with LSN less than the first entry in the
> > WAL and only once.
> Not 'only once', there could be several unconfirmed transactions
> and thus several 'confirm' messages.

Updated.

After snapshot is created the WAL should start from the first operation
that follows the commit operation snapshot is generated for. That means
WAL will contain 'confirm' messages that refer to transactions that are
not present in the WAL. Apparently, we have to allow this for the case
'confirm' refers to a transaction with LSN less than the first entry in
the WAL.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-04-08  9:18   ` Sergey Ostanevich
@ 2020-04-08 14:05     ` Konstantin Osipov
  2020-04-08 15:06       ` Sergey Ostanevich
  0 siblings, 1 reply; 53+ messages in thread
From: Konstantin Osipov @ 2020-04-08 14:05 UTC (permalink / raw)
  To: Sergey Ostanevich; +Cc: tarantool-patches

* Sergey Ostanevich <sergos@tarantool.org> [20/04/08 12:23]:

One thing I continue not understanding is why settle on RFC
now when in-memory wal is not in yet? 

There is an unpleasant risk of committing to something that turns
out to not work out in the best possible way.


-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-04-08 14:05     ` Konstantin Osipov
@ 2020-04-08 15:06       ` Sergey Ostanevich
  0 siblings, 0 replies; 53+ messages in thread
From: Sergey Ostanevich @ 2020-04-08 15:06 UTC (permalink / raw)
  To: Konstantin Osipov, Aleksandr Lyapunov, tarantool-patches

Hi!

Thanks for review!

On 08 апр 17:05, Konstantin Osipov wrote:
> * Sergey Ostanevich <sergos@tarantool.org> [20/04/08 12:23]:
> 
> One thing I continue not understanding is why settle on RFC
> now when in-memory wal is not in yet? 

Does this RFC depend on in-memory WAL after all?
The formulation of principles in RFC neither rely on nor deny any
optimizations of underlying infrastructure. I believe in-memory can be
introduced indepenetly. Correct me, if I'm wrong.

> There is an unpleasant risk of committing to something that turns
> out to not work out in the best possible way.

It is maxima of current MRG management: instead of perpetually inventing
'best possible' without clear roadmap - and not finish it - identify
what's needed and perform to its timely appearance.

Again, if you see some conflicts between RFC and any technologies being
developed - name them, let's try to resolve them.

Regards,
Sergos

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-04-03 21:08 [Tarantool-patches] [RFC] Quorum-based synchronous replication Sergey Ostanevich
  2020-04-07 13:02 ` Aleksandr Lyapunov
@ 2020-04-14 12:58 ` Sergey Bronnikov
  2020-04-14 14:43   ` Sergey Ostanevich
  2020-04-20 23:32 ` Vladislav Shpilevoy
  2020-04-23 21:38 ` Vladislav Shpilevoy
  3 siblings, 1 reply; 53+ messages in thread
From: Sergey Bronnikov @ 2020-04-14 12:58 UTC (permalink / raw)
  To: Sergey Ostanevich; +Cc: tarantool-patches

Hi,

see 5 comments inline

On 00:08 Sat 04 Apr , Sergey Ostanevich wrote:
> 
> * **Status**: In progress
> * **Start date**: 31-03-2020
> * **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\>
> * **Issues**:

1. Just for convenience, please add https://github.com/tarantool/tarantool/issues/4842

> ## Summary
> 
> The aim of this RFC is to address the following list of problems
> formulated at MRG planning meeting:
>   - protocol backward compatibility to enable cluster upgrade w/o
>     downtime
>   - consistency of data on replica and leader
>   - switch from leader to replica without data loss
>   - up to date replicas to run read-only requests
>   - ability to switch async replicas into sync ones

2. Ability to switch async replicas into sync ones and vice-versa? Or not?

>   - guarantee of rollback on leader and sync replicas
>   - simplicity of cluster orchestration
>  
> What this RFC is not:
>  
>   - high availability (HA) solution with automated failover, roles
>     assignments an so on
>   - master-master configuration support
> 
> 
> ## Background and motivation
> 
> There are number of known implemenatation of consistent data presence in
> a cluster. They can be commonly named as "wait for LSN" technique. The
> biggest issue with this technique is the abscence of rollback gauarantees

3. typo: gauarantees -> guarantees

> at replica in case of transaction failure on one master or some of the 
> replics in the cluster. 

4. typo: replics -> replicas
> 
> To provide such capabilities a new functionality should be introduced in
> Tarantool core, with limitation mentioned before - backward compatilibity
> and ease of cluster orchestration.

5. but there is nothing mentioned before about these limitations.

> ## Detailed design
> 
> ### Quorum commit
> The main idea behind the proposal is to reuse existent machinery as much
> as possible. It will ensure the well-tested and proven functionality
> across many instances in MRG and beyond is used. The transaction rollback
> mechanism is in place and works for WAL write failure. If we substitute 
> the WAL success with a new situation which is named 'quorum' later in
> this document then no changes to the machinery is needed. The same is
> true for snapshot machinery that allows to create a copy of the database
> in memory for the whole period of snapshot file write. Adding quorum here
> also minimizes changes.
> 
> Currently replication represented by the following scheme:
> ```
> Customer        Leader          WAL(L)        Replica        WAL(R)
>    |------TXN----->|              |             |              |
>    |               |              |             |              | 
>    |         [TXN Rollback        |             |              |
>    |            created]          |             |              |
>    |               |              |             |              | 
>    |               |-----TXN----->|             |              |
>    |               |              |             |              | 
>    |               |<---WAL Ok----|             |              |
>    |               |              |             |              | 
>    |         [TXN Rollback        |             |              |
>    |           destroyed]         |             |              |
>    |               |              |             |              | 
>    |<----TXN Ok----|              |             |              |
>    |               |-------Replicate TXN------->|              |
>    |               |              |             |              | 
>    |               |              |       [TXN Rollback        |
>    |               |              |          created]          |
>    |               |              |             |              | 
>    |               |              |             |-----TXN----->|
>    |               |              |             |              | 
>    |               |              |             |<---WAL Ok----|
>    |               |              |             |              |
>    |               |              |       [TXN Rollback        |
>    |               |              |         destroyed]         |
>    |               |              |             |              | 
> ```
> 
> 
> To introduce the 'quorum' we have to receive confirmation from replicas
> to make a decision on whether the quorum is actually present. Leader
> collects necessary amount of replicas confirmation plus its own WAL
> success. This state is named 'quorum' and gives leader the right to
> complete the customers' request. So the picture will change to:
> ```
> Customer        Leader          WAL(L)        Replica        WAL(R)
>    |------TXN----->|              |             |              |
>    |               |              |             |              | 
>    |         [TXN Rollback        |             |              |
>    |            created]          |             |              |
>    |               |              |             |              | 
>    |               |-----TXN----->|             |              |
>    |               |              |             |              | 
>    |               |-------Replicate TXN------->|              |
>    |               |              |             |              | 
>    |               |              |       [TXN Rollback        |
>    |               |<---WAL Ok----|          created]          |
>    |               |              |             |              | 
>    |           [Waiting           |             |-----TXN----->|
>    |         of a quorum]         |             |              | 
>    |               |              |             |<---WAL Ok----|
>    |               |              |             |              | 
>    |               |<------Replication Ok-------|              | 
>    |               |              |             |              | 
>    |            [Quorum           |             |              |
>    |           achieved]          |             |              |
>    |               |              |             |              | 
>    |         [TXN Rollback        |             |              |
>    |           destroyed]         |             |              |
>    |               |              |             |              | 
>    |               |----Quorum--->|             |              | 
>    |               |              |             |              | 
>    |               |-----------Quorum---------->|              | 
>    |               |              |             |              | 
>    |<---TXN Ok-----|              |       [TXN Rollback        |
>    |               |              |         destroyed]         |
>    |               |              |             |              | 
>    |               |              |             |----Quorum--->| 
>    |               |              |             |              | 
> ```
> 
> The quorum should be collected as a table for a list of transactions 
> waiting for quorum. The latest transaction that collects the quorum is
> considered as complete, as well as all transactions prior to it, since
> all transactions should be applied in order. Leader writes a 'quorum'
> message to the WAL and it is delivered to Replicas.
>  
> Replica should report a positive or a negative result of the TXN to the 
> Leader via the IPROTO explicitly to allow Leader to collect the quorum
> or anti-quorum for the TXN. In case negative result for the TXN received
> from minor number of Replicas, then Leader has to send an error message
> to each Replica, which in turn has to disconnect from the replication
> the same way as it is done now in case of conflict.
>  
> In case Leader receives enough error messages to do not achieve the
> quorum it should write the 'rollback' message in the WAL. After that
> Leader and Replicas will perform the rollback for all TXN that didn't
> receive quorum.
>  
> ### Recovery and failover.
>  
> Tarantool instance during reading WAL should postpone the commit until
> the quorum is read. In case the WAL eof is achieved, the instance should
> keep rollback for all transactions that are waiting for a quorum entry
> until the role of the instance is set. In case this instance become a 
> Replica there are no additional actions needed, sine all info about 
> quorum/rollback will arrive via replication. In case this instance is 
> assigned a Leader role, it should write 'rollback' in its WAL and
> perform rollback for all transactions waiting for a quorum.
>  
> In case of a Leader failure a Replica with the biggest LSN with former
> leader's ID is elected as a new leader. The replica should record
> 'rollback' in its WAL which effectively means that all transactions
> without quorum should be rolled back. This rollback will be delivered to
> all replicas and they will perform rollbacks of all transactions waiting
> for quorum.
>  
> ### Snapshot generation.
>  
> We also can reuse current machinery of snapshot generation. Upon
> receiving a request to create a snapshot an instance should request a
> readview for the current commit operation. Although start of the
> snapshot generation should be postponed until this commit operation
> receives its quorum. In case operation is rolled back, the snapshot
> generation should be aborted and restarted using current transaction
> after rollback is complete.
>  
> After snapshot is created the WAL should start from the first operation
> that follows the commit operation snapshot is generated for. That means
> WAL will contain a quorum message that refers to a transaction that is
> not present in the WAL. Apparently, we have to allow this for the case
> quorum refers to a transaction with LSN less than the first entry in the
> WAL and only once.
>  
> ### Asynchronous replication.
>  
> Along with synchronous Replicas the cluster can contain asynchronous
> Replicas. That means async Replica doesn't reply to the Leader with
> errors since they're not contributing into quorum. Still, async
> Replicas have to follow the new WAL operation, such as keep rollback
> info until 'quorum' message is received. This is essential for the case
> of 'rollback' message appearance in the WAL. This message assumes
> Replica is able to perform all necessary rollback by itself. Cluster
> information should contain explicit notification of each Replica
> operation mode. 
>  
> ### Synchronous replication enabling.
>  
> Synchronous operation can be required for a set of spaces in the data
> scheme. That means only transactions that contain data modification for
> these spaces should require quorum. Such transactions named synchronous.
> As soon as last operation of synchronous transaction appeared in Leader's
> WAL, it will cause all following transactions - matter if they are
> synchronous or not - wait for the quorum. In case quorum is not achieved
> the 'rollback' operation will cause rollback of all transactions after
> the synchronous one. It will ensure the consistent state of the data both
> on Leader and Replicas. In case user doesn't require synchronous operation
> for any space then no changes to the WAL generation and replication will
> appear.
>  
> Cluster description should contain explicit attribute for each Replica
> to denote it participates in synchronous activities. Also the description
> should contain criterion on how many Replicas responses are needed to
> achieve the quorum.  
>  
> 
> ## Rationale and alternatives
> 
> There is an implementation of synchronous replication as part of gh-980 
> activities, still it is not in a state to get into the product. More 
> than that it intentionally breaks backward compatibility which is a
> prerequisite for this proposal.

-- 
sergeyb@

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-04-14 12:58 ` Sergey Bronnikov
@ 2020-04-14 14:43   ` Sergey Ostanevich
  2020-04-15 11:09     ` sergos
  0 siblings, 1 reply; 53+ messages in thread
From: Sergey Ostanevich @ 2020-04-14 14:43 UTC (permalink / raw)
  To: Sergey Bronnikov; +Cc: tarantool-patches

Hi!

Thanks for review!

On 14 апр 15:58, Sergey Bronnikov wrote:
> Hi,
> 
> see 5 comments inline
> 
> On 00:08 Sat 04 Apr , Sergey Ostanevich wrote:
> > 
> > * **Status**: In progress
> > * **Start date**: 31-03-2020
> > * **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\>
> > * **Issues**:
> 
> 1. Just for convenience, please add https://github.com/tarantool/tarantool/issues/4842
> 
Done.

> > ## Summary
> > 
> > The aim of this RFC is to address the following list of problems
> > formulated at MRG planning meeting:
> >   - protocol backward compatibility to enable cluster upgrade w/o
> >     downtime
> >   - consistency of data on replica and leader
> >   - switch from leader to replica without data loss
> >   - up to date replicas to run read-only requests
> >   - ability to switch async replicas into sync ones
> 
> 2. Ability to switch async replicas into sync ones and vice-versa? Or not?
> 
Both ways, updated.

> >   - guarantee of rollback on leader and sync replicas
> >   - simplicity of cluster orchestration
> >  
> > What this RFC is not:
> >  
> >   - high availability (HA) solution with automated failover, roles
> >     assignments an so on
> >   - master-master configuration support
> > 
> > 
> > ## Background and motivation
> > 
> > There are number of known implemenatation of consistent data presence in
> > a cluster. They can be commonly named as "wait for LSN" technique. The
> > biggest issue with this technique is the abscence of rollback gauarantees
> 
> 3. typo: gauarantees -> guarantees
> 
done

> > at replica in case of transaction failure on one master or some of the 
> > replics in the cluster. 
> 
> 4. typo: replics -> replicas
> > 
done

> > To provide such capabilities a new functionality should be introduced in
> > Tarantool core, with limitation mentioned before - backward compatilibity
> > and ease of cluster orchestration.
> 
> 5. but there is nothing mentioned before about these limitations.
> 
They were named as problems to address, so I renamed them as
requirements.

[cut]

Pushed updated version to the branch.

Thanks,
Sergos

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-04-14 14:43   ` Sergey Ostanevich
@ 2020-04-15 11:09     ` sergos
  2020-04-15 14:50       ` sergos
  0 siblings, 1 reply; 53+ messages in thread
From: sergos @ 2020-04-15 11:09 UTC (permalink / raw)
  To: tarantool-patches
  Cc: Николай
	Карлов,
	Тимур
	Сафин

Hi!

The latest version is below, also available at
https://github.com/tarantool/tarantool/blob/sergos/quorum-based-synchro/doc/rfc/quorum-based-synchro.md

---
* **Status**: In progress
* **Start date**: 31-03-2020
* **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\>
* **Issues**: https://github.com/tarantool/tarantool/issues/4842

## Summary

The aim of this RFC is to address the following list of problems
formulated at MRG planning meeting:
  - protocol backward compatibility to enable cluster upgrade w/o
    downtime
  - consistency of data on replica and leader
  - switch from leader to replica without data loss
  - up to date replicas to run read-only requests
  - ability to switch async replicas into sync ones and vice versa
  - guarantee of rollback on leader and sync replicas
  - simplicity of cluster orchestration

What this RFC is not:

  - high availability (HA) solution with automated failover, roles
    assignments an so on
  - master-master configuration support

## Background and motivation

There are number of known implementation of consistent data presence in
a cluster. They can be commonly named as "wait for LSN" technique. The
biggest issue with this technique is thecompatibility absence of rollback guarantees
at replica in case of transaction failure on one master or some of the
replicas in the cluster.

To provide such capabilities a new functionality should be introduced in
Tarantool core, with requirements mentioned before - backward
compatilibity and ease of cluster orchestration.

## Detailed design

### Quorum commit

The main idea behind the proposal is to reuse existent machinery as much
as possible. It will ensure the well-tested and proven functionality
across many instances in MRG and beyond is used. The transaction rollback
mechanism is in place and works for WAL write failure. If we substitute
the WAL success with a new situation which is named 'quorum' later in
this document then no changes to the machinery is needed. The same is
true for snapshot machinery that allows to create a copy of the database
in memory for the whole period of snapshot file write. Adding quorum here
also minimizes changes.

Currently replication represented by the following scheme:
```
Customer        Leader          WAL(L)        Replica        WAL(R)
   |------TXN----->|              |             |              |
   |               |              |             |              |
   |         [TXN Rollback        |             |              |
   |            created]          |             |              |
   |               |              |             |              |
   |               |-----TXN----->|             |              |
   |               |              |             |              |
   |               |<---WAL Ok----|             |              |
   |               |              |             |              |
   |         [TXN Rollback        |             |              |
   |           destroyed]         |             |              |
   |               |              |             |              |
   |<----TXN Ok----|              |             |              |
   |               |-------Replicate TXN------->|              |
   |               |              |             |              |
   |               |              |       [TXN Rollback        |
   |               |              |          created]          |
   |               |              |             |              |
   |               |              |             |-----TXN----->|
   |               |              |             |              |
   |               |              |             |<---WAL Ok----|
   |               |              |             |              |
   |               |              |       [TXN Rollback        |
   |               |              |         destroyed]         |
   |               |              |             |              |
```

To introduce the 'quorum' we have to receive confirmation from replicas
to make a decision on whether the quorum is actually present. Leader
collects necessary amount of replicas confirmation plus its own WAL
success. This state is named 'quorum' and gives leader the right to
complete the customers' request. So the picture will change to:
```
Customer        Leader          WAL(L)        Replica        WAL(R)
   |------TXN----->|              |             |              |
   |               |              |             |              |
   |         [TXN Rollback        |             |              |
   |            created]          |             |              |
   |               |              |             |              |
   |               |-----TXN----->|             |              |
   |               |              |             |              |
   |               |-------Replicate TXN------->|              |
   |               |              |             |              |
   |               |              |       [TXN Rollback        |
   |               |<---WAL Ok----|          created]          |
   |               |              |             |              |
   |           [Waiting           |             |-----TXN----->|
   |         of a quorum]         |             |              |
   |               |              |             |<---WAL Ok----|
   |               |              |             |              |
   |               |<------Replication Ok-------|              |
   |               |              |             |              |
   |            [Quorum           |             |              |
   |           achieved]          |             |              |
   |               |              |             |              |
   |         [TXN Rollback        |             |              |
   |           destroyed]         |             |              |
   |               |              |             |              |
   |               |---Confirm--->|             |              |
   |               |              |             |              |
   |               |----------Confirm---------->|              |
   |               |              |             |              |
   |<---TXN Ok-----|              |       [TXN Rollback        |
   |               |              |         destroyed]         |
   |               |              |             |              |
   |               |              |             |---Confirm--->|
   |               |              |             |              |
```

The quorum should be collected as a table for a list of transactions
waiting for quorum. The latest transaction that collects the quorum is
considered as complete, as well as all transactions prior to it, since
all transactions should be applied in order. Leader writes a 'confirm'
message to the WAL that refers to the transaction's LSN and it has its
own LSN. This confirm message is delivered to all replicas through the
existing replication mechanism.

Replica should report a positive or a negative result of the TXN to the
leader via the IPROTO explicitly to allow leader to collect the quorum
or anti-quorum for the TXN. In case a negative result for the TXN is
received from minor number of replicas, then leader has to send an error
message to the replicas, which in turn have to disconnect from the
replication the same way as it is done now in case of conflict.

In case leader receives enough error messages to do not achieve the
quorum it should write the 'rollback' message in the WAL. After that
leader and replicas will perform the rollback for all TXN that didn't
receive quorum.

### Recovery and failover.

Tarantool instance during reading WAL should postpone the commit until
the 'confirm' is read. In case the WAL eof is achieved, the instance
should keep rollback for all transactions that are waiting for a confirm
entry until the role of the instance is set. In case this instance
become a replica there are no additional actions needed, since all info
about quorum/rollback will arrive via replication. In case this instance
is assigned a leader role, it should write 'rollback' in its WAL and
perform rollback for all transactions waiting for a quorum.

In case of a leader failure a replica with the biggest LSN with former
leader's ID is elected as a new leader. The replica should record
'rollback' in its WAL which effectively means that all transactions
without quorum should be rolled back. This rollback will be delivered to
all replicas and they will perform rollbacks of all transactions waiting
for quorum.

An interface to force apply pending transactions by issuing a confirm
entry for them have to be introduced for manual recovery.

### Snapshot generation.

We also can reuse current machinery of snapshot generation. Upon
receiving a request to create a snapshot an instance should request a
readview for the current commit operation. Although start of the
snapshot generation should be postponed until this commit operation
receives its confirmation. In case operation is rolled back, the snapshot
generation should be aborted and restarted using current transaction
after rollback is complete.

After snapshot is created the WAL should start from the first operation
that follows the commit operation snapshot is generated for. That means
WAL will contain 'confirm' messages that refer to transactions that are
not present in the WAL. Apparently, we have to allow this for the case
'confirm' refers to a transaction with LSN less than the first entry in
the WAL.

In case master appears unavailable a replica still have to be able to
create a snapshot. Replica can perform rollback for all transactions that
are not confirmed and claim its LSN as the latest confirmed txn. Then it
can create a snapshot in a regular way and start with blank xlog file.
All rolled back transactions will appear through the regular replication
in case master reappears later on.

### Asynchronous replication.

Along with synchronous replicas the cluster can contain asynchronous
replicas. That means async replica doesn't reply to the leader with
errors since they're not contributing into quorum. Still, async
replicas have to follow the new WAL operation, such as keep rollback
info until 'quorum' message is received. This is essential for the case
of 'rollback' message appearance in the WAL. This message assumes
replica is able to perform all necessary rollback by itself. Cluster
information should contain explicit notification of each replica
operation mode.

### Synchronous replication enabling.

Synchronous operation can be required for a set of spaces in the data
scheme. That means only transactions that contain data modification for
these spaces should require quorum. Such transactions named synchronous.
As soon as last operation of synchronous transaction appeared in leader's
WAL, it will cause all following transactions - matter if they are
synchronous or not - wait for the quorum. In case quorum is not achieved
the 'rollback' operation will cause rollback of all transactions after
the synchronous one. It will ensure the consistent state of the data both
on leader and replicas. In case user doesn't require synchronous operation
for any space then no changes to the WAL generation and replication will
appear.

Cluster description should contain explicit attribute for each replica
to denote it participates in synchronous activities. Also the description
should contain criterion on how many replicas responses are needed to
achieve the quorum.

## Rationale and alternatives

There is an implementation of synchronous replication as part of gh-980
activities, still it is not in a state to get into the product. More
than that it intentionally breaks backward compatibility which is a
prerequisite for this proposal.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-04-15 11:09     ` sergos
@ 2020-04-15 14:50       ` sergos
  2020-04-16  7:13         ` Aleksandr Lyapunov
                           ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: sergos @ 2020-04-15 14:50 UTC (permalink / raw)
  To: Николай
	Карлов,
	Тимур
	Сафин,
	Mons Anderson, Aleksandr Lyapunov, Sergey Bronnikov
  Cc: tarantool-patches

Sorry for mess introduced by mail client in previous message.
Here’s the correct version with 3 more misprints fixed.

The version is available here
https://github.com/tarantool/tarantool/blob/sergos/quorum-based-synchro/doc/rfc/quorum-based-synchro.md

Please, reply all with your comments/blessings today.

Regards,
Sergos

---
* **Status**: In progress
* **Start date**: 31-03-2020
* **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\>
* **Issues**: https://github.com/tarantool/tarantool/issues/4842

## Summary

The aim of this RFC is to address the following list of problems
formulated at MRG planning meeting:
  - protocol backward compatibility to enable cluster upgrade w/o
    downtime
  - consistency of data on replica and leader
  - switch from leader to replica without data loss
  - up to date replicas to run read-only requests
  - ability to switch async replicas into sync ones and vice versa
  - guarantee of rollback on leader and sync replicas
  - simplicity of cluster orchestration

What this RFC is not:

  - high availability (HA) solution with automated failover, roles
    assignments an so on
  - master-master configuration support

## Background and motivation

There are number of known implementation of consistent data presence in
a cluster. They can be commonly named as "wait for LSN" technique. The
biggest issue with this technique is the absence of rollback guarantees
at replica in case of transaction failure on one master or some of the
replicas in the cluster.

To provide such capabilities a new functionality should be introduced in
Tarantool core, with requirements mentioned before - backward
compatibility and ease of cluster orchestration.

## Detailed design

### Quorum commit

The main idea behind the proposal is to reuse existent machinery as much
as possible. It will ensure the well-tested and proven functionality
across many instances in MRG and beyond is used. The transaction rollback
mechanism is in place and works for WAL write failure. If we substitute
the WAL success with a new situation which is named 'quorum' later in
this document then no changes to the machinery is needed. The same is
true for snapshot machinery that allows to create a copy of the database
in memory for the whole period of snapshot file write. Adding quorum here
also minimizes changes.

Currently replication represented by the following scheme:
```
Customer        Leader          WAL(L)        Replica        WAL(R)
   |------TXN----->|              |             |              |
   |               |              |             |              |
   |         [TXN Rollback        |             |              |
   |            created]          |             |              |
   |               |              |             |              |
   |               |-----TXN----->|             |              |
   |               |              |             |              |
   |               |<---WAL Ok----|             |              |
   |               |              |             |              |
   |         [TXN Rollback        |             |              |
   |           destroyed]         |             |              |
   |               |              |             |              |
   |<----TXN Ok----|              |             |              |
   |               |-------Replicate TXN------->|              |
   |               |              |             |              |
   |               |              |       [TXN Rollback        |
   |               |              |          created]          |
   |               |              |             |              |
   |               |              |             |-----TXN----->|
   |               |              |             |              |
   |               |              |             |<---WAL Ok----|
   |               |              |             |              |
   |               |              |       [TXN Rollback        |
   |               |              |         destroyed]         |
   |               |              |             |              |
```

To introduce the 'quorum' we have to receive confirmation from replicas
to make a decision on whether the quorum is actually present. Leader
collects necessary amount of replicas confirmation plus its own WAL
success. This state is named 'quorum' and gives leader the right to
complete the customers' request. So the picture will change to:
```
Customer        Leader          WAL(L)        Replica        WAL(R)
   |------TXN----->|              |             |              |
   |               |              |             |              |
   |         [TXN Rollback        |             |              |
   |            created]          |             |              |
   |               |              |             |              |
   |               |-----TXN----->|             |              |
   |               |              |             |              |
   |               |-------Replicate TXN------->|              |
   |               |              |             |              |
   |               |              |       [TXN Rollback        |
   |               |<---WAL Ok----|          created]          |
   |               |              |             |              |
   |           [Waiting           |             |-----TXN----->|
   |         of a quorum]         |             |              |
   |               |              |             |<---WAL Ok----|
   |               |              |             |              |
   |               |<------Replication Ok-------|              |
   |               |              |             |              |
   |            [Quorum           |             |              |
   |           achieved]          |             |              |
   |               |              |             |              |
   |         [TXN Rollback        |             |              |
   |           destroyed]         |             |              |
   |               |              |             |              |
   |               |---Confirm--->|             |              |
   |               |              |             |              |
   |               |----------Confirm---------->|              |
   |               |              |             |              |
   |<---TXN Ok-----|              |       [TXN Rollback        |
   |               |              |         destroyed]         |
   |               |              |             |              |
   |               |              |             |---Confirm--->|
   |               |              |             |              |
```

The quorum should be collected as a table for a list of transactions
waiting for quorum. The latest transaction that collects the quorum is
considered as complete, as well as all transactions prior to it, since
all transactions should be applied in order. Leader writes a 'confirm'
message to the WAL that refers to the transaction's LSN and it has its
own LSN. This confirm message is delivered to all replicas through the
existing replication mechanism.

Replica should report a positive or a negative result of the TXN to the
leader via the IPROTO explicitly to allow leader to collect the quorum
or anti-quorum for the TXN. In case a negative result for the TXN is
received from minor number of replicas, then leader has to send an error
message to the replicas, which in turn have to disconnect from the
replication the same way as it is done now in case of conflict.

In case leader receives enough error messages to do not achieve the
quorum it should write the 'rollback' message in the WAL. After that
leader and replicas will perform the rollback for all TXN that didn't
receive quorum.

### Recovery and failover.

Tarantool instance during reading WAL should postpone the commit until
the 'confirm' is read. In case the WAL eof is achieved, the instance
should keep rollback for all transactions that are waiting for a confirm
entry until the role of the instance is set. In case this instance
become a replica there are no additional actions needed, since all info
about quorum/rollback will arrive via replication. In case this instance
is assigned a leader role, it should write 'rollback' in its WAL and
perform rollback for all transactions waiting for a quorum.

In case of a leader failure a replica with the biggest LSN with former
leader's ID is elected as a new leader. The replica should record
'rollback' in its WAL which effectively means that all transactions
without quorum should be rolled back. This rollback will be delivered to
all replicas and they will perform rollbacks of all transactions waiting
for quorum.

An interface to force apply pending transactions by issuing a confirm
entry for them have to be introduced for manual recovery.

### Snapshot generation.

We also can reuse current machinery of snapshot generation. Upon
receiving a request to create a snapshot an instance should request a
readview for the current commit operation. Although start of the
snapshot generation should be postponed until this commit operation
receives its confirmation. In case operation is rolled back, the snapshot
generation should be aborted and restarted using current transaction
after rollback is complete.

After snapshot is created the WAL should start from the first operation
that follows the commit operation snapshot is generated for. That means
WAL will contain 'confirm' messages that refer to transactions that are
not present in the WAL. Apparently, we have to allow this for the case
'confirm' refers to a transaction with LSN less than the first entry in
the WAL.

In case master appears unavailable a replica still have to be able to
create a snapshot. Replica can perform rollback for all transactions that
are not confirmed and claim its LSN as the latest confirmed txn. Then it
can create a snapshot in a regular way and start with blank xlog file.
All rolled back transactions will appear through the regular replication
in case master reappears later on.

### Asynchronous replication.

Along with synchronous replicas the cluster can contain asynchronous
replicas. That means async replica doesn't reply to the leader with
errors since they're not contributing into quorum. Still, async
replicas have to follow the new WAL operation, such as keep rollback
info until 'quorum' message is received. This is essential for the case
of 'rollback' message appearance in the WAL. This message assumes
replica is able to perform all necessary rollback by itself. Cluster
information should contain explicit notification of each replica
operation mode.

### Synchronous replication enabling.

Synchronous operation can be required for a set of spaces in the data
scheme. That means only transactions that contain data modification for
these spaces should require quorum. Such transactions named synchronous.
As soon as last operation of synchronous transaction appeared in leader's
WAL, it will cause all following transactions - matter if they are
synchronous or not - wait for the quorum. In case quorum is not achieved
the 'rollback' operation will cause rollback of all transactions after
the synchronous one. It will ensure the consistent state of the data both
on leader and replicas. In case user doesn't require synchronous operation
for any space then no changes to the WAL generation and replication will
appear.

Cluster description should contain explicit attribute for each replica
to denote it participates in synchronous activities. Also the description
should contain criterion on how many replicas responses are needed to
achieve the quorum.

## Rationale and alternatives

There is an implementation of synchronous replication as part of gh-980
activities, still it is not in a state to get into the product. More
than that it intentionally breaks backward compatibility which is a
prerequisite for this proposal.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-04-15 14:50       ` sergos
@ 2020-04-16  7:13         ` Aleksandr Lyapunov
  2020-04-17 10:10         ` Konstantin Osipov
  2020-04-20 11:20         ` Serge Petrenko
  2 siblings, 0 replies; 53+ messages in thread
From: Aleksandr Lyapunov @ 2020-04-16  7:13 UTC (permalink / raw)
  To: sergos,
	Николай
	Карлов,
	Тимур
	Сафин,
	Mons Anderson, Sergey Bronnikov
  Cc: tarantool-patches

lgtm

On 4/15/20 5:50 PM, sergos@tarantool.org wrote:
> Sorry for mess introduced by mail client in previous message.
> Here’s the correct version with 3 more misprints fixed.
>
> The version is available here
> https://github.com/tarantool/tarantool/blob/sergos/quorum-based-synchro/doc/rfc/quorum-based-synchro.md
>
> Please, reply all with your comments/blessings today.
>
> Regards,
> Sergos
>
> ---
> * **Status**: In progress
> * **Start date**: 31-03-2020
> * **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\>
> * **Issues**: https://github.com/tarantool/tarantool/issues/4842
>
> ## Summary
>
> The aim of this RFC is to address the following list of problems
> formulated at MRG planning meeting:
>    - protocol backward compatibility to enable cluster upgrade w/o
>      downtime
>    - consistency of data on replica and leader
>    - switch from leader to replica without data loss
>    - up to date replicas to run read-only requests
>    - ability to switch async replicas into sync ones and vice versa
>    - guarantee of rollback on leader and sync replicas
>    - simplicity of cluster orchestration
>
> What this RFC is not:
>
>    - high availability (HA) solution with automated failover, roles
>      assignments an so on
>    - master-master configuration support
>
> ## Background and motivation
>
> There are number of known implementation of consistent data presence in
> a cluster. They can be commonly named as "wait for LSN" technique. The
> biggest issue with this technique is the absence of rollback guarantees
> at replica in case of transaction failure on one master or some of the
> replicas in the cluster.
>
> To provide such capabilities a new functionality should be introduced in
> Tarantool core, with requirements mentioned before - backward
> compatibility and ease of cluster orchestration.
>
> ## Detailed design
>
> ### Quorum commit
>
> The main idea behind the proposal is to reuse existent machinery as much
> as possible. It will ensure the well-tested and proven functionality
> across many instances in MRG and beyond is used. The transaction rollback
> mechanism is in place and works for WAL write failure. If we substitute
> the WAL success with a new situation which is named 'quorum' later in
> this document then no changes to the machinery is needed. The same is
> true for snapshot machinery that allows to create a copy of the database
> in memory for the whole period of snapshot file write. Adding quorum here
> also minimizes changes.
>
> Currently replication represented by the following scheme:
> ```
> Customer        Leader          WAL(L)        Replica        WAL(R)
>     |------TXN----->|              |             |              |
>     |               |              |             |              |
>     |         [TXN Rollback        |             |              |
>     |            created]          |             |              |
>     |               |              |             |              |
>     |               |-----TXN----->|             |              |
>     |               |              |             |              |
>     |               |<---WAL Ok----|             |              |
>     |               |              |             |              |
>     |         [TXN Rollback        |             |              |
>     |           destroyed]         |             |              |
>     |               |              |             |              |
>     |<----TXN Ok----|              |             |              |
>     |               |-------Replicate TXN------->|              |
>     |               |              |             |              |
>     |               |              |       [TXN Rollback        |
>     |               |              |          created]          |
>     |               |              |             |              |
>     |               |              |             |-----TXN----->|
>     |               |              |             |              |
>     |               |              |             |<---WAL Ok----|
>     |               |              |             |              |
>     |               |              |       [TXN Rollback        |
>     |               |              |         destroyed]         |
>     |               |              |             |              |
> ```
>
> To introduce the 'quorum' we have to receive confirmation from replicas
> to make a decision on whether the quorum is actually present. Leader
> collects necessary amount of replicas confirmation plus its own WAL
> success. This state is named 'quorum' and gives leader the right to
> complete the customers' request. So the picture will change to:
> ```
> Customer        Leader          WAL(L)        Replica        WAL(R)
>     |------TXN----->|              |             |              |
>     |               |              |             |              |
>     |         [TXN Rollback        |             |              |
>     |            created]          |             |              |
>     |               |              |             |              |
>     |               |-----TXN----->|             |              |
>     |               |              |             |              |
>     |               |-------Replicate TXN------->|              |
>     |               |              |             |              |
>     |               |              |       [TXN Rollback        |
>     |               |<---WAL Ok----|          created]          |
>     |               |              |             |              |
>     |           [Waiting           |             |-----TXN----->|
>     |         of a quorum]         |             |              |
>     |               |              |             |<---WAL Ok----|
>     |               |              |             |              |
>     |               |<------Replication Ok-------|              |
>     |               |              |             |              |
>     |            [Quorum           |             |              |
>     |           achieved]          |             |              |
>     |               |              |             |              |
>     |         [TXN Rollback        |             |              |
>     |           destroyed]         |             |              |
>     |               |              |             |              |
>     |               |---Confirm--->|             |              |
>     |               |              |             |              |
>     |               |----------Confirm---------->|              |
>     |               |              |             |              |
>     |<---TXN Ok-----|              |       [TXN Rollback        |
>     |               |              |         destroyed]         |
>     |               |              |             |              |
>     |               |              |             |---Confirm--->|
>     |               |              |             |              |
> ```
>
> The quorum should be collected as a table for a list of transactions
> waiting for quorum. The latest transaction that collects the quorum is
> considered as complete, as well as all transactions prior to it, since
> all transactions should be applied in order. Leader writes a 'confirm'
> message to the WAL that refers to the transaction's LSN and it has its
> own LSN. This confirm message is delivered to all replicas through the
> existing replication mechanism.
>
> Replica should report a positive or a negative result of the TXN to the
> leader via the IPROTO explicitly to allow leader to collect the quorum
> or anti-quorum for the TXN. In case a negative result for the TXN is
> received from minor number of replicas, then leader has to send an error
> message to the replicas, which in turn have to disconnect from the
> replication the same way as it is done now in case of conflict.
>
> In case leader receives enough error messages to do not achieve the
> quorum it should write the 'rollback' message in the WAL. After that
> leader and replicas will perform the rollback for all TXN that didn't
> receive quorum.
>
> ### Recovery and failover.
>
> Tarantool instance during reading WAL should postpone the commit until
> the 'confirm' is read. In case the WAL eof is achieved, the instance
> should keep rollback for all transactions that are waiting for a confirm
> entry until the role of the instance is set. In case this instance
> become a replica there are no additional actions needed, since all info
> about quorum/rollback will arrive via replication. In case this instance
> is assigned a leader role, it should write 'rollback' in its WAL and
> perform rollback for all transactions waiting for a quorum.
>
> In case of a leader failure a replica with the biggest LSN with former
> leader's ID is elected as a new leader. The replica should record
> 'rollback' in its WAL which effectively means that all transactions
> without quorum should be rolled back. This rollback will be delivered to
> all replicas and they will perform rollbacks of all transactions waiting
> for quorum.
>
> An interface to force apply pending transactions by issuing a confirm
> entry for them have to be introduced for manual recovery.
>
> ### Snapshot generation.
>
> We also can reuse current machinery of snapshot generation. Upon
> receiving a request to create a snapshot an instance should request a
> readview for the current commit operation. Although start of the
> snapshot generation should be postponed until this commit operation
> receives its confirmation. In case operation is rolled back, the snapshot
> generation should be aborted and restarted using current transaction
> after rollback is complete.
>
> After snapshot is created the WAL should start from the first operation
> that follows the commit operation snapshot is generated for. That means
> WAL will contain 'confirm' messages that refer to transactions that are
> not present in the WAL. Apparently, we have to allow this for the case
> 'confirm' refers to a transaction with LSN less than the first entry in
> the WAL.
>
> In case master appears unavailable a replica still have to be able to
> create a snapshot. Replica can perform rollback for all transactions that
> are not confirmed and claim its LSN as the latest confirmed txn. Then it
> can create a snapshot in a regular way and start with blank xlog file.
> All rolled back transactions will appear through the regular replication
> in case master reappears later on.
>
> ### Asynchronous replication.
>
> Along with synchronous replicas the cluster can contain asynchronous
> replicas. That means async replica doesn't reply to the leader with
> errors since they're not contributing into quorum. Still, async
> replicas have to follow the new WAL operation, such as keep rollback
> info until 'quorum' message is received. This is essential for the case
> of 'rollback' message appearance in the WAL. This message assumes
> replica is able to perform all necessary rollback by itself. Cluster
> information should contain explicit notification of each replica
> operation mode.
>
> ### Synchronous replication enabling.
>
> Synchronous operation can be required for a set of spaces in the data
> scheme. That means only transactions that contain data modification for
> these spaces should require quorum. Such transactions named synchronous.
> As soon as last operation of synchronous transaction appeared in leader's
> WAL, it will cause all following transactions - matter if they are
> synchronous or not - wait for the quorum. In case quorum is not achieved
> the 'rollback' operation will cause rollback of all transactions after
> the synchronous one. It will ensure the consistent state of the data both
> on leader and replicas. In case user doesn't require synchronous operation
> for any space then no changes to the WAL generation and replication will
> appear.
>
> Cluster description should contain explicit attribute for each replica
> to denote it participates in synchronous activities. Also the description
> should contain criterion on how many replicas responses are needed to
> achieve the quorum.
>
> ## Rationale and alternatives
>
> There is an implementation of synchronous replication as part of gh-980
> activities, still it is not in a state to get into the product. More
> than that it intentionally breaks backward compatibility which is a
> prerequisite for this proposal.
>
>
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-04-15 14:50       ` sergos
  2020-04-16  7:13         ` Aleksandr Lyapunov
@ 2020-04-17 10:10         ` Konstantin Osipov
  2020-04-17 13:45           ` Sergey Ostanevich
  2020-04-20 11:20         ` Serge Petrenko
  2 siblings, 1 reply; 53+ messages in thread
From: Konstantin Osipov @ 2020-04-17 10:10 UTC (permalink / raw)
  To: sergos
  Cc: Николай
	Карлов,
	Mons Anderson, tarantool-patches,
	Тимур
	Сафин

* sergos@tarantool.org <sergos@tarantool.org> [20/04/15 17:51]:
> ### Quorum commit

This part looks correct. It only describes two paths out of many
though:
- leader is able to collect the majority
- leader is not able to collect the majority

What happens when a leader receives a message for a round which is
complete?
How does a replica which missed a round catch up? 
What happens if replica fails to apply txn 1 (e.g. because of a
duplciate key), but confirms txn 2? 

What happens if txn1 gets no majority at the leader, but txn 2
gets a majority? How are the followers rolled back?

> The main idea behind the proposal is to reuse existent machinery as much
> as possible. It will ensure the well-tested and proven functionality
> across many instances in MRG and beyond is used. The transaction rollback
> mechanism is in place and works for WAL write failure. If we substitute
> the WAL success with a new situation which is named 'quorum' later in
> this document then no changes to the machinery is needed. The same is
> true for snapshot machinery that allows to create a copy of the database
> in memory for the whole period of snapshot file write. Adding quorum here
> also minimizes changes.
> 
> Currently replication represented by the following scheme:
> ```
> Customer        Leader          WAL(L)        Replica        WAL(R)
>    |------TXN----->|              |             |              |
>    |               |              |             |              |
>    |         [TXN Rollback        |             |              |
>    |            created]          |             |              |
>    |               |              |             |              |
>    |               |-----TXN----->|             |              |
>    |               |              |             |              |
>    |               |<---WAL Ok----|             |              |
>    |               |              |             |              |
>    |         [TXN Rollback        |             |              |
>    |           destroyed]         |             |              |
>    |               |              |             |              |
>    |<----TXN Ok----|              |             |              |
>    |               |-------Replicate TXN------->|              |
>    |               |              |             |              |
>    |               |              |       [TXN Rollback        |
>    |               |              |          created]          |
>    |               |              |             |              |
>    |               |              |             |-----TXN----->|
>    |               |              |             |              |
>    |               |              |             |<---WAL Ok----|
>    |               |              |             |              |
>    |               |              |       [TXN Rollback        |
>    |               |              |         destroyed]         |
>    |               |              |             |              |
> ```
> 
> To introduce the 'quorum' we have to receive confirmation from replicas
> to make a decision on whether the quorum is actually present. Leader
> collects necessary amount of replicas confirmation plus its own WAL
> success. This state is named 'quorum' and gives leader the right to
> complete the customers' request. So the picture will change to:
> ```
> Customer        Leader          WAL(L)        Replica        WAL(R)
>    |------TXN----->|              |             |              |
>    |               |              |             |              |
>    |         [TXN Rollback        |             |              |
>    |            created]          |             |              |
>    |               |              |             |              |
>    |               |-----TXN----->|             |              |
>    |               |              |             |              |
>    |               |-------Replicate TXN------->|              |
>    |               |              |             |              |
>    |               |              |       [TXN Rollback        |
>    |               |<---WAL Ok----|          created]          |
>    |               |              |             |              |
>    |           [Waiting           |             |-----TXN----->|
>    |         of a quorum]         |             |              |
>    |               |              |             |<---WAL Ok----|
>    |               |              |             |              |
>    |               |<------Replication Ok-------|              |
>    |               |              |             |              |
>    |            [Quorum           |             |              |
>    |           achieved]          |             |              |
>    |               |              |             |              |
>    |         [TXN Rollback        |             |              |
>    |           destroyed]         |             |              |
>    |               |              |             |              |
>    |               |---Confirm--->|             |              |
>    |               |              |             |              |
>    |               |----------Confirm---------->|              |
>    |               |              |             |              |
>    |<---TXN Ok-----|              |       [TXN Rollback        |
>    |               |              |         destroyed]         |
>    |               |              |             |              |
>    |               |              |             |---Confirm--->|
>    |               |              |             |              |
> ```
> 
> The quorum should be collected as a table for a list of transactions
> waiting for quorum. The latest transaction that collects the quorum is
> considered as complete, as well as all transactions prior to it, since
> all transactions should be applied in order. Leader writes a 'confirm'
> message to the WAL that refers to the transaction's LSN and it has its
> own LSN. This confirm message is delivered to all replicas through the
> existing replication mechanism.
> 
> Replica should report a positive or a negative result of the TXN to the
> leader via the IPROTO explicitly to allow leader to collect the quorum
> or anti-quorum for the TXN. In case a negative result for the TXN is
> received from minor number of replicas, then leader has to send an error
> message to the replicas, which in turn have to disconnect from the
> replication the same way as it is done now in case of conflict.
> 
> In case leader receives enough error messages to do not achieve the
> quorum it should write the 'rollback' message in the WAL. After that
> leader and replicas will perform the rollback for all TXN that didn't
> receive quorum.
> 
> ### Recovery and failover.
> 
> Tarantool instance during reading WAL should postpone the commit until
> the 'confirm' is read. In case the WAL eof is achieved, the instance
> should keep rollback for all transactions that are waiting for a confirm
> entry until the role of the instance is set. In case this instance
> become a replica there are no additional actions needed, since all info
> about quorum/rollback will arrive via replication. In case this instance
> is assigned a leader role, it should write 'rollback' in its WAL and
> perform rollback for all transactions waiting for a quorum.
> 
> In case of a leader failure a replica with the biggest LSN with former
> leader's ID is elected as a new leader.

As long as multi-master is not banned, there may be multiple
leaders. Does this proposal suggest multi-master is banned? Then
it should describe the implementation of this, and in absense of
transparent query forwarding it will break all clients.

> The replica should record
> 'rollback' in its WAL which effectively means that all transactions
> without quorum should be rolled back. This rollback will be delivered to
> all replicas and they will perform rollbacks of all transactions waiting
> for quorum.
> 
> An interface to force apply pending transactions by issuing a confirm
> entry for them have to be introduced for manual recovery.
> 
> ### Snapshot generation.
> 
> We also can reuse current machinery of snapshot generation. Upon
> receiving a request to create a snapshot an instance should request a
> readview for the current commit operation. Although start of the
> snapshot generation should be postponed until this commit operation
> receives its confirmation. In case operation is rolled back, the snapshot
> generation should be aborted and restarted using current transaction
> after rollback is complete.
> 
> After snapshot is created the WAL should start from the first operation
> that follows the commit operation snapshot is generated for. That means
> WAL will contain 'confirm' messages that refer to transactions that are
> not present in the WAL. Apparently, we have to allow this for the case
> 'confirm' refers to a transaction with LSN less than the first entry in
> the WAL.
> 
> In case master appears unavailable a replica still have to be able to
> create a snapshot. Replica can perform rollback for all transactions that
> are not confirmed and claim its LSN as the latest confirmed txn. Then it
> can create a snapshot in a regular way and start with blank xlog file.
> All rolled back transactions will appear through the regular replication
> in case master reappears later on.
> 
> ### Asynchronous replication.
> 
> Along with synchronous replicas the cluster can contain asynchronous
> replicas. That means async replica doesn't reply to the leader with
> errors since they're not contributing into quorum. Still, async
> replicas have to follow the new WAL operation, such as keep rollback
> info until 'quorum' message is received. This is essential for the case
> of 'rollback' message appearance in the WAL. This message assumes
> replica is able to perform all necessary rollback by itself. Cluster
> information should contain explicit notification of each replica
> operation mode.
> 
> ### Synchronous replication enabling.
> 
> Synchronous operation can be required for a set of spaces in the data
> scheme. That means only transactions that contain data modification for
> these spaces should require quorum. Such transactions named synchronous.
> As soon as last operation of synchronous transaction appeared in leader's
> WAL, it will cause all following transactions - matter if they are
> synchronous or not - wait for the quorum. In case quorum is not achieved
> the 'rollback' operation will cause rollback of all transactions after
> the synchronous one. It will ensure the consistent state of the data both
> on leader and replicas. In case user doesn't require synchronous operation
> for any space then no changes to the WAL generation and replication will
> appear.
> 
> Cluster description should contain explicit attribute for each replica
> to denote it participates in synchronous activities. Also the description
> should contain criterion on how many replicas responses are needed to
> achieve the quorum.
> 
> ## Rationale and alternatives
> 
> There is an implementation of synchronous replication as part of gh-980
> activities, still it is not in a state to get into the product. More
> than that it intentionally breaks backward compatibility which is a
> prerequisite for this proposal.
> 
> 

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-04-17 10:10         ` Konstantin Osipov
@ 2020-04-17 13:45           ` Sergey Ostanevich
  0 siblings, 0 replies; 53+ messages in thread
From: Sergey Ostanevich @ 2020-04-17 13:45 UTC (permalink / raw)
  To: Konstantin Osipov,
	Николай
	Карлов,
	Тимур
	Сафин,
	Mons Anderson, Aleksandr Lyapunov, Sergey Bronnikov,
	tarantool-patches

Hi, thanks for review!

On 17 апр 13:10, Konstantin Osipov wrote:
> * sergos@tarantool.org <sergos@tarantool.org> [20/04/15 17:51]:
> > ### Quorum commit
> 
> This part looks correct. It only describes two paths out of many
> though:
> - leader is able to collect the majority
> - leader is not able to collect the majority
> 
> What happens when a leader receives a message for a round which is
> complete?

It just ignores it, the reason - see next comment.

> How does a replica which missed a round catch up? 
> What happens if replica fails to apply txn 1 (e.g. because of a
> duplciate key), but confirms txn 2? 

This should never happen, since each replica applies txns in strict
order, means failure of txn 1 will happen before the confirmation of
txn 2. As soon as replica fails to apply a txn it should report an
error, disconnect and roll back all txns in it's pipeline. After that
the replica will ne in a consistent state with Leader's lsn before the
txn 1.
> 
> What happens if txn1 gets no majority at the leader, but txn 2
> gets a majority? How are the followers rolled back?

This situation means that some of ACKs from replicas didn't arrive.
Which doesn't mean they failed to apply txn 1. Althoug, success of txn 2
means the txn 1 was also applied - hence, receiveing a txn N ACK from a
replica means ACK for each txn M: M < N. 

> > In case of a leader failure a replica with the biggest LSN with former
> > leader's ID is elected as a new leader.
> 
> As long as multi-master is not banned, there may be multiple
> leaders. Does this proposal suggest multi-master is banned? Then
> it should describe the implementation of this, and in absense of
> transparent query forwarding it will break all clients.
> 

It was mentioned at the top of RFC: 

> What this RFC is not:
>
>    - high availability (HA) solution with automated failover, roles
>      assignments an so on
>    - master-master configuration support

Which I tend to describe as 'do not recommend'. Similar to what we have
in documentation about the cascading replication configuration.
Although, I heard from some users that they successfuly use such config.

Regards,
Sergos

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-04-15 14:50       ` sergos
  2020-04-16  7:13         ` Aleksandr Lyapunov
  2020-04-17 10:10         ` Konstantin Osipov
@ 2020-04-20 11:20         ` Serge Petrenko
  2 siblings, 0 replies; 53+ messages in thread
From: Serge Petrenko @ 2020-04-20 11:20 UTC (permalink / raw)
  To: Sergey Ostanevich
  Cc: Николай
	Карлов,
	Mons Anderson, tarantool-patches,
	Тимур
	Сафин

LGTM.

--
Serge Petrenko
sergepetrenko@tarantool.org




> 15 апр. 2020 г., в 17:50, sergos@tarantool.org написал(а):
> 
> Sorry for mess introduced by mail client in previous message.
> Here’s the correct version with 3 more misprints fixed.
> 
> The version is available here
> https://github.com/tarantool/tarantool/blob/sergos/quorum-based-synchro/doc/rfc/quorum-based-synchro.md
> 
> Please, reply all with your comments/blessings today.
> 
> Regards,
> Sergos
> 
> ---
> * **Status**: In progress
> * **Start date**: 31-03-2020
> * **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\>
> * **Issues**: https://github.com/tarantool/tarantool/issues/4842
> 
> ## Summary
> 
> The aim of this RFC is to address the following list of problems
> formulated at MRG planning meeting:
>  - protocol backward compatibility to enable cluster upgrade w/o
>    downtime
>  - consistency of data on replica and leader
>  - switch from leader to replica without data loss
>  - up to date replicas to run read-only requests
>  - ability to switch async replicas into sync ones and vice versa
>  - guarantee of rollback on leader and sync replicas
>  - simplicity of cluster orchestration
> 
> What this RFC is not:
> 
>  - high availability (HA) solution with automated failover, roles
>    assignments an so on
>  - master-master configuration support
> 
> ## Background and motivation
> 
> There are number of known implementation of consistent data presence in
> a cluster. They can be commonly named as "wait for LSN" technique. The
> biggest issue with this technique is the absence of rollback guarantees
> at replica in case of transaction failure on one master or some of the
> replicas in the cluster.
> 
> To provide such capabilities a new functionality should be introduced in
> Tarantool core, with requirements mentioned before - backward
> compatibility and ease of cluster orchestration.
> 
> ## Detailed design
> 
> ### Quorum commit
> 
> The main idea behind the proposal is to reuse existent machinery as much
> as possible. It will ensure the well-tested and proven functionality
> across many instances in MRG and beyond is used. The transaction rollback
> mechanism is in place and works for WAL write failure. If we substitute
> the WAL success with a new situation which is named 'quorum' later in
> this document then no changes to the machinery is needed. The same is
> true for snapshot machinery that allows to create a copy of the database
> in memory for the whole period of snapshot file write. Adding quorum here
> also minimizes changes.
> 
> Currently replication represented by the following scheme:
> ```
> Customer        Leader          WAL(L)        Replica        WAL(R)
>   |------TXN----->|              |             |              |
>   |               |              |             |              |
>   |         [TXN Rollback        |             |              |
>   |            created]          |             |              |
>   |               |              |             |              |
>   |               |-----TXN----->|             |              |
>   |               |              |             |              |
>   |               |<---WAL Ok----|             |              |
>   |               |              |             |              |
>   |         [TXN Rollback        |             |              |
>   |           destroyed]         |             |              |
>   |               |              |             |              |
>   |<----TXN Ok----|              |             |              |
>   |               |-------Replicate TXN------->|              |
>   |               |              |             |              |
>   |               |              |       [TXN Rollback        |
>   |               |              |          created]          |
>   |               |              |             |              |
>   |               |              |             |-----TXN----->|
>   |               |              |             |              |
>   |               |              |             |<---WAL Ok----|
>   |               |              |             |              |
>   |               |              |       [TXN Rollback        |
>   |               |              |         destroyed]         |
>   |               |              |             |              |
> ```
> 
> To introduce the 'quorum' we have to receive confirmation from replicas
> to make a decision on whether the quorum is actually present. Leader
> collects necessary amount of replicas confirmation plus its own WAL
> success. This state is named 'quorum' and gives leader the right to
> complete the customers' request. So the picture will change to:
> ```
> Customer        Leader          WAL(L)        Replica        WAL(R)
>   |------TXN----->|              |             |              |
>   |               |              |             |              |
>   |         [TXN Rollback        |             |              |
>   |            created]          |             |              |
>   |               |              |             |              |
>   |               |-----TXN----->|             |              |
>   |               |              |             |              |
>   |               |-------Replicate TXN------->|              |
>   |               |              |             |              |
>   |               |              |       [TXN Rollback        |
>   |               |<---WAL Ok----|          created]          |
>   |               |              |             |              |
>   |           [Waiting           |             |-----TXN----->|
>   |         of a quorum]         |             |              |
>   |               |              |             |<---WAL Ok----|
>   |               |              |             |              |
>   |               |<------Replication Ok-------|              |
>   |               |              |             |              |
>   |            [Quorum           |             |              |
>   |           achieved]          |             |              |
>   |               |              |             |              |
>   |         [TXN Rollback        |             |              |
>   |           destroyed]         |             |              |
>   |               |              |             |              |
>   |               |---Confirm--->|             |              |
>   |               |              |             |              |
>   |               |----------Confirm---------->|              |
>   |               |              |             |              |
>   |<---TXN Ok-----|              |       [TXN Rollback        |
>   |               |              |         destroyed]         |
>   |               |              |             |              |
>   |               |              |             |---Confirm--->|
>   |               |              |             |              |
> ```
> 
> The quorum should be collected as a table for a list of transactions
> waiting for quorum. The latest transaction that collects the quorum is
> considered as complete, as well as all transactions prior to it, since
> all transactions should be applied in order. Leader writes a 'confirm'
> message to the WAL that refers to the transaction's LSN and it has its
> own LSN. This confirm message is delivered to all replicas through the
> existing replication mechanism.
> 
> Replica should report a positive or a negative result of the TXN to the
> leader via the IPROTO explicitly to allow leader to collect the quorum
> or anti-quorum for the TXN. In case a negative result for the TXN is
> received from minor number of replicas, then leader has to send an error
> message to the replicas, which in turn have to disconnect from the
> replication the same way as it is done now in case of conflict.
> 
> In case leader receives enough error messages to do not achieve the
> quorum it should write the 'rollback' message in the WAL. After that
> leader and replicas will perform the rollback for all TXN that didn't
> receive quorum.
> 
> ### Recovery and failover.
> 
> Tarantool instance during reading WAL should postpone the commit until
> the 'confirm' is read. In case the WAL eof is achieved, the instance
> should keep rollback for all transactions that are waiting for a confirm
> entry until the role of the instance is set. In case this instance
> become a replica there are no additional actions needed, since all info
> about quorum/rollback will arrive via replication. In case this instance
> is assigned a leader role, it should write 'rollback' in its WAL and
> perform rollback for all transactions waiting for a quorum.
> 
> In case of a leader failure a replica with the biggest LSN with former
> leader's ID is elected as a new leader. The replica should record
> 'rollback' in its WAL which effectively means that all transactions
> without quorum should be rolled back. This rollback will be delivered to
> all replicas and they will perform rollbacks of all transactions waiting
> for quorum.
> 
> An interface to force apply pending transactions by issuing a confirm
> entry for them have to be introduced for manual recovery.
> 
> ### Snapshot generation.
> 
> We also can reuse current machinery of snapshot generation. Upon
> receiving a request to create a snapshot an instance should request a
> readview for the current commit operation. Although start of the
> snapshot generation should be postponed until this commit operation
> receives its confirmation. In case operation is rolled back, the snapshot
> generation should be aborted and restarted using current transaction
> after rollback is complete.
> 
> After snapshot is created the WAL should start from the first operation
> that follows the commit operation snapshot is generated for. That means
> WAL will contain 'confirm' messages that refer to transactions that are
> not present in the WAL. Apparently, we have to allow this for the case
> 'confirm' refers to a transaction with LSN less than the first entry in
> the WAL.
> 
> In case master appears unavailable a replica still have to be able to
> create a snapshot. Replica can perform rollback for all transactions that
> are not confirmed and claim its LSN as the latest confirmed txn. Then it
> can create a snapshot in a regular way and start with blank xlog file.
> All rolled back transactions will appear through the regular replication
> in case master reappears later on.
> 
> ### Asynchronous replication.
> 
> Along with synchronous replicas the cluster can contain asynchronous
> replicas. That means async replica doesn't reply to the leader with
> errors since they're not contributing into quorum. Still, async
> replicas have to follow the new WAL operation, such as keep rollback
> info until 'quorum' message is received. This is essential for the case
> of 'rollback' message appearance in the WAL. This message assumes
> replica is able to perform all necessary rollback by itself. Cluster
> information should contain explicit notification of each replica
> operation mode.
> 
> ### Synchronous replication enabling.
> 
> Synchronous operation can be required for a set of spaces in the data
> scheme. That means only transactions that contain data modification for
> these spaces should require quorum. Such transactions named synchronous.
> As soon as last operation of synchronous transaction appeared in leader's
> WAL, it will cause all following transactions - matter if they are
> synchronous or not - wait for the quorum. In case quorum is not achieved
> the 'rollback' operation will cause rollback of all transactions after
> the synchronous one. It will ensure the consistent state of the data both
> on leader and replicas. In case user doesn't require synchronous operation
> for any space then no changes to the WAL generation and replication will
> appear.
> 
> Cluster description should contain explicit attribute for each replica
> to denote it participates in synchronous activities. Also the description
> should contain criterion on how many replicas responses are needed to
> achieve the quorum.
> 
> ## Rationale and alternatives
> 
> There is an implementation of synchronous replication as part of gh-980
> activities, still it is not in a state to get into the product. More
> than that it intentionally breaks backward compatibility which is a
> prerequisite for this proposal.
> 
> 
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-04-03 21:08 [Tarantool-patches] [RFC] Quorum-based synchronous replication Sergey Ostanevich
  2020-04-07 13:02 ` Aleksandr Lyapunov
  2020-04-14 12:58 ` Sergey Bronnikov
@ 2020-04-20 23:32 ` Vladislav Shpilevoy
  2020-04-21 10:49   ` Sergey Ostanevich
  2020-04-23 21:38 ` Vladislav Shpilevoy
  3 siblings, 1 reply; 53+ messages in thread
From: Vladislav Shpilevoy @ 2020-04-20 23:32 UTC (permalink / raw)
  To: Sergey Ostanevich, tarantool-patches

Hi!

This is the latest version I found on the branch. I give my
comments for it.

Keep in mind I didn't read other reviews before writing my own,
assuming that all questions were fixed, and by idea I should
have understood everything after reading this now.

Nonetheless see 12 comments below.

> * **Status**: In progress
> * **Start date**: 31-03-2020
> * **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\>
> * **Issues**: https://github.com/tarantool/tarantool/issues/4842
> 
> ## Summary
> 
> The aim of this RFC is to address the following list of problems
> formulated at MRG planning meeting:
>   - protocol backward compatibility to enable cluster upgrade w/o
>     downtime
>   - consistency of data on replica and leader
>   - switch from leader to replica without data loss
>   - up to date replicas to run read-only requests
>   - ability to switch async replicas into sync ones and vice versa
>   - guarantee of rollback on leader and sync replicas
>   - simplicity of cluster orchestration
> 
> What this RFC is not:
> 
>   - high availability (HA) solution with automated failover, roles
>     assignments an so on

1. So no leader election? That essentially makes single failure point
for RW requests, is it correct?

On the other hand I see section 'Recovery and failover.' below. And
it seems to be automated, with selecting a replica with the biggest
LSN. Where is the truth?

>   - master-master configuration support
> 
> ## Background and motivation
> 
> There are number of known implementation of consistent data presence in
> a cluster. They can be commonly named as "wait for LSN" technique. The
> biggest issue with this technique is the absence of rollback guarantees
> at replica in case of transaction failure on one master or some of the
> replicas in the cluster.
> 
> To provide such capabilities a new functionality should be introduced in
> Tarantool core, with requirements mentioned before - backward
> compatibility and ease of cluster orchestration.
> 
> ## Detailed design
> 
> ### Quorum commit
> 
> The main idea behind the proposal is to reuse existent machinery as much
> as possible. It will ensure the well-tested and proven functionality
> across many instances in MRG and beyond is used. The transaction rollback
> mechanism is in place and works for WAL write failure. If we substitute
> the WAL success with a new situation which is named 'quorum' later in
> this document then no changes to the machinery is needed.

2. The problem here is that you create dependency on WAL. According to
your words, replication is inside WAL, and if WAL gave ok, then all is
replicated and applied. But that makes current code structure even worse
than it is. Now WAL, GC, and replication code is spaghetti, basically.
All depends on all. I was rather thinking, that we should fix that first.
Not aggravate.

WAL should provide API for writing to disk. Replication should not bother
about WAL. GC should not bother about replication. All should be independent,
and linked in one place by some kind of a manager, which would just use their
APIs. I believe Cyrill G. would agree with me here, I remember him
complaining about replication-wal-gc code inter-dependencies too. Please,
request his review on this, if you didn't yet.

> The same is
> true for snapshot machinery that allows to create a copy of the database
> in memory for the whole period of snapshot file write. Adding quorum here
> also minimizes changes.
> Currently replication represented by the following scheme:
> ```
> Customer        Leader          WAL(L)        Replica        WAL(R)

3. Are you saying 'leader' === 'master'?

>    |------TXN----->|              |             |              |
>    |               |              |             |              |
>    |         [TXN Rollback        |             |              |
>    |            created]          |             |              |

4. What is 'txn rollback', and why is it created before even a transaction
is started? At least, rollback is a verb. Maybe you meant 'undo log'?

>    |               |              |             |              |
>    |               |-----TXN----->|             |              |
>    |               |              |             |              |
>    |               |<---WAL Ok----|             |              |
>    |               |              |             |              |
>    |         [TXN Rollback        |             |              |
>    |           destroyed]         |             |              |
>    |               |              |             |              |
>    |<----TXN Ok----|              |             |              |
>    |               |-------Replicate TXN------->|              |
>    |               |              |             |              |
>    |               |              |       [TXN Rollback        |
>    |               |              |          created]          |
>    |               |              |             |              |
>    |               |              |             |-----TXN----->|
>    |               |              |             |              |
>    |               |              |             |<---WAL Ok----|
>    |               |              |             |              |
>    |               |              |       [TXN Rollback        |
>    |               |              |         destroyed]         |
>    |               |              |             |              |
> ```
> 
> To introduce the 'quorum' we have to receive confirmation from replicas
> to make a decision on whether the quorum is actually present. Leader
> collects necessary amount of replicas confirmation plus its own WAL

5. Please, define 'necessary amount'?

> success. This state is named 'quorum' and gives leader the right to
> complete the customers' request. So the picture will change to:
> ```
> Customer        Leader          WAL(L)        Replica        WAL(R)
>    |------TXN----->|              |             |              |
>    |               |              |             |              |
>    |         [TXN Rollback        |             |              |
>    |            created]          |             |              |
>    |               |              |             |              |
>    |               |-----TXN----->|             |              |

6. Are we going to replicate transaction after user writes commit()?
Or will we replicate it while it is in progress? So called 'presumed
commit'. I remember I read some papers explaining how it significantly
speeds up synchronous transactions. Probably that was a paper about
2-phase commit, can't remember already. But the idea is still applicable
for the replication too.

>    |               |              |             |              |
>    |               |-------Replicate TXN------->|              |
>    |               |              |             |              |
>    |               |              |       [TXN Rollback        |
>    |               |<---WAL Ok----|          created]          |
>    |               |              |             |              |
>    |           [Waiting           |             |-----TXN----->|
>    |         of a quorum]         |             |              |
>    |               |              |             |<---WAL Ok----|
>    |               |              |             |              |
>    |               |<------Replication Ok-------|              |
>    |               |              |             |              |
>    |            [Quorum           |             |              |
>    |           achieved]          |             |              |
>    |               |              |             |              |
>    |         [TXN Rollback        |             |              |
>    |           destroyed]         |             |              |
>    |               |              |             |              |
>    |               |---Confirm--->|             |              |
>    |               |              |             |              |
>    |               |----------Confirm---------->|              |
>    |               |              |             |              |
>    |<---TXN Ok-----|              |       [TXN Rollback        |
>    |               |              |         destroyed]         |
>    |               |              |             |              |
>    |               |              |             |---Confirm--->|
>    |               |              |             |              |
> ```
> 
> The quorum should be collected as a table for a list of transactions
> waiting for quorum. The latest transaction that collects the quorum is
> considered as complete, as well as all transactions prior to it, since
> all transactions should be applied in order. Leader writes a 'confirm'
> message to the WAL that refers to the transaction's LSN and it has its
> own LSN. This confirm message is delivered to all replicas through the
> existing replication mechanism.
> 
> Replica should report a positive or a negative result of the TXN to the
> leader via the IPROTO explicitly to allow leader to collect the quorum
> or anti-quorum for the TXN. In case a negative result for the TXN is
> received from minor number of replicas, then leader has to send an error
> message to the replicas, which in turn have to disconnect from the
> replication the same way as it is done now in case of conflict.
> In case leader receives enough error messages to do not achieve the
> quorum it should write the 'rollback' message in the WAL. After that
> leader and replicas will perform the rollback for all TXN that didn't
> receive quorum.
> 
> ### Recovery and failover.
> 
> Tarantool instance during reading WAL should postpone the commit until
> the 'confirm' is read. In case the WAL eof is achieved, the instance
> should keep rollback for all transactions that are waiting for a confirm
> entry until the role of the instance is set. In case this instance
> become a replica there are no additional actions needed, since all info
> about quorum/rollback will arrive via replication. In case this instance
> is assigned a leader role, it should write 'rollback' in its WAL and
> perform rollback for all transactions waiting for a quorum.
> 
> In case of a leader failure a replica with the biggest LSN with former
> leader's ID is elected as a new leader. The replica should record
> 'rollback' in its WAL which effectively means that all transactions
> without quorum should be rolled back. This rollback will be delivered to
> all replicas and they will perform rollbacks of all transactions waiting
> for quorum.

7. Please, elaborate leader election. It is not as trivial as just 'elect'.
What if the replica with the biggest LSN is temporary not available, but
it knows that it has the biggest LSN? Will it become a leader without
asking other nodes? What will do the other nodes? Will they wait for the
new leader node to become available? Do they have a timeout on that?

Basically, it would be nice to see the split-brain problem description here,
and its solution for us.

How leader failure is detected? Do you rely on our heartbeat messages?
Are you going to adapt SWIM for this?

Raft has a dedicated subsystem for election, it is not that simple. It
involves voting, randomized algorithms. Am I missing something obvious in
this RFC, which makes the leader election much simpler specifically for
Tarantool?

> An interface to force apply pending transactions by issuing a confirm
> entry for them have to be introduced for manual recovery.
> 
> ### Snapshot generation.
> 
> We also can reuse current machinery of snapshot generation. Upon
> receiving a request to create a snapshot an instance should request a
> readview for the current commit operation. Although start of the
> snapshot generation should be postponed until this commit operation
> receives its confirmation. In case operation is rolled back, the snapshot
> generation should be aborted and restarted using current transaction
> after rollback is complete.

8. This section highly depends on transaction manager for memtx. If you
have a transaction manager, you always have a ready-to-use read-view
of the latest committed data. At least this is my understanding.

After all, the manager should provide transaction isolation. And it means,
that all non-committed transactions are not visible. And for that we need
a read-view. Therefore, it could be used to make a snapshot.

> After snapshot is created the WAL should start from the first operation
> that follows the commit operation snapshot is generated for. That means
> WAL will contain 'confirm' messages that refer to transactions that are
> not present in the WAL. Apparently, we have to allow this for the case
> 'confirm' refers to a transaction with LSN less than the first entry in
> the WAL.

9. I couldn't understand that. Why confirm is in WAL for data stored in
the snap? I thought you said above, that snapshot should be done for all
confirmed data. Besides, having confirm out of snap means the snap is
not self-sufficient anymore.

> In case master appears unavailable a replica still have to be able to
> create a snapshot. Replica can perform rollback for all transactions that
> are not confirmed and claim its LSN as the latest confirmed txn. Then it
> can create a snapshot in a regular way and start with blank xlog file.
> All rolled back transactions will appear through the regular replication
> in case master reappears later on.

10. You should be able to make a snapshot without rollback. Read-views are
available anyway. At least it is so in Vinyl, from what I remember. And this
is going to be similar in memtx.

> 
> ### Asynchronous replication.
> 
> Along with synchronous replicas the cluster can contain asynchronous
> replicas. That means async replica doesn't reply to the leader with
> errors since they're not contributing into quorum. Still, async
> replicas have to follow the new WAL operation, such as keep rollback
> info until 'quorum' message is received. This is essential for the case
> of 'rollback' message appearance in the WAL. This message assumes
> replica is able to perform all necessary rollback by itself. Cluster
> information should contain explicit notification of each replica
> operation mode.
> 
> ### Synchronous replication enabling.
> 
> Synchronous operation can be required for a set of spaces in the data
> scheme. That means only transactions that contain data modification for
> these spaces should require quorum. Such transactions named synchronous.
> As soon as last operation of synchronous transaction appeared in leader's
> WAL, it will cause all following transactions - matter if they are
> synchronous or not - wait for the quorum. In case quorum is not achieved
> the 'rollback' operation will cause rollback of all transactions after
> the synchronous one. It will ensure the consistent state of the data both
> on leader and replicas. In case user doesn't require synchronous operation
> for any space then no changes to the WAL generation and replication will
> appear.
> 
> Cluster description should contain explicit attribute for each replica
> to denote it participates in synchronous activities. Also the description
> should contain criterion on how many replicas responses are needed to
> achieve the quorum.

11. Aha, I see 'necessary amount' from above is a manually set value. Ok.

> 
> ## Rationale and alternatives
> 
> There is an implementation of synchronous replication as part of gh-980
> activities, still it is not in a state to get into the product. More
> than that it intentionally breaks backward compatibility which is a
> prerequisite for this proposal.

12. How are we going to deal with fsync()? Will it be forcefully enabled
on sync replicas and the leader?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-04-20 23:32 ` Vladislav Shpilevoy
@ 2020-04-21 10:49   ` Sergey Ostanevich
  2020-04-21 22:17     ` Vladislav Shpilevoy
  0 siblings, 1 reply; 53+ messages in thread
From: Sergey Ostanevich @ 2020-04-21 10:49 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

Hi!

Thanks for review!

> >   - high availability (HA) solution with automated failover, roles
> >     assignments an so on
> 
> 1. So no leader election? That essentially makes single failure point
> for RW requests, is it correct?
> 
> On the other hand I see section 'Recovery and failover.' below. And
> it seems to be automated, with selecting a replica with the biggest
> LSN. Where is the truth?
> 

The failover can be manual or implemented independnetnly. By no means
this means we should not explain how this should be done according to
the replication schema discussed.

And yes, the SPOF is the leader of the cluster. This is expected and is
Ok according to all MRG planning meeting participants.

> >   - master-master configuration support
> > 
> > ## Background and motivation
> > 
> > There are number of known implementation of consistent data presence in
> > a cluster. They can be commonly named as "wait for LSN" technique. The
> > biggest issue with this technique is the absence of rollback guarantees
> > at replica in case of transaction failure on one master or some of the
> > replicas in the cluster.
> > 
> > To provide such capabilities a new functionality should be introduced in
> > Tarantool core, with requirements mentioned before - backward
> > compatibility and ease of cluster orchestration.
> > 
> > ## Detailed design
> > 
> > ### Quorum commit
> > 
> > The main idea behind the proposal is to reuse existent machinery as much
> > as possible. It will ensure the well-tested and proven functionality
> > across many instances in MRG and beyond is used. The transaction rollback
> > mechanism is in place and works for WAL write failure. If we substitute
> > the WAL success with a new situation which is named 'quorum' later in
> > this document then no changes to the machinery is needed.
> 
> 2. The problem here is that you create dependency on WAL. According to
> your words, replication is inside WAL, and if WAL gave ok, then all is

'Replication is inside WAL' - what do you mean by that? The replication in
its current state works from WAL, although it's an exaggregation to say
it is 'inside WAL'. Why it means a new dependency after that?

> replicated and applied. But that makes current code structure even worse
> than it is. Now WAL, GC, and replication code is spaghetti, basically.
> All depends on all. I was rather thinking, that we should fix that first.
> Not aggravate.
> 
> WAL should provide API for writing to disk. Replication should not bother
> about WAL. GC should not bother about replication. All should be independent,
> and linked in one place by some kind of a manager, which would just use their

So you want to introduce a single point that will translate all messages
between all participants? I believe current state was introduced exactly
to avoid this situation. Each participant can be subscribed for a
particular trigger inside another participant and take it into account
in its activities - at the right time for itself. 

> APIs. I believe Cyrill G. would agree with me here, I remember him
> complaining about replication-wal-gc code inter-dependencies too. Please,
> request his review on this, if you didn't yet.
> 
I personally have the same problem trying to implement a trivial test,
by just figuring out the layers and dependencies of participants. This
is about poor documentation im my understanding, not poor design. 

> > The same is
> > true for snapshot machinery that allows to create a copy of the database
> > in memory for the whole period of snapshot file write. Adding quorum here
> > also minimizes changes.
> > Currently replication represented by the following scheme:
> > ```
> > Customer        Leader          WAL(L)        Replica        WAL(R)
> 
> 3. Are you saying 'leader' === 'master'?

According to international polite naming.
Mark Twain nowadays goes obscene with his 'Mars Tom'.

> 
> >    |------TXN----->|              |             |              |
> >    |               |              |             |              |
> >    |         [TXN Rollback        |             |              |
> >    |            created]          |             |              |
> 
> 4. What is 'txn rollback', and why is it created before even a transaction
> is started? At least, rollback is a verb. Maybe you meant 'undo log'?

No, rollback has both verb and noun meaning. Nevertheless, if you 
tripped over this - it should be 

fixed.

> 
> >    |               |              |             |              |
> >    |               |-----TXN----->|             |              |
> >    |               |              |             |              |
> >    |               |<---WAL Ok----|             |              |
> >    |               |              |             |              |
> >    |         [TXN Rollback        |             |              |
> >    |           destroyed]         |             |              |
> >    |               |              |             |              |
> >    |<----TXN Ok----|              |             |              |
> >    |               |-------Replicate TXN------->|              |
> >    |               |              |             |              |
> >    |               |              |       [TXN Rollback        |
> >    |               |              |          created]          |
> >    |               |              |             |              |
> >    |               |              |             |-----TXN----->|
> >    |               |              |             |              |
> >    |               |              |             |<---WAL Ok----|
> >    |               |              |             |              |
> >    |               |              |       [TXN Rollback        |
> >    |               |              |         destroyed]         |
> >    |               |              |             |              |
> > ```
> > 
> > To introduce the 'quorum' we have to receive confirmation from replicas
> > to make a decision on whether the quorum is actually present. Leader
> > collects necessary amount of replicas confirmation plus its own WAL
> 
> 5. Please, define 'necessary amount'?

Apparently, resolved with comment #11

> 
> > success. This state is named 'quorum' and gives leader the right to
> > complete the customers' request. So the picture will change to:
> > ```
> > Customer        Leader          WAL(L)        Replica        WAL(R)
> >    |------TXN----->|              |             |              |
> >    |               |              |             |              |
> >    |         [TXN Rollback        |             |              |
> >    |            created]          |             |              |
> >    |               |              |             |              |
> >    |               |-----TXN----->|             |              |
> 
> 6. Are we going to replicate transaction after user writes commit()?

Does your 'user' means customer from the picture? In such a case do you
expect to have an interactive transaction? Definitely, we do not
consider it here in any form, since replication happens only for the 
comlpete transaction. 

> Or will we replicate it while it is in progress? So called 'presumed
> commit'. I remember I read some papers explaining how it significantly
> speeds up synchronous transactions. Probably that was a paper about
> 2-phase commit, can't remember already. But the idea is still applicable
> for the replication too.

This can be considered only after MVCC is introduced - currently running
as a separate activity. Then we can replicate transaction 'in the fly'
into a separate blob/readview/better_name. By now this means we will be
too much interwoven to correctly rollback afterwards, at the time of
quorum is failed.

> > In case of a leader failure a replica with the biggest LSN with former
> > leader's ID is elected as a new leader. The replica should record
> > 'rollback' in its WAL which effectively means that all transactions
> > without quorum should be rolled back. This rollback will be delivered to
> > all replicas and they will perform rollbacks of all transactions waiting
> > for quorum.
> 
> 7. Please, elaborate leader election. It is not as trivial as just 'elect'.
> What if the replica with the biggest LSN is temporary not available, but
> it knows that it has the biggest LSN? Will it become a leader without
> asking other nodes? What will do the other nodes? Will they wait for the
> new leader node to become available? Do they have a timeout on that?
> 
By now I do not plan any activities on HA - including automated failover
and leader re-election. In case leader sees insufficient number of
replicas to achieve quorum - it stops, reporting the problem to the
external orchestrator. 

> Basically, it would be nice to see the split-brain problem description here,
> and its solution for us.
> 
I believe the split-brain is under orchestrator control either - we
should provide API to switch leader in the cluster, so that when a
former leader came back it will not get quorum for any txn it has,
replying to customers with failure as a result.

> How leader failure is detected? Do you rely on our heartbeat messages?
> Are you going to adapt SWIM for this?
> 
> Raft has a dedicated subsystem for election, it is not that simple. It
> involves voting, randomized algorithms. Am I missing something obvious in
> this RFC, which makes the leader election much simpler specifically for
> Tarantool?
> 
All of these I assume as HA features, when Tarantool can automate the
failover and leader re-election. Out of the scope by now.

> > An interface to force apply pending transactions by issuing a confirm
> > entry for them have to be introduced for manual recovery.
> > 
> > ### Snapshot generation.
> > 
> > We also can reuse current machinery of snapshot generation. Upon
> > receiving a request to create a snapshot an instance should request a
> > readview for the current commit operation. Although start of the
> > snapshot generation should be postponed until this commit operation
> > receives its confirmation. In case operation is rolled back, the snapshot
> > generation should be aborted and restarted using current transaction
> > after rollback is complete.
> 
> 8. This section highly depends on transaction manager for memtx. If you
> have a transaction manager, you always have a ready-to-use read-view
> of the latest committed data. At least this is my understanding.
> 
> After all, the manager should provide transaction isolation. And it means,
> that all non-committed transactions are not visible. And for that we need
> a read-view. Therefore, it could be used to make a snapshot.
> 
Currently there's no such manager for memtx. So I proposed this
workaround with minimal impact on our current machinery. 
Alexander Lyapunov is working on the manager in parallel, he reviewed
and blessed this RFC, so apparently there's no contradiction with his
plans.

> > After snapshot is created the WAL should start from the first operation
> > that follows the commit operation snapshot is generated for. That means
> > WAL will contain 'confirm' messages that refer to transactions that are
> > not present in the WAL. Apparently, we have to allow this for the case
> > 'confirm' refers to a transaction with LSN less than the first entry in
> > the WAL.
> 
> 9. I couldn't understand that. Why confirm is in WAL for data stored in
> the snap? I thought you said above, that snapshot should be done for all
> confirmed data. Besides, having confirm out of snap means the snap is
> not self-sufficient anymore.
>
Snap waits for confirm message to start. During this wait the WAL keep
growing. At the moment confirm arrived the snap will be created - say,
for txn #10. The WAL will be started with lsn #11 and commit can be
somewhere lsn #30. 
So, starting with this snap data appears consistent for lsn #10 - it is
guaranteed by the wait of commit message. Then replay of WAL will come 
to a confirm message lsn #30 - referring to lsn #10 - that actually
ignored, since it looks beyond the WAL start. There could be confirm
messages for even earlier txns if wait takes sufficient time - all of
them will refer to lsn beyond the WAL. And it is Ok.

> > In case master appears unavailable a replica still have to be able to
> > create a snapshot. Replica can perform rollback for all transactions that
> > are not confirmed and claim its LSN as the latest confirmed txn. Then it
> > can create a snapshot in a regular way and start with blank xlog file.
> > All rolled back transactions will appear through the regular replication
> > in case master reappears later on.
> 
> 10. You should be able to make a snapshot without rollback. Read-views are
> available anyway. At least it is so in Vinyl, from what I remember. And this
> is going to be similar in memtx.
> 
You have to make a snapshot for a consistent data state. Unless we have
transaction manager in memtx - this is the way to do so. And as I
mentioned, this is a different activity.

> > Cluster description should contain explicit attribute for each replica
> > to denote it participates in synchronous activities. Also the description
> > should contain criterion on how many replicas responses are needed to
> > achieve the quorum.
> 
> 11. Aha, I see 'necessary amount' from above is a manually set value. Ok.
> 
> > 
> > ## Rationale and alternatives
> > 
> > There is an implementation of synchronous replication as part of gh-980
> > activities, still it is not in a state to get into the product. More
> > than that it intentionally breaks backward compatibility which is a
> > prerequisite for this proposal.
> 
> 12. How are we going to deal with fsync()? Will it be forcefully enabled
> on sync replicas and the leader?

To my understanding - it's up to user. I was considering a cluster that
has no WAL at all - relying on sychro replication and sufficient number
of replicas. Everyone who I asked about it told me I'm nuts. To my great
surprise Alexander Lyapunov brought exactly the same idea to discuss. 

All of these is for one resolution: I would keep it for user to decide.
Obviously, to speed up the processing leader can disable wal completely,
but to do so we have to re-work the relay to work from memory. Replicas
can use WAL in a way user wants: 2 replicas with slow HDD should'n wait
for fsync(), while super-fast Intel DCPMM one can enable it. Balancing
is up to user.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-04-21 10:49   ` Sergey Ostanevich
@ 2020-04-21 22:17     ` Vladislav Shpilevoy
  2020-04-22 16:50       ` Sergey Ostanevich
  2020-04-23  6:58       ` Konstantin Osipov
  0 siblings, 2 replies; 53+ messages in thread
From: Vladislav Shpilevoy @ 2020-04-21 22:17 UTC (permalink / raw)
  To: Sergey Ostanevich; +Cc: tarantool-patches

>>>   - high availability (HA) solution with automated failover, roles
>>>     assignments an so on
>>
>> 1. So no leader election? That essentially makes single failure point
>> for RW requests, is it correct?
>>
>> On the other hand I see section 'Recovery and failover.' below. And
>> it seems to be automated, with selecting a replica with the biggest
>> LSN. Where is the truth?
>>
> 
> The failover can be manual or implemented independnetnly. By no means
> this means we should not explain how this should be done according to
> the replication schema discussed.

But it is not explained. You just said 'the biggest LSN owner is chosen'.
This looks very good in theory and on the paper, but it is not that simple.
If you are going to explain how it works. You said 'by no means we should not
explain'.

Talking of the election in scope of our replication schema, I don't see
where it is discussed. Is there a separate RFC I am missing? I am asking exactly
about that - where are pings, healthchecks, timeouts, voting on top of our
replication schema? If you don't want to make the election a part of this RFC
at all, then why is there a section, which literally says that the election
is present and it is 'the biggest LSN owner is chosen'?

In case the election is out of this for now, did you think about how a possible
future leader election algorithm could be implemented on top of this sync
replication? Just to be sure we are not ruining some things which will be
necessary for auto election later on.

> And yes, the SPOF is the leader of the cluster. This is expected and is
> Ok according to all MRG planning meeting participants.

That is not about MRG only, Tarantool is not a closed-source MRG-only DB.
I am not against making it non-automated for now, but I want to be sure it
will be possible to implement this as an enhancement.

>>>   - master-master configuration support
>>>
>>> ## Background and motivation
>>>
>>> There are number of known implementation of consistent data presence in
>>> a cluster. They can be commonly named as "wait for LSN" technique. The
>>> biggest issue with this technique is the absence of rollback guarantees
>>> at replica in case of transaction failure on one master or some of the
>>> replicas in the cluster.
>>>
>>> To provide such capabilities a new functionality should be introduced in
>>> Tarantool core, with requirements mentioned before - backward
>>> compatibility and ease of cluster orchestration.
>>>
>>> ## Detailed design
>>>
>>> ### Quorum commit
>>>
>>> The main idea behind the proposal is to reuse existent machinery as much
>>> as possible. It will ensure the well-tested and proven functionality
>>> across many instances in MRG and beyond is used. The transaction rollback
>>> mechanism is in place and works for WAL write failure. If we substitute
>>> the WAL success with a new situation which is named 'quorum' later in
>>> this document then no changes to the machinery is needed.
>>
>> 2. The problem here is that you create dependency on WAL. According to
>> your words, replication is inside WAL, and if WAL gave ok, then all is
> 
> 'Replication is inside WAL' - what do you mean by that? The replication in
> its current state works from WAL, although it's an exaggregation to say
> it is 'inside WAL'. Why it means a new dependency after that?

You said 'WAL success' is substituted with a new situation 'quorum'. It
means, strictly interpreting your words, that 'wal_write()' function
won't return 0 until quorum is collected.

This is what I mean by moving replication into WAL subsystem.

>> replicated and applied. But that makes current code structure even worse
>> than it is. Now WAL, GC, and replication code is spaghetti, basically.
>> All depends on all. I was rather thinking, that we should fix that first.
>> Not aggravate.
>>
>> WAL should provide API for writing to disk. Replication should not bother
>> about WAL. GC should not bother about replication. All should be independent,
>> and linked in one place by some kind of a manager, which would just use their
> 
> So you want to introduce a single point that will translate all messages
> between all participants?

Well, this is called 'cbus', we already have it. This is a separate headache,
which no one understands except Georgy. However I was not talking about it.
I am wrong, it should not be a single manager. But all the subsystems should
be as independent as possible still.

> I believe current state was introduced exactly
> to avoid this situation. Each participant can be subscribed for a
> particular trigger inside another participant and take it into account
> in its activities - at the right time for itself. 

Whatever. I don't know the code good enough, so I am probably wrong
somewhere here. But every time looking at these numerous triggers depending
on each other and called at arbitrary moments was enough. I tried to fix
these trigger-dependencies in scope of different tasks, but cleaning this
code appears to be a huge task by itself.

This can be done without obscure triggers called from arbitrary threads at
atribtrary moments of time. That in fact was the main blocker for the in-memory
WAL, when I tried to finish it after Georgy in January. We have fibers exactly
to avoid triggers. To be able to write linear and simple code. The triggers
can be replaced by dedicated fibers, and fiber condition variables can be used
to wait for exact moments of time of needed events where functionality is
event based.

>> APIs. I believe Cyrill G. would agree with me here, I remember him
>> complaining about replication-wal-gc code inter-dependencies too. Please,
>> request his review on this, if you didn't yet.
>>
> I personally have the same problem trying to implement a trivial test,
> by just figuring out the layers and dependencies of participants. This
> is about poor documentation im my understanding, not poor design. 

There is almost more documentation than the code. Just look at the number
of comments and level of their details. And still it does not help. So
looks like just a bad API and dependency design. Does not matter how hard
it is documented, it just becomes more interdepending and harder to wrap a
mind around it.

>>> In case of a leader failure a replica with the biggest LSN with former
>>> leader's ID is elected as a new leader. The replica should record
>>> 'rollback' in its WAL which effectively means that all transactions
>>> without quorum should be rolled back. This rollback will be delivered to
>>> all replicas and they will perform rollbacks of all transactions waiting
>>> for quorum.
>>
>> 7. Please, elaborate leader election. It is not as trivial as just 'elect'.
>> What if the replica with the biggest LSN is temporary not available, but
>> it knows that it has the biggest LSN? Will it become a leader without
>> asking other nodes? What will do the other nodes? Will they wait for the
>> new leader node to become available? Do they have a timeout on that?
>>
> By now I do not plan any activities on HA - including automated failover
> and leader re-election. In case leader sees insufficient number of
> replicas to achieve quorum - it stops, reporting the problem to the
> external orchestrator. 

From the text in this section it does not look like a not planned activity,
but like an already made decision. It is not even in the 'Plans' section.
You just said 'is elected'. By whom? How?

If the election is not a part of the RFC, I would suggest moving this out,
or into a separate section 'Plans' or something. Or reformulate this by
saying like 'the cluster stops serving write requests until external
intervention sets a new leader'. And 'it is *advised* to use the one with
the biggest LSN in the old leader's vclock component'. Something like that.

>> Basically, it would be nice to see the split-brain problem description here,
>> and its solution for us.
>>
> I believe the split-brain is under orchestrator control either - we
> should provide API to switch leader in the cluster, so that when a
> former leader came back it will not get quorum for any txn it has,
> replying to customers with failure as a result.

Exactly. We should provide something for this from inside. But are there
any details? How should that work? Should all the healthy replicas reject
everything from the false-leader? Should the false-leader somehow realize,
that it is not considered a leader anymore, and should stop itself? If we
choose the former way, how a replica defines who is the true leader? For
example, some replicas still may consider the old leader as a true master.
If we choose the latter way, what is the algorithm of determining that we
are not a leader anymore?

>>> After snapshot is created the WAL should start from the first operation
>>> that follows the commit operation snapshot is generated for. That means
>>> WAL will contain 'confirm' messages that refer to transactions that are
>>> not present in the WAL. Apparently, we have to allow this for the case
>>> 'confirm' refers to a transaction with LSN less than the first entry in
>>> the WAL.
>>
>> 9. I couldn't understand that. Why confirm is in WAL for data stored in
>> the snap? I thought you said above, that snapshot should be done for all
>> confirmed data. Besides, having confirm out of snap means the snap is
>> not self-sufficient anymore.
>>
> Snap waits for confirm message to start. During this wait the WAL keep
> growing. At the moment confirm arrived the snap will be created - say,
> for txn #10. The WAL will be started with lsn #11 and commit can be
> somewhere lsn #30. 
> So, starting with this snap data appears consistent for lsn #10 - it is
> guaranteed by the wait of commit message. Then replay of WAL will come 
> to a confirm message lsn #30 - referring to lsn #10 - that actually
> ignored, since it looks beyond the WAL start. There could be confirm
> messages for even earlier txns if wait takes sufficient time - all of
> them will refer to lsn beyond the WAL. And it is Ok.

What is 'commit message'? I don't see it on the schema above. I see only
confirms.

So the problem is that some data may be written to WAL after we started
committing our transactions going to the snap, but before we received a
quorum. And we can't truncate the WAL by the quorum, because there is
already newer data, which was not included into the snap. Because WAL is
not stopped, it still accepts new transactions. Now I understand.

Would be good to have this example in the RFC.

>>> Cluster description should contain explicit attribute for each replica
>>> to denote it participates in synchronous activities. Also the description
>>> should contain criterion on how many replicas responses are needed to
>>> achieve the quorum.
>>
>> 11. Aha, I see 'necessary amount' from above is a manually set value. Ok.
>>
>>>
>>> ## Rationale and alternatives
>>>
>>> There is an implementation of synchronous replication as part of gh-980
>>> activities, still it is not in a state to get into the product. More
>>> than that it intentionally breaks backward compatibility which is a
>>> prerequisite for this proposal.
>>
>> 12. How are we going to deal with fsync()? Will it be forcefully enabled
>> on sync replicas and the leader?
> 
> To my understanding - it's up to user. I was considering a cluster that
> has no WAL at all - relying on sychro replication and sufficient number
> of replicas. Everyone who I asked about it told me I'm nuts. To my great
> surprise Alexander Lyapunov brought exactly the same idea to discuss. 

I didn't see an RFC on that, and this can become easily possible, when
in-memory relay is implemented. If it is implemented in a clean way. We
just can turn off the disk backoff, and it will work from memory-only.

> All of these is for one resolution: I would keep it for user to decide.
> Obviously, to speed up the processing leader can disable wal completely,
> but to do so we have to re-work the relay to work from memory. Replicas
> can use WAL in a way user wants: 2 replicas with slow HDD should'n wait
> for fsync(), while super-fast Intel DCPMM one can enable it. Balancing
> is up to user.

Possibility of omitting fsync means that it is possible, that all nodes
write confirm, which is reported to the client, then the nodes restart,
and the data is lost. I would say it somewhere.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-04-21 22:17     ` Vladislav Shpilevoy
@ 2020-04-22 16:50       ` Sergey Ostanevich
  2020-04-22 20:28         ` Vladislav Shpilevoy
  2020-04-23  6:58       ` Konstantin Osipov
  1 sibling, 1 reply; 53+ messages in thread
From: Sergey Ostanevich @ 2020-04-22 16:50 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

Hi!

On 22 апр 00:17, Vladislav Shpilevoy wrote:
> >>>   - high availability (HA) solution with automated failover, roles
> >>>     assignments an so on
> >>
> >> 1. So no leader election? That essentially makes single failure point
> >> for RW requests, is it correct?
> >>
> >> On the other hand I see section 'Recovery and failover.' below. And
> >> it seems to be automated, with selecting a replica with the biggest
> >> LSN. Where is the truth?
> >>
> > 
> > The failover can be manual or implemented independnetnly. By no means
> > this means we should not explain how this should be done according to
> > the replication schema discussed.
> 
> But it is not explained. You just said 'the biggest LSN owner is chosen'.
> This looks very good in theory and on the paper, but it is not that simple.
> If you are going to explain how it works. You said 'by no means we should not
> explain'.
> 
I expect this to be a reference to what is currently implemented by
Tarantool users in many ways. I think I have to rephrase that one 'can
keep current election apporach using biggest LSN' since proposed
solution does not change current semantics of WAL generation, just
'confirm' and 'rollback' operations that are regular entries in the WAL.

> Talking of the election in scope of our replication schema, I don't see
> where it is discussed. Is there a separate RFC I am missing? I am asking exactly
> about that - where are pings, healthchecks, timeouts, voting on top of our
> replication schema? If you don't want to make the election a part of this RFC
> at all, then why is there a section, which literally says that the election
> is present and it is 'the biggest LSN owner is chosen'?
> 
> In case the election is out of this for now, did you think about how a possible
> future leader election algorithm could be implemented on top of this sync
> replication? Just to be sure we are not ruining some things which will be
> necessary for auto election later on.
> 
My answer will be the same - no changes to WAL from this point of view,
just replicas has their respective undo logs and can rollback to the
consistent state.

> > And yes, the SPOF is the leader of the cluster. This is expected and is
> > Ok according to all MRG planning meeting participants.
> 
> That is not about MRG only, Tarantool is not a closed-source MRG-only DB.
> I am not against making it non-automated for now, but I want to be sure it
> will be possible to implement this as an enhancement.
> 
Sure we want to - using the existing SWIM module for membership and to
elaborate something close to RAFT, still it is not in our immediate plans.

> >>>   - master-master configuration support
> >>>
> >>> ## Background and motivation
> >>>
> >>> There are number of known implementation of consistent data presence in
> >>> a cluster. They can be commonly named as "wait for LSN" technique. The
> >>> biggest issue with this technique is the absence of rollback guarantees
> >>> at replica in case of transaction failure on one master or some of the
> >>> replicas in the cluster.
> >>>
> >>> To provide such capabilities a new functionality should be introduced in
> >>> Tarantool core, with requirements mentioned before - backward
> >>> compatibility and ease of cluster orchestration.
> >>>
> >>> ## Detailed design
> >>>
> >>> ### Quorum commit
> >>>
> >>> The main idea behind the proposal is to reuse existent machinery as much
> >>> as possible. It will ensure the well-tested and proven functionality
> >>> across many instances in MRG and beyond is used. The transaction rollback
> >>> mechanism is in place and works for WAL write failure. If we substitute
> >>> the WAL success with a new situation which is named 'quorum' later in
> >>> this document then no changes to the machinery is needed.
> >>
> >> 2. The problem here is that you create dependency on WAL. According to
> >> your words, replication is inside WAL, and if WAL gave ok, then all is
> > 
> > 'Replication is inside WAL' - what do you mean by that? The replication in
> > its current state works from WAL, although it's an exaggregation to say
> > it is 'inside WAL'. Why it means a new dependency after that?
> 
> You said 'WAL success' is substituted with a new situation 'quorum'. It
> means, strictly interpreting your words, that 'wal_write()' function
> won't return 0 until quorum is collected.
> 
> This is what I mean by moving replication into WAL subsystem.
> 
The wal_write() should report the result of WAL operation. It should not
return quorum - the WAL result should be used along with quorum messages
from replicas to claim txn is complete. This shouldn't be part of WAL.

> >> replicated and applied. But that makes current code structure even worse
> >> than it is. Now WAL, GC, and replication code is spaghetti, basically.
> >> All depends on all. I was rather thinking, that we should fix that first.
> >> Not aggravate.
> >>
> >> WAL should provide API for writing to disk. Replication should not bother
> >> about WAL. GC should not bother about replication. All should be independent,
> >> and linked in one place by some kind of a manager, which would just use their
> > 
> > So you want to introduce a single point that will translate all messages
> > between all participants?
> 
> Well, this is called 'cbus', we already have it. This is a separate headache,
> which no one understands except Georgy. However I was not talking about it.
> I am wrong, it should not be a single manager. But all the subsystems should
> be as independent as possible still.
> 
> > I believe current state was introduced exactly
> > to avoid this situation. Each participant can be subscribed for a
> > particular trigger inside another participant and take it into account
> > in its activities - at the right time for itself. 
> 
> Whatever. I don't know the code good enough, so I am probably wrong
> somewhere here. But every time looking at these numerous triggers depending
> on each other and called at arbitrary moments was enough. I tried to fix
> these trigger-dependencies in scope of different tasks, but cleaning this
> code appears to be a huge task by itself.
> 
> This can be done without obscure triggers called from arbitrary threads at
> atribtrary moments of time. That in fact was the main blocker for the in-memory
> WAL, when I tried to finish it after Georgy in January. We have fibers exactly
> to avoid triggers. To be able to write linear and simple code. The triggers
> can be replaced by dedicated fibers, and fiber condition variables can be used
> to wait for exact moments of time of needed events where functionality is
> event based.
> 
I totaly agree to you that it is a big task itself. I believe we won't
introduce too much extra dependencies between the parties - just tweak
some of them.
So far I want to start an ativity - Cyrill Gorcunov supports me - to
draw a mutual dependency map of all participants: their threads, fibers,
triggers and how they are connected. I believe it will help us to
prepare a first step to redesign the system - or make a thoughtful
decision to keep it as is. 

> >> APIs. I believe Cyrill G. would agree with me here, I remember him
> >> complaining about replication-wal-gc code inter-dependencies too. Please,
> >> request his review on this, if you didn't yet.
> >>
> > I personally have the same problem trying to implement a trivial test,
> > by just figuring out the layers and dependencies of participants. This
> > is about poor documentation im my understanding, not poor design. 
> 
> There is almost more documentation than the code. Just look at the number
> of comments and level of their details. And still it does not help. So
> looks like just a bad API and dependency design. Does not matter how hard
> it is documented, it just becomes more interdepending and harder to wrap a
> mind around it.
> 
You can document every line of code with explanation of what it does,
but there' no such line that will require to draw the 'big picture'. I
believe it is the problem. Design is not the code, rather guideline for
the code. To decypher it back from (perhaps not-so good sometimes) code
is a big task itself. This should help to understand - and only then
improve - the implementation.

> >>> In case of a leader failure a replica with the biggest LSN with former
> >>> leader's ID is elected as a new leader. The replica should record
> >>> 'rollback' in its WAL which effectively means that all transactions
> >>> without quorum should be rolled back. This rollback will be delivered to
> >>> all replicas and they will perform rollbacks of all transactions waiting
> >>> for quorum.
> >>
> >> 7. Please, elaborate leader election. It is not as trivial as just 'elect'.
> >> What if the replica with the biggest LSN is temporary not available, but
> >> it knows that it has the biggest LSN? Will it become a leader without
> >> asking other nodes? What will do the other nodes? Will they wait for the
> >> new leader node to become available? Do they have a timeout on that?
> >>
> > By now I do not plan any activities on HA - including automated failover
> > and leader re-election. In case leader sees insufficient number of
> > replicas to achieve quorum - it stops, reporting the problem to the
> > external orchestrator. 
> 
> From the text in this section it does not look like a not planned activity,
> but like an already made decision. It is not even in the 'Plans' section.
> You just said 'is elected'. By whom? How?
> 
Again, I want to address this to current users - I would rephrase.

> If the election is not a part of the RFC, I would suggest moving this out,
> or into a separate section 'Plans' or something. Or reformulate this by
> saying like 'the cluster stops serving write requests until external
> intervention sets a new leader'. And 'it is *advised* to use the one with
> the biggest LSN in the old leader's vclock component'. Something like that.
> 
I don't think the automated election should be even a plan for SR. It is
a feature on top of it, shouldn't be a prerequisite in any form.

> >> Basically, it would be nice to see the split-brain problem description here,
> >> and its solution for us.
> >>
> > I believe the split-brain is under orchestrator control either - we
> > should provide API to switch leader in the cluster, so that when a
> > former leader came back it will not get quorum for any txn it has,
> > replying to customers with failure as a result.
> 
> Exactly. We should provide something for this from inside. But are there
> any details? How should that work? Should all the healthy replicas reject
> everything from the false-leader? Should the false-leader somehow realize,
> that it is not considered a leader anymore, and should stop itself? If we
> choose the former way, how a replica defines who is the true leader? For
> example, some replicas still may consider the old leader as a true master.
> If we choose the latter way, what is the algorithm of determining that we
> are not a leader anymore?
> 
It is all about external orchestration - if replica can't get ping from
leader it stops, reporting its status to orchestrator. 
If leader lost number of replicas that makes quorum impossible - it
stops replication, reporting to the orchestrator. 
Will it be sufficient to cover the question?

> >>> After snapshot is created the WAL should start from the first operation
> >>> that follows the commit operation snapshot is generated for. That means
> >>> WAL will contain 'confirm' messages that refer to transactions that are
> >>> not present in the WAL. Apparently, we have to allow this for the case
> >>> 'confirm' refers to a transaction with LSN less than the first entry in
> >>> the WAL.
> >>
> >> 9. I couldn't understand that. Why confirm is in WAL for data stored in
> >> the snap? I thought you said above, that snapshot should be done for all
> >> confirmed data. Besides, having confirm out of snap means the snap is
> >> not self-sufficient anymore.
> >>
> > Snap waits for confirm message to start. During this wait the WAL keep
> > growing. At the moment confirm arrived the snap will be created - say,
> > for txn #10. The WAL will be started with lsn #11 and commit can be
> > somewhere lsn #30. 
> > So, starting with this snap data appears consistent for lsn #10 - it is
> > guaranteed by the wait of commit message. Then replay of WAL will come 
> > to a confirm message lsn #30 - referring to lsn #10 - that actually
> > ignored, since it looks beyond the WAL start. There could be confirm
> > messages for even earlier txns if wait takes sufficient time - all of
> > them will refer to lsn beyond the WAL. And it is Ok.
> 
> What is 'commit message'? I don't see it on the schema above. I see only
> confirms.
> 
Sorry, it is a misprint - I meant 'confirm'.

> So the problem is that some data may be written to WAL after we started
> committing our transactions going to the snap, but before we received a
> quorum. And we can't truncate the WAL by the quorum, because there is
> already newer data, which was not included into the snap. Because WAL is
> not stopped, it still accepts new transactions. Now I understand.
> 
> Would be good to have this example in the RFC.
>
Ok, I will try to elaborate on this.

> >>> Cluster description should contain explicit attribute for each replica
> >>> to denote it participates in synchronous activities. Also the description
> >>> should contain criterion on how many replicas responses are needed to
> >>> achieve the quorum.
> >>
> >> 11. Aha, I see 'necessary amount' from above is a manually set value. Ok.
> >>
> >>>
> >>> ## Rationale and alternatives
> >>>
> >>> There is an implementation of synchronous replication as part of gh-980
> >>> activities, still it is not in a state to get into the product. More
> >>> than that it intentionally breaks backward compatibility which is a
> >>> prerequisite for this proposal.
> >>
> >> 12. How are we going to deal with fsync()? Will it be forcefully enabled
> >> on sync replicas and the leader?
> > 
> > To my understanding - it's up to user. I was considering a cluster that
> > has no WAL at all - relying on sychro replication and sufficient number
> > of replicas. Everyone who I asked about it told me I'm nuts. To my great
> > surprise Alexander Lyapunov brought exactly the same idea to discuss. 
> 
> I didn't see an RFC on that, and this can become easily possible, when
> in-memory relay is implemented. If it is implemented in a clean way. We
> just can turn off the disk backoff, and it will work from memory-only.
> 
It is not in RFC and we had no support from the customers in question.

> > All of these is for one resolution: I would keep it for user to decide.
> > Obviously, to speed up the processing leader can disable wal completely,
> > but to do so we have to re-work the relay to work from memory. Replicas
> > can use WAL in a way user wants: 2 replicas with slow HDD should'n wait
> > for fsync(), while super-fast Intel DCPMM one can enable it. Balancing
> > is up to user.
> 
> Possibility of omitting fsync means that it is possible, that all nodes
> write confirm, which is reported to the client, then the nodes restart,
> and the data is lost. I would say it somewhere.

The data will not be lost, unless _all_ nodes fail at the same time -
including leader. Otherwise the data will be propagated from the
survivor through the regular replication. No changes here to what we
have currently.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-04-22 16:50       ` Sergey Ostanevich
@ 2020-04-22 20:28         ` Vladislav Shpilevoy
  0 siblings, 0 replies; 53+ messages in thread
From: Vladislav Shpilevoy @ 2020-04-22 20:28 UTC (permalink / raw)
  To: Sergey Ostanevich; +Cc: tarantool-patches

>>>> Basically, it would be nice to see the split-brain problem description here,
>>>> and its solution for us.
>>>>
>>> I believe the split-brain is under orchestrator control either - we
>>> should provide API to switch leader in the cluster, so that when a
>>> former leader came back it will not get quorum for any txn it has,
>>> replying to customers with failure as a result.
>>
>> Exactly. We should provide something for this from inside. But are there
>> any details? How should that work? Should all the healthy replicas reject
>> everything from the false-leader? Should the false-leader somehow realize,
>> that it is not considered a leader anymore, and should stop itself? If we
>> choose the former way, how a replica defines who is the true leader? For
>> example, some replicas still may consider the old leader as a true master.
>> If we choose the latter way, what is the algorithm of determining that we
>> are not a leader anymore?
>>
> It is all about external orchestration - if replica can't get ping from
> leader it stops, reporting its status to orchestrator. 
> If leader lost number of replicas that makes quorum impossible - it
> stops replication, reporting to the orchestrator. 
> Will it be sufficient to cover the question?

Perhaps.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-04-21 22:17     ` Vladislav Shpilevoy
  2020-04-22 16:50       ` Sergey Ostanevich
@ 2020-04-23  6:58       ` Konstantin Osipov
  2020-04-23  9:14         ` Konstantin Osipov
  1 sibling, 1 reply; 53+ messages in thread
From: Konstantin Osipov @ 2020-04-23  6:58 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

* Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/04/22 01:21]:
> > To my understanding - it's up to user. I was considering a cluster that
> > has no WAL at all - relying on sychro replication and sufficient number
> > of replicas. Everyone who I asked about it told me I'm nuts. To my great
> > surprise Alexander Lyapunov brought exactly the same idea to discuss. 
> 
> I didn't see an RFC on that, and this can become easily possible, when
> in-memory relay is implemented. If it is implemented in a clean way. We
> just can turn off the disk backoff, and it will work from memory-only.

Sync replication must work from in-memory relay only. It works as
a natural failure detector: a replica which is slow or unavailable
is first removed from the subscribers of in-memory relay, and only 
then (possibly much much later) is marked as down.

By looking at the in-memory relay you have a clear idea what peers
are available and can abort a transaction if a cluster is in the
downgraded state right away. You never wait for impossible events. 

If you do have to wait, and say your wait timeout is 1 second, you
quickly run out of any fibers in the fiber pool for any work,
because all of them will be waiting on the sync transactions they
picked up from iproto to finish. The system will loose its
throttling capability. 

There are other reasons, too: the protocol will eventually be
quite tricky and the logic has to reside in a single place and not
require inter-thread communication. 
Committing a transaction purely anywhere outside WAL will require 
inter-thread communication, which is costly and should be avoided.

I am surprised I have to explain this again and again - I never
assumed this spec is a precursor for a half-backed implementation,
only as a high-level description of the next steps after in-memory
relay is in.

> > All of these is for one resolution: I would keep it for user to decide.
> > Obviously, to speed up the processing leader can disable wal completely,
> > but to do so we have to re-work the relay to work from memory. Replicas
> > can use WAL in a way user wants: 2 replicas with slow HDD should'n wait
> > for fsync(), while super-fast Intel DCPMM one can enable it. Balancing
> > is up to user.
> 
> Possibility of omitting fsync means that it is possible, that all nodes
> write confirm, which is reported to the client, then the nodes restart,
> and the data is lost. I would say it somewhere.

Worse yet you can elect a leader "based on WAL length" and then it
is no longer the leader, because it lost it long WAL after
restart. fcync() is mandatory during election, in other cases it
shouldn't impact durability in our case.

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-04-23  6:58       ` Konstantin Osipov
@ 2020-04-23  9:14         ` Konstantin Osipov
  2020-04-23 11:27           ` Sergey Ostanevich
  0 siblings, 1 reply; 53+ messages in thread
From: Konstantin Osipov @ 2020-04-23  9:14 UTC (permalink / raw)
  To: Vladislav Shpilevoy, Sergey Ostanevich, tarantool-patches

* Konstantin Osipov <kostja.osipov@gmail.com> [20/04/23 09:58]:
> > > To my understanding - it's up to user. I was considering a cluster that
> > > has no WAL at all - relying on sychro replication and sufficient number
> > > of replicas. Everyone who I asked about it told me I'm nuts. To my great
> > > surprise Alexander Lyapunov brought exactly the same idea to discuss. 
> > 
> > I didn't see an RFC on that, and this can become easily possible, when
> > in-memory relay is implemented. If it is implemented in a clean way. We
> > just can turn off the disk backoff, and it will work from memory-only.
> 
> Sync replication must work from in-memory relay only. It works as
> a natural failure detector: a replica which is slow or unavailable
> is first removed from the subscribers of in-memory relay, and only 
> then (possibly much much later) is marked as down.
> 
> By looking at the in-memory relay you have a clear idea what peers
> are available and can abort a transaction if a cluster is in the
> downgraded state right away. You never wait for impossible events. 
> 
> If you do have to wait, and say your wait timeout is 1 second, you
> quickly run out of any fibers in the fiber pool for any work,
> because all of them will be waiting on the sync transactions they
> picked up from iproto to finish. The system will loose its
> throttling capability. 

The other issue is that if your replicas are alive but
slow/lagging behind, you can't let too many undo records to
pile up unacknowledged in tx thread.
The in-memory relay solves this nicely too, because it kicks out
replicas from memory to file mode if they are unable to keep up
with the speed of change.

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-04-23  9:14         ` Konstantin Osipov
@ 2020-04-23 11:27           ` Sergey Ostanevich
  2020-04-23 11:43             ` Konstantin Osipov
  0 siblings, 1 reply; 53+ messages in thread
From: Sergey Ostanevich @ 2020-04-23 11:27 UTC (permalink / raw)
  To: Konstantin Osipov, Vladislav Shpilevoy, tarantool-patches

Hi!

Thanks for review!

On 23 апр 12:14, Konstantin Osipov wrote:
> * Konstantin Osipov <kostja.osipov@gmail.com> [20/04/23 09:58]:
> > > > To my understanding - it's up to user. I was considering a cluster that
> > > > has no WAL at all - relying on sychro replication and sufficient number
> > > > of replicas. Everyone who I asked about it told me I'm nuts. To my great
> > > > surprise Alexander Lyapunov brought exactly the same idea to discuss. 
> > > 
> > > I didn't see an RFC on that, and this can become easily possible, when
> > > in-memory relay is implemented. If it is implemented in a clean way. We
> > > just can turn off the disk backoff, and it will work from memory-only.
> > 
> > Sync replication must work from in-memory relay only. It works as
> > a natural failure detector: a replica which is slow or unavailable
> > is first removed from the subscribers of in-memory relay, and only 
> > then (possibly much much later) is marked as down.
> > 
> > By looking at the in-memory relay you have a clear idea what peers
> > are available and can abort a transaction if a cluster is in the
> > downgraded state right away. You never wait for impossible events. 
> > 
> > If you do have to wait, and say your wait timeout is 1 second, you
> > quickly run out of any fibers in the fiber pool for any work,
> > because all of them will be waiting on the sync transactions they
> > picked up from iproto to finish. The system will loose its
> > throttling capability. 
> 
There's no need to explain it to customer: sync replication is not
expected to be as fast as pure in-memory. By no means. We have network
communication, disk operation, multiple entities quorum - all of these
can't be as fast. No need to try cramp more than network can push
through, obvoiusly.

The quality one buys for this price: consistency of data in multiple
instances distributed across different locations. 

> The other issue is that if your replicas are alive but
> slow/lagging behind, you can't let too many undo records to
> pile up unacknowledged in tx thread.
> The in-memory relay solves this nicely too, because it kicks out
> replicas from memory to file mode if they are unable to keep up
> with the speed of change.
> 
That is the same problem - resources of leader, so natural limit for
throughput. I bet Tarantool faces similar limitations even now,
although different ones. 

The in-memory relay supposed to keep the same interface, so we expect to
hop easily to this new shiny express as soon as it appears. This will be
an optimization and we're trying to implement something and then speed
it up.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-04-23 11:27           ` Sergey Ostanevich
@ 2020-04-23 11:43             ` Konstantin Osipov
  2020-04-23 15:11               ` Sergey Ostanevich
  0 siblings, 1 reply; 53+ messages in thread
From: Konstantin Osipov @ 2020-04-23 11:43 UTC (permalink / raw)
  To: Sergey Ostanevich; +Cc: tarantool-patches, Vladislav Shpilevoy

* Sergey Ostanevich <sergos@tarantool.org> [20/04/23 14:29]:
> Hi!
> 
> Thanks for review!
> 
> On 23 апр 12:14, Konstantin Osipov wrote:
> > * Konstantin Osipov <kostja.osipov@gmail.com> [20/04/23 09:58]:
> > > > > To my understanding - it's up to user. I was considering a cluster that
> > > > > has no WAL at all - relying on sychro replication and sufficient number
> > > > > of replicas. Everyone who I asked about it told me I'm nuts. To my great
> > > > > surprise Alexander Lyapunov brought exactly the same idea to discuss. 
> > > > 
> > > > I didn't see an RFC on that, and this can become easily possible, when
> > > > in-memory relay is implemented. If it is implemented in a clean way. We
> > > > just can turn off the disk backoff, and it will work from memory-only.
> > > 
> > > Sync replication must work from in-memory relay only. It works as
> > > a natural failure detector: a replica which is slow or unavailable
> > > is first removed from the subscribers of in-memory relay, and only 
> > > then (possibly much much later) is marked as down.
> > > 
> > > By looking at the in-memory relay you have a clear idea what peers
> > > are available and can abort a transaction if a cluster is in the
> > > downgraded state right away. You never wait for impossible events. 
> > > 
> > > If you do have to wait, and say your wait timeout is 1 second, you
> > > quickly run out of any fibers in the fiber pool for any work,
> > > because all of them will be waiting on the sync transactions they
> > > picked up from iproto to finish. The system will loose its
> > > throttling capability. 
> > 
> There's no need to explain it to customer: sync replication is not
> expected to be as fast as pure in-memory. By no means. We have network
> communication, disk operation, multiple entities quorum - all of these
> can't be as fast. No need to try cramp more than network can push
> through, obvoiusly.

This expected performance overhead is not a grant to run out of
memory or available fibers on a node failure or network partitioning.

> The quality one buys for this price: consistency of data in multiple
> instances distributed across different locations. 

The spec should demonstrate the consistency is guaranteed: right
now it can easily be violated during a leader change, and this is
left out of scope of the spec.

My take is that any implementation which is not close enough to a
TLA+ proven spec is not trustworthy, so I would not claim myself
or trust any one elses claims that it is consistent. At best this
RFC could achieve durability, by ensuring that no transaction is
committed unless it is delivered to a majority of replicas.
Consistency requires implementing RAFT spec in full and showing
that leader changes preserve the write ahead log linearizability.

> > The other issue is that if your replicas are alive but
> > slow/lagging behind, you can't let too many undo records to
> > pile up unacknowledged in tx thread.
> > The in-memory relay solves this nicely too, because it kicks out
> > replicas from memory to file mode if they are unable to keep up
> > with the speed of change.
> > 
> That is the same problem - resources of leader, so natural limit for
> throughput. I bet Tarantool faces similar limitations even now,
> although different ones. 
> 
> The in-memory relay supposed to keep the same interface, so we expect to
> hop easily to this new shiny express as soon as it appears. This will be
> an optimization and we're trying to implement something and then speed
> it up.

It is pretty clear that the implementation will be different. 

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-04-23 11:43             ` Konstantin Osipov
@ 2020-04-23 15:11               ` Sergey Ostanevich
  2020-04-23 20:39                 ` Konstantin Osipov
  0 siblings, 1 reply; 53+ messages in thread
From: Sergey Ostanevich @ 2020-04-23 15:11 UTC (permalink / raw)
  To: Konstantin Osipov, Vladislav Shpilevoy, tarantool-patches

On 23 апр 14:43, Konstantin Osipov wrote:
> > The quality one buys for this price: consistency of data in multiple
> > instances distributed across different locations. 
> 
> The spec should demonstrate the consistency is guaranteed: right
> now it can easily be violated during a leader change, and this is
> left out of scope of the spec.
> 
> My take is that any implementation which is not close enough to a
> TLA+ proven spec is not trustworthy, so I would not claim myself
> or trust any one elses claims that it is consistent. At best this
> RFC could achieve durability, by ensuring that no transaction is
> committed unless it is delivered to a majority of replicas.

What is exactly mentioned in RFC goals.

> Consistency requires implementing RAFT spec in full and showing
> that leader changes preserve the write ahead log linearizability.
> 
So the leader should stop accepting transactions, wait for all txn in
queue resolved into confirmed either issue a rollback - after a 
timeout as a last resort.
Since no automation in leader election the cluster will appear in a
consistent state after this. Now a new leader can be appointed with
all circumstances taken into account - nodes availability, ping from
the proxy, lsn, etc.
Again, this RFC is not about any HA features, such as auto-failover.

> > > The other issue is that if your replicas are alive but
> > > slow/lagging behind, you can't let too many undo records to
> > > pile up unacknowledged in tx thread.
> > > The in-memory relay solves this nicely too, because it kicks out
> > > replicas from memory to file mode if they are unable to keep up
> > > with the speed of change.
> > > 
> > That is the same problem - resources of leader, so natural limit for
> > throughput. I bet Tarantool faces similar limitations even now,
> > although different ones. 
> > 
> > The in-memory relay supposed to keep the same interface, so we expect to
> > hop easily to this new shiny express as soon as it appears. This will be
> > an optimization and we're trying to implement something and then speed
> > it up.
> 
> It is pretty clear that the implementation will be different. 
> 
Which contradicts to the interface preservance, right?

> -- 
> Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-04-23 15:11               ` Sergey Ostanevich
@ 2020-04-23 20:39                 ` Konstantin Osipov
  0 siblings, 0 replies; 53+ messages in thread
From: Konstantin Osipov @ 2020-04-23 20:39 UTC (permalink / raw)
  To: Sergey Ostanevich; +Cc: tarantool-patches, Vladislav Shpilevoy

* Sergey Ostanevich <sergos@tarantool.org> [20/04/23 18:11]:
> > The spec should demonstrate the consistency is guaranteed: right
> > now it can easily be violated during a leader change, and this is
> > left out of scope of the spec.
> > 
> > My take is that any implementation which is not close enough to a
> > TLA+ proven spec is not trustworthy, so I would not claim myself
> > or trust any one elses claims that it is consistent. At best this
> > RFC could achieve durability, by ensuring that no transaction is
> > committed unless it is delivered to a majority of replicas.
> 
> What is exactly mentioned in RFC goals.

This is durability, though, not consistency. My point is: if
consistency can not be guaranteed anyway, why assume single leader. Let's
consider what happens if all replicas are allowed to collect acks, 
define for it the same semantics as we do today in case of async
multi-master. Then add the remaining bits of RAFT.
> 
> > Consistency requires implementing RAFT spec in full and showing
> > that leader changes preserve the write ahead log linearizability.
> > 
> So the leader should stop accepting transactions, wait for all txn in
> queue resolved into confirmed either issue a rollback - after a 
> timeout as a last resort.
> Since no automation in leader election the cluster will appear in a
> consistent state after this. Now a new leader can be appointed with
> all circumstances taken into account - nodes availability, ping from
> the proxy, lsn, etc.
> Again, this RFC is not about any HA features, such as auto-failover.
> 
> > > > The other issue is that if your replicas are alive but
> > > > slow/lagging behind, you can't let too many undo records to
> > > > pile up unacknowledged in tx thread.
> > > > The in-memory relay solves this nicely too, because it kicks out
> > > > replicas from memory to file mode if they are unable to keep up
> > > > with the speed of change.
> > > > 
> > > That is the same problem - resources of leader, so natural limit for
> > > throughput. I bet Tarantool faces similar limitations even now,
> > > although different ones. 
> > > 
> > > The in-memory relay supposed to keep the same interface, so we expect to
> > > hop easily to this new shiny express as soon as it appears. This will be
> > > an optimization and we're trying to implement something and then speed
> > > it up.
> > 
> > It is pretty clear that the implementation will be different. 
> > 
> Which contradicts to the interface preservance, right?

I don't believe internals and API can be so disconnected. I think
in-memory relay is such a significant change that the
implementation has to build upon it. 
The trigger-based implementation was contributed back in 2015 and
went nowhere, in fact it was an inspiration to create a backlog of
such items as parallel applier, applier in iproto, in-memory
relay, and so on - all of these are "review items" for the
trigger-based syncrep:

https://github.com/Alexey-Ivanensky/tarantool/tree/bsync

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-04-03 21:08 [Tarantool-patches] [RFC] Quorum-based synchronous replication Sergey Ostanevich
                   ` (2 preceding siblings ...)
  2020-04-20 23:32 ` Vladislav Shpilevoy
@ 2020-04-23 21:38 ` Vladislav Shpilevoy
  2020-04-23 22:28   ` Konstantin Osipov
  2020-04-30 14:50   ` Sergey Ostanevich
  3 siblings, 2 replies; 53+ messages in thread
From: Vladislav Shpilevoy @ 2020-04-23 21:38 UTC (permalink / raw)
  To: Sergey Ostanevich, tarantool-patches, Timur Safin, Mons Anderson

Hi!

Here is a short summary of our late night discussion and the
questions it brought up, while I was trying to design a draft
plan of an implementation. Since the RFC is too far from the
code, and I needed a more 'pedestrian' and detailed plan.

The question is about 'confirm' message and quorum collection.
Here is the schema presented in the RFC:

> Customer        Leader          WAL(L)        Replica        WAL(R)
>    |------TXN----->|              |             |              |
>    |               |              |             |              |
>    |         [TXN undo log        |             |              |
>    |            created]          |             |              |
>    |               |              |             |              |
>    |               |-----TXN----->|             |              |
>    |               |              |             |              |
>    |               |-------Replicate TXN------->|              |
>    |               |              |             |              |
>    |               |              |       [TXN undo log        |
>    |               |<---WAL Ok----|          created]          |
>    |               |              |             |              |
>    |           [Waiting           |             |-----TXN----->|
>    |         of a quorum]         |             |              |
>    |               |              |             |<---WAL Ok----|
>    |               |              |             |              |
>    |               |<------Replication Ok-------|              |
>    |               |              |             |              |
>    |            [Quorum           |             |              |
>    |           achieved]          |             |              |
>    |               |              |             |              |
>    |         [TXN undo log        |             |              |
>    |           destroyed]         |             |              |
>    |               |              |             |              |
>    |               |---Confirm--->|             |              |
>    |               |              |             |              |
>    |               |----------Confirm---------->|              |
>    |               |              |             |              |
>    |<---TXN Ok-----|              |       [TXN undo log        |
>    |               |              |         destroyed]         |
>    |               |              |             |              |
>    |               |              |             |---Confirm--->|
>    |               |              |             |              |

It says, that once the quorum is collected, and 'confirm' is written
to local leader's WAL, it is considered committed and is reported
to the client as successful.

On the other hand it is said, that in case of leader change the
new leader will rollback all not confirmed transactions. That leads
to the following bug:

Assume we have 4 instances: i1, i2, i3, i4. Leader is i1. It
writes a transaction with LSN1. The LSN1 is sent to other nodes,
they apply it ok, and send acks to the leader. The leader sees
i2-i4 all applied the transaction (propagated their LSNs to LSN1).
It writes 'confirm' to its local WAL, reports it to the client as
success, the client's request is over, it is returned back to
some remote node, etc. The transaction is officially synchronously
committed.

Then the leader's machine dies - disk is dead. The confirm was
not sent to any of the other nodes. For example, it started having
problems with network connection to the replicas recently before
the death. Or it just didn't manage to hand the confirm out.

From now on if any of the other nodes i2-i4 becomes a leader, it
will rollback the officially confirmed transaction, even if it
has it, and all the other nodes too.

That basically means, this sync replication gives exactly the same
guarantees as the async replication - 'confirm' on the leader tells
nothing about replicas except that they *are able to apply the
transaction*, but still may not apply it.

Am I missing something?

Another issue is with failure detection. Lets assume, that we wait
for 'confirm' to be propagated on quorum of replicas too. Assume
some replicas responded with an error. So they first said they can
apply the transaction, and saved it into their WALs, and then they
couldn't apply confirm. That could happen because of 2 reasons:
replica has problems with WAL, or the replica becomes unreachable
from the master.

WAL-problematic replicas can be disconnected forcefully, since they
are clearly not able to work properly anymore. But what to do with
disconnected replicas? 'Confirm' can't wait for them forever - we
will run out of fibers, if we have even just hundreds of RPS of
sync transactions, and wait for, lets say, a few minutes. On the
other hand we can't roll them back, because 'confirm' has been
written to the local WAL already.

Note for those who is concerned: this has nothing to do with
in-memory relay. It has the same problems, which are in the protocol,
not in the implementation.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-04-23 21:38 ` Vladislav Shpilevoy
@ 2020-04-23 22:28   ` Konstantin Osipov
  2020-04-30 14:50   ` Sergey Ostanevich
  1 sibling, 0 replies; 53+ messages in thread
From: Konstantin Osipov @ 2020-04-23 22:28 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

* Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/04/24 00:42]:

> It says, that once the quorum is collected, and 'confirm' is written
> to local leader's WAL, it is considered committed and is reported
> to the client as successful.
> 
> On the other hand it is said, that in case of leader change the
> new leader will rollback all not confirmed transactions. That leads
> to the following bug:
> 
> Assume we have 4 instances: i1, i2, i3, i4. Leader is i1. It
> writes a transaction with LSN1. The LSN1 is sent to other nodes,
> they apply it ok, and send acks to the leader. The leader sees
> i2-i4 all applied the transaction (propagated their LSNs to LSN1).
> It writes 'confirm' to its local WAL, reports it to the client as
> success, the client's request is over, it is returned back to
> some remote node, etc. The transaction is officially synchronously
> committed.
> 
> Then the leader's machine dies - disk is dead. The confirm was
> not sent to any of the other nodes. For example, it started having
> problems with network connection to the replicas recently before
> the death. Or it just didn't manage to hand the confirm out.
> 
> >From now on if any of the other nodes i2-i4 becomes a leader, it
> will rollback the officially confirmed transaction, even if it
> has it, and all the other nodes too.
> 
> That basically means, this sync replication gives exactly the same
> guarantees as the async replication - 'confirm' on the leader tells
> nothing about replicas except that they *are able to apply the
> transaction*, but still may not apply it.
> 
> Am I missing something?

This video explains what leader has to do after it's been elected:

https://www.youtube.com/watch?v=YbZ3zDzDnrw

In short, the transactions in leader's wal has to be committed,
not rolled back.

Raft paper has https://raft.github.io/raft.pdf has answers in a
concise single page summary.

Why have this discussion at all, any ambiguity or discrepancy
between this document and raft paper should be treated as a
mistake in this document. Or do you actually think it's possible
to invent a new consensus algorithm this way?

> Note for those who is concerned: this has nothing to do with
> in-memory relay. It has the same problems, which are in the protocol,
> not in the implementation.

No, the issues are distinct:
1) there may be cases where this paper doesn't follow RAFT. It
   should be obvious to everyone, that with the exception to
   external leader election and failure detection it has to if
   correctness is of any concern, so it's simply a matter of
   fixing this doc to match raft.

   As to the leader election, there are two alternatives: either
   spec out in this paper how the external election is interacting
   with the cluster, including finishing up old transactions and
   neutralizing old leaders, or allow multi-master, so forget
   about consistency for now.
2) an implementation based on triggers will be complicated and
   will have performance/stability implications. This is what I
   hope I was able to convey and in this case we can put the
   matter to rest. 

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-04-23 21:38 ` Vladislav Shpilevoy
  2020-04-23 22:28   ` Konstantin Osipov
@ 2020-04-30 14:50   ` Sergey Ostanevich
  2020-05-06  8:52     ` Konstantin Osipov
                       ` (2 more replies)
  1 sibling, 3 replies; 53+ messages in thread
From: Sergey Ostanevich @ 2020-04-30 14:50 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

Hi!

Thanks for the review!

After a long discussion we agreed to rework the RFC. 

On 23 апр 23:38, Vladislav Shpilevoy wrote:
> It says, that once the quorum is collected, and 'confirm' is written
> to local leader's WAL, it is considered committed and is reported
> to the client as successful.
> 
> On the other hand it is said, that in case of leader change the
> new leader will rollback all not confirmed transactions. That leads

This is no longer right, we decided to follow the RAFT's approach that
leader rules the world, hence committing all changes in it's WAL.

> 
> Another issue is with failure detection. Lets assume, that we wait
> for 'confirm' to be propagated on quorum of replicas too. Assume
> some replicas responded with an error. So they first said they can
> apply the transaction, and saved it into their WALs, and then they
> couldn't apply confirm. That could happen because of 2 reasons:
> replica has problems with WAL, or the replica becomes unreachable
> from the master.
> 
> WAL-problematic replicas can be disconnected forcefully, since they
> are clearly not able to work properly anymore. But what to do with
> disconnected replicas? 'Confirm' can't wait for them forever - we
> will run out of fibers, if we have even just hundreds of RPS of
> sync transactions, and wait for, lets say, a few minutes. On the
> other hand we can't roll them back, because 'confirm' has been
> written to the local WAL already.

Here we agreed that replica will be kicked out of cluster and wait for
human intervention to fix the problems - probably with rejoin. In case
available replics are not enough to achieve the quorum leader also
reports the problem and stop the cluster operation until cluster
reconfigured or number of replicas will become sufficient.

Below is the new RFC, available at
https://github.com/tarantool/tarantool/blob/sergos/quorum-based-synchro/doc/rfc/quorum-based-synchro.md

---
* **Status**: In progress
* **Start date**: 31-03-2020
* **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\>
* **Issues**: https://github.com/tarantool/tarantool/issues/4842

## Summary

The aim of this RFC is to address the following list of problems
formulated at MRG planning meeting:
  - protocol backward compatibility to enable cluster upgrade w/o
    downtime
  - consistency of data on replica and leader
  - switch from leader to replica without data loss
  - up to date replicas to run read-only requests
  - ability to switch async replicas into sync ones and vice versa
  - guarantee of rollback on leader and sync replicas
  - simplicity of cluster orchestration

What this RFC is not:

  - high availability (HA) solution with automated failover, roles
    assignments an so on
  - master-master configuration support

## Background and motivation

There are number of known implementation of consistent data presence in
a Tarantool cluster. They can be commonly named as "wait for LSN"
technique. The biggest issue with this technique is the absence of
rollback guarantees at replica in case of transaction failure on one
master or some of the replicas in the cluster.

To provide such capabilities a new functionality should be introduced in
Tarantool core, with requirements mentioned before - backward
compatibility and ease of cluster orchestration.

## Detailed design

### Quorum commit

The main idea behind the proposal is to reuse existent machinery as much
as possible. It will ensure the well-tested and proven functionality
across many instances in MRG and beyond is used. The transaction rollback
mechanism is in place and works for WAL write failure. If we substitute
the WAL success with a new situation which is named 'quorum' later in
this document then no changes to the machinery is needed. The same is
true for snapshot machinery that allows to create a copy of the database
in memory for the whole period of snapshot file write. Adding quorum here
also minimizes changes.

Currently replication represented by the following scheme:
```
Customer        Leader          WAL(L)        Replica        WAL(R)
   |------TXN----->|              |             |              |
   |               |              |             |              |
   |         [TXN undo log        |             |              |
   |            created]          |             |              |
   |               |              |             |              |
   |               |-----TXN----->|             |              |
   |               |              |             |              |
   |               |<---WAL Ok----|             |              |
   |               |              |             |              |
   |         [TXN undo log        |             |              |
   |           destroyed]         |             |              |
   |               |              |             |              |
   |<----TXN Ok----|              |             |              |
   |               |-------Replicate TXN------->|              |
   |               |              |             |              |
   |               |              |       [TXN undo log        |
   |               |              |          created]          |
   |               |              |             |              |
   |               |              |             |-----TXN----->|
   |               |              |             |              |
   |               |              |             |<---WAL Ok----|
   |               |              |             |              |
   |               |              |       [TXN undo log        |
   |               |              |         destroyed]         |
   |               |              |             |              |
```

To introduce the 'quorum' we have to receive confirmation from replicas
to make a decision on whether the quorum is actually present. Leader
collects necessary amount of replicas confirmation plus its own WAL
success. This state is named 'quorum' and gives leader the right to
complete the customers' request. So the picture will change to:
```
Customer        Leader          WAL(L)        Replica        WAL(R)
   |------TXN----->|              |             |              |
   |               |              |             |              |
   |         [TXN undo log        |             |              |
   |            created]          |             |              |
   |               |              |             |              |
   |               |-----TXN----->|             |              |
   |               |              |             |              |
   |               |-------Replicate TXN------->|              |
   |               |              |             |              |
   |               |              |       [TXN undo log        |
   |               |<---WAL Ok----|          created]          |
   |               |              |             |              |
   |           [Waiting           |             |-----TXN----->|
   |         of a quorum]         |             |              |
   |               |              |             |<---WAL Ok----|
   |               |              |             |              |
   |               |<------Replication Ok-------|              |
   |               |              |             |              |
   |            [Quorum           |             |              |
   |           achieved]          |             |              |
   |               |              |             |              |
   |         [TXN undo log        |             |              |
   |           destroyed]         |             |              |
   |               |              |             |              |
   |               |---Confirm--->|             |              |
   |               |              |             |              |
   |               |----------Confirm---------->|              |
   |               |              |             |              |
   |<---TXN Ok-----|              |       [TXN undo log        |
   |               |              |         destroyed]         |
   |               |              |             |              |
   |               |              |             |---Confirm--->|
   |               |              |             |              |
```

The quorum should be collected as a table for a list of transactions
waiting for quorum. The latest transaction that collects the quorum is
considered as complete, as well as all transactions prior to it, since
all transactions should be applied in order. Leader writes a 'confirm'
message to the WAL that refers to the transaction's [LEADER_ID, LSN] and
the confirm has its own LSN. This confirm message is delivered to all
replicas through the existing replication mechanism.

Replica should report a TXN application success to the leader via the
IPROTO explicitly to allow leader to collect the quorum for the TXN.
In case of application failure the replica has to disconnect from the
replication the same way as it is done now. The replica also has to
report its disconnection to the orchestrator. Further actions require
human intervention, since failure means either technical problem (such
as not enough space for WAL) that has to be resovled or an inconsistent
state that requires rejoin.

As soon as leader appears in a situation it has not enough replicas
to achieve quorum, the cluster should stop accepting any requests - both
write and read. The reason for this is that replication of transactions
can achieve quorum on replicas not visible to the leader. On the other
hand, leader can't achieve quorum with available minority. Leader has to
report the state and wait for human intervention. There's an option to
ask leader to rollback to the latest transaction that has quorum: leader
issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN
is of the first transaction in the leader's undo log. The rollback
message replicated to the available cluster will put it in a consistent
state. After that configuration of the cluster can be updated to
available quorum and leader can be switched back to write mode.

### Leader role assignment.

To assign a leader role to an instance the following should be performed:
  1. among all available instances pick the one that has the biggest
     vclock element of the former leader ID; an arbitrary istance can be
     selected in case it is first time the leader is assigned
  2. the leader should assure that number of available instances in the
     cluster is enough to achieve the quorum and proceed to step 3,
     otherwise the leader should report the situation of incomplete quorum,
     as in the last paragraph of previous section
  3. the selected instance has to take the responsibility to replicate
     former leader entries from its WAL, obtainig quorum and commit
     confirm messages referring to [FORMER_LEADER_ID, LSN] in its WAL,
     replicating to the cluster, after that it can start adding its own
     entries into the WAL

### Recovery and failover.

Tarantool instance during reading WAL should postpone the undo log
deletion until the 'confirm' is read. In case the WAL eof is achieved,
the instance should keep undo log for all transactions that are waiting
for a confirm entry until the role of the instance is set.

If this instance will be assigned a leader role then all transactions
that have no corresponding confirm message should be confirmed (see the
leader role assignment).

In case there's not enough replicas to set up a quorum the cluster can
be switched into a read-only mode. Note, this can't be done by default
since some of transactions can have confirmed state. It is up to human
intervention to force rollback of all transactions that have no confirm
and to put the cluster into a consistent state.

In case the instance will be assigned a replica role, it may appear in
a state that it has conflicting WAL entries, in case it recovered from a
leader role and some of transactions didn't replicated to the current
leader. This situation should be resolved through rejoin of the instance.

### Snapshot generation.

We also can reuse current machinery of snapshot generation. Upon
receiving a request to create a snapshot an instance should request a
readview for the current commit operation. Although start of the
snapshot generation should be postponed until this commit operation
receives its confirmation. In case operation is rolled back, the snapshot
generation should be aborted and restarted using current transaction
after rollback is complete.

After snapshot is created the WAL should start from the first operation
that follows the commit operation snapshot is generated for. That means
WAL will contain 'confirm' messages that refer to transactions that are
not present in the WAL. Apparently, we have to allow this for the case
'confirm' refers to a transaction with LSN less than the first entry in
the WAL.

In case master appears unavailable a replica still have to be able to
create a snapshot. Replica can perform rollback for all transactions that
are not confirmed and claim its LSN as the latest confirmed txn. Then it
can create a snapshot in a regular way and start with blank xlog file.
All rolled back transactions will appear through the regular replication
in case master reappears later on.

### Asynchronous replication.

Along with synchronous replicas the cluster can contain asynchronous
replicas. That means async replica doesn't reply to the leader with
errors since they're not contributing into quorum. Still, async
replicas have to follow the new WAL operation, such as keep rollback
info until 'quorum' message is received. This is essential for the case
of 'rollback' message appearance in the WAL. This message assumes
replica is able to perform all necessary rollback by itself. Cluster
information should contain explicit notification of each replica
operation mode.

### Synchronous replication enabling.

Synchronous operation can be required for a set of spaces in the data
scheme. That means only transactions that contain data modification for
these spaces should require quorum. Such transactions named synchronous.
As soon as last operation of synchronous transaction appeared in leader's
WAL, it will cause all following transactions - no matter if they are
synchronous or not - wait for the quorum. In case quorum is not achieved
the 'rollback' operation will cause rollback of all transactions after
the synchronous one. It will ensure the consistent state of the data both
on leader and replicas. In case user doesn't require synchronous operation
for any space then no changes to the WAL generation and replication will
appear.

Cluster description should contain explicit attribute for each replica
to denote it participates in synchronous activities. Also the description
should contain criterion on how many replicas responses are needed to
achieve the quorum.

## Rationale and alternatives

There is an implementation of synchronous replication as part of gh-980
activities, still it is not in a state to get into the product. More
than that it intentionally breaks backward compatibility which is a
prerequisite for this proposal.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-04-30 14:50   ` Sergey Ostanevich
@ 2020-05-06  8:52     ` Konstantin Osipov
  2020-05-06 16:39       ` Sergey Ostanevich
  2020-05-06 18:55     ` Konstantin Osipov
  2020-05-07 23:01     ` Konstantin Osipov
  2 siblings, 1 reply; 53+ messages in thread
From: Konstantin Osipov @ 2020-05-06  8:52 UTC (permalink / raw)
  To: Sergey Ostanevich; +Cc: tarantool-patches, Vladislav Shpilevoy

* Sergey Ostanevich <sergos@tarantool.org> [20/04/30 17:51]:
> Hi!
> 
> Thanks for the review!
> 
> After a long discussion we agreed to rework the RFC. 
> 
> On 23 апр 23:38, Vladislav Shpilevoy wrote:
> > It says, that once the quorum is collected, and 'confirm' is written
> > to local leader's WAL, it is considered committed and is reported
> > to the client as successful.
> > 
> > On the other hand it is said, that in case of leader change the
> > new leader will rollback all not confirmed transactions. That leads
> 
> This is no longer right, we decided to follow the RAFT's approach that
> leader rules the world, hence committing all changes in it's WAL.
> 
> > 
> > Another issue is with failure detection. Lets assume, that we wait
> > for 'confirm' to be propagated on quorum of replicas too. Assume
> > some replicas responded with an error. So they first said they can
> > apply the transaction, and saved it into their WALs, and then they
> > couldn't apply confirm. That could happen because of 2 reasons:
> > replica has problems with WAL, or the replica becomes unreachable
> > from the master.
> > 
> > WAL-problematic replicas can be disconnected forcefully, since they
> > are clearly not able to work properly anymore. But what to do with
> > disconnected replicas? 'Confirm' can't wait for them forever - we
> > will run out of fibers, if we have even just hundreds of RPS of
> > sync transactions, and wait for, lets say, a few minutes. On the
> > other hand we can't roll them back, because 'confirm' has been
> > written to the local WAL already.
> 
> Here we agreed that replica will be kicked out of cluster and wait for
> human intervention to fix the problems - probably with rejoin. In case
> available replics are not enough to achieve the quorum leader also
> reports the problem and stop the cluster operation until cluster
> reconfigured or number of replicas will become sufficient.
> 
> Below is the new RFC, available at
> https://github.com/tarantool/tarantool/blob/sergos/quorum-based-synchro/doc/rfc/quorum-based-synchro.md
> 
> ---
> * **Status**: In progress
> * **Start date**: 31-03-2020
> * **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\>
> * **Issues**: https://github.com/tarantool/tarantool/issues/4842
> 
> ## Summary
> 
> The aim of this RFC is to address the following list of problems
> formulated at MRG planning meeting:
>   - protocol backward compatibility to enable cluster upgrade w/o
>     downtime
>   - consistency of data on replica and leader
>   - switch from leader to replica without data loss
>   - up to date replicas to run read-only requests
>   - ability to switch async replicas into sync ones and vice versa
>   - guarantee of rollback on leader and sync replicas
>   - simplicity of cluster orchestration
> 
> What this RFC is not:
> 
>   - high availability (HA) solution with automated failover, roles
>     assignments an so on
>   - master-master configuration support
> 
> ## Background and motivation
> 
> There are number of known implementation of consistent data presence in
> a Tarantool cluster. They can be commonly named as "wait for LSN"
> technique. The biggest issue with this technique is the absence of
> rollback guarantees at replica in case of transaction failure on one
> master or some of the replicas in the cluster.
> 
> To provide such capabilities a new functionality should be introduced in
> Tarantool core, with requirements mentioned before - backward
> compatibility and ease of cluster orchestration.
> 
> ## Detailed design
> 
> ### Quorum commit
> 
> The main idea behind the proposal is to reuse existent machinery as much
> as possible. It will ensure the well-tested and proven functionality
> across many instances in MRG and beyond is used. The transaction rollback
> mechanism is in place and works for WAL write failure. If we substitute
> the WAL success with a new situation which is named 'quorum' later in
> this document then no changes to the machinery is needed. The same is
> true for snapshot machinery that allows to create a copy of the database
> in memory for the whole period of snapshot file write. Adding quorum here
> also minimizes changes.
> 
> Currently replication represented by the following scheme:
> ```
> Customer        Leader          WAL(L)        Replica        WAL(R)
>    |------TXN----->|              |             |              |
>    |               |              |             |              |
>    |         [TXN undo log        |             |              |
>    |            created]          |             |              |
>    |               |              |             |              |
>    |               |-----TXN----->|             |              |
>    |               |              |             |              |
>    |               |<---WAL Ok----|             |              |
>    |               |              |             |              |
>    |         [TXN undo log        |             |              |
>    |           destroyed]         |             |              |
>    |               |              |             |              |
>    |<----TXN Ok----|              |             |              |
>    |               |-------Replicate TXN------->|              |
>    |               |              |             |              |
>    |               |              |       [TXN undo log        |
>    |               |              |          created]          |
>    |               |              |             |              |
>    |               |              |             |-----TXN----->|
>    |               |              |             |              |
>    |               |              |             |<---WAL Ok----|
>    |               |              |             |              |
>    |               |              |       [TXN undo log        |
>    |               |              |         destroyed]         |
>    |               |              |             |              |
> ```
> 
> To introduce the 'quorum' we have to receive confirmation from replicas
> to make a decision on whether the quorum is actually present. Leader
> collects necessary amount of replicas confirmation plus its own WAL
> success. This state is named 'quorum' and gives leader the right to
> complete the customers' request. So the picture will change to:
> ```
> Customer        Leader          WAL(L)        Replica        WAL(R)
>    |------TXN----->|              |             |              |
>    |               |              |             |              |
>    |         [TXN undo log        |             |              |
>    |            created]          |             |              |
>    |               |              |             |              |
>    |               |-----TXN----->|             |              |
>    |               |              |             |              |
>    |               |-------Replicate TXN------->|              |
>    |               |              |             |              |
>    |               |              |       [TXN undo log        |
>    |               |<---WAL Ok----|          created]          |
>    |               |              |             |              |
>    |           [Waiting           |             |-----TXN----->|
>    |         of a quorum]         |             |              |
>    |               |              |             |<---WAL Ok----|
>    |               |              |             |              |
>    |               |<------Replication Ok-------|              |
>    |               |              |             |              |
>    |            [Quorum           |             |              |
>    |           achieved]          |             |              |
>    |               |              |             |              |
>    |         [TXN undo log        |             |              |
>    |           destroyed]         |             |              |
>    |               |              |             |              |
>    |               |---Confirm--->|             |              |
>    |               |              |             |              |

What happens if writing Confirm to WAL fails? TXN und log record
is destroyed already. Will the server panic now on WAL failure,
even if it is intermittent?

>    |               |----------Confirm---------->|              |

What happens if peers receive and maybe even write Confirm to their WALs
but local WAL write is lost after a restart? WAL is not synced, 
so we can easily lose the tail of the WAL. Tarantool will sync up
with all replicas on restart, but there will be no "Replication
OK" messages from them, so it wouldn't know that the transaction
is committed on them. How is this handled? We may end up with some
replicas confirming the transaction while the leader will roll it
back on restart. Do you suggest there is a human intervention on
restart as well?


>    |               |              |             |              |
>    |<---TXN Ok-----|              |       [TXN undo log        |
>    |               |              |         destroyed]         |
>    |               |              |             |              |
>    |               |              |             |---Confirm--->|
>    |               |              |             |              |
> ```
> 
> The quorum should be collected as a table for a list of transactions
> waiting for quorum. The latest transaction that collects the quorum is
> considered as complete, as well as all transactions prior to it, since
> all transactions should be applied in order. Leader writes a 'confirm'
> message to the WAL that refers to the transaction's [LEADER_ID, LSN] and
> the confirm has its own LSN. This confirm message is delivered to all
> replicas through the existing replication mechanism.
> 
> Replica should report a TXN application success to the leader via the
> IPROTO explicitly to allow leader to collect the quorum for the TXN.
> In case of application failure the replica has to disconnect from the
> replication the same way as it is done now. The replica also has to
> report its disconnection to the orchestrator. Further actions require
> human intervention, since failure means either technical problem (such
> as not enough space for WAL) that has to be resovled or an inconsistent
> state that requires rejoin.

> As soon as leader appears in a situation it has not enough replicas
> to achieve quorum, the cluster should stop accepting any requests - both
> write and read.

How does *the cluster* know the state of the leader and if it
doesn't, how it can possibly implement this? Did you mean
the leader should stop accepting transactions here? But how can
the leader know if it has not enough replicas during a read
transaction, if it doesn't contact any replica to serve a read?

> The reason for this is that replication of transactions
> can achieve quorum on replicas not visible to the leader. On the other
> hand, leader can't achieve quorum with available minority. Leader has to
> report the state and wait for human intervention. There's an option to
> ask leader to rollback to the latest transaction that has quorum: leader
> issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN
> is of the first transaction in the leader's undo log. The rollback
> message replicated to the available cluster will put it in a consistent
> state. After that configuration of the cluster can be updated to
> available quorum and leader can be switched back to write mode.

As you should be able to conclude from restart scenario, it is
possible a replica has the record in *confirmed* state but the
leader has it in pending state. The replica will not be able to
roll back then. Do you suggest the replica should abort if it
can't rollback? This may lead to an avalanche of rejoins on leader
restart, bringing performance to a halt.


-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-05-06  8:52     ` Konstantin Osipov
@ 2020-05-06 16:39       ` Sergey Ostanevich
  2020-05-06 18:44         ` Konstantin Osipov
  2020-05-13 21:36         ` Vladislav Shpilevoy
  0 siblings, 2 replies; 53+ messages in thread
From: Sergey Ostanevich @ 2020-05-06 16:39 UTC (permalink / raw)
  To: Konstantin Osipov, Vladislav Shpilevoy, tarantool-patches

Hi!

Thanks for review!

> >    |               |              |             |              |
> >    |            [Quorum           |             |              |
> >    |           achieved]          |             |              |
> >    |               |              |             |              |
> >    |         [TXN undo log        |             |              |
> >    |           destroyed]         |             |              |
> >    |               |              |             |              |
> >    |               |---Confirm--->|             |              |
> >    |               |              |             |              |
> 
> What happens if writing Confirm to WAL fails? TXN und log record
> is destroyed already. Will the server panic now on WAL failure,
> even if it is intermittent?

I would like to have an example of intermittent WAL failure. Can it be
other than problem with disc - be it space/availability/malfunction?

For all of those it should be resolved outside the DBMS anyways. So,
leader should stop and report its problems to orchestrator/admins.

I would agree that undo log can be destroyed *after* the Confirm is
landed to WAL - same is for replica.

> 
> >    |               |----------Confirm---------->|              |
> 
> What happens if peers receive and maybe even write Confirm to their WALs
> but local WAL write is lost after a restart?

Did you mean WAL write on leader as a local? Then we have a replica with
a bigger LSN for the leader ID. 

> WAL is not synced, 
> so we can easily lose the tail of the WAL. Tarantool will sync up
> with all replicas on restart,

But at this point a new leader will be appointed - the old one is
restarted. Then the Confirm message will arrive to the restarted leader 
through a regular replication.

> but there will be no "Replication
> OK" messages from them, so it wouldn't know that the transaction
> is committed on them. How is this handled? We may end up with some
> replicas confirming the transaction while the leader will roll it
> back on restart. Do you suggest there is a human intervention on
> restart as well?
> 
> 
> >    |               |              |             |              |
> >    |<---TXN Ok-----|              |       [TXN undo log        |
> >    |               |              |         destroyed]         |
> >    |               |              |             |              |
> >    |               |              |             |---Confirm--->|
> >    |               |              |             |              |
> > ```
> > 
> > The quorum should be collected as a table for a list of transactions
> > waiting for quorum. The latest transaction that collects the quorum is
> > considered as complete, as well as all transactions prior to it, since
> > all transactions should be applied in order. Leader writes a 'confirm'
> > message to the WAL that refers to the transaction's [LEADER_ID, LSN] and
> > the confirm has its own LSN. This confirm message is delivered to all
> > replicas through the existing replication mechanism.
> > 
> > Replica should report a TXN application success to the leader via the
> > IPROTO explicitly to allow leader to collect the quorum for the TXN.
> > In case of application failure the replica has to disconnect from the
> > replication the same way as it is done now. The replica also has to
> > report its disconnection to the orchestrator. Further actions require
> > human intervention, since failure means either technical problem (such
> > as not enough space for WAL) that has to be resovled or an inconsistent
> > state that requires rejoin.
> 
> > As soon as leader appears in a situation it has not enough replicas
> > to achieve quorum, the cluster should stop accepting any requests - both
> > write and read.
> 
> How does *the cluster* know the state of the leader and if it
> doesn't, how it can possibly implement this? Did you mean
> the leader should stop accepting transactions here? But how can
> the leader know if it has not enough replicas during a read
> transaction, if it doesn't contact any replica to serve a read?

I expect to have a disconnection trigger assigned to all relays so that
disconnection will cause the number of replicas decrease. The quorum
size is static, so we can stop at the very moment the number dives below.

> 
> > The reason for this is that replication of transactions
> > can achieve quorum on replicas not visible to the leader. On the other
> > hand, leader can't achieve quorum with available minority. Leader has to
> > report the state and wait for human intervention. There's an option to
> > ask leader to rollback to the latest transaction that has quorum: leader
> > issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN
> > is of the first transaction in the leader's undo log. The rollback
> > message replicated to the available cluster will put it in a consistent
> > state. After that configuration of the cluster can be updated to
> > available quorum and leader can be switched back to write mode.
> 
> As you should be able to conclude from restart scenario, it is
> possible a replica has the record in *confirmed* state but the
> leader has it in pending state. The replica will not be able to
> roll back then. Do you suggest the replica should abort if it
> can't rollback? This may lead to an avalanche of rejoins on leader
> restart, bringing performance to a halt.

No, I declare replica with biggest LSN as a new shining leader. More
than that, new leader can (so far it will be by default) finalize the
former leader life's work by replicating txns and appropriate confirms.

Sergos.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-05-06 16:39       ` Sergey Ostanevich
@ 2020-05-06 18:44         ` Konstantin Osipov
  2020-05-12 15:55           ` Sergey Ostanevich
  2020-05-13 21:36         ` Vladislav Shpilevoy
  1 sibling, 1 reply; 53+ messages in thread
From: Konstantin Osipov @ 2020-05-06 18:44 UTC (permalink / raw)
  To: Sergey Ostanevich; +Cc: tarantool-patches, Vladislav Shpilevoy

* Sergey Ostanevich <sergos@tarantool.org> [20/05/06 19:41]:
> > >    |               |              |             |              |
> > >    |            [Quorum           |             |              |
> > >    |           achieved]          |             |              |
> > >    |               |              |             |              |
> > >    |         [TXN undo log        |             |              |
> > >    |           destroyed]         |             |              |
> > >    |               |              |             |              |
> > >    |               |---Confirm--->|             |              |
> > >    |               |              |             |              |
> > 
> > What happens if writing Confirm to WAL fails? TXN und log record
> > is destroyed already. Will the server panic now on WAL failure,
> > even if it is intermittent?
> 
> I would like to have an example of intermittent WAL failure. Can it be
> other than problem with disc - be it space/availability/malfunction?

For SAN disks it can simply be a networking issue. The same is
true for any virtual filesystem in the cloud. For local disks it
is most often out of space, but this is not an impossible event.

> For all of those it should be resolved outside the DBMS anyways. So,
> leader should stop and report its problems to orchestrator/admins.

Sergey, I understand that RAFT spec is big and with this spec you
try to split it into manageable parts. The question is how useful
is this particular piece. I'm trying to point out that "the leader
should stop" is not a silver bullet - especially since each such
stop may mean a rejoin of some other node. The purpose of sync
replication is to provide consistency without reducing
availability (i.e. make progress as long as the quorum
of nodes make progress). 

The current spec, suggesting there should be a leader stop in case
of most errors, reduces availability significantly, and doesn't
make external coordinator job any easier - it still has to follow to
the letter the prescriptions of RAFT. 

> landed to WAL - same is for replica.

> 
> > 
> > >    |               |----------Confirm---------->|              |
> > 
> > What happens if peers receive and maybe even write Confirm to their WALs
> > but local WAL write is lost after a restart?
> 
> Did you mean WAL write on leader as a local? Then we have a replica with
> a bigger LSN for the leader ID. 

> > WAL is not synced, 
> > so we can easily lose the tail of the WAL. Tarantool will sync up
> > with all replicas on restart,
> 
> But at this point a new leader will be appointed - the old one is
> restarted. Then the Confirm message will arrive to the restarted leader 
> through a regular replication.

This assumes that restart is guaranteed to be noticed by the
external coordinator and there is an election on every restart.

> > but there will be no "Replication
> > OK" messages from them, so it wouldn't know that the transaction
> > is committed on them. How is this handled? We may end up with some
> > replicas confirming the transaction while the leader will roll it
> > back on restart. Do you suggest there is a human intervention on
> > restart as well?
> > 
> > 
> > >    |               |              |             |              |
> > >    |<---TXN Ok-----|              |       [TXN undo log        |
> > >    |               |              |         destroyed]         |
> > >    |               |              |             |              |
> > >    |               |              |             |---Confirm--->|
> > >    |               |              |             |              |
> > > ```
> > > 
> > > The quorum should be collected as a table for a list of transactions
> > > waiting for quorum. The latest transaction that collects the quorum is
> > > considered as complete, as well as all transactions prior to it, since
> > > all transactions should be applied in order. Leader writes a 'confirm'
> > > message to the WAL that refers to the transaction's [LEADER_ID, LSN] and
> > > the confirm has its own LSN. This confirm message is delivered to all
> > > replicas through the existing replication mechanism.
> > > 
> > > Replica should report a TXN application success to the leader via the
> > > IPROTO explicitly to allow leader to collect the quorum for the TXN.
> > > In case of application failure the replica has to disconnect from the
> > > replication the same way as it is done now. The replica also has to
> > > report its disconnection to the orchestrator. Further actions require
> > > human intervention, since failure means either technical problem (such
> > > as not enough space for WAL) that has to be resovled or an inconsistent
> > > state that requires rejoin.
> > 
> > > As soon as leader appears in a situation it has not enough replicas
> > > to achieve quorum, the cluster should stop accepting any requests - both
> > > write and read.
> > 
> > How does *the cluster* know the state of the leader and if it
> > doesn't, how it can possibly implement this? Did you mean
> > the leader should stop accepting transactions here? But how can
> > the leader know if it has not enough replicas during a read
> > transaction, if it doesn't contact any replica to serve a read?
> 
> I expect to have a disconnection trigger assigned to all relays so that
> disconnection will cause the number of replicas decrease. The quorum
> size is static, so we can stop at the very moment the number dives below.

What happens between the event the leader is partitioned away and
a new leader is elected?

The leader may be unaware of the events and serve a read just
fine.

So at least you can't say the leader shouldn't be serving reads
without quorum - because the only way to achieve it is to collect
a quorum of responses to reads as well.

> > > The reason for this is that replication of transactions
> > > can achieve quorum on replicas not visible to the leader. On the other
> > > hand, leader can't achieve quorum with available minority. Leader has to
> > > report the state and wait for human intervention. There's an option to
> > > ask leader to rollback to the latest transaction that has quorum: leader
> > > issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN
> > > is of the first transaction in the leader's undo log. The rollback
> > > message replicated to the available cluster will put it in a consistent
> > > state. After that configuration of the cluster can be updated to
> > > available quorum and leader can be switched back to write mode.
> > 
> > As you should be able to conclude from restart scenario, it is
> > possible a replica has the record in *confirmed* state but the
> > leader has it in pending state. The replica will not be able to
> > roll back then. Do you suggest the replica should abort if it
> > can't rollback? This may lead to an avalanche of rejoins on leader
> > restart, bringing performance to a halt.
> 
> No, I declare replica with biggest LSN as a new shining leader. More
> than that, new leader can (so far it will be by default) finalize the
> former leader life's work by replicating txns and appropriate confirms.

Right, this also assumes the restart is noticed, so it follows the
same logic.

-- 
Konstantin Osipov, Moscow, Russia
https://scylladb.com

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-04-30 14:50   ` Sergey Ostanevich
  2020-05-06  8:52     ` Konstantin Osipov
@ 2020-05-06 18:55     ` Konstantin Osipov
  2020-05-06 19:10       ` Konstantin Osipov
  2020-05-13 21:42       ` Vladislav Shpilevoy
  2020-05-07 23:01     ` Konstantin Osipov
  2 siblings, 2 replies; 53+ messages in thread
From: Konstantin Osipov @ 2020-05-06 18:55 UTC (permalink / raw)
  To: Sergey Ostanevich; +Cc: tarantool-patches, Vladislav Shpilevoy

* Sergey Ostanevich <sergos@tarantool.org> [20/04/30 17:51]:

A few more issues:

- the spec assumes there is a full mesh. In any other
  topology electing a leader based on the longest wal can easily
  deadlock. Yet it provides no protection against non-full-mesh
  setups. Currently the server can't even detect that this is not
  a full-mesh setup, so can't check if the precondition for this
  to work correctly is met.

- the spec assumes that quorum is identical to the
  number of replicas, and the number of replicas is stable across
  cluster life time. Can I have quorum=2 while the number of
  replicas is 4? Am I allowed to increase the number of replicas
  online? What happens when a replica is added,
  how exactly and starting from which transaction is the leader
  required to collect a bigger quorum?

- the same goes for removing a replica. How is the quorum reduced?

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-05-06 18:55     ` Konstantin Osipov
@ 2020-05-06 19:10       ` Konstantin Osipov
  2020-05-12 16:03         ` Sergey Ostanevich
  2020-05-13 21:42       ` Vladislav Shpilevoy
  1 sibling, 1 reply; 53+ messages in thread
From: Konstantin Osipov @ 2020-05-06 19:10 UTC (permalink / raw)
  To: Sergey Ostanevich, Vladislav Shpilevoy, tarantool-patches

* Konstantin Osipov <kostja.osipov@gmail.com> [20/05/06 21:55]:
> A few more issues:
> 
> - the spec assumes there is a full mesh. In any other
>   topology electing a leader based on the longest wal can easily
>   deadlock. Yet it provides no protection against non-full-mesh
>   setups. Currently the server can't even detect that this is not
>   a full-mesh setup, so can't check if the precondition for this
>   to work correctly is met.

Come to think of it, it's a special case of network partitioning.
A replica with the longest WAL can be reachable by the external
coordinator but partitioned away from the majority, so never able to
make progress.


-- 
Konstantin Osipov, Moscow, Russia
https://scylladb.com

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-04-30 14:50   ` Sergey Ostanevich
  2020-05-06  8:52     ` Konstantin Osipov
  2020-05-06 18:55     ` Konstantin Osipov
@ 2020-05-07 23:01     ` Konstantin Osipov
  2020-05-12 16:40       ` Sergey Ostanevich
  2 siblings, 1 reply; 53+ messages in thread
From: Konstantin Osipov @ 2020-05-07 23:01 UTC (permalink / raw)
  To: Sergey Ostanevich; +Cc: tarantool-patches, Vladislav Shpilevoy

> ### Synchronous replication enabling.
> 
> Synchronous operation can be required for a set of spaces in the data
> scheme. That means only transactions that contain data modification for
> these spaces should require quorum. Such transactions named synchronous.
> As soon as last operation of synchronous transaction appeared in leader's
> WAL, it will cause all following transactions - no matter if they are
> synchronous or not - wait for the quorum. In case quorum is not achieved
> the 'rollback' operation will cause rollback of all transactions after
> the synchronous one. It will ensure the consistent state of the data both
> on leader and replicas. In case user doesn't require synchronous operation
> for any space then no changes to the WAL generation and replication will
> appear.

1) It's unclear what happens here if async tx follows a sync tx.
   Does it wait for the sync tx? This reduces availability for
   async txs - so it's hardly acceptable. Besides, with
   group=local spaces, one can quickly run out of memory for undo.

Then it should be allowed to proceed and commit.

Then mixing sync and async tables in a single transaction
shouldn't be allowed.

Imagine t1 is sync and t2 is async. tx1 changes t1 and t2, tx2
changes t2. tx1 is not confirmed and must be rolled back. But it can
not revert changes of tx2.

The spec should clarify that.

2) First candidates to "sync" spaces are system spaces, especially
   _schema (to fix box.once()) and _cluster (to fix parallel join
   of multiple replicas).

I can't imagine it's possible to make system spaces synchronous
with an external coordinator - the coordinator may not be
available during box.cfg{}.

3) One can quickly run out of memory for undo. Any sync
   transaction should be capped with a timeout to avoid OOMs. I
   don't know how many times I should repeat it. The only good
   solution for load control is in-memory WAL, which will allow to
   rollback all transactions as soon as network partitioning is
   detected.

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-05-06 18:44         ` Konstantin Osipov
@ 2020-05-12 15:55           ` Sergey Ostanevich
  2020-05-12 16:42             ` Konstantin Osipov
  2020-05-13 21:39             ` Vladislav Shpilevoy
  0 siblings, 2 replies; 53+ messages in thread
From: Sergey Ostanevich @ 2020-05-12 15:55 UTC (permalink / raw)
  To: Konstantin Osipov, Vladislav Shpilevoy, tarantool-patches

On 06 мая 21:44, Konstantin Osipov wrote:
> * Sergey Ostanevich <sergos@tarantool.org> [20/05/06 19:41]:
> > > >    |               |              |             |              |
> > > >    |            [Quorum           |             |              |
> > > >    |           achieved]          |             |              |
> > > >    |               |              |             |              |
> > > >    |         [TXN undo log        |             |              |
> > > >    |           destroyed]         |             |              |
> > > >    |               |              |             |              |
> > > >    |               |---Confirm--->|             |              |
> > > >    |               |              |             |              |
> > > 
> > > What happens if writing Confirm to WAL fails? TXN und log record
> > > is destroyed already. Will the server panic now on WAL failure,
> > > even if it is intermittent?
> > 
> > I would like to have an example of intermittent WAL failure. Can it be
> > other than problem with disc - be it space/availability/malfunction?
> 
> For SAN disks it can simply be a networking issue. The same is
> true for any virtual filesystem in the cloud. For local disks it
> is most often out of space, but this is not an impossible event.

The SANDisk is an SSD vendor. I bet you mean NAS - network array
storage, isn't it? Then I see no difference in WAL write into NAS in
current schema - you will catch a timeout, WAL will report failure,
replica stops.

> 
> > For all of those it should be resolved outside the DBMS anyways. So,
> > leader should stop and report its problems to orchestrator/admins.
> 
> Sergey, I understand that RAFT spec is big and with this spec you
> try to split it into manageable parts. The question is how useful
> is this particular piece. I'm trying to point out that "the leader
> should stop" is not a silver bullet - especially since each such
> stop may mean a rejoin of some other node. The purpose of sync
> replication is to provide consistency without reducing
> availability (i.e. make progress as long as the quorum
> of nodes make progress). 

I'm not sure if we're talking about the same RAFT - mine is "In Search
of an Understandable Consensus Algorithm (Extended Version)" from
Stanford as of May 2014. And it is 15 pages - including references,
conclusions and intro. Seems not that big.

Although, most of it is dedicated to the leader election itself, which
we intentionally put aside from this RFC. It is written in the very
beginning and I empasized this by explicit mentioning of it.

> 
> The current spec, suggesting there should be a leader stop in case
> of most errors, reduces availability significantly, and doesn't
> make external coordinator job any easier - it still has to follow to
> the letter the prescriptions of RAFT. 

So, the postponing of a commit until quorum collection is the most
useful part of this RFC, also to some point I'm trying to address the
WAL insconsistency. Although, it can be covered only partly: if a
leader's log diverge in unconfirmed transactions only, then they can be
rolled back easiy. Technically, it should be enough if leader changed
for a replica from the cluster majority at the moment of failure.
Otherwise it will require pre-parsing of the WAL and it can well happens
that WAL is not long enough, hence ex-leader still need a complete
bootstrap. 

> 
> > landed to WAL - same is for replica.
> 
> > 
> > > 
> > > >    |               |----------Confirm---------->|              |
> > > 
> > > What happens if peers receive and maybe even write Confirm to their WALs
> > > but local WAL write is lost after a restart?
> > 
> > Did you mean WAL write on leader as a local? Then we have a replica with
> > a bigger LSN for the leader ID. 
> 
> > > WAL is not synced, 
> > > so we can easily lose the tail of the WAL. Tarantool will sync up
> > > with all replicas on restart,
> > 
> > But at this point a new leader will be appointed - the old one is
> > restarted. Then the Confirm message will arrive to the restarted leader 
> > through a regular replication.
> 
> This assumes that restart is guaranteed to be noticed by the
> external coordinator and there is an election on every restart.

Sure yes, if it restarted - then connection lost can't be unnoticed by
anyone, be it coordinator or cluster.

> 
> > > but there will be no "Replication
> > > OK" messages from them, so it wouldn't know that the transaction
> > > is committed on them. How is this handled? We may end up with some
> > > replicas confirming the transaction while the leader will roll it
> > > back on restart. Do you suggest there is a human intervention on
> > > restart as well?
> > > 
> > > 
> > > >    |               |              |             |              |
> > > >    |<---TXN Ok-----|              |       [TXN undo log        |
> > > >    |               |              |         destroyed]         |
> > > >    |               |              |             |              |
> > > >    |               |              |             |---Confirm--->|
> > > >    |               |              |             |              |
> > > > ```
> > > > 
> > > > The quorum should be collected as a table for a list of transactions
> > > > waiting for quorum. The latest transaction that collects the quorum is
> > > > considered as complete, as well as all transactions prior to it, since
> > > > all transactions should be applied in order. Leader writes a 'confirm'
> > > > message to the WAL that refers to the transaction's [LEADER_ID, LSN] and
> > > > the confirm has its own LSN. This confirm message is delivered to all
> > > > replicas through the existing replication mechanism.
> > > > 
> > > > Replica should report a TXN application success to the leader via the
> > > > IPROTO explicitly to allow leader to collect the quorum for the TXN.
> > > > In case of application failure the replica has to disconnect from the
> > > > replication the same way as it is done now. The replica also has to
> > > > report its disconnection to the orchestrator. Further actions require
> > > > human intervention, since failure means either technical problem (such
> > > > as not enough space for WAL) that has to be resovled or an inconsistent
> > > > state that requires rejoin.
> > > 
> > > > As soon as leader appears in a situation it has not enough replicas
> > > > to achieve quorum, the cluster should stop accepting any requests - both
> > > > write and read.
> > > 
> > > How does *the cluster* know the state of the leader and if it
> > > doesn't, how it can possibly implement this? Did you mean
> > > the leader should stop accepting transactions here? But how can
> > > the leader know if it has not enough replicas during a read
> > > transaction, if it doesn't contact any replica to serve a read?
> > 
> > I expect to have a disconnection trigger assigned to all relays so that
> > disconnection will cause the number of replicas decrease. The quorum
> > size is static, so we can stop at the very moment the number dives below.
> 
> What happens between the event the leader is partitioned away and
> a new leader is elected?
> 
> The leader may be unaware of the events and serve a read just
> fine.

As it is stated 20 lines above: 
> > > > As soon as leader appears in a situation it has not enough
> > > > replicas
> > > > to achieve quorum, the cluster should stop accepting any
> > > > requests - both
> > > > write and read.

So it will not serve. 

> 
> So at least you can't say the leader shouldn't be serving reads
> without quorum - because the only way to achieve it is to collect
> a quorum of responses to reads as well.

The leader lost connection to the (N-Q)+1 repllicas out of the N in
cluster with a quorum of Q == it stops serving anything. So the quorum
criteria is there: no quorum - no reads. 

> 
> > > > The reason for this is that replication of transactions
> > > > can achieve quorum on replicas not visible to the leader. On the other
> > > > hand, leader can't achieve quorum with available minority. Leader has to
> > > > report the state and wait for human intervention. There's an option to
> > > > ask leader to rollback to the latest transaction that has quorum: leader
> > > > issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN
> > > > is of the first transaction in the leader's undo log. The rollback
> > > > message replicated to the available cluster will put it in a consistent
> > > > state. After that configuration of the cluster can be updated to
> > > > available quorum and leader can be switched back to write mode.
> > > 
> > > As you should be able to conclude from restart scenario, it is
> > > possible a replica has the record in *confirmed* state but the
> > > leader has it in pending state. The replica will not be able to
> > > roll back then. Do you suggest the replica should abort if it
> > > can't rollback? This may lead to an avalanche of rejoins on leader
> > > restart, bringing performance to a halt.
> > 
> > No, I declare replica with biggest LSN as a new shining leader. More
> > than that, new leader can (so far it will be by default) finalize the
> > former leader life's work by replicating txns and appropriate confirms.
> 
> Right, this also assumes the restart is noticed, so it follows the
> same logic.

How a restart can be unnoticed, if it causes disconnection?

> 
> -- 
> Konstantin Osipov, Moscow, Russia
> https://scylladb.com

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-05-06 19:10       ` Konstantin Osipov
@ 2020-05-12 16:03         ` Sergey Ostanevich
  0 siblings, 0 replies; 53+ messages in thread
From: Sergey Ostanevich @ 2020-05-12 16:03 UTC (permalink / raw)
  To: Konstantin Osipov, Vladislav Shpilevoy, tarantool-patches

On 06 мая 22:10, Konstantin Osipov wrote:
> * Konstantin Osipov <kostja.osipov@gmail.com> [20/05/06 21:55]:
> > A few more issues:
> > 
> > - the spec assumes there is a full mesh. In any other
> >   topology electing a leader based on the longest wal can easily
> >   deadlock. Yet it provides no protection against non-full-mesh
> >   setups. Currently the server can't even detect that this is not
> >   a full-mesh setup, so can't check if the precondition for this
> >   to work correctly is met.
> 
> Come to think of it, it's a special case of network partitioning.
> A replica with the longest WAL can be reachable by the external
> coordinator but partitioned away from the majority, so never able to
> make progress.

So the answer from this replica on it's appointment will be 'I have no
quorum'. Hence, the orchentration should pick the next-length WAL.
What's the problem?

> 
> 
> -- 
> Konstantin Osipov, Moscow, Russia
> https://scylladb.com

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-05-07 23:01     ` Konstantin Osipov
@ 2020-05-12 16:40       ` Sergey Ostanevich
  2020-05-12 17:47         ` Konstantin Osipov
  0 siblings, 1 reply; 53+ messages in thread
From: Sergey Ostanevich @ 2020-05-12 16:40 UTC (permalink / raw)
  To: Konstantin Osipov, Vladislav Shpilevoy, tarantool-patches

On 08 мая 02:01, Konstantin Osipov wrote:
> 
> 
> > ### Synchronous replication enabling.
> > 
> > Synchronous operation can be required for a set of spaces in the data
> > scheme. That means only transactions that contain data modification for
> > these spaces should require quorum. Such transactions named synchronous.
> > As soon as last operation of synchronous transaction appeared in leader's
> > WAL, it will cause all following transactions - no matter if they are
> > synchronous or not - wait for the quorum. In case quorum is not achieved
> > the 'rollback' operation will cause rollback of all transactions after
> > the synchronous one. It will ensure the consistent state of the data both
> > on leader and replicas. In case user doesn't require synchronous operation
> > for any space then no changes to the WAL generation and replication will
> > appear.
> 
> 1) It's unclear what happens here if async tx follows a sync tx.
>    Does it wait for the sync tx? This reduces availability for

Definitely yes, unless we keep the 'dirty read' as it is at the moment
in memtx. This is the essence of the design, and it is temporary until 
the MVCC similar to the vinyl machinery appears. I intentionally didn't
include this big task into this RFC. 

It will provide similar capabilities, although it will keep only
dependent transactions in the undo log. Also, it looks like it will fit
well into the machinery of this RFC. 

>    async txs - so it's hardly acceptable. Besides, with
>    group=local spaces, one can quickly run out of memory for undo.
>    
> 
> Then it should be allowed to proceed and commit.
> 
> Then mixing sync and async tables in a single transaction
> shouldn't be allowed.
> 
> Imagine t1 is sync and t2 is async. tx1 changes t1 and t2, tx2
> changes t2. tx1 is not confirmed and must be rolled back. But it can
> not revert changes of tx2.
> 
> The spec should clarify that.
> 
> 2) First candidates to "sync" spaces are system spaces, especially
>    _schema (to fix box.once()) and _cluster (to fix parallel join
>    of multiple replicas).
> 
> I can't imagine it's possible to make system spaces synchronous
> with an external coordinator - the coordinator may not be
> available during box.cfg{}.

May not be - means no coordination, means the server can't start.
Again, we're not trying to elaborate the self-driven cluster at this
moment, we rely on external coonrdination.
> 
> 3) One can quickly run out of memory for undo. Any sync
>    transaction should be capped with a timeout to avoid OOMs. I
>    don't know how many times I should repeat it. The only good
>    solution for load control is in-memory WAL, which will allow to
>    rollback all transactions as soon as network partitioning is
>    detected.

How in-memry WAL can help save on _undo_ memory? 
To rollback whatever amount of transactions one need to store the undo. 

> 
> -- 
> Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-05-12 15:55           ` Sergey Ostanevich
@ 2020-05-12 16:42             ` Konstantin Osipov
  2020-05-13 21:39             ` Vladislav Shpilevoy
  1 sibling, 0 replies; 53+ messages in thread
From: Konstantin Osipov @ 2020-05-12 16:42 UTC (permalink / raw)
  To: Sergey Ostanevich; +Cc: tarantool-patches, Vladislav Shpilevoy

* Sergey Ostanevich <sergos@tarantool.org> [20/05/12 18:56]:
> On 06 мая 21:44, Konstantin Osipov wrote:
> > * Sergey Ostanevich <sergos@tarantool.org> [20/05/06 19:41]:
> > > > >    |               |              |             |              |
> > > > >    |            [Quorum           |             |              |
> > > > >    |           achieved]          |             |              |
> > > > >    |               |              |             |              |
> > > > >    |         [TXN undo log        |             |              |
> > > > >    |           destroyed]         |             |              |
> > > > >    |               |              |             |              |
> > > > >    |               |---Confirm--->|             |              |
> > > > >    |               |              |             |              |
> > > > 
> > > > What happens if writing Confirm to WAL fails? TXN und log record
> > > > is destroyed already. Will the server panic now on WAL failure,
> > > > even if it is intermittent?
> > > 
> > > I would like to have an example of intermittent WAL failure. Can it be
> > > other than problem with disc - be it space/availability/malfunction?
> > 
> > For SAN disks it can simply be a networking issue. The same is
> > true for any virtual filesystem in the cloud. For local disks it
> > is most often out of space, but this is not an impossible event.
> 
> The SANDisk is an SSD vendor. I bet you mean NAS - network array
> storage, isn't it? Then I see no difference in WAL write into NAS in
> current schema - you will catch a timeout, WAL will report failure,
> replica stops.

SAN stands for storage area network.

There is no timeout in wal tx bus and no timeout in WAL I/O. 

A replica doesn't stop on an intermittent failure. Stopping a
replica on an intermittent failure reduces availability of
non-sync writes.

It seems you have some assumptions in mind which are not in the
document - e.g. that some timeouts are added. They are not in the
POC either.

I suppose the document is expected to explain quite accurately
what has to be done, e.g. how these new timeouts work?

> > > For all of those it should be resolved outside the DBMS anyways. So,
> > > leader should stop and report its problems to orchestrator/admins.
> > 
> > Sergey, I understand that RAFT spec is big and with this spec you
> > try to split it into manageable parts. The question is how useful
> > is this particular piece. I'm trying to point out that "the leader
> > should stop" is not a silver bullet - especially since each such
> > stop may mean a rejoin of some other node. The purpose of sync
> > replication is to provide consistency without reducing
> > availability (i.e. make progress as long as the quorum
> > of nodes make progress). 
> 
> I'm not sure if we're talking about the same RAFT - mine is "In Search
> of an Understandable Consensus Algorithm (Extended Version)" from
> Stanford as of May 2014. And it is 15 pages - including references,
> conclusions and intro. Seems not that big.
> 
> Although, most of it is dedicated to the leader election itself, which
> we intentionally put aside from this RFC. It is written in the very
> beginning and I empasized this by explicit mentioning of it.

I conclude that it is big from the state of this document. It
provides some coverage of the normal operation.
Leader election, failure detection, recovery/restart,
replication configuration changes are either barely
mentioned or not covered at all.
I find no other reason to not cover them except to be able to come
up with a MVP quicker. Do you?


> > The current spec, suggesting there should be a leader stop in case
> > of most errors, reduces availability significantly, and doesn't
> > make external coordinator job any easier - it still has to follow to
> > the letter the prescriptions of RAFT. 
> 
> So, the postponing of a commit until quorum collection is the most
> useful part of this RFC, also to some point I'm trying to address the
> WAL insconsistency.

> Although, it can be covered only partly: if a
> leader's log diverge in unconfirmed transactions only, then they can be
> rolled back easiy. Technically, it should be enough if leader changed
> for a replica from the cluster majority at the moment of failure.
> Otherwise it will require pre-parsing of the WAL and it can well happens
> that WAL is not long enough, hence ex-leader still need a complete
> bootstrap. 

I don't understand what's pre-parsing and how what you write is
relevant to the fact that reduced availability of non-raft writes
is bad.

> > > But at this point a new leader will be appointed - the old one is
> > > restarted. Then the Confirm message will arrive to the restarted leader 
> > > through a regular replication.
> > 
> > This assumes that restart is guaranteed to be noticed by the
> > external coordinator and there is an election on every restart.
> 
> Sure yes, if it restarted - then connection lost can't be unnoticed by
> anyone, be it coordinator or cluster.

Well, the spec doesn't say anywhere that the external coordinator
has to establish a TCP connection to every participant. Could you
please add a chapter where this is clarified? It seems you have a
specific coordinator in mind ?

> > > > > The quorum should be collected as a table for a list of transactions
> > > > > waiting for quorum. The latest transaction that collects the quorum is
> > > > > considered as complete, as well as all transactions prior to it, since
> > > > > all transactions should be applied in order. Leader writes a 'confirm'
> > > > > message to the WAL that refers to the transaction's [LEADER_ID, LSN] and
> > > > > the confirm has its own LSN. This confirm message is delivered to all
> > > > > replicas through the existing replication mechanism.
> > > > > 
> > > > > Replica should report a TXN application success to the leader via the
> > > > > IPROTO explicitly to allow leader to collect the quorum for the TXN.
> > > > > In case of application failure the replica has to disconnect from the
> > > > > replication the same way as it is done now. The replica also has to
> > > > > report its disconnection to the orchestrator. Further actions require
> > > > > human intervention, since failure means either technical problem (such
> > > > > as not enough space for WAL) that has to be resovled or an inconsistent
> > > > > state that requires rejoin.
> > > > 
> > > > > As soon as leader appears in a situation it has not enough replicas
> > > > > to achieve quorum, the cluster should stop accepting any requests - both
> > > > > write and read.
> > > > 
> > > > How does *the cluster* know the state of the leader and if it
> > > > doesn't, how it can possibly implement this? Did you mean
> > > > the leader should stop accepting transactions here? But how can
> > > > the leader know if it has not enough replicas during a read
> > > > transaction, if it doesn't contact any replica to serve a read?
> > > 
> > > I expect to have a disconnection trigger assigned to all relays so that
> > > disconnection will cause the number of replicas decrease. The quorum
> > > size is static, so we can stop at the very moment the number dives below.
> > 
> > What happens between the event the leader is partitioned away and
> > a new leader is elected?
> > 
> > The leader may be unaware of the events and serve a read just
> > fine.
> 
> As it is stated 20 lines above: 
> > > > > As soon as leader appears in a situation it has not enough
> > > > > replicas
> > > > > to achieve quorum, the cluster should stop accepting any
> > > > > requests - both
> > > > > write and read.
> 
> So it will not serve. 

Sergey, this is recursion. I'm asking you to clarify exactly this
point.
Do you assume that replicas perform some kind of
failure detection? What kind? Is it *in addition* to the failure
detection performed by the external coordinator? 
Any failure detector imaginable would be asynchronous. 
What happens between the failure and the time it's detected?

> > So at least you can't say the leader shouldn't be serving reads
> > without quorum - because the only way to achieve it is to collect
> > a quorum of responses to reads as well.
> 
> The leader lost connection to the (N-Q)+1 repllicas out of the N in
> cluster with a quorum of Q == it stops serving anything. So the quorum
> criteria is there: no quorum - no reads. 

OK, so you assume that TCP connection *is* the failure detector? 

Failure detection in TCP is optional, asynchronous, and worst of
all, unreliable. Why do think it can be used? 

> > > > > The reason for this is that replication of transactions
> > > > > can achieve quorum on replicas not visible to the leader. On the other
> > > > > hand, leader can't achieve quorum with available minority. Leader has to
> > > > > report the state and wait for human intervention. There's an option to
> > > > > ask leader to rollback to the latest transaction that has quorum: leader
> > > > > issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN
> > > > > is of the first transaction in the leader's undo log. The rollback
> > > > > message replicated to the available cluster will put it in a consistent
> > > > > state. After that configuration of the cluster can be updated to
> > > > > available quorum and leader can be switched back to write mode.
> > > > 
> > > > As you should be able to conclude from restart scenario, it is
> > > > possible a replica has the record in *confirmed* state but the
> > > > leader has it in pending state. The replica will not be able to
> > > > roll back then. Do you suggest the replica should abort if it
> > > > can't rollback? This may lead to an avalanche of rejoins on leader
> > > > restart, bringing performance to a halt.
> > > 
> > > No, I declare replica with biggest LSN as a new shining leader. More
> > > than that, new leader can (so far it will be by default) finalize the
> > > former leader life's work by replicating txns and appropriate confirms.
> > 
> > Right, this also assumes the restart is noticed, so it follows the
> > same logic.
> 
> How a restart can be unnoticed, if it causes disconnection?

Honestly, I'm baffled. It's like we speak different languages. 
I can't imagine you are unaware of the fallacies of distributed
computing, but I see no other explanation to you question.

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-05-12 16:40       ` Sergey Ostanevich
@ 2020-05-12 17:47         ` Konstantin Osipov
  2020-05-13 21:34           ` Vladislav Shpilevoy
  0 siblings, 1 reply; 53+ messages in thread
From: Konstantin Osipov @ 2020-05-12 17:47 UTC (permalink / raw)
  To: Sergey Ostanevich; +Cc: tarantool-patches, Vladislav Shpilevoy

* Sergey Ostanevich <sergos@tarantool.org> [20/05/12 19:43]:

> > 1) It's unclear what happens here if async tx follows a sync tx.
> >    Does it wait for the sync tx? This reduces availability for
> 
> Definitely yes, unless we keep the 'dirty read' as it is at the moment
> in memtx. This is the essence of the design, and it is temporary until 
> the MVCC similar to the vinyl machinery appears. I intentionally didn't
> include this big task into this RFC. 
> 
> It will provide similar capabilities, although it will keep only
> dependent transactions in the undo log. Also, it looks like it will fit
> well into the machinery of this RFC. 

= reduced availability for all who have at least one sync space.

If different spaces have different quorum size = quorum size of
the biggest group is effectively used for all spaces.

Replica-local transactions, e.g. those used by vinyl compaction, 
are rolled back if there is no quorum.

What's the value of this?


> 
> >    async txs - so it's hardly acceptable. Besides, with
> >    group=local spaces, one can quickly run out of memory for undo.
> >    
> > 
> > Then it should be allowed to proceed and commit.
> > 
> > Then mixing sync and async tables in a single transaction
> > shouldn't be allowed.
> > 
> > Imagine t1 is sync and t2 is async. tx1 changes t1 and t2, tx2
> > changes t2. tx1 is not confirmed and must be rolled back. But it can
> > not revert changes of tx2.
> > 
> > The spec should clarify that.

You conveniently skip this explanation of the problem - meaning
you don't intend to address it?

> > 
> > 3) One can quickly run out of memory for undo. Any sync
> >    transaction should be capped with a timeout to avoid OOMs. I
> >    don't know how many times I should repeat it. The only good
> >    solution for load control is in-memory WAL, which will allow to
> >    rollback all transactions as soon as network partitioning is
> >    detected.
> 
> How in-memry WAL can help save on _undo_ memory? 
> To rollback whatever amount of transactions one need to store the undo. 

I wrote earlier that it works as a natural failure detector and
throttling mechanism. If
there is no quorum, we can see it immediately by looking at the
number of active subscribers of the in-memory WAL, so do not
accumulate undo.

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-05-12 17:47         ` Konstantin Osipov
@ 2020-05-13 21:34           ` Vladislav Shpilevoy
  2020-05-13 23:31             ` Konstantin Osipov
  0 siblings, 1 reply; 53+ messages in thread
From: Vladislav Shpilevoy @ 2020-05-13 21:34 UTC (permalink / raw)
  To: Konstantin Osipov, Sergey Ostanevich, tarantool-patches

Thanks for the discussion!

On 12/05/2020 19:47, Konstantin Osipov wrote:
> * Sergey Ostanevich <sergos@tarantool.org> [20/05/12 19:43]:
> 
>>> 1) It's unclear what happens here if async tx follows a sync tx.
>>>    Does it wait for the sync tx? This reduces availability for
>>
>> Definitely yes, unless we keep the 'dirty read' as it is at the moment
>> in memtx. This is the essence of the design, and it is temporary until 
>> the MVCC similar to the vinyl machinery appears. I intentionally didn't
>> include this big task into this RFC. 
>>
>> It will provide similar capabilities, although it will keep only
>> dependent transactions in the undo log. Also, it looks like it will fit
>> well into the machinery of this RFC. 
> 
> = reduced availability for all who have at least one sync space.
> 
> If different spaces have different quorum size = quorum size of
> the biggest group is effectively used for all spaces.
> 
> Replica-local transactions, e.g. those used by vinyl compaction, 
> are rolled back if there is no quorum.
> 
> What's the value of this?

There is an example when it leaves the database in an inconsistent
state, when half of a transaction is applied. I don't know why Sergey
didn't add it. I propose to him to extend the RFC with these examples.
Since you are not the first person, who finds this strange and wrong.
So clearly the RFC still does not explain this moment diligently
enough.

>>>    async txs - so it's hardly acceptable. Besides, with
>>>    group=local spaces, one can quickly run out of memory for undo.
>>>   
>>>
>>> 3) One can quickly run out of memory for undo. Any sync
>>>    transaction should be capped with a timeout to avoid OOMs. I
>>>    don't know how many times I should repeat it. The only good
>>>    solution for load control is in-memory WAL, which will allow to
>>>    rollback all transactions as soon as network partitioning is
>>>    detected.
>>
>> How in-memry WAL can help save on _undo_ memory? 
>> To rollback whatever amount of transactions one need to store the undo. 
> 
> I wrote earlier that it works as a natural failure detector and
> throttling mechanism. If
> there is no quorum, we can see it immediately by looking at the
> number of active subscribers of the in-memory WAL, so do not
> accumulate undo.

Here we go again ...

Talking of throttling. Without in-memory WAL no need for throttling. All is
'slow' by design already, as you think.

Talking of failure detection - what??? I don't get it. This is something new.
With in-memory relay or without you anyway can see if there is a quorum.
This is a matter of API of replication and transaction modules, and their
interaction with each other, solved by txn_limbo in my branch.

But still, I don't see how knowing number of subscribers helps with the
quorum. Subscriber presence does not add to quorums by itself. Anyway every
transaction needs to be replicated before you can say that its quorum got
+1 replica ack.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-05-06 16:39       ` Sergey Ostanevich
  2020-05-06 18:44         ` Konstantin Osipov
@ 2020-05-13 21:36         ` Vladislav Shpilevoy
  2020-05-13 23:45           ` Konstantin Osipov
  1 sibling, 1 reply; 53+ messages in thread
From: Vladislav Shpilevoy @ 2020-05-13 21:36 UTC (permalink / raw)
  To: Sergey Ostanevich, Konstantin Osipov, tarantool-patches

Thanks for the discussion!

On 06/05/2020 18:39, Sergey Ostanevich wrote:
> Hi!
> 
> Thanks for review!
> 
>>>    |               |              |             |              |
>>>    |            [Quorum           |             |              |
>>>    |           achieved]          |             |              |
>>>    |               |              |             |              |
>>>    |         [TXN undo log        |             |              |
>>>    |           destroyed]         |             |              |
>>>    |               |              |             |              |
>>>    |               |---Confirm--->|             |              |
>>>    |               |              |             |              |
>>
>> What happens if writing Confirm to WAL fails? TXN und log record
>> is destroyed already. Will the server panic now on WAL failure,
>> even if it is intermittent?
> 
> I would like to have an example of intermittent WAL failure. Can it be
> other than problem with disc - be it space/availability/malfunction?
> 
> For all of those it should be resolved outside the DBMS anyways. So,
> leader should stop and report its problems to orchestrator/admins.
> 
> I would agree that undo log can be destroyed *after* the Confirm is
> landed to WAL - same is for replica.

Well, in fact you can't (or can you?). Because it won't help. Once you
tried to write 'Confirm', it means you got the quorum. So now in case
you will fail, a new leader will write 'Confirm' for you, when will see
a quorum too. So the current leader has no right to write 'Rollback'
from this moment, from what I understand. Because it still can be
confirmed by a new leader later, if you fail before 'Rollback' is
replicated to all.

However the same problem appears, if you write 'Confirm' *successfully*.
Still the leader can fail, and a newer leader will write 'Rollback' if
won't collect the quorum again. Don't know what to do with that really.
Probably nothing.

>>
>>>    |               |----------Confirm---------->|              |
>>
>> What happens if peers receive and maybe even write Confirm to their WALs
>> but local WAL write is lost after a restart?
> 
> Did you mean WAL write on leader as a local? Then we have a replica with
> a bigger LSN for the leader ID. 
> 
>> WAL is not synced, 
>> so we can easily lose the tail of the WAL. Tarantool will sync up
>> with all replicas on restart,
> 
> But at this point a new leader will be appointed - the old one is
> restarted. Then the Confirm message will arrive to the restarted leader 
> through a regular replication.
> 
>> but there will be no "Replication
>> OK" messages from them, so it wouldn't know that the transaction
>> is committed on them. How is this handled? We may end up with some
>> replicas confirming the transaction while the leader will roll it
>> back on restart. Do you suggest there is a human intervention on
>> restart as well?
>>
>>
>>>    |               |              |             |              |
>>>    |<---TXN Ok-----|              |       [TXN undo log        |
>>>    |               |              |         destroyed]         |
>>>    |               |              |             |              |
>>>    |               |              |             |---Confirm--->|
>>>    |               |              |             |              |
>>> ```
>>>
>>> The quorum should be collected as a table for a list of transactions
>>> waiting for quorum. The latest transaction that collects the quorum is
>>> considered as complete, as well as all transactions prior to it, since
>>> all transactions should be applied in order. Leader writes a 'confirm'
>>> message to the WAL that refers to the transaction's [LEADER_ID, LSN] and
>>> the confirm has its own LSN. This confirm message is delivered to all
>>> replicas through the existing replication mechanism.
>>>
>>> Replica should report a TXN application success to the leader via the
>>> IPROTO explicitly to allow leader to collect the quorum for the TXN.
>>> In case of application failure the replica has to disconnect from the
>>> replication the same way as it is done now. The replica also has to
>>> report its disconnection to the orchestrator. Further actions require
>>> human intervention, since failure means either technical problem (such
>>> as not enough space for WAL) that has to be resovled or an inconsistent
>>> state that requires rejoin.
>>
>>> As soon as leader appears in a situation it has not enough replicas
>>> to achieve quorum, the cluster should stop accepting any requests - both
>>> write and read.
>>
>> How does *the cluster* know the state of the leader and if it
>> doesn't, how it can possibly implement this? Did you mean
>> the leader should stop accepting transactions here? But how can
>> the leader know if it has not enough replicas during a read
>> transaction, if it doesn't contact any replica to serve a read?
> 
> I expect to have a disconnection trigger assigned to all relays so that
> disconnection will cause the number of replicas decrease. The quorum
> size is static, so we can stop at the very moment the number dives below.

This is a very dubious statement. In TCP disconnect may be detected much
later, than it happened. So to collect a quorum on something you need to
literally collect this quorum, with special WAL records, via network, and
all. A disconnect trigger does not help at all here.

Talking of the whole 'read-quorum' idea, I don't like it. Because this
really makes things unbearably harder to implement, the nodes become
much slower and less available in terms of any problems.

I think reads should be allowed always, and from any node (except during
bootstrap, of course). After all, you have transactions for consistency. So
as far as replication respects transaction boundaries, every node is in a
consistent state. Maybe not all of them are in the same state, but every one
is consistent.

Honestly, I can't even imagine, how is it possible to implement a completely
synchronous simultaneous cluster progression. It is impossible even in theory.
There always will be a time period, when some nodes are further than the
others. At least because of network delays.

So either we allows reads from master only, or we allow reads from everywhere,
and in that case nothing will save from a possibility of seeing different
data on different nodes.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-05-12 15:55           ` Sergey Ostanevich
  2020-05-12 16:42             ` Konstantin Osipov
@ 2020-05-13 21:39             ` Vladislav Shpilevoy
  2020-05-13 23:54               ` Konstantin Osipov
  2020-05-14 20:38               ` Sergey Ostanevich
  1 sibling, 2 replies; 53+ messages in thread
From: Vladislav Shpilevoy @ 2020-05-13 21:39 UTC (permalink / raw)
  To: Sergey Ostanevich, Konstantin Osipov, tarantool-patches

Hi! Thanks for the discussion!

On 12/05/2020 17:55, Sergey Ostanevich wrote:
> On 06 мая 21:44, Konstantin Osipov wrote:
>> * Sergey Ostanevich <sergos@tarantool.org> [20/05/06 19:41]:
>>>>>    |               |              |             |              |
>>>>>    |            [Quorum           |             |              |
>>>>>    |           achieved]          |             |              |
>>>>>    |               |              |             |              |
>>>>>    |         [TXN undo log        |             |              |
>>>>>    |           destroyed]         |             |              |
>>>>>    |               |              |             |              |
>>>>>    |               |---Confirm--->|             |              |
>>>>>    |               |              |             |              |
>>>>
>>>> What happens if writing Confirm to WAL fails? TXN und log record
>>>> is destroyed already. Will the server panic now on WAL failure,
>>>> even if it is intermittent?
>>>
>>> I would like to have an example of intermittent WAL failure. Can it be
>>> other than problem with disc - be it space/availability/malfunction?
>>
>> For SAN disks it can simply be a networking issue. The same is
>> true for any virtual filesystem in the cloud. For local disks it
>> is most often out of space, but this is not an impossible event.
> 
> The SANDisk is an SSD vendor. I bet you mean NAS - network array
> storage, isn't it? Then I see no difference in WAL write into NAS in
> current schema - you will catch a timeout, WAL will report failure,
> replica stops.
> 
>>
>>> For all of those it should be resolved outside the DBMS anyways. So,
>>> leader should stop and report its problems to orchestrator/admins.
>>
>> Sergey, I understand that RAFT spec is big and with this spec you
>> try to split it into manageable parts. The question is how useful
>> is this particular piece. I'm trying to point out that "the leader
>> should stop" is not a silver bullet - especially since each such
>> stop may mean a rejoin of some other node. The purpose of sync
>> replication is to provide consistency without reducing
>> availability (i.e. make progress as long as the quorum
>> of nodes make progress). 
> 
> I'm not sure if we're talking about the same RAFT - mine is "In Search
> of an Understandable Consensus Algorithm (Extended Version)" from
> Stanford as of May 2014. And it is 15 pages - including references,
> conclusions and intro. Seems not that big.

15 pages of tightly packed theory is a big piece of data. And especially
big, when it comes to application to a real project, with existing
infrastructure, and all. Just my IMHO. I remember implementing SWIM - it
is smaller than RAFT. Much smaller and simpler, and yet it took year to
implement it, and cover all things described in the paper.

This is not as simple as it looks, when it comes to edge cases. This is
why the whole sync replication frustrates me more than anything else
before, and why I am so reluctant to doing anything with it.

The RFC mostly covers the normal operation, here I agree with Kostja. But
the normal operation is not that interesting. Failures are much more
important.

> Although, most of it is dedicated to the leader election itself, which
> we intentionally put aside from this RFC. It is written in the very
> beginning and I empasized this by explicit mentioning of it.

And still there will be leader election. Even though not ours for now.
And Tarantool should provide API and instructions so as external
applications could follow them and do the election.

Usually in RFCs we describe API. With arguments, behaviour, and all.

>> The current spec, suggesting there should be a leader stop in case
>> of most errors, reduces availability significantly, and doesn't
>> make external coordinator job any easier - it still has to follow to
>> the letter the prescriptions of RAFT. 
>>
>>>
>>>>
>>>>>    |               |----------Confirm---------->|              |
>>>>
>>>> What happens if peers receive and maybe even write Confirm to their WALs
>>>> but local WAL write is lost after a restart?
>>>
>>> Did you mean WAL write on leader as a local? Then we have a replica with
>>> a bigger LSN for the leader ID. 
>>
>>>> WAL is not synced, 
>>>> so we can easily lose the tail of the WAL. Tarantool will sync up
>>>> with all replicas on restart,
>>>
>>> But at this point a new leader will be appointed - the old one is
>>> restarted. Then the Confirm message will arrive to the restarted leader 
>>> through a regular replication.
>>
>> This assumes that restart is guaranteed to be noticed by the
>> external coordinator and there is an election on every restart.
> 
> Sure yes, if it restarted - then connection lost can't be unnoticed by
> anyone, be it coordinator or cluster.

Here comes another problem. Disconnect and restart have nothing to do with
each other. The coordinator can loose connection without the peer leader
restart. Just because it is network. Anything can happen. Moreover, while
the coordinator does not have a connection, the leader can restart multiple
times.

We can't tell the coordinator rely on connectivity as a restart signal.

>>>> but there will be no "Replication
>>>> OK" messages from them, so it wouldn't know that the transaction
>>>> is committed on them. How is this handled? We may end up with some
>>>> replicas confirming the transaction while the leader will roll it
>>>> back on restart. Do you suggest there is a human intervention on
>>>> restart as well?
>>>>
>>>>
>>>>>    |               |              |             |              |
>>>>>    |<---TXN Ok-----|              |       [TXN undo log        |
>>>>>    |               |              |         destroyed]         |
>>>>>    |               |              |             |              |
>>>>>    |               |              |             |---Confirm--->|
>>>>>    |               |              |             |              |
>>>>> ```
>>>>>
>>>>> The quorum should be collected as a table for a list of transactions
>>>>> waiting for quorum. The latest transaction that collects the quorum is
>>>>> considered as complete, as well as all transactions prior to it, since
>>>>> all transactions should be applied in order. Leader writes a 'confirm'
>>>>> message to the WAL that refers to the transaction's [LEADER_ID, LSN] and
>>>>> the confirm has its own LSN. This confirm message is delivered to all
>>>>> replicas through the existing replication mechanism.
>>>>>
>>>>> Replica should report a TXN application success to the leader via the
>>>>> IPROTO explicitly to allow leader to collect the quorum for the TXN.
>>>>> In case of application failure the replica has to disconnect from the
>>>>> replication the same way as it is done now. The replica also has to
>>>>> report its disconnection to the orchestrator. Further actions require
>>>>> human intervention, since failure means either technical problem (such
>>>>> as not enough space for WAL) that has to be resovled or an inconsistent
>>>>> state that requires rejoin.
>>>>
>>>>> As soon as leader appears in a situation it has not enough replicas
>>>>> to achieve quorum, the cluster should stop accepting any requests - both
>>>>> write and read.
>>>>
>>>> How does *the cluster* know the state of the leader and if it
>>>> doesn't, how it can possibly implement this? Did you mean
>>>> the leader should stop accepting transactions here? But how can
>>>> the leader know if it has not enough replicas during a read
>>>> transaction, if it doesn't contact any replica to serve a read?
>>>
>>> I expect to have a disconnection trigger assigned to all relays so that
>>> disconnection will cause the number of replicas decrease. The quorum
>>> size is static, so we can stop at the very moment the number dives below.
>>
>> What happens between the event the leader is partitioned away and
>> a new leader is elected?
>>
>> The leader may be unaware of the events and serve a read just
>> fine.
> 
> As it is stated 20 lines above: 
>>>>> As soon as leader appears in a situation it has not enough
>>>>> replicas
>>>>> to achieve quorum, the cluster should stop accepting any
>>>>> requests - both
>>>>> write and read.
> 
> So it will not serve. 

This breaks compatibility, since now an orphan node is perfectly able
to serve reads. The cluster can't just stop doing everything, if the
quorum is lost. Stop writes - yes, since the quorum is lost anyway. But
reads do not need a quorum.

If you say reads need a quorum, then they would need to go through WAL,
collect confirmations, and all.

>> So at least you can't say the leader shouldn't be serving reads
>> without quorum - because the only way to achieve it is to collect
>> a quorum of responses to reads as well.
> 
> The leader lost connection to the (N-Q)+1 repllicas out of the N in
> cluster with a quorum of Q == it stops serving anything. So the quorum
> criteria is there: no quorum - no reads. 

Connection count tells nothing. Network connectivity is not a reliable
source of information. Only messages and persistent data are reliable
(to certain extent).

>>>>> The reason for this is that replication of transactions
>>>>> can achieve quorum on replicas not visible to the leader. On the other
>>>>> hand, leader can't achieve quorum with available minority. Leader has to
>>>>> report the state and wait for human intervention. There's an option to
>>>>> ask leader to rollback to the latest transaction that has quorum: leader
>>>>> issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN
>>>>> is of the first transaction in the leader's undo log. The rollback
>>>>> message replicated to the available cluster will put it in a consistent
>>>>> state. After that configuration of the cluster can be updated to
>>>>> available quorum and leader can be switched back to write mode.
>>>>
>>>> As you should be able to conclude from restart scenario, it is
>>>> possible a replica has the record in *confirmed* state but the
>>>> leader has it in pending state. The replica will not be able to
>>>> roll back then. Do you suggest the replica should abort if it
>>>> can't rollback? This may lead to an avalanche of rejoins on leader
>>>> restart, bringing performance to a halt.
>>>
>>> No, I declare replica with biggest LSN as a new shining leader. More
>>> than that, new leader can (so far it will be by default) finalize the
>>> former leader life's work by replicating txns and appropriate confirms.
>>
>> Right, this also assumes the restart is noticed, so it follows the
>> same logic.
> 
> How a restart can be unnoticed, if it causes disconnection?

Disconnection has nothing to do with restart. The coordinator itself may
restart. Or it may loose connection to the leader temporarily. Or the
leader may loose it without any restarts.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-05-06 18:55     ` Konstantin Osipov
  2020-05-06 19:10       ` Konstantin Osipov
@ 2020-05-13 21:42       ` Vladislav Shpilevoy
  2020-05-14  0:05         ` Konstantin Osipov
  1 sibling, 1 reply; 53+ messages in thread
From: Vladislav Shpilevoy @ 2020-05-13 21:42 UTC (permalink / raw)
  To: Konstantin Osipov, Sergey Ostanevich, tarantool-patches

Thanks for the discussion!

On 06/05/2020 20:55, Konstantin Osipov wrote:
> * Sergey Ostanevich <sergos@tarantool.org> [20/04/30 17:51]:
> 
> A few more issues:
> 
> - the spec assumes there is a full mesh. In any other
>   topology electing a leader based on the longest wal can easily
>   deadlock. Yet it provides no protection against non-full-mesh
>   setups. Currently the server can't even detect that this is not
>   a full-mesh setup, so can't check if the precondition for this
>   to work correctly is met.

Yes, this is a very unstable construction. But we failed to come up
with a solution right now, which would protect against accidental
non-fullmesh. For example, how will it work, when I add a new node?
If non-fullmesh is forbidden, the new node just can't be added ever,
because this can't be done on all nodes simultaneously.

> - the spec assumes that quorum is identical to the
>   number of replicas, and the number of replicas is stable across
>   cluster life time. Can I have quorum=2 while the number of
>   replicas is 4? Am I allowed to increase the number of replicas
>   online? What happens when a replica is added,
>   how exactly and starting from which transaction is the leader
>   required to collect a bigger quorum?

Quorum <= number of replicas. It is a parameter, just like
replication_connect_quorum.

I think you are allowed to add new replicas. When a replica is added,
it goes through the normal join process.

> - the same goes for removing a replica. How is the quorum reduced?

Node is just removed, I guess. If total number of nodes becomes less
than quorum, obviously no transactions will be served.

However what to do with the existing pending transactions, which
already accounted the removed replica in their quorums? Should they be
decremented?

All what I am talking here are guesses. Which should be clarified in the
RFC in the ideal world, of course.

Tbh, we discussed the sync replication for may hours in voice, and this
is a surprise, that all of them fit into such a small update of the RFC.
Even though it didn't fit. Since we obviously still didn't clarify many
things. Especially exact API look.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-05-13 21:34           ` Vladislav Shpilevoy
@ 2020-05-13 23:31             ` Konstantin Osipov
  0 siblings, 0 replies; 53+ messages in thread
From: Konstantin Osipov @ 2020-05-13 23:31 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

* Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/05/14 00:37]:
> >>> 3) One can quickly run out of memory for undo. Any sync
> >>>    transaction should be capped with a timeout to avoid OOMs. I
> >>>    don't know how many times I should repeat it. The only good
> >>>    solution for load control is in-memory WAL, which will allow to
> >>>    rollback all transactions as soon as network partitioning is
> >>>    detected.
> >>
> >> How in-memry WAL can help save on _undo_ memory? 
> >> To rollback whatever amount of transactions one need to store the undo. 
> > 
> > I wrote earlier that it works as a natural failure detector and
> > throttling mechanism. If
> > there is no quorum, we can see it immediately by looking at the
> > number of active subscribers of the in-memory WAL, so do not
> > accumulate undo.
> 
> Here we go again ...
> 
> Talking of throttling. Without in-memory WAL no need for throttling. All is
> 'slow' by design already, as you think.

What is the limit for transactions in txn_limbo list? How does
this limit work? What about the fibers, which are pinned as long
as the transaction is not committed?
> 
> Talking of failure detection - what??? I don't get it. This is something new.
> With in-memory relay or without you anyway can see if there is a quorum.

How do you "see" it? You write to the WAL and wait for acks. You
could add a wait timeout, and assume there is no quorum if there
are no acks within the timeout. This is not the best strategy, but
there is no other. The spec doesn't say even that, it simply says
that somehow lack of quorum is detected, but how it is detected is
not clear.

With in-memory WAL you can afford to wait longer if you have space
in the ring buffer, and you know immediately if you shouldn't wait
because you see that the ring buffer is full and the majority of
subscribers are behind the start of the buffer.


> This is a matter of API of replication and transaction modules, and their
> interaction with each other, solved by txn_limbo in my branch.

How is it "solved"?

> But still, I don't see how knowing number of subscribers helps with the
> quorum. Subscriber presence does not add to quorums by itself. Anyway every
> transaction needs to be replicated before you can say that its quorum got
> +1 replica ack.

It helps to see quickly absence of the quorum, not presence of it.

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-05-13 21:36         ` Vladislav Shpilevoy
@ 2020-05-13 23:45           ` Konstantin Osipov
  0 siblings, 0 replies; 53+ messages in thread
From: Konstantin Osipov @ 2020-05-13 23:45 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

* Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/05/14 00:37]:
> > Thanks for review!
> > 
> >>>    |               |              |             |              |
> >>>    |            [Quorum           |             |              |
> >>>    |           achieved]          |             |              |
> >>>    |               |              |             |              |
> >>>    |         [TXN undo log        |             |              |
> >>>    |           destroyed]         |             |              |
> >>>    |               |              |             |              |
> >>>    |               |---Confirm--->|             |              |
> >>>    |               |              |             |              |
> >>
> >> What happens if writing Confirm to WAL fails? TXN und log record
> >> is destroyed already. Will the server panic now on WAL failure,
> >> even if it is intermittent?
> > 
> > I would like to have an example of intermittent WAL failure. Can it be
> > other than problem with disc - be it space/availability/malfunction?
> > 
> > For all of those it should be resolved outside the DBMS anyways. So,
> > leader should stop and report its problems to orchestrator/admins.
> > 
> > I would agree that undo log can be destroyed *after* the Confirm is
> > landed to WAL - same is for replica.
> 
> Well, in fact you can't (or can you?). Because it won't help. Once you
> tried to write 'Confirm', it means you got the quorum. So now in case
> you will fail, a new leader will write 'Confirm' for you, when will see
> a quorum too. So the current leader has no right to write 'Rollback'
> from this moment, from what I understand. Because it still can be
> confirmed by a new leader later, if you fail before 'Rollback' is
> replicated to all.
> 
> However the same problem appears, if you write 'Confirm' *successfully*.
> Still the leader can fail, and a newer leader will write 'Rollback' if
> won't collect the quorum again. Don't know what to do with that really.
> Probably nothing.

Maybe consult with the raft spec? 

The new leader is guaranteed to see the transaction since it has
reached the majority of replicas. So it will definitely write
"confirm" for it. The reason I asked the question is I want the
case of intermittent failures be described in the spec.
For example, is "confirm" a cbus message, then if there is a
cascading rollback of the batch it is part of, it can be rolled
back. I would like to see all these scenarios covered in the spec.
If one of them ends with panic, I would like to understand how the
external coordinator is going to resolve the new election. 
Raft has answers for all of it.



> >>> As soon as leader appears in a situation it has not enough replicas
> >>> to achieve quorum, the cluster should stop accepting any requests - both
> >>> write and read.
> >>
> >> How does *the cluster* know the state of the leader and if it
> >> doesn't, how it can possibly implement this? Did you mean
> >> the leader should stop accepting transactions here? But how can
> >> the leader know if it has not enough replicas during a read
> >> transaction, if it doesn't contact any replica to serve a read?
> > 
> > I expect to have a disconnection trigger assigned to all relays so that
> > disconnection will cause the number of replicas decrease. The quorum
> > size is static, so we can stop at the very moment the number dives below.
> 
> This is a very dubious statement. In TCP disconnect may be detected much
> later, than it happened. So to collect a quorum on something you need to
> literally collect this quorum, with special WAL records, via network, and
> all. A disconnect trigger does not help at all here.

Erhm, thanks. 

> Talking of the whole 'read-quorum' idea, I don't like it. Because this
> really makes things unbearably harder to implement, the nodes become
> much slower and less available in terms of any problems.
> 
> I think reads should be allowed always, and from any node (except during
> bootstrap, of course). After all, you have transactions for consistency. So
> as far as replication respects transaction boundaries, every node is in a
> consistent state. Maybe not all of them are in the same state, but every one
> is consistent.

In memtx, you read by default dirty, uncommitted data. It was OK for single-node
transactions, since the only chance for it to be rolled back were
out of space/disk failure, which were extremely rare, now you
really read dirty stuff, because you can easily have it rolled
back because of lack of quorum or re-election.
So it's a much bigger deal.


> Honestly, I can't even imagine, how is it possible to implement a completely
> synchronous simultaneous cluster progression. It is impossible even in theory.
> There always will be a time period, when some nodes are further than the
> others. At least because of network delays.
> 
> So either we allows reads from master only, or we allow reads from everywhere,
> and in that case nothing will save from a possibility of seeing different
> data on different nodes.

This is why there are many consistency models out there (just
google consistency models in distributed systems), and the minor
details are important. It's indeed hard to implement the strictest model
(serial), but it is also often unnecessary, and there is consensus
in the relational databases what issues are acceptable and what are not. 

More specifically, I think for tarantool sync replication we
should aim at read committed. The spec should say it in no
uncertain terms and explain how it is achieved.

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-05-13 21:39             ` Vladislav Shpilevoy
@ 2020-05-13 23:54               ` Konstantin Osipov
  2020-05-14 20:38               ` Sergey Ostanevich
  1 sibling, 0 replies; 53+ messages in thread
From: Konstantin Osipov @ 2020-05-13 23:54 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

* Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/05/14 00:42]:
> > Sure yes, if it restarted - then connection lost can't be unnoticed by
> > anyone, be it coordinator or cluster.
> 
> Here comes another problem. Disconnect and restart have nothing to do with
> each other. The coordinator can loose connection without the peer leader
> restart. Just because it is network. Anything can happen. Moreover, while
> the coordinator does not have a connection, the leader can restart multiple
> times.

yes. 

> We can't tell the coordinator rely on connectivity as a restart signal.

Well, we could demand that the leader always demotes itself after
restart. But the spec should be explicit about it and explain how
the election happens in this case, because it still may have the
longest WAL (but with some junk in it, thanks to lost confirms),
so after restart the leader may need to reconcile its wal with the
majority, fetching missing records back.

Once again, RAFT is very explicit about this. By default it
requires that the leader commit log is durable, i.e.
wal_mode=sync. This would kill performance. Implementations exist
which run in wal_mode=write (cassandra is one of them), but they know how to
repair the log at the leader before proceeding with the next
transaction. The reason I brought this up is that it's extremely
tricky, and confusing as hell if the election is external (agree
there should be an API, or better yet, abandon the idea of
external election, just have no election for now at all, assume
the leader never changes, and we only provide durability in
multi-master config), with no consistency guarantees (but eventual
one).

> > How a restart can be unnoticed, if it causes disconnection?
> 
> Disconnection has nothing to do with restart. The coordinator itself may
> restart. Or it may loose connection to the leader temporarily. Or the
> leader may loose it without any restarts.

and yes.

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-05-13 21:42       ` Vladislav Shpilevoy
@ 2020-05-14  0:05         ` Konstantin Osipov
  0 siblings, 0 replies; 53+ messages in thread
From: Konstantin Osipov @ 2020-05-14  0:05 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

* Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/05/14 00:47]:
> > A few more issues:
> > 
> > - the spec assumes there is a full mesh. In any other
> >   topology electing a leader based on the longest wal can easily
> >   deadlock. Yet it provides no protection against non-full-mesh
> >   setups. Currently the server can't even detect that this is not
> >   a full-mesh setup, so can't check if the precondition for this
> >   to work correctly is met.
> 
> Yes, this is a very unstable construction. But we failed to come up
> with a solution right now, which would protect against accidental
> non-fullmesh. For example, how will it work, when I add a new node?
> If non-fullmesh is forbidden, the new node just can't be added ever,
> because this can't be done on all nodes simultaneously.

Again the answer is present in the raft spec. The node is added in two
steps, first steps commits the "add node" event to the durable
state of the entire group, the second step (which is also a raft
transaction) enacts the new node. This could be achieved in more
or less straightforward manner if _cluster is a sync table with
replication group = all members of the cluster. But as I said, I
can't imagine this is possible with an external coordinator, since
it may not be available during boot.

Regarding detecting the full mesh, remember the task I created for
using swim to discover members and bring non-full-mesh setups to
full-mesh automatically? Is the reason for this task to exist
clear now? Is it clear now why I asked you (multiple times) to
begin working on sync replication by adding built-in swim
instances on every replica and using them, instead of the current
replication heartbeats, for failure detection? I believe there was
a task somewhere for it, too. 

> > - the spec assumes that quorum is identical to the
> >   number of replicas, and the number of replicas is stable across
> >   cluster life time. Can I have quorum=2 while the number of
> >   replicas is 4? Am I allowed to increase the number of replicas
> >   online? What happens when a replica is added,
> >   how exactly and starting from which transaction is the leader
> >   required to collect a bigger quorum?
> 
> Quorum <= number of replicas. It is a parameter, just like
> replication_connect_quorum.

I wrote in a comment to the task that it'd be even better if we
list node uuids as group members, and assign group to space
explicitly, so that it's not just ## of replicas, but specific
replicas identified by their uuids.

The thing is, it's vague in the spec. The spec has to be explicit
about all box.schema API changes, because they will define legacy
that will be hard to deal with later.

> I think you are allowed to add new replicas. When a replica is added,
> it goes through the normal join process.

At what point is joins the group and can ACK, i.e. become part of
a quorum? That's the question I wanted to be written down
explicitly in this document. RAFT has an answer for it.

> > - the same goes for removing a replica. How is the quorum reduced?
> 
> Node is just removed, I guess. If total number of nodes becomes less
> than quorum, obviously no transactions will be served.

Other vendors support 3 different scenarios here:
- it can be down for maintenance. In our turns, it means it is
  simply shut down, without changes to _cluster or space settings
- it can be removed forever, in that case an admin may want to
  reduce the quorum size. 
- it can be replaced.

with box.schema.group API all 3 cases can be translated to API
calls on the group itself. 

e.g. it would be possible to say
box.schema.group.groupname.remove(uuid)
box.schema.group.groupname.replace(old_uuid, new_uuid).

We don't need to implement it right away, but we must provision
for these operations in the spec, and at least  have a clue how
they will be handled in the future.

> However what to do with the existing pending transactions, which
> already accounted the removed replica in their quorums? Should they be
> decremented?
> 
> All what I am talking here are guesses. Which should be clarified in the
> RFC in the ideal world, of course.
> 
> Tbh, we discussed the sync replication for may hours in voice, and this
> is a surprise, that all of them fit into such a small update of the RFC.
> Even though it didn't fit. Since we obviously still didn't clarify many
> things. Especially exact API look.

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-05-13 21:39             ` Vladislav Shpilevoy
  2020-05-13 23:54               ` Konstantin Osipov
@ 2020-05-14 20:38               ` Sergey Ostanevich
  2020-05-20 20:59                 ` Sergey Ostanevich
  1 sibling, 1 reply; 53+ messages in thread
From: Sergey Ostanevich @ 2020-05-14 20:38 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

Hi!

> >> Sergey, I understand that RAFT spec is big and with this spec you
> >> try to split it into manageable parts. The question is how useful
> >> is this particular piece. I'm trying to point out that "the leader
> >> should stop" is not a silver bullet - especially since each such
> >> stop may mean a rejoin of some other node. The purpose of sync
> >> replication is to provide consistency without reducing
> >> availability (i.e. make progress as long as the quorum
> >> of nodes make progress). 
> > 
> > I'm not sure if we're talking about the same RAFT - mine is "In Search
> > of an Understandable Consensus Algorithm (Extended Version)" from
> > Stanford as of May 2014. And it is 15 pages - including references,
> > conclusions and intro. Seems not that big.
> 
> 15 pages of tightly packed theory is a big piece of data. And especially
> big, when it comes to application to a real project, with existing
> infrastructure, and all. Just my IMHO. I remember implementing SWIM - it
> is smaller than RAFT. Much smaller and simpler, and yet it took year to
> implement it, and cover all things described in the paper.

That I won't object and this was the reason not to take the RAFT as is
and implement it in full for next 2-3 years. That's why we had the very
first part of RFC describing what it tries to address and what's not.

> 
> This is not as simple as it looks, when it comes to edge cases. This is
> why the whole sync replication frustrates me more than anything else
> before, and why I am so reluctant to doing anything with it.
> 
> The RFC mostly covers the normal operation, here I agree with Kostja. But
> the normal operation is not that interesting. Failures are much more
> important.

Definitely and I expect to follow with more functionality on top of it.
I believe it will be easier to do if the start will be as small as
possible change to the existent code base, which I also try to follow.

> 
> > Although, most of it is dedicated to the leader election itself, which
> > we intentionally put aside from this RFC. It is written in the very
> > beginning and I empasized this by explicit mentioning of it.
> 
> And still there will be leader election. Even though not ours for now.
> And Tarantool should provide API and instructions so as external
> applications could follow them and do the election.
> 
> Usually in RFCs we describe API. With arguments, behaviour, and all.

That is something I believe should be done after we agree on the whole
idea, such as confirm entry in WAL for sync transactions that appeared
there earlier. Otherwise we can get very deep into the details, spending
time for API definition while the idea itself can appear wrong.

I believe that was a common ground to start, but we immediately went to
discussion of so many details I tried to keep away before we agree on the
key parts, such as WAL consistency or quorum collection.

> 
> >> The current spec, suggesting there should be a leader stop in case
> >> of most errors, reduces availability significantly, and doesn't
> >> make external coordinator job any easier - it still has to follow to
> >> the letter the prescriptions of RAFT. 
> >>
> >>>
> >>>>
> >>>>>    |               |----------Confirm---------->|              |
> >>>>
> >>>> What happens if peers receive and maybe even write Confirm to their WALs
> >>>> but local WAL write is lost after a restart?
> >>>
> >>> Did you mean WAL write on leader as a local? Then we have a replica with
> >>> a bigger LSN for the leader ID. 
> >>
> >>>> WAL is not synced, 
> >>>> so we can easily lose the tail of the WAL. Tarantool will sync up
> >>>> with all replicas on restart,
> >>>
> >>> But at this point a new leader will be appointed - the old one is
> >>> restarted. Then the Confirm message will arrive to the restarted leader 
> >>> through a regular replication.
> >>
> >> This assumes that restart is guaranteed to be noticed by the
> >> external coordinator and there is an election on every restart.
> > 
> > Sure yes, if it restarted - then connection lost can't be unnoticed by
> > anyone, be it coordinator or cluster.
> 
> Here comes another problem. Disconnect and restart have nothing to do with
> each other. The coordinator can loose connection without the peer leader
> restart. Just because it is network. Anything can happen. Moreover, while
> the coordinator does not have a connection, the leader can restart multiple
> times.

Definitely there should be a higher level functionality to support some
sort of membership protocol, such as SWIM or RAFT itself. But
introduction of it should not affect the basic priciples we have to
agree upon.

> 
> We can't tell the coordinator rely on connectivity as a restart signal.
> 
> >>>> but there will be no "Replication
> >>>> OK" messages from them, so it wouldn't know that the transaction
> >>>> is committed on them. How is this handled? We may end up with some
> >>>> replicas confirming the transaction while the leader will roll it
> >>>> back on restart. Do you suggest there is a human intervention on
> >>>> restart as well?
> >>>>
> >>>>
> >>>>>    |               |              |             |              |
> >>>>>    |<---TXN Ok-----|              |       [TXN undo log        |
> >>>>>    |               |              |         destroyed]         |
> >>>>>    |               |              |             |              |
> >>>>>    |               |              |             |---Confirm--->|
> >>>>>    |               |              |             |              |
> >>>>> ```
> >>>>>
> >>>>> The quorum should be collected as a table for a list of transactions
> >>>>> waiting for quorum. The latest transaction that collects the quorum is
> >>>>> considered as complete, as well as all transactions prior to it, since
> >>>>> all transactions should be applied in order. Leader writes a 'confirm'
> >>>>> message to the WAL that refers to the transaction's [LEADER_ID, LSN] and
> >>>>> the confirm has its own LSN. This confirm message is delivered to all
> >>>>> replicas through the existing replication mechanism.
> >>>>>
> >>>>> Replica should report a TXN application success to the leader via the
> >>>>> IPROTO explicitly to allow leader to collect the quorum for the TXN.
> >>>>> In case of application failure the replica has to disconnect from the
> >>>>> replication the same way as it is done now. The replica also has to
> >>>>> report its disconnection to the orchestrator. Further actions require
> >>>>> human intervention, since failure means either technical problem (such
> >>>>> as not enough space for WAL) that has to be resovled or an inconsistent
> >>>>> state that requires rejoin.
> >>>>
> >>>>> As soon as leader appears in a situation it has not enough replicas
> >>>>> to achieve quorum, the cluster should stop accepting any requests - both
> >>>>> write and read.
> >>>>
> >>>> How does *the cluster* know the state of the leader and if it
> >>>> doesn't, how it can possibly implement this? Did you mean
> >>>> the leader should stop accepting transactions here? But how can
> >>>> the leader know if it has not enough replicas during a read
> >>>> transaction, if it doesn't contact any replica to serve a read?
> >>>
> >>> I expect to have a disconnection trigger assigned to all relays so that
> >>> disconnection will cause the number of replicas decrease. The quorum
> >>> size is static, so we can stop at the very moment the number dives below.
> >>
> >> What happens between the event the leader is partitioned away and
> >> a new leader is elected?
> >>
> >> The leader may be unaware of the events and serve a read just
> >> fine.
> > 
> > As it is stated 20 lines above: 
> >>>>> As soon as leader appears in a situation it has not enough
> >>>>> replicas
> >>>>> to achieve quorum, the cluster should stop accepting any
> >>>>> requests - both
> >>>>> write and read.
> > 
> > So it will not serve. 
> 
> This breaks compatibility, since now an orphan node is perfectly able
> to serve reads. The cluster can't just stop doing everything, if the
> quorum is lost. Stop writes - yes, since the quorum is lost anyway. But
> reads do not need a quorum.
> 
> If you say reads need a quorum, then they would need to go through WAL,
> collect confirmations, and all.

The reads should not be inconsistent - so that cluster will keep
answering A or B for the same request. And in case we lost quorum we
can't say for sure that all instances will answer the same.

As we discussed it before, if leader appears in minor part of the
cluster it can't issue rollback for all unconfirmed txns, since the
majority will re-elect leader who will collect quorum for them. Means,
we will appear is a state that cluster split in two. So the minor part
should stop. Am I wrong here?

> 
> >> So at least you can't say the leader shouldn't be serving reads
> >> without quorum - because the only way to achieve it is to collect
> >> a quorum of responses to reads as well.
> > 
> > The leader lost connection to the (N-Q)+1 repllicas out of the N in
> > cluster with a quorum of Q == it stops serving anything. So the quorum
> > criteria is there: no quorum - no reads. 
> 
> Connection count tells nothing. Network connectivity is not a reliable
> source of information. Only messages and persistent data are reliable
> (to certain extent).

Well, persistent data can't help obtain quorum if there's no connection
to the replicas who should contribute to quorum.
Correct me, if I'm wrong: in case no quorum available we can't garantee
that the data is stored on at least <quorum> number of servers. Means -
cluster is not operable.

> 
> >>>>> The reason for this is that replication of transactions
> >>>>> can achieve quorum on replicas not visible to the leader. On the other
> >>>>> hand, leader can't achieve quorum with available minority. Leader has to
> >>>>> report the state and wait for human intervention. There's an option to
> >>>>> ask leader to rollback to the latest transaction that has quorum: leader
> >>>>> issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN
> >>>>> is of the first transaction in the leader's undo log. The rollback
> >>>>> message replicated to the available cluster will put it in a consistent
> >>>>> state. After that configuration of the cluster can be updated to
> >>>>> available quorum and leader can be switched back to write mode.
> >>>>
> >>>> As you should be able to conclude from restart scenario, it is
> >>>> possible a replica has the record in *confirmed* state but the
> >>>> leader has it in pending state. The replica will not be able to
> >>>> roll back then. Do you suggest the replica should abort if it
> >>>> can't rollback? This may lead to an avalanche of rejoins on leader
> >>>> restart, bringing performance to a halt.
> >>>
> >>> No, I declare replica with biggest LSN as a new shining leader. More
> >>> than that, new leader can (so far it will be by default) finalize the
> >>> former leader life's work by replicating txns and appropriate confirms.
> >>
> >> Right, this also assumes the restart is noticed, so it follows the
> >> same logic.
> > 
> > How a restart can be unnoticed, if it causes disconnection?
> 
> Disconnection has nothing to do with restart. The coordinator itself may
> restart. Or it may loose connection to the leader temporarily. Or the
> leader may loose it without any restarts.

But how we detect it right now in Tarantool? Is there any machinery?
I suppose we can simply rely on the same at least to test the minimal -
and 'normally operating' - first approach to the problem.

So, thank you for all comments and please, find my updated RFC below.

Sergos.

---

* **Status**: In progress
* **Start date**: 31-03-2020
* **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\>
* **Issues**: https://github.com/tarantool/tarantool/issues/4842

## Summary

The aim of this RFC is to address the following list of problems
formulated at MRG planning meeting:
  - protocol backward compatibility to enable cluster upgrade w/o
    downtime
  - consistency of data on replica and leader
  - switch from leader to replica without data loss
  - up to date replicas to run read-only requests
  - ability to switch async replicas into sync ones and vice versa
  - guarantee of rollback on leader and sync replicas
  - simplicity of cluster orchestration

What this RFC is not:

  - high availability (HA) solution with automated failover, roles
    assignments an so on
  - master-master configuration support

## Background and motivation

There are number of known implementation of consistent data presence in
a Tarantool cluster. They can be commonly named as "wait for LSN"
technique. The biggest issue with this technique is the absence of
rollback guarantees at replica in case of transaction failure on one
master or some of the replicas in the cluster.

To provide such capabilities a new functionality should be introduced in
Tarantool core, with requirements mentioned before - backward
compatibility and ease of cluster orchestration.

## Detailed design

### Quorum commit

The main idea behind the proposal is to reuse existent machinery as much
as possible. It will ensure the well-tested and proven functionality
across many instances in MRG and beyond is used. The transaction rollback
mechanism is in place and works for WAL write failure. If we substitute
the WAL success with a new situation which is named 'quorum' later in
this document then no changes to the machinery is needed. The same is
true for snapshot machinery that allows to create a copy of the database
in memory for the whole period of snapshot file write. Adding quorum here
also minimizes changes.

Currently replication represented by the following scheme:
```
Customer        Leader          WAL(L)        Replica        WAL(R)
   |------TXN----->|              |             |              |
   |               |              |             |              |
   |         [TXN undo log        |             |              |
   |            created]          |             |              |
   |               |              |             |              |
   |               |-----TXN----->|             |              |
   |               |              |             |              |
   |               |<---WAL Ok----|             |              |
   |               |              |             |              |
   |         [TXN undo log        |             |              |
   |           destroyed]         |             |              |
   |               |              |             |              |
   |<----TXN Ok----|              |             |              |
   |               |-------Replicate TXN------->|              |
   |               |              |             |              |
   |               |              |       [TXN undo log        |
   |               |              |          created]          |
   |               |              |             |              |
   |               |              |             |-----TXN----->|
   |               |              |             |              |
   |               |              |             |<---WAL Ok----|
   |               |              |             |              |
   |               |              |       [TXN undo log        |
   |               |              |         destroyed]         |
   |               |              |             |              |
```

To introduce the 'quorum' we have to receive confirmation from replicas
to make a decision on whether the quorum is actually present. Leader
collects necessary amount of replicas confirmation plus its own WAL
success. This state is named 'quorum' and gives leader the right to
complete the customers' request. So the picture will change to:
```
Customer        Leader          WAL(L)        Replica        WAL(R)
   |------TXN----->|              |             |              |
   |               |              |             |              |
   |         [TXN undo log        |             |              |
   |            created]          |             |              |
   |               |              |             |              |
   |               |-----TXN----->|             |              |
   |               |              |             |              |
   |               |-------Replicate TXN------->|              |
   |               |              |             |              |
   |               |              |       [TXN undo log        |
   |               |<---WAL Ok----|          created]          |
   |               |              |             |              |
   |           [Waiting           |             |-----TXN----->|
   |         of a quorum]         |             |              |
   |               |              |             |<---WAL Ok----|
   |               |              |             |              |
   |               |<------Replication Ok-------|              |
   |               |              |             |              |
   |            [Quorum           |             |              |
   |           achieved]          |             |              |
   |               |              |             |              |
   |               |---Confirm--->|             |              |
   |               |              |             |              |
   |               |----------Confirm---------->|              |
   |               |              |             |              |
   |<---TXN Ok-----|              |             |---Confirm--->|
   |               |              |             |              |
   |         [TXN undo log        |       [TXN undo log        |
   |           destroyed]         |         destroyed]         |
   |               |              |             |              |
```

The quorum should be collected as a table for a list of transactions
waiting for quorum. The latest transaction that collects the quorum is
considered as complete, as well as all transactions prior to it, since
all transactions should be applied in order. Leader writes a 'confirm'
message to the WAL that refers to the transaction's [LEADER_ID, LSN] and
the confirm has its own LSN. This confirm message is delivered to all
replicas through the existing replication mechanism.

Replica should report a TXN application success to the leader via the
IPROTO explicitly to allow leader to collect the quorum for the TXN.
In case of application failure the replica has to disconnect from the
replication the same way as it is done now. The replica also has to
report its disconnection to the orchestrator. Further actions require
human intervention, since failure means either technical problem (such
as not enough space for WAL) that has to be resolved or an inconsistent
state that requires rejoin.

As soon as leader appears in a situation it has not enough replicas
to achieve quorum, the cluster should stop accepting any requests - both
write and read. The reason for this is that replication of transactions
can achieve quorum on replicas not visible to the leader. On the other
hand, leader can't achieve quorum with available minority. Leader has to
report the state and wait for human intervention. There's an option to
ask leader to rollback to the latest transaction that has quorum: leader
issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN
is of the first transaction in the leader's undo log. The rollback
message replicated to the available cluster will put it in a consistent
state. After that configuration of the cluster can be updated to
available quorum and leader can be switched back to write mode.

### Leader role assignment.

To assign a leader role to an instance the following should be performed:
  1. among all available instances pick the one that has the biggest
     vclock element of the former leader ID; an arbitrary istance can be
     selected in case it is first time the leader is assigned
  2. the leader should assure that number of available instances in the
     cluster is enough to achieve the quorum and proceed to step 3,
     otherwise the leader should report the situation of incomplete quorum,
     as in the last paragraph of previous section
  3. the selected instance has to take the responsibility to replicate
     former leader entries from its WAL, obtainig quorum and commit
     confirm messages referring to [FORMER_LEADER_ID, LSN] in its WAL,
     replicating to the cluster, after that it can start adding its own
     entries into the WAL

### Recovery and failover.

Tarantool instance during reading WAL should postpone the undo log
deletion until the 'confirm' is read. In case the WAL eof is achieved,
the instance should keep undo log for all transactions that are waiting
for a confirm entry until the role of the instance is set.

If this instance will be assigned a leader role then all transactions
that have no corresponding confirm message should be confirmed (see the
leader role assignment).

In case there's not enough replicas to set up a quorum the cluster can
be switched into a read-only mode. Note, this can't be done by default
since some of transactions can have confirmed state. It is up to human
intervention to force rollback of all transactions that have no confirm
and to put the cluster into a consistent state.

In case the instance will be assigned a replica role, it may appear in
a state that it has conflicting WAL entries, in case it recovered from a
leader role and some of transactions didn't replicated to the current
leader. This situation should be resolved through rejoin of the instance.

Consider an example below. Originally instance with ID1 was assigned a
Leader role and the cluster had 2 replicas with quorum set to 2.

```
+---------------------+---------------------+---------------------+
| ID1                 | ID2                 | ID3                 |
| Leader              | Replica 1           | Replica 2           |
+---------------------+---------------------+---------------------+
| ID1 Tx1             | ID1 Tx1             | ID1 Tx1             |
+---------------------+---------------------+---------------------+
| ID1 Tx2             | ID1 Tx2             | ID1 Tx2             |
+---------------------+---------------------+---------------------+
| ID1 Tx3             | ID1 Tx3             | ID1 Tx3             |
+---------------------+---------------------+---------------------+
| ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] |                     |
+---------------------+---------------------+---------------------+
| ID1 Tx4             | ID1 Tx4             |                     |
+---------------------+---------------------+---------------------+
| ID1 Tx5             | ID1 Tx5             |                     |
+---------------------+---------------------+---------------------+
| ID1 Conf [ID1, Tx2] |                     |                     |
+---------------------+---------------------+---------------------+
| Tx6                 |                     |                     |
+---------------------+---------------------+---------------------+
| Tx7                 |                     |                     |
+---------------------+---------------------+---------------------+
```
Suppose at this moment the ID1 instance crashes. Then the ID2 instance
should be assigned a leader role since its ID1 LSN is the biggest.
Then this new leader will deliver its WAL to all replicas.

As soon as quorum for Tx4 and Tx5 will be obtained, it should write the
corresponding Confirms to its WAL. Note that Tx are still uses ID1.
```
+---------------------+---------------------+---------------------+
| ID1                 | ID2                 | ID3                 |
| (dead)              | Leader              | Replica 2           |
+---------------------+---------------------+---------------------+
| ID1 Tx1             | ID1 Tx1             | ID1 Tx1             |
+---------------------+---------------------+---------------------+
| ID1 Tx2             | ID1 Tx2             | ID1 Tx2             |
+---------------------+---------------------+---------------------+
| ID1 Tx3             | ID1 Tx3             | ID1 Tx3             |
+---------------------+---------------------+---------------------+
| ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] |
+---------------------+---------------------+---------------------+
| ID1 Tx4             | ID1 Tx4             | ID1 Tx4             |
+---------------------+---------------------+---------------------+
| ID1 Tx5             | ID1 Tx5             | ID1 Tx5             |
+---------------------+---------------------+---------------------+
| ID1 Conf [ID1, Tx2] | ID2 Conf [Id1, Tx5] | ID2 Conf [Id1, Tx5] |
+---------------------+---------------------+---------------------+
| ID1 Tx6             |                     |                     |
+---------------------+---------------------+---------------------+
| ID1 Tx7             |                     |                     |
+---------------------+---------------------+---------------------+
```
After rejoining ID1 will figure out the inconsistency of its WAL: the
last WAL entry it has is corresponding to Tx7, while in Leader's log the
last entry with ID1 is Tx5.

In case the ID1's WAL contains corresponding entry then Replica 1 can
stop reading WAL as soon as it hits the vclock[ID1] obtained from the
current Leader. It will put the ID1 into a consistent state and it can
obtain latest data via replication. The WAL should be rotated after a
snapshot creation. The old WAL should be renamed so it will not be
reused in the future and kept for postmortem.
```
+---------------------+---------------------+---------------------+
| ID1                 | ID2                 | ID3                 |
| Replica 1           | Leader              | Replica 2           |
+---------------------+---------------------+---------------------+
| ID1 Tx1             | ID1 Tx1             | ID1 Tx1             |
+---------------------+---------------------+---------------------+
| ID1 Tx2             | ID1 Tx2             | ID1 Tx2             |
+---------------------+---------------------+---------------------+
| ID1 Tx3             | ID1 Tx3             | ID1 Tx3             |
+---------------------+---------------------+---------------------+
| ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] |
+---------------------+---------------------+---------------------+
| ID1 Tx4             | ID1 Tx4             | ID1 Tx4             |
+---------------------+---------------------+---------------------+
| ID1 Tx5             | ID1 Tx5             | ID1 Tx5             |
+---------------------+---------------------+---------------------+
|                     | ID2 Conf [Id1, Tx5] | ID2 Conf [Id1, Tx5] |
+---------------------+---------------------+---------------------+
|                     | ID2 Tx1             | ID2 Tx1             |
+---------------------+---------------------+---------------------+
|                     | ID2 Tx2             | ID2 Tx2             |
+---------------------+---------------------+---------------------+
```
Although, there could be a situation that ID1's WAL begins with an LSN
after the biggest available in the Leader's WAL. Either, for vinyl
part of WAL can be referenced in .run files, hence can't be evicted by
a simple WAL ignore. In such a case the ID1 needs a complete rejoin.

### Snapshot generation.

We also can reuse current machinery of snapshot generation. Upon
receiving a request to create a snapshot an instance should request a
readview for the current commit operation. Although start of the
snapshot generation should be postponed until this commit operation
receives its confirmation. In case operation is rolled back, the snapshot
generation should be aborted and restarted using current transaction
after rollback is complete.

After snapshot is created the WAL should start from the first operation
that follows the commit operation snapshot is generated for. That means
WAL will contain 'confirm' messages that refer to transactions that are
not present in the WAL. Apparently, we have to allow this for the case
'confirm' refers to a transaction with LSN less than the first entry in
the WAL.

In case master appears unavailable a replica still have to be able to
create a snapshot. Replica can perform rollback for all transactions that
are not confirmed and claim its LSN as the latest confirmed txn. Then it
can create a snapshot in a regular way and start with blank xlog file.
All rolled back transactions will appear through the regular replication
in case master reappears later on.

### Asynchronous replication.

Along with synchronous replicas the cluster can contain asynchronous
replicas. That means async replica doesn't reply to the leader with
errors since they're not contributing into quorum. Still, async
replicas have to follow the new WAL operation, such as keep rollback
info until 'quorum' message is received. This is essential for the case
of 'rollback' message appearance in the WAL. This message assumes
replica is able to perform all necessary rollback by itself. Cluster
information should contain explicit notification of each replica
operation mode.

### Synchronous replication enabling.

Synchronous operation can be required for a set of spaces in the data
scheme. That means only transactions that contain data modification for
these spaces should require quorum. Such transactions named synchronous.
As soon as last operation of synchronous transaction appeared in leader's
WAL, it will cause all following transactions - no matter if they are
synchronous or not - wait for the quorum. In case quorum is not achieved
the 'rollback' operation will cause rollback of all transactions after
the synchronous one. It will ensure the consistent state of the data both
on leader and replicas. In case user doesn't require synchronous operation
for any space then no changes to the WAL generation and replication will
appear.

Cluster description should contain explicit attribute for each replica
to denote it participates in synchronous activities. Also the description
should contain criterion on how many replicas responses are needed to
achieve the quorum.

## Rationale and alternatives

There is an implementation of synchronous replication as part of gh-980
activities, still it is not in a state to get into the product. More
than that it intentionally breaks backward compatibility which is a
prerequisite for this proposal.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-05-14 20:38               ` Sergey Ostanevich
@ 2020-05-20 20:59                 ` Sergey Ostanevich
  2020-05-25 23:41                   ` Vladislav Shpilevoy
  0 siblings, 1 reply; 53+ messages in thread
From: Sergey Ostanevich @ 2020-05-20 20:59 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

Hi!

I've updated part of recovery and leader election. The latest version is
at the bottom.

Thanks,
Sergos

On 14 мая 23:38, Sergey Ostanevich wrote:
> Hi!
> 
> > >> Sergey, I understand that RAFT spec is big and with this spec you
> > >> try to split it into manageable parts. The question is how useful
> > >> is this particular piece. I'm trying to point out that "the leader
> > >> should stop" is not a silver bullet - especially since each such
> > >> stop may mean a rejoin of some other node. The purpose of sync
> > >> replication is to provide consistency without reducing
> > >> availability (i.e. make progress as long as the quorum
> > >> of nodes make progress). 
> > > 
> > > I'm not sure if we're talking about the same RAFT - mine is "In Search
> > > of an Understandable Consensus Algorithm (Extended Version)" from
> > > Stanford as of May 2014. And it is 15 pages - including references,
> > > conclusions and intro. Seems not that big.
> > 
> > 15 pages of tightly packed theory is a big piece of data. And especially
> > big, when it comes to application to a real project, with existing
> > infrastructure, and all. Just my IMHO. I remember implementing SWIM - it
> > is smaller than RAFT. Much smaller and simpler, and yet it took year to
> > implement it, and cover all things described in the paper.
> 
> That I won't object and this was the reason not to take the RAFT as is
> and implement it in full for next 2-3 years. That's why we had the very
> first part of RFC describing what it tries to address and what's not.
> 
> > 
> > This is not as simple as it looks, when it comes to edge cases. This is
> > why the whole sync replication frustrates me more than anything else
> > before, and why I am so reluctant to doing anything with it.
> > 
> > The RFC mostly covers the normal operation, here I agree with Kostja. But
> > the normal operation is not that interesting. Failures are much more
> > important.
> 
> Definitely and I expect to follow with more functionality on top of it.
> I believe it will be easier to do if the start will be as small as
> possible change to the existent code base, which I also try to follow.
> 
> > 
> > > Although, most of it is dedicated to the leader election itself, which
> > > we intentionally put aside from this RFC. It is written in the very
> > > beginning and I empasized this by explicit mentioning of it.
> > 
> > And still there will be leader election. Even though not ours for now.
> > And Tarantool should provide API and instructions so as external
> > applications could follow them and do the election.
> > 
> > Usually in RFCs we describe API. With arguments, behaviour, and all.
> 
> That is something I believe should be done after we agree on the whole
> idea, such as confirm entry in WAL for sync transactions that appeared
> there earlier. Otherwise we can get very deep into the details, spending
> time for API definition while the idea itself can appear wrong.
> 
> I believe that was a common ground to start, but we immediately went to
> discussion of so many details I tried to keep away before we agree on the
> key parts, such as WAL consistency or quorum collection.
> 
> > 
> > >> The current spec, suggesting there should be a leader stop in case
> > >> of most errors, reduces availability significantly, and doesn't
> > >> make external coordinator job any easier - it still has to follow to
> > >> the letter the prescriptions of RAFT. 
> > >>
> > >>>
> > >>>>
> > >>>>>    |               |----------Confirm---------->|              |
> > >>>>
> > >>>> What happens if peers receive and maybe even write Confirm to their WALs
> > >>>> but local WAL write is lost after a restart?
> > >>>
> > >>> Did you mean WAL write on leader as a local? Then we have a replica with
> > >>> a bigger LSN for the leader ID. 
> > >>
> > >>>> WAL is not synced, 
> > >>>> so we can easily lose the tail of the WAL. Tarantool will sync up
> > >>>> with all replicas on restart,
> > >>>
> > >>> But at this point a new leader will be appointed - the old one is
> > >>> restarted. Then the Confirm message will arrive to the restarted leader 
> > >>> through a regular replication.
> > >>
> > >> This assumes that restart is guaranteed to be noticed by the
> > >> external coordinator and there is an election on every restart.
> > > 
> > > Sure yes, if it restarted - then connection lost can't be unnoticed by
> > > anyone, be it coordinator or cluster.
> > 
> > Here comes another problem. Disconnect and restart have nothing to do with
> > each other. The coordinator can loose connection without the peer leader
> > restart. Just because it is network. Anything can happen. Moreover, while
> > the coordinator does not have a connection, the leader can restart multiple
> > times.
> 
> Definitely there should be a higher level functionality to support some
> sort of membership protocol, such as SWIM or RAFT itself. But
> introduction of it should not affect the basic priciples we have to
> agree upon.
> 
> > 
> > We can't tell the coordinator rely on connectivity as a restart signal.
> > 
> > >>>> but there will be no "Replication
> > >>>> OK" messages from them, so it wouldn't know that the transaction
> > >>>> is committed on them. How is this handled? We may end up with some
> > >>>> replicas confirming the transaction while the leader will roll it
> > >>>> back on restart. Do you suggest there is a human intervention on
> > >>>> restart as well?
> > >>>>
> > >>>>
> > >>>>>    |               |              |             |              |
> > >>>>>    |<---TXN Ok-----|              |       [TXN undo log        |
> > >>>>>    |               |              |         destroyed]         |
> > >>>>>    |               |              |             |              |
> > >>>>>    |               |              |             |---Confirm--->|
> > >>>>>    |               |              |             |              |
> > >>>>> ```
> > >>>>>
> > >>>>> The quorum should be collected as a table for a list of transactions
> > >>>>> waiting for quorum. The latest transaction that collects the quorum is
> > >>>>> considered as complete, as well as all transactions prior to it, since
> > >>>>> all transactions should be applied in order. Leader writes a 'confirm'
> > >>>>> message to the WAL that refers to the transaction's [LEADER_ID, LSN] and
> > >>>>> the confirm has its own LSN. This confirm message is delivered to all
> > >>>>> replicas through the existing replication mechanism.
> > >>>>>
> > >>>>> Replica should report a TXN application success to the leader via the
> > >>>>> IPROTO explicitly to allow leader to collect the quorum for the TXN.
> > >>>>> In case of application failure the replica has to disconnect from the
> > >>>>> replication the same way as it is done now. The replica also has to
> > >>>>> report its disconnection to the orchestrator. Further actions require
> > >>>>> human intervention, since failure means either technical problem (such
> > >>>>> as not enough space for WAL) that has to be resovled or an inconsistent
> > >>>>> state that requires rejoin.
> > >>>>
> > >>>>> As soon as leader appears in a situation it has not enough replicas
> > >>>>> to achieve quorum, the cluster should stop accepting any requests - both
> > >>>>> write and read.
> > >>>>
> > >>>> How does *the cluster* know the state of the leader and if it
> > >>>> doesn't, how it can possibly implement this? Did you mean
> > >>>> the leader should stop accepting transactions here? But how can
> > >>>> the leader know if it has not enough replicas during a read
> > >>>> transaction, if it doesn't contact any replica to serve a read?
> > >>>
> > >>> I expect to have a disconnection trigger assigned to all relays so that
> > >>> disconnection will cause the number of replicas decrease. The quorum
> > >>> size is static, so we can stop at the very moment the number dives below.
> > >>
> > >> What happens between the event the leader is partitioned away and
> > >> a new leader is elected?
> > >>
> > >> The leader may be unaware of the events and serve a read just
> > >> fine.
> > > 
> > > As it is stated 20 lines above: 
> > >>>>> As soon as leader appears in a situation it has not enough
> > >>>>> replicas
> > >>>>> to achieve quorum, the cluster should stop accepting any
> > >>>>> requests - both
> > >>>>> write and read.
> > > 
> > > So it will not serve. 
> > 
> > This breaks compatibility, since now an orphan node is perfectly able
> > to serve reads. The cluster can't just stop doing everything, if the
> > quorum is lost. Stop writes - yes, since the quorum is lost anyway. But
> > reads do not need a quorum.
> > 
> > If you say reads need a quorum, then they would need to go through WAL,
> > collect confirmations, and all.
> 
> The reads should not be inconsistent - so that cluster will keep
> answering A or B for the same request. And in case we lost quorum we
> can't say for sure that all instances will answer the same.
> 
> As we discussed it before, if leader appears in minor part of the
> cluster it can't issue rollback for all unconfirmed txns, since the
> majority will re-elect leader who will collect quorum for them. Means,
> we will appear is a state that cluster split in two. So the minor part
> should stop. Am I wrong here?
> 
> > 
> > >> So at least you can't say the leader shouldn't be serving reads
> > >> without quorum - because the only way to achieve it is to collect
> > >> a quorum of responses to reads as well.
> > > 
> > > The leader lost connection to the (N-Q)+1 repllicas out of the N in
> > > cluster with a quorum of Q == it stops serving anything. So the quorum
> > > criteria is there: no quorum - no reads. 
> > 
> > Connection count tells nothing. Network connectivity is not a reliable
> > source of information. Only messages and persistent data are reliable
> > (to certain extent).
> 
> Well, persistent data can't help obtain quorum if there's no connection
> to the replicas who should contribute to quorum.
> Correct me, if I'm wrong: in case no quorum available we can't garantee
> that the data is stored on at least <quorum> number of servers. Means -
> cluster is not operable.
> 
> > 
> > >>>>> The reason for this is that replication of transactions
> > >>>>> can achieve quorum on replicas not visible to the leader. On the other
> > >>>>> hand, leader can't achieve quorum with available minority. Leader has to
> > >>>>> report the state and wait for human intervention. There's an option to
> > >>>>> ask leader to rollback to the latest transaction that has quorum: leader
> > >>>>> issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN
> > >>>>> is of the first transaction in the leader's undo log. The rollback
> > >>>>> message replicated to the available cluster will put it in a consistent
> > >>>>> state. After that configuration of the cluster can be updated to
> > >>>>> available quorum and leader can be switched back to write mode.
> > >>>>
> > >>>> As you should be able to conclude from restart scenario, it is
> > >>>> possible a replica has the record in *confirmed* state but the
> > >>>> leader has it in pending state. The replica will not be able to
> > >>>> roll back then. Do you suggest the replica should abort if it
> > >>>> can't rollback? This may lead to an avalanche of rejoins on leader
> > >>>> restart, bringing performance to a halt.
> > >>>
> > >>> No, I declare replica with biggest LSN as a new shining leader. More
> > >>> than that, new leader can (so far it will be by default) finalize the
> > >>> former leader life's work by replicating txns and appropriate confirms.
> > >>
> > >> Right, this also assumes the restart is noticed, so it follows the
> > >> same logic.
> > > 
> > > How a restart can be unnoticed, if it causes disconnection?
> > 
> > Disconnection has nothing to do with restart. The coordinator itself may
> > restart. Or it may loose connection to the leader temporarily. Or the
> > leader may loose it without any restarts.
> 
> But how we detect it right now in Tarantool? Is there any machinery?
> I suppose we can simply rely on the same at least to test the minimal -
> and 'normally operating' - first approach to the problem.
> 
> 
> So, thank you for all comments and please, find my updated RFC below.
> 
> Sergos.
> 
> ---

* **Status**: In progress
* **Start date**: 31-03-2020
* **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\>
* **Issues**: https://github.com/tarantool/tarantool/issues/4842

## Summary

The aim of this RFC is to address the following list of problems
formulated at MRG planning meeting:
  - protocol backward compatibility to enable cluster upgrade w/o
    downtime
  - consistency of data on replica and leader
  - switch from leader to replica without data loss
  - up to date replicas to run read-only requests
  - ability to switch async replicas into sync ones and vice versa
  - guarantee of rollback on leader and sync replicas
  - simplicity of cluster orchestration

What this RFC is not:

  - high availability (HA) solution with automated failover, roles
    assignments an so on
  - master-master configuration support

## Background and motivation

There are number of known implementation of consistent data presence in
a Tarantool cluster. They can be commonly named as "wait for LSN"
technique. The biggest issue with this technique is the absence of
rollback guarantees at replica in case of transaction failure on one
master or some of the replicas in the cluster.

To provide such capabilities a new functionality should be introduced in
Tarantool core, with requirements mentioned before - backward
compatibility and ease of cluster orchestration.

The cluster operation is expected to be in a full-mesh topology, although
the process of automated topology support is beyond this RFC.

## Detailed design

### Quorum commit

The main idea behind the proposal is to reuse existent machinery as much
as possible. It will ensure the well-tested and proven functionality
across many instances in MRG and beyond is used. The transaction rollback
mechanism is in place and works for WAL write failure. If we substitute
the WAL success with a new situation which is named 'quorum' later in
this document then no changes to the machinery is needed. The same is
true for snapshot machinery that allows to create a copy of the database
in memory for the whole period of snapshot file write. Adding quorum here
also minimizes changes.

Currently replication represented by the following scheme:
```
Customer        Leader          WAL(L)        Replica        WAL(R)
   |------TXN----->|              |             |              |
   |               |              |             |              |
   |         [TXN undo log        |             |              |
   |            created]          |             |              |
   |               |              |             |              |
   |               |-----TXN----->|             |              |
   |               |              |             |              |
   |               |<---WAL Ok----|             |              |
   |               |              |             |              |
   |         [TXN undo log        |             |              |
   |           destroyed]         |             |              |
   |               |              |             |              |
   |<----TXN Ok----|              |             |              |
   |               |-------Replicate TXN------->|              |
   |               |              |             |              |
   |               |              |       [TXN undo log        |
   |               |              |          created]          |
   |               |              |             |              |
   |               |              |             |-----TXN----->|
   |               |              |             |              |
   |               |              |             |<---WAL Ok----|
   |               |              |             |              |
   |               |              |       [TXN undo log        |
   |               |              |         destroyed]         |
   |               |              |             |              |
```

To introduce the 'quorum' we have to receive confirmation from replicas
to make a decision on whether the quorum is actually present. Leader
collects necessary amount of replicas confirmation plus its own WAL
success. This state is named 'quorum' and gives leader the right to
complete the customers' request. So the picture will change to:
```
Customer        Leader          WAL(L)        Replica        WAL(R)
   |------TXN----->|              |             |              |
   |               |              |             |              |
   |         [TXN undo log        |             |              |
   |            created]          |             |              |
   |               |              |             |              |
   |               |-----TXN----->|             |              |
   |               |              |             |              |
   |               |-------Replicate TXN------->|              |
   |               |              |             |              |
   |               |              |       [TXN undo log        |
   |               |<---WAL Ok----|          created]          |
   |               |              |             |              |
   |           [Waiting           |             |-----TXN----->|
   |         of a quorum]         |             |              |
   |               |              |             |<---WAL Ok----|
   |               |              |             |              |
   |               |<------Replication Ok-------|              |
   |               |              |             |              |
   |            [Quorum           |             |              |
   |           achieved]          |             |              |
   |               |              |             |              |
   |               |---Confirm--->|             |              |
   |               |              |             |              |
   |               |----------Confirm---------->|              |
   |               |              |             |              |
   |<---TXN Ok-----|              |             |---Confirm--->|
   |               |              |             |              |
   |         [TXN undo log        |       [TXN undo log        |
   |           destroyed]         |         destroyed]         |
   |               |              |             |              |
```

The quorum should be collected as a table for a list of transactions
waiting for quorum. The latest transaction that collects the quorum is
considered as complete, as well as all transactions prior to it, since
all transactions should be applied in order. Leader writes a 'confirm'
message to the WAL that refers to the transaction's [LEADER_ID, LSN] and
the confirm has its own LSN. This confirm message is delivered to all
replicas through the existing replication mechanism.

Replica should report a TXN application success to the leader via the
IPROTO explicitly to allow leader to collect the quorum for the TXN.
In case of application failure the replica has to disconnect from the
replication the same way as it is done now. The replica also has to
report its disconnection to the orchestrator. Further actions require
human intervention, since failure means either technical problem (such
as not enough space for WAL) that has to be resolved or an inconsistent
state that requires rejoin.

As soon as leader appears in a situation it has not enough replicas
to achieve quorum, the cluster should stop accepting any requests - both
write and read. The reason for this is that replication of transactions
can achieve quorum on replicas not visible to the leader. On the other
hand, leader can't achieve quorum with available minority. Leader has to
report the state and wait for human intervention. There's an option to
ask leader to rollback to the latest transaction that has quorum: leader
issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN
is of the first transaction in the leader's undo log. The rollback
message replicated to the available cluster will put it in a consistent
state. After that configuration of the cluster can be updated to
available quorum and leader can be switched back to write mode.

### Leader role assignment.

Be it a user-initiated assignment or an algorithmic one, it should use
a common interface to assign the leader role. By now we implement a
simplified machinery, still it should be feasible in the future to fit
the algorithms, such as RAFT or proposed before box.ctl.promote.

A system space \_voting can be used to replicate the voting among the
cluster, this space should be writable even for a read-only instance.
This space should contain a CURRENT_LEADER_ID at any time - means the
current leader, can be a zero value at the start. This is needed to
compare the  appropriate vclock component below.

All replicas should be subscribed to changes in the space and react as
described below.

 promote(ID) - should be called from a replica with it's own ID.
   Writes an entry in the voting space about this ID is waiting for
   votes from cluster. The entry should also contain the current
   vclock[CURRENT_LEADER_ID] of the nominee.

Upon changes in the space each replica should compare its appropriate
vclock component with submitted one and append its vote to the space:
AYE in case nominee's vclock is bigger or equal to the replica's one,
NAY otherwise.

As soon as nominee collects the quorum for being elected, it claims
himself a Leader by switching in rw mode, writes CURRENT_LEADER_ID as
a FORMER_LEADER_ID in the \_voting space and put its ID as a
CURRENT_LEADER_ID. In case a NAY is appeared in the \_voting or a
timeout predefined in box.cfg is reached, the nominee should remove
it's entry from the space.

The leader should assure that number of available instances in the
cluster is enough to achieve the quorum and proceed to step 3, otherwise
the leader should report the situation of incomplete quorum, as
described in the last paragraph of previous section.

The new Leader has to take the responsibility to replicate former Leader's
entries from its WAL, obtain quorum and commit confirm messages referring
to [FORMER_LEADER_ID, LSN] in its WAL, replicating to the cluster, after
that it can start adding its own entries into the WAL.

 demote(ID) - should be called from the Leader instance.
   The Leader has to switch in ro mode and wait for its' undo log is
   empty. This effectively means all transactions are committed in the
   cluster and it is safe pass the leadership. Then it should write
   CURRENT_LEADER_ID as a FORMER_LEADER_ID and put CURRENT_LEADER_ID
   into 0.

### Recovery and failover.

Tarantool instance during reading WAL should postpone the undo log
deletion until the 'confirm' is read. In case the WAL eof is achieved,
the instance should keep undo log for all transactions that are waiting
for a confirm entry until the role of the instance is set.

If this instance will be assigned a leader role then all transactions
that have no corresponding confirm message should be confirmed (see the
leader role assignment).

In case there's not enough replicas to set up a quorum the cluster can
be switched into a read-only mode. Note, this can't be done by default
since some of transactions can have confirmed state. It is up to human
intervention to force rollback of all transactions that have no confirm
and to put the cluster into a consistent state.

In case the instance will be assigned a replica role, it may appear in
a state that it has conflicting WAL entries, in case it recovered from a
leader role and some of transactions didn't replicated to the current
leader. This situation should be resolved through rejoin of the instance.

Consider an example below. Originally instance with ID1 was assigned a
Leader role and the cluster had 2 replicas with quorum set to 2.

```
+---------------------+---------------------+---------------------+
| ID1                 | ID2                 | ID3                 |
| Leader              | Replica 1           | Replica 2           |
+---------------------+---------------------+---------------------+
| ID1 Tx1             | ID1 Tx1             | ID1 Tx1             |
+---------------------+---------------------+---------------------+
| ID1 Tx2             | ID1 Tx2             | ID1 Tx2             |
+---------------------+---------------------+---------------------+
| ID1 Tx3             | ID1 Tx3             | ID1 Tx3             |
+---------------------+---------------------+---------------------+
| ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] |                     |
+---------------------+---------------------+---------------------+
| ID1 Tx4             | ID1 Tx4             |                     |
+---------------------+---------------------+---------------------+
| ID1 Tx5             | ID1 Tx5             |                     |
+---------------------+---------------------+---------------------+
| ID1 Conf [ID1, Tx2] |                     |                     |
+---------------------+---------------------+---------------------+
| Tx6                 |                     |                     |
+---------------------+---------------------+---------------------+
| Tx7                 |                     |                     |
+---------------------+---------------------+---------------------+
```
Suppose at this moment the ID1 instance crashes. Then the ID2 instance
should be assigned a leader role since its ID1 LSN is the biggest.
Then this new leader will deliver its WAL to all replicas.

As soon as quorum for Tx4 and Tx5 will be obtained, it should write the
corresponding Confirms to its WAL. Note that Tx are still uses ID1.
```
+---------------------+---------------------+---------------------+
| ID1                 | ID2                 | ID3                 |
| (dead)              | Leader              | Replica 2           |
+---------------------+---------------------+---------------------+
| ID1 Tx1             | ID1 Tx1             | ID1 Tx1             |
+---------------------+---------------------+---------------------+
| ID1 Tx2             | ID1 Tx2             | ID1 Tx2             |
+---------------------+---------------------+---------------------+
| ID1 Tx3             | ID1 Tx3             | ID1 Tx3             |
+---------------------+---------------------+---------------------+
| ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] |
+---------------------+---------------------+---------------------+
| ID1 Tx4             | ID1 Tx4             | ID1 Tx4             |
+---------------------+---------------------+---------------------+
| ID1 Tx5             | ID1 Tx5             | ID1 Tx5             |
+---------------------+---------------------+---------------------+
| ID1 Conf [ID1, Tx2] | ID2 Conf [Id1, Tx5] | ID2 Conf [Id1, Tx5] |
+---------------------+---------------------+---------------------+
| ID1 Tx6             |                     |                     |
+---------------------+---------------------+---------------------+
| ID1 Tx7             |                     |                     |
+---------------------+---------------------+---------------------+
```
After rejoining ID1 will figure out the inconsistency of its WAL: the
last WAL entry it has is corresponding to Tx7, while in Leader's log the
last entry with ID1 is Tx5. Confirm for a Tx can only be issued after
appearance of the Tx on the majoirty of replicas, hence there's a good
chances that ID1 will have inconsistency in its WAL covered with undo
log. So, by rolling back all excessive Txs (in the example they are Tx6
and Tx7) the ID1 can put its memtx and vynil in consistent state.

At this point a snapshot can be created at ID1 with appropriate WAL
rotation. The old WAL should be renamed so it will not be reused in the
future and can be kept for postmortem.
```
+---------------------+---------------------+---------------------+
| ID1                 | ID2                 | ID3                 |
| Replica 1           | Leader              | Replica 2           |
+---------------------+---------------------+---------------------+
| ID1 Tx1             | ID1 Tx1             | ID1 Tx1             |
+---------------------+---------------------+---------------------+
| ID1 Tx2             | ID1 Tx2             | ID1 Tx2             |
+---------------------+---------------------+---------------------+
| ID1 Tx3             | ID1 Tx3             | ID1 Tx3             |
+---------------------+---------------------+---------------------+
| ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] |
+---------------------+---------------------+---------------------+
| ID1 Tx4             | ID1 Tx4             | ID1 Tx4             |
+---------------------+---------------------+---------------------+
| ID1 Tx5             | ID1 Tx5             | ID1 Tx5             |
+---------------------+---------------------+---------------------+
|                     | ID2 Conf [Id1, Tx5] | ID2 Conf [Id1, Tx5] |
+---------------------+---------------------+---------------------+
|                     | ID2 Tx1             | ID2 Tx1             |
+---------------------+---------------------+---------------------+
|                     | ID2 Tx2             | ID2 Tx2             |
+---------------------+---------------------+---------------------+
```
Although, in case undo log is not enough to cover the WAL inconsistence
with the new leader, the ID1 needs a complete rejoin.

### Snapshot generation.

We also can reuse current machinery of snapshot generation. Upon
receiving a request to create a snapshot an instance should request a
readview for the current commit operation. Although start of the
snapshot generation should be postponed until this commit operation
receives its confirmation. In case operation is rolled back, the snapshot
generation should be aborted and restarted using current transaction
after rollback is complete.

After snapshot is created the WAL should start from the first operation
that follows the commit operation snapshot is generated for. That means
WAL will contain 'confirm' messages that refer to transactions that are
not present in the WAL. Apparently, we have to allow this for the case
'confirm' refers to a transaction with LSN less than the first entry in
the WAL.

In case master appears unavailable a replica still have to be able to
create a snapshot. Replica can perform rollback for all transactions that
are not confirmed and claim its LSN as the latest confirmed txn. Then it
can create a snapshot in a regular way and start with blank xlog file.
All rolled back transactions will appear through the regular replication
in case master reappears later on.

### Asynchronous replication.

Along with synchronous replicas the cluster can contain asynchronous
replicas. That means async replica doesn't reply to the leader with
errors since they're not contributing into quorum. Still, async
replicas have to follow the new WAL operation, such as keep rollback
info until 'quorum' message is received. This is essential for the case
of 'rollback' message appearance in the WAL. This message assumes
replica is able to perform all necessary rollback by itself. Cluster
information should contain explicit notification of each replica
operation mode.

### Synchronous replication enabling.

Synchronous operation can be required for a set of spaces in the data
scheme. That means only transactions that contain data modification for
these spaces should require quorum. Such transactions named synchronous.
As soon as last operation of synchronous transaction appeared in leader's
WAL, it will cause all following transactions - no matter if they are
synchronous or not - wait for the quorum. In case quorum is not achieved
the 'rollback' operation will cause rollback of all transactions after
the synchronous one. It will ensure the consistent state of the data both
on leader and replicas. In case user doesn't require synchronous operation
for any space then no changes to the WAL generation and replication will
appear.

Cluster description should contain explicit attribute for each replica
to denote it participates in synchronous activities. Also the description
should contain criterion on how many replicas responses are needed to
achieve the quorum.

## Rationale and alternatives

There is an implementation of synchronous replication as part of gh-980
activities, still it is not in a state to get into the product. More
than that it intentionally breaks backward compatibility which is a
prerequisite for this proposal.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-05-20 20:59                 ` Sergey Ostanevich
@ 2020-05-25 23:41                   ` Vladislav Shpilevoy
  2020-05-27 21:17                     ` Sergey Ostanevich
  0 siblings, 1 reply; 53+ messages in thread
From: Vladislav Shpilevoy @ 2020-05-25 23:41 UTC (permalink / raw)
  To: Sergey Ostanevich; +Cc: tarantool-patches

Hi! Thanks for the changes!

>>>>>>>> As soon as leader appears in a situation it has not enough
>>>>>>>> replicas
>>>>>>>> to achieve quorum, the cluster should stop accepting any
>>>>>>>> requests - both
>>>>>>>> write and read.
>>>>
>>>> So it will not serve. 
>>>
>>> This breaks compatibility, since now an orphan node is perfectly able
>>> to serve reads. The cluster can't just stop doing everything, if the
>>> quorum is lost. Stop writes - yes, since the quorum is lost anyway. But
>>> reads do not need a quorum.
>>>
>>> If you say reads need a quorum, then they would need to go through WAL,
>>> collect confirmations, and all.
>>
>> The reads should not be inconsistent - so that cluster will keep
>> answering A or B for the same request. And in case we lost quorum we
>> can't say for sure that all instances will answer the same.
>>
>> As we discussed it before, if leader appears in minor part of the
>> cluster it can't issue rollback for all unconfirmed txns, since the
>> majority will re-elect leader who will collect quorum for them. Means,
>> we will appear is a state that cluster split in two. So the minor part
>> should stop. Am I wrong here?

Yeah, kinda. As long as you allow reading from replicas, you *always* will
have a time slot, when you will be able to read different data for the
same key on different replicas. Even with reads going through quorum.

Because it is physically impossible to make nodes A and B start answering
the same data at the same time moment. To notify them about a confirm you will
send network messages, they will have not the same delay, won't be processed
in the same moment of time, and some of them probably won't be even delivered.

The only correct way to read the same - read from one node only. From the
leader. And since this is not our way, it means we can't beat the 'inconsistent'
reads problems. And I don't think we should. Because if somebody needs to do
'consistent' reads, they should read from leader only.

In other words, the concept of 'consistency' is highly application dependent
here. If we provide a way to read from replicas, we give flexibility to choose:
read from leader only and see always the same data, or read from all, and have
a possibility, that requests may see different data on different replicas
sometimes.

> ## Detailed design
> 
> ### Quorum commit
> 
> The main idea behind the proposal is to reuse existent machinery as much
> as possible. It will ensure the well-tested and proven functionality
> across many instances in MRG and beyond is used. The transaction rollback
> mechanism is in place and works for WAL write failure. If we substitute
> the WAL success with a new situation which is named 'quorum' later in
> this document then no changes to the machinery is needed. The same is
> true for snapshot machinery that allows to create a copy of the database
> in memory for the whole period of snapshot file write. Adding quorum here
> also minimizes changes.
> 
> Currently replication represented by the following scheme:
> ```
> Customer        Leader          WAL(L)        Replica        WAL(R)
>    |------TXN----->|              |             |              |
>    |               |              |             |              |
>    |         [TXN undo log        |             |              |
>    |            created]          |             |              |
>    |               |              |             |              |
>    |               |-----TXN----->|             |              |
>    |               |              |             |              |
>    |               |<---WAL Ok----|             |              |
>    |               |              |             |              |
>    |         [TXN undo log        |             |              |
>    |           destroyed]         |             |              |
>    |               |              |             |              |
>    |<----TXN Ok----|              |             |              |
>    |               |-------Replicate TXN------->|              |
>    |               |              |             |              |
>    |               |              |       [TXN undo log        |
>    |               |              |          created]          |
>    |               |              |             |              |
>    |               |              |             |-----TXN----->|
>    |               |              |             |              |
>    |               |              |             |<---WAL Ok----|
>    |               |              |             |              |
>    |               |              |       [TXN undo log        |
>    |               |              |         destroyed]         |
>    |               |              |             |              |
> ```
> 
> To introduce the 'quorum' we have to receive confirmation from replicas
> to make a decision on whether the quorum is actually present. Leader
> collects necessary amount of replicas confirmation plus its own WAL
> success. This state is named 'quorum' and gives leader the right to
> complete the customers' request. So the picture will change to:
> ```
> Customer        Leader          WAL(L)        Replica        WAL(R)
>    |------TXN----->|              |             |              |
>    |               |              |             |              |
>    |         [TXN undo log        |             |              |
>    |            created]          |             |              |
>    |               |              |             |              |
>    |               |-----TXN----->|             |              |
>    |               |              |             |              |
>    |               |-------Replicate TXN------->|              |
>    |               |              |             |              |
>    |               |              |       [TXN undo log        |
>    |               |<---WAL Ok----|          created]          |
>    |               |              |             |              |
>    |           [Waiting           |             |-----TXN----->|
>    |         of a quorum]         |             |              |
>    |               |              |             |<---WAL Ok----|
>    |               |              |             |              |
>    |               |<------Replication Ok-------|              |
>    |               |              |             |              |
>    |            [Quorum           |             |              |
>    |           achieved]          |             |              |
>    |               |              |             |              |
>    |               |---Confirm--->|             |              |
>    |               |              |             |              |
>    |               |----------Confirm---------->|              |
>    |               |              |             |              |
>    |<---TXN Ok-----|              |             |---Confirm--->|
>    |               |              |             |              |
>    |         [TXN undo log        |       [TXN undo log        |
>    |           destroyed]         |         destroyed]         |
>    |               |              |             |              |
> ```
> 
> The quorum should be collected as a table for a list of transactions
> waiting for quorum. The latest transaction that collects the quorum is
> considered as complete, as well as all transactions prior to it, since
> all transactions should be applied in order. Leader writes a 'confirm'
> message to the WAL that refers to the transaction's [LEADER_ID, LSN] and
> the confirm has its own LSN. This confirm message is delivered to all
> replicas through the existing replication mechanism.
> 
> Replica should report a TXN application success to the leader via the
> IPROTO explicitly to allow leader to collect the quorum for the TXN.
> In case of application failure the replica has to disconnect from the
> replication the same way as it is done now. The replica also has to
> report its disconnection to the orchestrator. Further actions require
> human intervention, since failure means either technical problem (such
> as not enough space for WAL) that has to be resolved or an inconsistent
> state that requires rejoin.

I don't think a replica should report disconnection. Problem of
disconnection is that it leads to loosing the connection. So it may be
not able to connect to the orchestrator. Also it would be strange for
tarantool to depend on some external service, to which it should report.
This looks like the orchestrator's business how will it determine
connectivity. Replica has nothing to do with it from its side.

> As soon as leader appears in a situation it has not enough replicas
> to achieve quorum, the cluster should stop accepting any requests - both
> write and read.

The moment of not having enough replicas can't be determined properly.
You may loose connection to replicas (they could be powered off), but
TCP won't see that, and the node will continue working. The failure will
be discovered only when a 'write' request will try to collect a quorum,
or after a timeout will pass on not delivering heartbeats. During this
time reads will be served. And there is no way to prevent them except
collecting a quorum on that. See my first comment in this email for more
details.

On the summary: we can't stop accepting read requests.

Btw, what to do with reads, which were *in-progress*, when the quorum
was lost? Such as long vinyl reads.

> The reason for this is that replication of transactions
> can achieve quorum on replicas not visible to the leader. On the other
> hand, leader can't achieve quorum with available minority. Leader has to
> report the state and wait for human intervention.

Yeah, but if the leader couldn't achieve a quorum on some transactions,
they are not visible (assuming MVCC will work properly). So they can't
be read anyway. And if a leader answered an error, it does not mean that
the transaction wasn't replicated on the majority, as we discussed at some
meeting, I don't already remember when. So here read allowance also works
fine - not having some data visible and getting error at a sync transaction
does not mean it is not committed. A user should be aware of that.

> There's an option to
> ask leader to rollback to the latest transaction that has quorum: leader
> issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN
> is of the first transaction in the leader's undo log. The rollback
> message replicated to the available cluster will put it in a consistent
> state. After that configuration of the cluster can be updated to
> available quorum and leader can be switched back to write mode.
> 
> ### Leader role assignment.
> 
> Be it a user-initiated assignment or an algorithmic one, it should use
> a common interface to assign the leader role. By now we implement a
> simplified machinery, still it should be feasible in the future to fit
> the algorithms, such as RAFT or proposed before box.ctl.promote.
> 
> A system space \_voting can be used to replicate the voting among the
> cluster, this space should be writable even for a read-only instance.
> This space should contain a CURRENT_LEADER_ID at any time - means the
> current leader, can be a zero value at the start. This is needed to
> compare the  appropriate vclock component below.
> 
> All replicas should be subscribed to changes in the space and react as
> described below.
> 
>  promote(ID) - should be called from a replica with it's own ID.
>    Writes an entry in the voting space about this ID is waiting for
>    votes from cluster. The entry should also contain the current
>    vclock[CURRENT_LEADER_ID] of the nominee.
> 
> Upon changes in the space each replica should compare its appropriate
> vclock component with submitted one and append its vote to the space:
> AYE in case nominee's vclock is bigger or equal to the replica's one,
> NAY otherwise.
> 
> As soon as nominee collects the quorum for being elected, it claims
> himself a Leader by switching in rw mode, writes CURRENT_LEADER_ID as
> a FORMER_LEADER_ID in the \_voting space and put its ID as a
> CURRENT_LEADER_ID. In case a NAY is appeared in the \_voting or a
> timeout predefined in box.cfg is reached, the nominee should remove
> it's entry from the space.
> 
> The leader should assure that number of available instances in the
> cluster is enough to achieve the quorum and proceed to step 3, otherwise
> the leader should report the situation of incomplete quorum, as
> described in the last paragraph of previous section.
> 
> The new Leader has to take the responsibility to replicate former Leader's
> entries from its WAL, obtain quorum and commit confirm messages referring
> to [FORMER_LEADER_ID, LSN] in its WAL, replicating to the cluster, after
> that it can start adding its own entries into the WAL.
> 
>  demote(ID) - should be called from the Leader instance.
>    The Leader has to switch in ro mode and wait for its' undo log is
>    empty. This effectively means all transactions are committed in the
>    cluster and it is safe pass the leadership. Then it should write
>    CURRENT_LEADER_ID as a FORMER_LEADER_ID and put CURRENT_LEADER_ID
>    into 0.

This looks like box.ctl.promote() algorithm. Although I thought we decided
not to implement any kind of auto election here, no? Box.ctl.promote()
assumed, that it does all the steps automatically, except choosing on which
node to call this function. This is what it was so complicated. It was
basically raft.

But yeah, as discussed verbally, this is a subject for improvement.

The way I see it is that we need to give vclock based algorithm of choosing
a new leader; tell how to stop replication from the old leader; allow to
read vclock from replicas (basically, let the external service read box.info).

Since you said you think we should not provide an API for all sync transactions
rollback, it looks like no need in a special new API. But if we still want
to allow to rollback all pending transactions of the old leader on a new leader
(like Mons wants) then yeah, seems like we would need a new function. For example,
box.ctl.sync_rollback() to rollback all pending. And box.ctl.sync_confirm() to
confirm all pending. Perhaps we could add more admin-line parameters such as
replica_id with which to write 'confirm/rollback' message.

> ### Recovery and failover.
> 
> Tarantool instance during reading WAL should postpone the undo log
> deletion until the 'confirm' is read. In case the WAL eof is achieved,
> the instance should keep undo log for all transactions that are waiting
> for a confirm entry until the role of the instance is set.
> 
> If this instance will be assigned a leader role then all transactions
> that have no corresponding confirm message should be confirmed (see the
> leader role assignment).
> 
> In case there's not enough replicas to set up a quorum the cluster can
> be switched into a read-only mode. Note, this can't be done by default
> since some of transactions can have confirmed state. It is up to human
> intervention to force rollback of all transactions that have no confirm
> and to put the cluster into a consistent state.

Above you said:

>> As soon as leader appears in a situation it has not enough replicas
>> to achieve quorum, the cluster should stop accepting any requests - both
>> write and read.

But here I see, that the cluster "switched into a read-only mode". So there
is a contradiction. And I think it should be resolved in favor of
'read-only mode'. I explained why in the previous comments.

> In case the instance will be assigned a replica role, it may appear in
> a state that it has conflicting WAL entries, in case it recovered from a
> leader role and some of transactions didn't replicated to the current
> leader. This situation should be resolved through rejoin of the instance.
> 
> Consider an example below. Originally instance with ID1 was assigned a
> Leader role and the cluster had 2 replicas with quorum set to 2.
> 
> ```
> +---------------------+---------------------+---------------------+
> | ID1                 | ID2                 | ID3                 |
> | Leader              | Replica 1           | Replica 2           |
> +---------------------+---------------------+---------------------+
> | ID1 Tx1             | ID1 Tx1             | ID1 Tx1             |
> +---------------------+---------------------+---------------------+
> | ID1 Tx2             | ID1 Tx2             | ID1 Tx2             |
> +---------------------+---------------------+---------------------+
> | ID1 Tx3             | ID1 Tx3             | ID1 Tx3             |
> +---------------------+---------------------+---------------------+
> | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] |                     |
> +---------------------+---------------------+---------------------+
> | ID1 Tx4             | ID1 Tx4             |                     |
> +---------------------+---------------------+---------------------+
> | ID1 Tx5             | ID1 Tx5             |                     |
> +---------------------+---------------------+---------------------+
> | ID1 Conf [ID1, Tx2] |                     |                     |
> +---------------------+---------------------+---------------------+
> | Tx6                 |                     |                     |
> +---------------------+---------------------+---------------------+
> | Tx7                 |                     |                     |
> +---------------------+---------------------+---------------------+
> ```
> Suppose at this moment the ID1 instance crashes. Then the ID2 instance
> should be assigned a leader role since its ID1 LSN is the biggest.
> Then this new leader will deliver its WAL to all replicas.
> 
> As soon as quorum for Tx4 and Tx5 will be obtained, it should write the
> corresponding Confirms to its WAL. Note that Tx are still uses ID1.
> ```
> +---------------------+---------------------+---------------------+
> | ID1                 | ID2                 | ID3                 |
> | (dead)              | Leader              | Replica 2           |
> +---------------------+---------------------+---------------------+
> | ID1 Tx1             | ID1 Tx1             | ID1 Tx1             |
> +---------------------+---------------------+---------------------+
> | ID1 Tx2             | ID1 Tx2             | ID1 Tx2             |
> +---------------------+---------------------+---------------------+
> | ID1 Tx3             | ID1 Tx3             | ID1 Tx3             |
> +---------------------+---------------------+---------------------+
> | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] |
> +---------------------+---------------------+---------------------+
> | ID1 Tx4             | ID1 Tx4             | ID1 Tx4             |
> +---------------------+---------------------+---------------------+
> | ID1 Tx5             | ID1 Tx5             | ID1 Tx5             |
> +---------------------+---------------------+---------------------+
> | ID1 Conf [ID1, Tx2] | ID2 Conf [Id1, Tx5] | ID2 Conf [Id1, Tx5] |

Id1 -> ID1 (typo)

> +---------------------+---------------------+---------------------+
> | ID1 Tx6             |                     |                     |
> +---------------------+---------------------+---------------------+
> | ID1 Tx7             |                     |                     |
> +---------------------+---------------------+---------------------+
> ```
> After rejoining ID1 will figure out the inconsistency of its WAL: the
> last WAL entry it has is corresponding to Tx7, while in Leader's log the
> last entry with ID1 is Tx5. Confirm for a Tx can only be issued after
> appearance of the Tx on the majoirty of replicas, hence there's a good
> chances that ID1 will have inconsistency in its WAL covered with undo
> log. So, by rolling back all excessive Txs (in the example they are Tx6
> and Tx7) the ID1 can put its memtx and vynil in consistent state.

Yeah, but the problem is that the node1 has vclock[ID1] == 'Conf [ID1, Tx2]'.
This row can't be rolled back. So looks like node1 needs a rejoin.

> At this point a snapshot can be created at ID1 with appropriate WAL
> rotation. The old WAL should be renamed so it will not be reused in the
> future and can be kept for postmortem.
> ```
> +---------------------+---------------------+---------------------+
> | ID1                 | ID2                 | ID3                 |
> | Replica 1           | Leader              | Replica 2           |
> +---------------------+---------------------+---------------------+
> | ID1 Tx1             | ID1 Tx1             | ID1 Tx1             |
> +---------------------+---------------------+---------------------+
> | ID1 Tx2             | ID1 Tx2             | ID1 Tx2             |
> +---------------------+---------------------+---------------------+
> | ID1 Tx3             | ID1 Tx3             | ID1 Tx3             |
> +---------------------+---------------------+---------------------+
> | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] |
> +---------------------+---------------------+---------------------+
> | ID1 Tx4             | ID1 Tx4             | ID1 Tx4             |
> +---------------------+---------------------+---------------------+
> | ID1 Tx5             | ID1 Tx5             | ID1 Tx5             |
> +---------------------+---------------------+---------------------+
> |                     | ID2 Conf [Id1, Tx5] | ID2 Conf [Id1, Tx5] |
> +---------------------+---------------------+---------------------+
> |                     | ID2 Tx1             | ID2 Tx1             |
> +---------------------+---------------------+---------------------+
> |                     | ID2 Tx2             | ID2 Tx2             |
> +---------------------+---------------------+---------------------+
> ```
> Although, in case undo log is not enough to cover the WAL inconsistence
> with the new leader, the ID1 needs a complete rejoin.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-05-25 23:41                   ` Vladislav Shpilevoy
@ 2020-05-27 21:17                     ` Sergey Ostanevich
  2020-06-09 16:19                       ` Sergey Ostanevich
  0 siblings, 1 reply; 53+ messages in thread
From: Sergey Ostanevich @ 2020-05-27 21:17 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

Hi!

Thanks for review!

Some comments below.
On 26 мая 01:41, Vladislav Shpilevoy wrote:
> >>
> >> The reads should not be inconsistent - so that cluster will keep
> >> answering A or B for the same request. And in case we lost quorum we
> >> can't say for sure that all instances will answer the same.
> >>
> >> As we discussed it before, if leader appears in minor part of the
> >> cluster it can't issue rollback for all unconfirmed txns, since the
> >> majority will re-elect leader who will collect quorum for them. Means,
> >> we will appear is a state that cluster split in two. So the minor part
> >> should stop. Am I wrong here?
>
> Yeah, kinda. As long as you allow reading from replicas, you *always* will
> have a time slot, when you will be able to read different data for the
> same key on different replicas. Even with reads going through quorum.
>
> Because it is physically impossible to make nodes A and B start answering
> the same data at the same time moment. To notify them about a confirm you will
> send network messages, they will have not the same delay, won't be processed
> in the same moment of time, and some of them probably won't be even delivered.
>
> The only correct way to read the same - read from one node only. From the
> leader. And since this is not our way, it means we can't beat the 'inconsistent'
> reads problems. And I don't think we should. Because if somebody needs to do
> 'consistent' reads, they should read from leader only.
>
> In other words, the concept of 'consistency' is highly application dependent
> here. If we provide a way to read from replicas, we give flexibility to choose:
> read from leader only and see always the same data, or read from all, and have
> a possibility, that requests may see different data on different replicas
> sometimes.

So, it looks like we will follow the current approach: if quorum can't
be achieved, cluster appears in r/o mode. Objections?

> >
> > Replica should report a TXN application success to the leader via the
> > IPROTO explicitly to allow leader to collect the quorum for the TXN.
> > In case of application failure the replica has to disconnect from the
> > replication the same way as it is done now. The replica also has to
> > report its disconnection to the orchestrator. Further actions require
> > human intervention, since failure means either technical problem (such
> > as not enough space for WAL) that has to be resolved or an inconsistent
> > state that requires rejoin.
>
> I don't think a replica should report disconnection. Problem of
> disconnection is that it leads to loosing the connection. So it may be
> not able to connect to the orchestrator. Also it would be strange for
> tarantool to depend on some external service, to which it should report.
> This looks like the orchestrator's business how will it determine
> connectivity. Replica has nothing to do with it from its side.

External service is something I expect to be useful for the first part
of implementation - the quorum part. Definitely, we will move onward to
achieve some automation in leader election and failover. I just don't
expect this to be part of this RFC.

Anyways, orchestrator has to ask replica to figure out the connectivity
between replica and leader.

>
> > As soon as leader appears in a situation it has not enough replicas
> > to achieve quorum, the cluster should stop accepting any requests - both
> > write and read.
>
> The moment of not having enough replicas can't be determined properly.
> You may loose connection to replicas (they could be powered off), but
> TCP won't see that, and the node will continue working. The failure will
> be discovered only when a 'write' request will try to collect a quorum,
> or after a timeout will pass on not delivering heartbeats. During this
> time reads will be served. And there is no way to prevent them except
> collecting a quorum on that. See my first comment in this email for more
> details.
>
> On the summary: we can't stop accepting read requests.
>
> Btw, what to do with reads, which were *in-progress*, when the quorum
> was lost? Such as long vinyl reads.

But the quorum was in place at the start of it? Then according to
transaction manager behavior only older version data will be available
for read - means data that collected quorum.

>
> > The reason for this is that replication of transactions
> > can achieve quorum on replicas not visible to the leader. On the other
> > hand, leader can't achieve quorum with available minority. Leader has to
> > report the state and wait for human intervention.
>
> Yeah, but if the leader couldn't achieve a quorum on some transactions,
> they are not visible (assuming MVCC will work properly). So they can't
> be read anyway. And if a leader answered an error, it does not mean that
> the transaction wasn't replicated on the majority, as we discussed at some
> meeting, I don't already remember when. So here read allowance also works
> fine - not having some data visible and getting error at a sync transaction
> does not mean it is not committed. A user should be aware of that.

True, we discussed that we should guarantee only that if we answered
'Ok' then data is present in quorum number of instances.

[...]

> >  demote(ID) - should be called from the Leader instance.
> >    The Leader has to switch in ro mode and wait for its' undo log is
> >    empty. This effectively means all transactions are committed in the
> >    cluster and it is safe pass the leadership. Then it should write
> >    CURRENT_LEADER_ID as a FORMER_LEADER_ID and put CURRENT_LEADER_ID
> >    into 0.
>
> This looks like box.ctl.promote() algorithm. Although I thought we decided
> not to implement any kind of auto election here, no? Box.ctl.promote()
> assumed, that it does all the steps automatically, except choosing on which
> node to call this function. This is what it was so complicated. It was
> basically raft.
>
> But yeah, as discussed verbally, this is a subject for improvement.

I personally would like to postpone the algorithm should be postponed
for the next stage (Q3-Q4) but now we should not mess up too much to
revamp. Hence, we have to elaborate the internals - such as _voting
table I mentioned.

Even with introduction of terms for each leader - as in RAFT for example
- we still can keep it in a replicated space, isn't it?

>
> The way I see it is that we need to give vclock based algorithm of choosing
> a new leader; tell how to stop replication from the old leader; allow to
> read vclock from replicas (basically, let the external service read box.info).

That's the #1 for me by now: how a read-only replica can quit listening
to a demoted leader, which can be not aware of its demotion? Still, for
efficiency it should be done w/o disconnection.

>
> Since you said you think we should not provide an API for all sync transactions
> rollback, it looks like no need in a special new API. But if we still want
> to allow to rollback all pending transactions of the old leader on a new leader
> (like Mons wants) then yeah, seems like we would need a new function. For example,
> box.ctl.sync_rollback() to rollback all pending. And box.ctl.sync_confirm() to
> confirm all pending. Perhaps we could add more admin-line parameters such as
> replica_id with which to write 'confirm/rollback' message.

I believe it's a good point to keep two approaches and perhaps set one
of the two in configuration. This should resolve the issue with 'the
rest of the cluster confirms old leader's transactions and because of it
leader can't rollback'.

>
> > ### Recovery and failover.
> >
> > Tarantool instance during reading WAL should postpone the undo log
> > deletion until the 'confirm' is read. In case the WAL eof is achieved,
> > the instance should keep undo log for all transactions that are waiting
> > for a confirm entry until the role of the instance is set.
> >
> > If this instance will be assigned a leader role then all transactions
> > that have no corresponding confirm message should be confirmed (see the
> > leader role assignment).
> >
> > In case there's not enough replicas to set up a quorum the cluster can
> > be switched into a read-only mode. Note, this can't be done by default
> > since some of transactions can have confirmed state. It is up to human
> > intervention to force rollback of all transactions that have no confirm
> > and to put the cluster into a consistent state.
>
> Above you said:
>
> >> As soon as leader appears in a situation it has not enough replicas
> >> to achieve quorum, the cluster should stop accepting any requests - both
> >> write and read.
>
> But here I see, that the cluster "switched into a read-only mode". So there
> is a contradiction. And I think it should be resolved in favor of
> 'read-only mode'. I explained why in the previous comments.

My bad, I was moving around this problem already and tend to allow r/o.
Will update.

>
> > In case the instance will be assigned a replica role, it may appear in
> > a state that it has conflicting WAL entries, in case it recovered from a
> > leader role and some of transactions didn't replicated to the current
> > leader. This situation should be resolved through rejoin of the instance.
> >
> > Consider an example below. Originally instance with ID1 was assigned a
> > Leader role and the cluster had 2 replicas with quorum set to 2.
> >
> > ```
> > +---------------------+---------------------+---------------------+
> > | ID1                 | ID2                 | ID3                 |
> > | Leader              | Replica 1           | Replica 2           |
> > +---------------------+---------------------+---------------------+
> > | ID1 Tx1             | ID1 Tx1             | ID1 Tx1             |
> > +---------------------+---------------------+---------------------+
> > | ID1 Tx2             | ID1 Tx2             | ID1 Tx2             |
> > +---------------------+---------------------+---------------------+
> > | ID1 Tx3             | ID1 Tx3             | ID1 Tx3             |
> > +---------------------+---------------------+---------------------+
> > | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] |                     |
> > +---------------------+---------------------+---------------------+
> > | ID1 Tx4             | ID1 Tx4             |                     |
> > +---------------------+---------------------+---------------------+
> > | ID1 Tx5             | ID1 Tx5             |                     |
> > +---------------------+---------------------+---------------------+
> > | ID1 Conf [ID1, Tx2] |                     |                     |
> > +---------------------+---------------------+---------------------+
> > | Tx6                 |                     |                     |
> > +---------------------+---------------------+---------------------+
> > | Tx7                 |                     |                     |
> > +---------------------+---------------------+---------------------+
> > ```
> > Suppose at this moment the ID1 instance crashes. Then the ID2 instance
> > should be assigned a leader role since its ID1 LSN is the biggest.
> > Then this new leader will deliver its WAL to all replicas.
> >
> > As soon as quorum for Tx4 and Tx5 will be obtained, it should write the
> > corresponding Confirms to its WAL. Note that Tx are still uses ID1.
> > ```
> > +---------------------+---------------------+---------------------+
> > | ID1                 | ID2                 | ID3                 |
> > | (dead)              | Leader              | Replica 2           |
> > +---------------------+---------------------+---------------------+
> > | ID1 Tx1             | ID1 Tx1             | ID1 Tx1             |
> > +---------------------+---------------------+---------------------+
> > | ID1 Tx2             | ID1 Tx2             | ID1 Tx2             |
> > +---------------------+---------------------+---------------------+
> > | ID1 Tx3             | ID1 Tx3             | ID1 Tx3             |
> > +---------------------+---------------------+---------------------+
> > | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] |
> > +---------------------+---------------------+---------------------+
> > | ID1 Tx4             | ID1 Tx4             | ID1 Tx4             |
> > +---------------------+---------------------+---------------------+
> > | ID1 Tx5             | ID1 Tx5             | ID1 Tx5             |
> > +---------------------+---------------------+---------------------+
> > | ID1 Conf [ID1, Tx2] | ID2 Conf [Id1, Tx5] | ID2 Conf [Id1, Tx5] |
>
> Id1 -> ID1 (typo)

Thanks!

>
> > +---------------------+---------------------+---------------------+
> > | ID1 Tx6             |                     |                     |
> > +---------------------+---------------------+---------------------+
> > | ID1 Tx7             |                     |                     |
> > +---------------------+---------------------+---------------------+
> > ```
> > After rejoining ID1 will figure out the inconsistency of its WAL: the
> > last WAL entry it has is corresponding to Tx7, while in Leader's log the
> > last entry with ID1 is Tx5. Confirm for a Tx can only be issued after
> > appearance of the Tx on the majoirty of replicas, hence there's a good
> > chances that ID1 will have inconsistency in its WAL covered with undo
> > log. So, by rolling back all excessive Txs (in the example they are Tx6
> > and Tx7) the ID1 can put its memtx and vynil in consistent state.
>
> Yeah, but the problem is that the node1 has vclock[ID1] == 'Conf [ID1, Tx2]'.
> This row can't be rolled back. So looks like node1 needs a rejoin.

Confirm message is equivalent to a NOP - @sergepetrenko apparently does
implementation exactly this way. So there's no need to roll it back in
an engine, rather perform the xlog rotation before it.

>
> > At this point a snapshot can be created at ID1 with appropriate WAL
> > rotation. The old WAL should be renamed so it will not be reused in the
> > future and can be kept for postmortem.
> > ```
> > +---------------------+---------------------+---------------------+
> > | ID1                 | ID2                 | ID3                 |
> > | Replica 1           | Leader              | Replica 2           |
> > +---------------------+---------------------+---------------------+
> > | ID1 Tx1             | ID1 Tx1             | ID1 Tx1             |
> > +---------------------+---------------------+---------------------+
> > | ID1 Tx2             | ID1 Tx2             | ID1 Tx2             |
> > +---------------------+---------------------+---------------------+
> > | ID1 Tx3             | ID1 Tx3             | ID1 Tx3             |
> > +---------------------+---------------------+---------------------+
> > | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] |
> > +---------------------+---------------------+---------------------+
> > | ID1 Tx4             | ID1 Tx4             | ID1 Tx4             |
> > +---------------------+---------------------+---------------------+
> > | ID1 Tx5             | ID1 Tx5             | ID1 Tx5             |
> > +---------------------+---------------------+---------------------+
> > |                     | ID2 Conf [Id1, Tx5] | ID2 Conf [Id1, Tx5] |
> > +---------------------+---------------------+---------------------+
> > |                     | ID2 Tx1             | ID2 Tx1             |
> > +---------------------+---------------------+---------------------+
> > |                     | ID2 Tx2             | ID2 Tx2             |
> > +---------------------+---------------------+---------------------+
> > ```
> > Although, in case undo log is not enough to cover the WAL inconsistence
> > with the new leader, the ID1 needs a complete rejoin.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-05-27 21:17                     ` Sergey Ostanevich
@ 2020-06-09 16:19                       ` Sergey Ostanevich
  2020-06-11 15:17                         ` Vladislav Shpilevoy
  0 siblings, 1 reply; 53+ messages in thread
From: Sergey Ostanevich @ 2020-06-09 16:19 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

Hi!

Please, take a look at the latest changes, which include timeouts for
quorum collection and the heartbeat for ensure the leader is alive.


regards,
Sergos


* **Status**: In progress
* **Start date**: 31-03-2020
* **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\>
* **Issues**: https://github.com/tarantool/tarantool/issues/4842

## Summary

The aim of this RFC is to address the following list of problems
formulated at MRG planning meeting:
  - protocol backward compatibility to enable cluster upgrade w/o
    downtime
  - consistency of data on replica and leader
  - switch from leader to replica without data loss
  - up to date replicas to run read-only requests
  - ability to switch async replicas into sync ones and vice versa
  - guarantee of rollback on leader and sync replicas
  - simplicity of cluster orchestration

What this RFC is not:

  - high availability (HA) solution with automated failover, roles
    assignments an so on
  - master-master configuration support

## Background and motivation

There are number of known implementation of consistent data presence in
a Tarantool cluster. They can be commonly named as "wait for LSN"
technique. The biggest issue with this technique is the absence of
rollback guarantees at replica in case of transaction failure on one
master or some of the replicas in the cluster.

To provide such capabilities a new functionality should be introduced in
Tarantool core, with requirements mentioned before - backward
compatibility and ease of cluster orchestration.

The cluster operation is expected to be in a full-mesh topology, although
the process of automated topology support is beyond this RFC.

## Detailed design

### Quorum commit

The main idea behind the proposal is to reuse existent machinery as much
as possible. It will ensure the well-tested and proven functionality
across many instances in MRG and beyond is used. The transaction rollback
mechanism is in place and works for WAL write failure. If we substitute
the WAL success with a new situation which is named 'quorum' later in
this document then no changes to the machinery is needed. The same is
true for snapshot machinery that allows to create a copy of the database
in memory for the whole period of snapshot file write. Adding quorum here
also minimizes changes.

Currently replication represented by the following scheme:
```
Customer        Leader          WAL(L)        Replica        WAL(R)
   |------TXN----->|              |             |              |
   |               |              |             |              |
   |         [TXN undo log        |             |              |
   |            created]          |             |              |
   |               |              |             |              |
   |               |-----TXN----->|             |              |
   |               |              |             |              |
   |               |<---WAL Ok----|             |              |
   |               |              |             |              |
   |         [TXN undo log        |             |              |
   |           destroyed]         |             |              |
   |               |              |             |              |
   |<----TXN Ok----|              |             |              |
   |               |-------Replicate TXN------->|              |
   |               |              |             |              |
   |               |              |       [TXN undo log        |
   |               |              |          created]          |
   |               |              |             |              |
   |               |              |             |-----TXN----->|
   |               |              |             |              |
   |               |              |             |<---WAL Ok----|
   |               |              |             |              |
   |               |              |       [TXN undo log        |
   |               |              |         destroyed]         |
   |               |              |             |              |
```

To introduce the 'quorum' we have to receive confirmation from replicas
to make a decision on whether the quorum is actually present. Leader
collects necessary amount of replicas confirmation plus its own WAL
success. This state is named 'quorum' and gives leader the right to
complete the customers' request. So the picture will change to:
```
Customer        Leader          WAL(L)        Replica        WAL(R)
   |------TXN----->|              |             |              |
   |               |              |             |              |
   |         [TXN undo log        |             |              |
   |            created]          |             |              |
   |               |              |             |              |
   |               |-----TXN----->|             |              |
   |               |              |             |              |
   |               |-------Replicate TXN------->|              |
   |               |              |             |              |
   |               |              |       [TXN undo log        |
   |               |<---WAL Ok----|          created]          |
   |               |              |             |              |
   |           [Waiting           |             |-----TXN----->|
   |         of a quorum]         |             |              |
   |               |              |             |<---WAL Ok----|
   |               |              |             |              |
   |               |<------Replication Ok-------|              |
   |               |              |             |              |
   |            [Quorum           |             |              |
   |           achieved]          |             |              |
   |               |              |             |              |
   |               |---Confirm--->|             |              |
   |               |              |             |              |
   |               |----------Confirm---------->|              |
   |               |              |             |              |
   |<---TXN Ok-----|              |             |---Confirm--->|
   |               |              |             |              |
   |         [TXN undo log        |       [TXN undo log        |
   |           destroyed]         |         destroyed]         |
   |               |              |             |              |
```

The quorum should be collected as a table for a list of transactions
waiting for quorum. The latest transaction that collects the quorum is
considered as complete, as well as all transactions prior to it, since
all transactions should be applied in order. Leader writes a 'confirm'
message to the WAL that refers to the transaction's [LEADER_ID, LSN] and
the confirm has its own LSN. This confirm message is delivered to all
replicas through the existing replication mechanism.

Replica should report a TXN application success to the leader via the
IPROTO explicitly to allow leader to collect the quorum for the TXN.
In case of application failure the replica has to disconnect from the
replication the same way as it is done now. The replica also has to
report its disconnection to the orchestrator. Further actions require
human intervention, since failure means either technical problem (such
as not enough space for WAL) that has to be resolved or an inconsistent
state that requires rejoin.

Currently Tarantool provides no protection from dirty read from the
memtx during the TXN write into the WAL. So, there is a chance of TXN
can fail to be written to the WAL, while some read requests can report
success of TXN. In this RFC we make no attempt to resolve the dirty
read, so it should be addressed by user code. Although we plan to
introduce an MVCC machinery similar to available in vinyl engnie which
will resolve the dirty read problem.

### Connection liveness

There is a timeout-based mechanism in Tarantool that controls the
asynchronous replication, which uses the following config:
```
* replication_connect_timeout  = 4
* replication_sync_lag         = 10
* replication_sync_timeout     = 300
* replication_timeout          = 1
```
For backward compatibility and to differentiate the async replication
we should augment the configuration with the following:
```
* synchro_replication_heartbeat = 4
* synchro_replication_quorum_timeout = 4
```
Leader should send a heartbeat every synchro_replication_heartbeat if
there were no messages sent. Replicas should respond to the heartbeat
just the same way as they do it now. As soon as Leader has no response
for another heartbeat interval, it should consider the replica is lost.
As soon as leader appears in a situation it has not enough replicas
to achieve quorum, it should stop accepting write requests. There's an
option for leader to rollback to the latest transaction that has quorum:
leader issues a 'rollback' message referring to the [LEADER_ID, LSN]
where LSN is of the first transaction in the leader's undo log. The
rollback message replicated to the available cluster will put it in a
consistent state. After that configuration of the cluster can be
updated to a new available quorum and leader can be switched back to
write mode.

During the quorum collection it can happen that some of replicas become
unavailable due to some reason, so leader should wait at most for
synchro_replication_quorum_timeout after which it issues a Rollback
pointing to the oldest TXN in the waiting list.

### Leader role assignment.

Be it a user-initiated assignment or an algorithmic one, it should use
a common interface to assign the leader role. By now we implement a
simplified machinery, still it should be feasible in the future to fit
the algorithms, such as RAFT or proposed before box.ctl.promote.

A system space \_voting can be used to replicate the voting among the
cluster, this space should be writable even for a read-only instance.
This space should contain a CURRENT_LEADER_ID at any time - means the
current leader, can be a zero value at the start. This is needed to
compare the appropriate vclock component below.

All replicas should be subscribed to changes in the space and react as
described below.

 promote(ID) - should be called from a replica with it's own ID.
   Writes an entry in the voting space about this ID is waiting for
   votes from cluster. The entry should also contain the current
   vclock[CURRENT_LEADER_ID] of the nominee.

Upon changes in the space each replica should compare its appropriate
vclock component with submitted one and append its vote to the space:
AYE in case nominee's vclock is bigger or equal to the replica's one,
NAY otherwise.

As soon as nominee collects the quorum for being elected, it claims
himself a Leader by switching in rw mode, writes CURRENT_LEADER_ID as
a FORMER_LEADER_ID in the \_voting space and put its ID as a
CURRENT_LEADER_ID. In case a NAY is appeared in the \_voting or a
timeout predefined in box.cfg is reached, the nominee should remove
it's entry from the space.

The leader should assure that number of available instances in the
cluster is enough to achieve the quorum and proceed to step 3, otherwise
the leader should report the situation of incomplete quorum, as
described in the last paragraph of previous section.

The new Leader has to take the responsibility to replicate former Leader's
entries from its WAL, obtain quorum and commit confirm messages referring
to [FORMER_LEADER_ID, LSN] in its WAL, replicating to the cluster, after
that it can start adding its own entries into the WAL.

 demote(ID) - should be called from the Leader instance.
   The Leader has to switch in ro mode and wait for its' undo log is
   empty. This effectively means all transactions are committed in the
   cluster and it is safe pass the leadership. Then it should write
   CURRENT_LEADER_ID as a FORMER_LEADER_ID and put CURRENT_LEADER_ID
   into 0.

### Recovery and failover.

Tarantool instance during reading WAL should postpone the undo log
deletion until the 'confirm' is read. In case the WAL eof is achieved,
the instance should keep undo log for all transactions that are waiting
for a confirm entry until the role of the instance is set.

If this instance will be assigned a leader role then all transactions
that have no corresponding confirm message should be confirmed (see the
leader role assignment).

In case there's not enough replicas to set up a quorum the cluster can
be switched into a read-only mode. Note, this can't be done by default
since some of transactions can have confirmed state. It is up to human
intervention to force rollback of all transactions that have no confirm
and to put the cluster into a consistent state.

In case the instance will be assigned a replica role, it may appear in
a state that it has conflicting WAL entries, in case it recovered from a
leader role and some of transactions didn't replicated to the current
leader. This situation should be resolved through rejoin of the instance.

Consider an example below. Originally instance with ID1 was assigned a
Leader role and the cluster had 2 replicas with quorum set to 2.

```
+---------------------+---------------------+---------------------+
| ID1                 | ID2                 | ID3                 |
| Leader              | Replica 1           | Replica 2           |
+---------------------+---------------------+---------------------+
| ID1 Tx1             | ID1 Tx1             | ID1 Tx1             |
+---------------------+---------------------+---------------------+
| ID1 Tx2             | ID1 Tx2             | ID1 Tx2             |
+---------------------+---------------------+---------------------+
| ID1 Tx3             | ID1 Tx3             | ID1 Tx3             |
+---------------------+---------------------+---------------------+
| ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] |                     |
+---------------------+---------------------+---------------------+
| ID1 Tx4             | ID1 Tx4             |                     |
+---------------------+---------------------+---------------------+
| ID1 Tx5             | ID1 Tx5             |                     |
+---------------------+---------------------+---------------------+
| ID1 Conf [ID1, Tx2] |                     |                     |
+---------------------+---------------------+---------------------+
| Tx6                 |                     |                     |
+---------------------+---------------------+---------------------+
| Tx7                 |                     |                     |
+---------------------+---------------------+---------------------+
```
Suppose at this moment the ID1 instance crashes. Then the ID2 instance
should be assigned a leader role since its ID1 LSN is the biggest.
Then this new leader will deliver its WAL to all replicas.

As soon as quorum for Tx4 and Tx5 will be obtained, it should write the
corresponding Confirms to its WAL. Note that Tx are still uses ID1.
```
+---------------------+---------------------+---------------------+
| ID1                 | ID2                 | ID3                 |
| (dead)              | Leader              | Replica 2           |
+---------------------+---------------------+---------------------+
| ID1 Tx1             | ID1 Tx1             | ID1 Tx1             |
+---------------------+---------------------+---------------------+
| ID1 Tx2             | ID1 Tx2             | ID1 Tx2             |
+---------------------+---------------------+---------------------+
| ID1 Tx3             | ID1 Tx3             | ID1 Tx3             |
+---------------------+---------------------+---------------------+
| ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] |
+---------------------+---------------------+---------------------+
| ID1 Tx4             | ID1 Tx4             | ID1 Tx4             |
+---------------------+---------------------+---------------------+
| ID1 Tx5             | ID1 Tx5             | ID1 Tx5             |
+---------------------+---------------------+---------------------+
| ID1 Conf [ID1, Tx2] | ID2 Conf [ID1, Tx5] | ID2 Conf [ID1, Tx5] |
+---------------------+---------------------+---------------------+
| ID1 Tx6             |                     |                     |
+---------------------+---------------------+---------------------+
| ID1 Tx7             |                     |                     |
+---------------------+---------------------+---------------------+
```
After rejoining ID1 will figure out the inconsistency of its WAL: the
last WAL entry it has is corresponding to Tx7, while in Leader's log the
last entry with ID1 is Tx5. Confirm for a Tx can only be issued after
appearance of the Tx on the majoirty of replicas, hence there's a good
chances that ID1 will have inconsistency in its WAL covered with undo
log. So, by rolling back all excessive Txs (in the example they are Tx6
and Tx7) the ID1 can put its memtx and vynil in consistent state.

At this point a snapshot can be created at ID1 with appropriate WAL
rotation. The old WAL should be renamed so it will not be reused in the
future and can be kept for postmortem.
```
+---------------------+---------------------+---------------------+
| ID1                 | ID2                 | ID3                 |
| Replica 1           | Leader              | Replica 2           |
+---------------------+---------------------+---------------------+
| ID1 Tx1             | ID1 Tx1             | ID1 Tx1             |
+---------------------+---------------------+---------------------+
| ID1 Tx2             | ID1 Tx2             | ID1 Tx2             |
+---------------------+---------------------+---------------------+
| ID1 Tx3             | ID1 Tx3             | ID1 Tx3             |
+---------------------+---------------------+---------------------+
| ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] |
+---------------------+---------------------+---------------------+
| ID1 Tx4             | ID1 Tx4             | ID1 Tx4             |
+---------------------+---------------------+---------------------+
| ID1 Tx5             | ID1 Tx5             | ID1 Tx5             |
+---------------------+---------------------+---------------------+
|                     | ID2 Conf [ID1, Tx5] | ID2 Conf [ID1, Tx5] |
+---------------------+---------------------+---------------------+
|                     | ID2 Tx1             | ID2 Tx1             |
+---------------------+---------------------+---------------------+
|                     | ID2 Tx2             | ID2 Tx2             |
+---------------------+---------------------+---------------------+
```
Although, in case undo log is not enough to cover the WAL inconsistence
with the new leader, the ID1 needs a complete rejoin.

### Snapshot generation.

We also can reuse current machinery of snapshot generation. Upon
receiving a request to create a snapshot an instance should request a
readview for the current commit operation. Although start of the
snapshot generation should be postponed until this commit operation
receives its confirmation. In case operation is rolled back, the snapshot
generation should be aborted and restarted using current transaction
after rollback is complete.

After snapshot is created the WAL should start from the first operation
that follows the commit operation snapshot is generated for. That means
WAL will contain 'confirm' messages that refer to transactions that are
not present in the WAL. Apparently, we have to allow this for the case
'confirm' refers to a transaction with LSN less than the first entry in
the WAL.

In case master appears unavailable a replica still have to be able to
create a snapshot. Replica can perform rollback for all transactions that
are not confirmed and claim its LSN as the latest confirmed txn. Then it
can create a snapshot in a regular way and start with blank xlog file.
All rolled back transactions will appear through the regular replication
in case master reappears later on.

### Asynchronous replication.

Along with synchronous replicas the cluster can contain asynchronous
replicas. That means async replica doesn't reply to the leader with
errors since they're not contributing into quorum. Still, async
replicas have to follow the new WAL operation, such as keep rollback
info until 'quorum' message is received. This is essential for the case
of 'rollback' message appearance in the WAL. This message assumes
replica is able to perform all necessary rollback by itself. Cluster
information should contain explicit notification of each replica
operation mode.

### Synchronous replication enabling.

Synchronous operation can be required for a set of spaces in the data
scheme. That means only transactions that contain data modification for
these spaces should require quorum. Such transactions named synchronous.
As soon as last operation of synchronous transaction appeared in leader's
WAL, it will cause all following transactions - no matter if they are
synchronous or not - wait for the quorum. In case quorum is not achieved
the 'rollback' operation will cause rollback of all transactions after
the synchronous one. It will ensure the consistent state of the data both
on leader and replicas. In case user doesn't require synchronous operation
for any space then no changes to the WAL generation and replication will
appear.

Cluster description should contain explicit attribute for each replica
to denote it participates in synchronous activities. Also the description
should contain criterion on how many replicas responses are needed to
achieve the quorum.

## Rationale and alternatives

There is an implementation of synchronous replication as part of gh-980
activities, still it is not in a state to get into the product. More
than that it intentionally breaks backward compatibility which is a
prerequisite for this proposal.

On 28 мая 00:17, Sergey Ostanevich wrote:
> Hi!
> 
> Thanks for review!
> 
> Some comments below.
> On 26 мая 01:41, Vladislav Shpilevoy wrote:
> > >>
> > >> The reads should not be inconsistent - so that cluster will keep
> > >> answering A or B for the same request. And in case we lost quorum we
> > >> can't say for sure that all instances will answer the same.
> > >>
> > >> As we discussed it before, if leader appears in minor part of the
> > >> cluster it can't issue rollback for all unconfirmed txns, since the
> > >> majority will re-elect leader who will collect quorum for them. Means,
> > >> we will appear is a state that cluster split in two. So the minor part
> > >> should stop. Am I wrong here?
> >
> > Yeah, kinda. As long as you allow reading from replicas, you *always* will
> > have a time slot, when you will be able to read different data for the
> > same key on different replicas. Even with reads going through quorum.
> >
> > Because it is physically impossible to make nodes A and B start answering
> > the same data at the same time moment. To notify them about a confirm you will
> > send network messages, they will have not the same delay, won't be processed
> > in the same moment of time, and some of them probably won't be even delivered.
> >
> > The only correct way to read the same - read from one node only. From the
> > leader. And since this is not our way, it means we can't beat the 'inconsistent'
> > reads problems. And I don't think we should. Because if somebody needs to do
> > 'consistent' reads, they should read from leader only.
> >
> > In other words, the concept of 'consistency' is highly application dependent
> > here. If we provide a way to read from replicas, we give flexibility to choose:
> > read from leader only and see always the same data, or read from all, and have
> > a possibility, that requests may see different data on different replicas
> > sometimes.
> 
> So, it looks like we will follow the current approach: if quorum can't
> be achieved, cluster appears in r/o mode. Objections?
> 
> > >
> > > Replica should report a TXN application success to the leader via the
> > > IPROTO explicitly to allow leader to collect the quorum for the TXN.
> > > In case of application failure the replica has to disconnect from the
> > > replication the same way as it is done now. The replica also has to
> > > report its disconnection to the orchestrator. Further actions require
> > > human intervention, since failure means either technical problem (such
> > > as not enough space for WAL) that has to be resolved or an inconsistent
> > > state that requires rejoin.
> >
> > I don't think a replica should report disconnection. Problem of
> > disconnection is that it leads to loosing the connection. So it may be
> > not able to connect to the orchestrator. Also it would be strange for
> > tarantool to depend on some external service, to which it should report.
> > This looks like the orchestrator's business how will it determine
> > connectivity. Replica has nothing to do with it from its side.
> 
> External service is something I expect to be useful for the first part
> of implementation - the quorum part. Definitely, we will move onward to
> achieve some automation in leader election and failover. I just don't
> expect this to be part of this RFC.
> 
> Anyways, orchestrator has to ask replica to figure out the connectivity
> between replica and leader.
> 
> >
> > > As soon as leader appears in a situation it has not enough replicas
> > > to achieve quorum, the cluster should stop accepting any requests - both
> > > write and read.
> >
> > The moment of not having enough replicas can't be determined properly.
> > You may loose connection to replicas (they could be powered off), but
> > TCP won't see that, and the node will continue working. The failure will
> > be discovered only when a 'write' request will try to collect a quorum,
> > or after a timeout will pass on not delivering heartbeats. During this
> > time reads will be served. And there is no way to prevent them except
> > collecting a quorum on that. See my first comment in this email for more
> > details.
> >
> > On the summary: we can't stop accepting read requests.
> >
> > Btw, what to do with reads, which were *in-progress*, when the quorum
> > was lost? Such as long vinyl reads.
> 
> But the quorum was in place at the start of it? Then according to
> transaction manager behavior only older version data will be available
> for read - means data that collected quorum.
> 
> >
> > > The reason for this is that replication of transactions
> > > can achieve quorum on replicas not visible to the leader. On the other
> > > hand, leader can't achieve quorum with available minority. Leader has to
> > > report the state and wait for human intervention.
> >
> > Yeah, but if the leader couldn't achieve a quorum on some transactions,
> > they are not visible (assuming MVCC will work properly). So they can't
> > be read anyway. And if a leader answered an error, it does not mean that
> > the transaction wasn't replicated on the majority, as we discussed at some
> > meeting, I don't already remember when. So here read allowance also works
> > fine - not having some data visible and getting error at a sync transaction
> > does not mean it is not committed. A user should be aware of that.
> 
> True, we discussed that we should guarantee only that if we answered
> 'Ok' then data is present in quorum number of instances.
> 
> [...]
> 
> > >  demote(ID) - should be called from the Leader instance.
> > >    The Leader has to switch in ro mode and wait for its' undo log is
> > >    empty. This effectively means all transactions are committed in the
> > >    cluster and it is safe pass the leadership. Then it should write
> > >    CURRENT_LEADER_ID as a FORMER_LEADER_ID and put CURRENT_LEADER_ID
> > >    into 0.
> >
> > This looks like box.ctl.promote() algorithm. Although I thought we decided
> > not to implement any kind of auto election here, no? Box.ctl.promote()
> > assumed, that it does all the steps automatically, except choosing on which
> > node to call this function. This is what it was so complicated. It was
> > basically raft.
> >
> > But yeah, as discussed verbally, this is a subject for improvement.
> 
> I personally would like to postpone the algorithm should be postponed
> for the next stage (Q3-Q4) but now we should not mess up too much to
> revamp. Hence, we have to elaborate the internals - such as _voting
> table I mentioned.
> 
> Even with introduction of terms for each leader - as in RAFT for example
> - we still can keep it in a replicated space, isn't it?
> 
> >
> > The way I see it is that we need to give vclock based algorithm of choosing
> > a new leader; tell how to stop replication from the old leader; allow to
> > read vclock from replicas (basically, let the external service read box.info).
> 
> That's the #1 for me by now: how a read-only replica can quit listening
> to a demoted leader, which can be not aware of its demotion? Still, for
> efficiency it should be done w/o disconnection.
> 
> >
> > Since you said you think we should not provide an API for all sync transactions
> > rollback, it looks like no need in a special new API. But if we still want
> > to allow to rollback all pending transactions of the old leader on a new leader
> > (like Mons wants) then yeah, seems like we would need a new function. For example,
> > box.ctl.sync_rollback() to rollback all pending. And box.ctl.sync_confirm() to
> > confirm all pending. Perhaps we could add more admin-line parameters such as
> > replica_id with which to write 'confirm/rollback' message.
> 
> I believe it's a good point to keep two approaches and perhaps set one
> of the two in configuration. This should resolve the issue with 'the
> rest of the cluster confirms old leader's transactions and because of it
> leader can't rollback'.
> 
> >
> > > ### Recovery and failover.
> > >
> > > Tarantool instance during reading WAL should postpone the undo log
> > > deletion until the 'confirm' is read. In case the WAL eof is achieved,
> > > the instance should keep undo log for all transactions that are waiting
> > > for a confirm entry until the role of the instance is set.
> > >
> > > If this instance will be assigned a leader role then all transactions
> > > that have no corresponding confirm message should be confirmed (see the
> > > leader role assignment).
> > >
> > > In case there's not enough replicas to set up a quorum the cluster can
> > > be switched into a read-only mode. Note, this can't be done by default
> > > since some of transactions can have confirmed state. It is up to human
> > > intervention to force rollback of all transactions that have no confirm
> > > and to put the cluster into a consistent state.
> >
> > Above you said:
> >
> > >> As soon as leader appears in a situation it has not enough replicas
> > >> to achieve quorum, the cluster should stop accepting any requests - both
> > >> write and read.
> >
> > But here I see, that the cluster "switched into a read-only mode". So there
> > is a contradiction. And I think it should be resolved in favor of
> > 'read-only mode'. I explained why in the previous comments.
> 
> My bad, I was moving around this problem already and tend to allow r/o.
> Will update.
> 
> >
> > > In case the instance will be assigned a replica role, it may appear in
> > > a state that it has conflicting WAL entries, in case it recovered from a
> > > leader role and some of transactions didn't replicated to the current
> > > leader. This situation should be resolved through rejoin of the instance.
> > >
> > > Consider an example below. Originally instance with ID1 was assigned a
> > > Leader role and the cluster had 2 replicas with quorum set to 2.
> > >
> > > ```
> > > +---------------------+---------------------+---------------------+
> > > | ID1                 | ID2                 | ID3                 |
> > > | Leader              | Replica 1           | Replica 2           |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx1             | ID1 Tx1             | ID1 Tx1             |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx2             | ID1 Tx2             | ID1 Tx2             |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx3             | ID1 Tx3             | ID1 Tx3             |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] |                     |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx4             | ID1 Tx4             |                     |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx5             | ID1 Tx5             |                     |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Conf [ID1, Tx2] |                     |                     |
> > > +---------------------+---------------------+---------------------+
> > > | Tx6                 |                     |                     |
> > > +---------------------+---------------------+---------------------+
> > > | Tx7                 |                     |                     |
> > > +---------------------+---------------------+---------------------+
> > > ```
> > > Suppose at this moment the ID1 instance crashes. Then the ID2 instance
> > > should be assigned a leader role since its ID1 LSN is the biggest.
> > > Then this new leader will deliver its WAL to all replicas.
> > >
> > > As soon as quorum for Tx4 and Tx5 will be obtained, it should write the
> > > corresponding Confirms to its WAL. Note that Tx are still uses ID1.
> > > ```
> > > +---------------------+---------------------+---------------------+
> > > | ID1                 | ID2                 | ID3                 |
> > > | (dead)              | Leader              | Replica 2           |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx1             | ID1 Tx1             | ID1 Tx1             |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx2             | ID1 Tx2             | ID1 Tx2             |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx3             | ID1 Tx3             | ID1 Tx3             |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx4             | ID1 Tx4             | ID1 Tx4             |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx5             | ID1 Tx5             | ID1 Tx5             |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Conf [ID1, Tx2] | ID2 Conf [Id1, Tx5] | ID2 Conf [Id1, Tx5] |
> >
> > Id1 -> ID1 (typo)
> 
> Thanks!
> 
> >
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx6             |                     |                     |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx7             |                     |                     |
> > > +---------------------+---------------------+---------------------+
> > > ```
> > > After rejoining ID1 will figure out the inconsistency of its WAL: the
> > > last WAL entry it has is corresponding to Tx7, while in Leader's log the
> > > last entry with ID1 is Tx5. Confirm for a Tx can only be issued after
> > > appearance of the Tx on the majoirty of replicas, hence there's a good
> > > chances that ID1 will have inconsistency in its WAL covered with undo
> > > log. So, by rolling back all excessive Txs (in the example they are Tx6
> > > and Tx7) the ID1 can put its memtx and vynil in consistent state.
> >
> > Yeah, but the problem is that the node1 has vclock[ID1] == 'Conf [ID1, Tx2]'.
> > This row can't be rolled back. So looks like node1 needs a rejoin.
> 
> Confirm message is equivalent to a NOP - @sergepetrenko apparently does
> implementation exactly this way. So there's no need to roll it back in
> an engine, rather perform the xlog rotation before it.
> 
> >
> > > At this point a snapshot can be created at ID1 with appropriate WAL
> > > rotation. The old WAL should be renamed so it will not be reused in the
> > > future and can be kept for postmortem.
> > > ```
> > > +---------------------+---------------------+---------------------+
> > > | ID1                 | ID2                 | ID3                 |
> > > | Replica 1           | Leader              | Replica 2           |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx1             | ID1 Tx1             | ID1 Tx1             |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx2             | ID1 Tx2             | ID1 Tx2             |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx3             | ID1 Tx3             | ID1 Tx3             |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx4             | ID1 Tx4             | ID1 Tx4             |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx5             | ID1 Tx5             | ID1 Tx5             |
> > > +---------------------+---------------------+---------------------+
> > > |                     | ID2 Conf [Id1, Tx5] | ID2 Conf [Id1, Tx5] |
> > > +---------------------+---------------------+---------------------+
> > > |                     | ID2 Tx1             | ID2 Tx1             |
> > > +---------------------+---------------------+---------------------+
> > > |                     | ID2 Tx2             | ID2 Tx2             |
> > > +---------------------+---------------------+---------------------+
> > > ```
> > > Although, in case undo log is not enough to cover the WAL inconsistence
> > > with the new leader, the ID1 needs a complete rejoin.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-06-09 16:19                       ` Sergey Ostanevich
@ 2020-06-11 15:17                         ` Vladislav Shpilevoy
  2020-06-12 20:31                           ` Sergey Ostanevich
  0 siblings, 1 reply; 53+ messages in thread
From: Vladislav Shpilevoy @ 2020-06-11 15:17 UTC (permalink / raw)
  To: Sergey Ostanevich; +Cc: tarantool-patches

Hi! Thanks for the updates!

> ### Connection liveness
> 
> There is a timeout-based mechanism in Tarantool that controls the
> asynchronous replication, which uses the following config:
> ```
> * replication_connect_timeout  = 4
> * replication_sync_lag         = 10
> * replication_sync_timeout     = 300
> * replication_timeout          = 1
> ```
> For backward compatibility and to differentiate the async replication
> we should augment the configuration with the following:
> ```
> * synchro_replication_heartbeat = 4

Heartbeats are already being sent. I don't see any sense in adding a
second heartbeat option.

> * synchro_replication_quorum_timeout = 4

Since this is a replication option, it should start from replication_
prefix.

> ```
> Leader should send a heartbeat every synchro_replication_heartbeat if
> there were no messages sent. Replicas should respond to the heartbeat
> just the same way as they do it now. As soon as Leader has no response
> for another heartbeat interval, it should consider the replica is lost.

All of that is already done in the regular heartbeats, not related nor
bound to any synchronous activities. Just like failure detection should be.

> As soon as leader appears in a situation it has not enough replicas
> to achieve quorum, it should stop accepting write requests. There's an
> option for leader to rollback to the latest transaction that has quorum:
> leader issues a 'rollback' message referring to the [LEADER_ID, LSN]
> where LSN is of the first transaction in the leader's undo log.

What is that option?

> The rollback message replicated to the available cluster will put it in a
> consistent state. After that configuration of the cluster can be
> updated to a new available quorum and leader can be switched back to
> write mode.
> 
> During the quorum collection it can happen that some of replicas become
> unavailable due to some reason, so leader should wait at most for
> synchro_replication_quorum_timeout after which it issues a Rollback
> pointing to the oldest TXN in the waiting list.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
  2020-06-11 15:17                         ` Vladislav Shpilevoy
@ 2020-06-12 20:31                           ` Sergey Ostanevich
  0 siblings, 0 replies; 53+ messages in thread
From: Sergey Ostanevich @ 2020-06-12 20:31 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

Hi!
Thanks for review, attaching a diff. Full version is available at the
branch
https://github.com/tarantool/tarantool/blob/sergos/quorum-based-synchro/


On 11 июн 17:17, Vladislav Shpilevoy wrote:
> Hi! Thanks for the updates!
> 
> > ### Connection liveness
> > 
> > There is a timeout-based mechanism in Tarantool that controls the
> > asynchronous replication, which uses the following config:
> > ```
> > * replication_connect_timeout  = 4
> > * replication_sync_lag         = 10
> > * replication_sync_timeout     = 300
> > * replication_timeout          = 1
> > ```
> > For backward compatibility and to differentiate the async replication
> > we should augment the configuration with the following:
> > ```
> > * synchro_replication_heartbeat = 4
> 
> Heartbeats are already being sent. I don't see any sense in adding a
> second heartbeat option.

I had an idea that synchronous replication can co-exist with async one,
so they have to have independent tuning. Now I realize that sending two
types of heartbeats is too much, so I'll drop this one.

> 
> > * synchro_replication_quorum_timeout = 4
> 
> Since this is a replication option, it should start from replication_
> prefix.

There are number of options already exist that are very similar in 
naming, such as replication_sync_timeout, replication_sync_lag and even
replication_connect_quorum. I expect to resolve the ambiguity with
putting in a new prefix, synchro_replication. 
The drawback is those options reused from async mode would be
not-so-clearly linked to the synch one.

> 
> > ```
> > Leader should send a heartbeat every synchro_replication_heartbeat if
> > there were no messages sent. Replicas should respond to the heartbeat
> > just the same way as they do it now. As soon as Leader has no response
> > for another heartbeat interval, it should consider the replica is lost.
> 
> All of that is already done in the regular heartbeats, not related nor
> bound to any synchronous activities. Just like failure detection should be.
> 
> > As soon as leader appears in a situation it has not enough replicas
> > to achieve quorum, it should stop accepting write requests. There's an
> > option for leader to rollback to the latest transaction that has quorum:
> > leader issues a 'rollback' message referring to the [LEADER_ID, LSN]
> > where LSN is of the first transaction in the leader's undo log.
> 
> What is that option?

Good catch, thanks! This option was introduced to get to a consistent
state with replicas. Although, if Leader will wait longer than timeout
for quorum it will rollback anyways, so I will remove mention of this.

> 
> > The rollback message replicated to the available cluster will put it in a
> > consistent state. After that configuration of the cluster can be
> > updated to a new available quorum and leader can be switched back to
> > write mode.
> > 
> > During the quorum collection it can happen that some of replicas become
> > unavailable due to some reason, so leader should wait at most for
> > synchro_replication_quorum_timeout after which it issues a Rollback
> > pointing to the oldest TXN in the waiting list.

diff --git a/doc/rfc/quorum-based-synchro.md b/doc/rfc/quorum-based-synchro.md
index c7dcf56b5..0a92642fd 100644
--- a/doc/rfc/quorum-based-synchro.md
+++ b/doc/rfc/quorum-based-synchro.md
@@ -83,9 +83,10 @@ Customer        Leader          WAL(L)        Replica        WAL(R)
 
 To introduce the 'quorum' we have to receive confirmation from replicas
 to make a decision on whether the quorum is actually present. Leader
-collects necessary amount of replicas confirmation plus its own WAL
-success. This state is named 'quorum' and gives leader the right to
-complete the customers' request. So the picture will change to:
+collects replication_synchro_quorum-1 of replicas confirmation and its
+own WAL success. This state is named 'quorum' and gives leader the
+right to complete the customers' request. So the picture will change
+to:
 ```
 Customer        Leader          WAL(L)        Replica        WAL(R)
    |------TXN----->|              |             |              |
@@ -158,26 +159,21 @@ asynchronous replication, which uses the following config:
 For backward compatibility and to differentiate the async replication
 we should augment the configuration with the following:
 ```
-* synchro_replication_heartbeat = 4
-* synchro_replication_quorum_timeout = 4
+* replication_synchro_quorum_timeout = 4
+* replication_synchro_quorum = 4
 ```
-Leader should send a heartbeat every synchro_replication_heartbeat if
-there were no messages sent. Replicas should respond to the heartbeat
-just the same way as they do it now. As soon as Leader has no response
-for another heartbeat interval, it should consider the replica is lost.
-As soon as leader appears in a situation it has not enough replicas
-to achieve quorum, it should stop accepting write requests. There's an
-option for leader to rollback to the latest transaction that has quorum:
-leader issues a 'rollback' message referring to the [LEADER_ID, LSN]
-where LSN is of the first transaction in the leader's undo log. The
-rollback message replicated to the available cluster will put it in a
-consistent state. After that configuration of the cluster can be
-updated to a new available quorum and leader can be switched back to
-write mode.
+Leader should send a heartbeat every replication_timeout if there were
+no messages sent. Replicas should respond to the heartbeat just the
+same way as they do it now. As soon as Leader has no response for
+another heartbeat interval, it should consider the replica is lost. As
+soon as leader appears in a situation it has not enough replicas to
+achieve quorum, it should stop accepting write requests. After that
+configuration of the cluster can be updated to a new available quorum
+and leader can be switched back to write mode.
 
 During the quorum collection it can happen that some of replicas become
 unavailable due to some reason, so leader should wait at most for
-synchro_replication_quorum_timeout after which it issues a Rollback
+replication_synchro_quorum_timeout after which it issues a Rollback
 pointing to the oldest TXN in the waiting list.
 
 ### Leader role assignment.
@@ -274,9 +270,9 @@ Leader role and the cluster had 2 replicas with quorum set to 2.
 +---------------------+---------------------+---------------------+
 | ID1 Conf [ID1, Tx2] |                     |                     |
 +---------------------+---------------------+---------------------+
-| Tx6                 |                     |                     |
+| ID1 Tx              |                     |                     |
 +---------------------+---------------------+---------------------+
-| Tx7                 |                     |                     |
+| ID1 Tx              |                     |                     |
 +---------------------+---------------------+---------------------+
 ```
 Suppose at this moment the ID1 instance crashes. Then the ID2 instance

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2020-06-12 20:31 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-03 21:08 [Tarantool-patches] [RFC] Quorum-based synchronous replication Sergey Ostanevich
2020-04-07 13:02 ` Aleksandr Lyapunov
2020-04-08  9:18   ` Sergey Ostanevich
2020-04-08 14:05     ` Konstantin Osipov
2020-04-08 15:06       ` Sergey Ostanevich
2020-04-14 12:58 ` Sergey Bronnikov
2020-04-14 14:43   ` Sergey Ostanevich
2020-04-15 11:09     ` sergos
2020-04-15 14:50       ` sergos
2020-04-16  7:13         ` Aleksandr Lyapunov
2020-04-17 10:10         ` Konstantin Osipov
2020-04-17 13:45           ` Sergey Ostanevich
2020-04-20 11:20         ` Serge Petrenko
2020-04-20 23:32 ` Vladislav Shpilevoy
2020-04-21 10:49   ` Sergey Ostanevich
2020-04-21 22:17     ` Vladislav Shpilevoy
2020-04-22 16:50       ` Sergey Ostanevich
2020-04-22 20:28         ` Vladislav Shpilevoy
2020-04-23  6:58       ` Konstantin Osipov
2020-04-23  9:14         ` Konstantin Osipov
2020-04-23 11:27           ` Sergey Ostanevich
2020-04-23 11:43             ` Konstantin Osipov
2020-04-23 15:11               ` Sergey Ostanevich
2020-04-23 20:39                 ` Konstantin Osipov
2020-04-23 21:38 ` Vladislav Shpilevoy
2020-04-23 22:28   ` Konstantin Osipov
2020-04-30 14:50   ` Sergey Ostanevich
2020-05-06  8:52     ` Konstantin Osipov
2020-05-06 16:39       ` Sergey Ostanevich
2020-05-06 18:44         ` Konstantin Osipov
2020-05-12 15:55           ` Sergey Ostanevich
2020-05-12 16:42             ` Konstantin Osipov
2020-05-13 21:39             ` Vladislav Shpilevoy
2020-05-13 23:54               ` Konstantin Osipov
2020-05-14 20:38               ` Sergey Ostanevich
2020-05-20 20:59                 ` Sergey Ostanevich
2020-05-25 23:41                   ` Vladislav Shpilevoy
2020-05-27 21:17                     ` Sergey Ostanevich
2020-06-09 16:19                       ` Sergey Ostanevich
2020-06-11 15:17                         ` Vladislav Shpilevoy
2020-06-12 20:31                           ` Sergey Ostanevich
2020-05-13 21:36         ` Vladislav Shpilevoy
2020-05-13 23:45           ` Konstantin Osipov
2020-05-06 18:55     ` Konstantin Osipov
2020-05-06 19:10       ` Konstantin Osipov
2020-05-12 16:03         ` Sergey Ostanevich
2020-05-13 21:42       ` Vladislav Shpilevoy
2020-05-14  0:05         ` Konstantin Osipov
2020-05-07 23:01     ` Konstantin Osipov
2020-05-12 16:40       ` Sergey Ostanevich
2020-05-12 17:47         ` Konstantin Osipov
2020-05-13 21:34           ` Vladislav Shpilevoy
2020-05-13 23:31             ` Konstantin Osipov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox