[Tarantool-discussions] [rfc] Quorum-based synchronous replication (draft)

Tarantool discussions archive
 help / color / mirror / Atom feed

* [Tarantool-discussions] [rfc] Quorum-based synchronous replication (draft)
@ 2020-03-25  9:44 Sergey Ostanevich
  2020-04-23  7:13 ` Konstantin Osipov
  0 siblings, 1 reply; 3+ messages in thread
From: Sergey Ostanevich @ 2020-03-25  9:44 UTC (permalink / raw)
  To: tarantool-discussions,
	Николай
	Карлов,
	k.nazarov, Mons Anderson

[-- Attachment #1: Type: text/plain, Size: 6330 bytes --]

The aim of this RFC is to address the following list of problems
formulated at MRG planning meeting:
  - protocol backward compatibility to enable cluster upgrade w/o
    downtime
  - consistency of data on replica and leader
  - switch from leader to replica without data loss
  - up to date replicas to run read-only requests
  - ability to switch async replicas into sync ones
  - guarantee of rollback on leader and sync replicas
  - simplicity of cluster orchestration

What this RFC is not:

  - high availability (HA) solution with automated failover, roles
    assignments an so on
  - master-master configuration support

Quorum-confirmed commit details

The main idea behind the proposal is to reuse existent machinery as much                                                  
as possible. It will ensure the well-tested and proven functionality
across many instances in MRG and beyond is used. The transaction rollback
mechanism is in place and works for WAL write failure. If we substitute 
the WAL success with a new situation which is named 'quorum’ later in 
this  document then no changes to the machinery is needed. The same is 
true for  snapshot machinery that allows to create a copy of the database
in memory for the whole period of snapshot file write. Adding quorum here 
also minimizes changes.

Currently replication represented by the following scheme:

To introduce the 'quorum' we have to receive confirmation from replicas
to make a decision on whether the quorum is actually present. Leader 
collects necessary amount of replicas confirmation plus its own WAL 
success. This state is named 'quorum' and gives leader the right to 
complete the customers' request. So the picture will change to:

The quorum should be collected as a table for a list of transactions 
waiting for quorum. The latest transaction that collects the quorum is
considered as complete, as well as all transactions prior to it, since
all transactions should be applied in order. Leader writes a ‘quorum’
message to the WAL and it is delivered to Replicas.

Replica should report a positive or a negative result  of the TXN to the 
Leader via the IPROTO explicitly to allow Leader to collect the quorum 
or anti-quorum for the TXN. In case negative result for the TXN received
from minor number of Replicas, then Leader has to send an error message
to each Replica, which in turn has to disconnect from the replication
the same way as it is done now in case of conflict.

In case Leader receives enough error messages to do not achieve the 
quorum it should write the ‘rollback’ message  in the WAL. After that 
Leader and Replicas will perform  the rollback  for all TXN that didn’t 
receive quorum.

Recovery and failover.

Tarantool instance during reading WAL should postpone the commit until 
the quorum is read. In case the WAL eof is achieved, the instance should 
keep rollback for all transactions that are waiting for a quorum entry 
until the role of the instance is set. In case this instance become a 
Replica there are no additional actions needed, sine all info about 
quorum/rollback will arrive via replication. In case this instance is 
assigned a Leader role, it should write ‘rollback’ in it’s WAL and 
perform rollback for all transactions waiting for a quorum.

In case of a Leader failure a Replica with the biggest LSN with former 
leader’s ID is elected as a new leader. The replica should record 
‘rollback’ in its  WAL which effectively means that all transactions 
without quorum should be rolled back. This rollback will be delivered to
all replicas and they will perform rollbacks of all transactions waiting 
for quorum.

Snapshot generation.

We also can reuse current machinery of snapshot generation. Upon
receiving a request to create a snapshot an instance should request a
readview for the current commit operation. Although start of the
snapshot generation should be postponed until this commit operation
receives its quorum. In case operation is rolled back, the snapshot
generation should be aborted and restarted using current transaction 
after rollback  is complete.

After snapshot is created the WAL should start from the first operation
that follows the commit operation snapshot is generated for. That means
WAL will contain a quorum message that refers to a transaction that is
not present in the WAL. Apparently, we have to allow this for the case
quorum refers to a transaction with LSN less than the first entry in  t he
WAL and only once.

Asynchronous replication.

Along with synchronous Replicas the cluster can contain asynchronous
Replicas. That means async Replica doesn’t reply  to the Leader with 
errors since they’re not contributing into quorum. Still, async 
Replicas have to follow the new WAL operation, such as keep rollback
info until ‘quorum’ message is received. This is essential for the case
of ‘rollback’ message appearance in the WAL. This message assumes 
Replica is able to perform all necessary rollback by itself.  Cluster 
information should contain explicit notification of each Replica
operation mode. 

Synchronous replication enabling.

Synchronous operation can be required for a set of spaces in the data
scheme. That means only transactions that contain data modification for 
these  spaces should require quorum. Such transactions named synchronous.
As soon as last operation of synchronous transaction appeared in Leader’s
WAL, it will cause all following transactions — no matter if they are
synchronous or not — wait for the quorum. In case quorum is not achieved
the ‘rollback’ operation will cause rollback of all transactions after
the synchronous one. It will ensure the consistent state of the data both
on Leader and Replicas.  In case user doesn’t require synchronous operation
for any space then  no changes to the WAL generation and replication will
appear.

Cluster description should contain explicit attribute for each Replica
to denote it participates in synchronous activities. Also the description
should contain criterion on how many Replicas responses are  needed to 
achieve the quorum.  

[-- Attachment #2.1: Type: text/html, Size: 80667 bytes --]

[-- Attachment #2.2: image.png --]
[-- Type: image/png, Size: 121718 bytes --]

[-- Attachment #2.3: image.png --]
[-- Type: image/png, Size: 163684 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Tarantool-discussions] [rfc] Quorum-based synchronous replication (draft)
  2020-03-25  9:44 [Tarantool-discussions] [rfc] Quorum-based synchronous replication (draft) Sergey Ostanevich
@ 2020-04-23  7:13 ` Konstantin Osipov
  2020-04-23  7:26   ` Konstantin Osipov
  0 siblings, 1 reply; 3+ messages in thread
From: Konstantin Osipov @ 2020-04-23  7:13 UTC (permalink / raw)
  To: Sergey Ostanevich
  Cc: Mons Anderson,
	Николай
	Карлов,
	tarantool-discussions, k.nazarov

* Sergey Ostanevich <sergos@tarantool.org> [20/03/25 13:20]:

I've just realized what I am missing here is the user visible
changes. 

How is this feature enabled in the configuration? 

The leader election is external and is based on WAL length. What's
the API for that? How is the external election entity switches the
old leader to read-only, so that the implementation doesn't have
to deal with leader resurrection? Without these details it's
useless to claim that this is a single-master, and all kinds of
multi-master issues come up.

The spec should explain how the external leader election doesn't
violate serializability.

Re external leader election, I've just recalled that the leader
works in a different mode first few commits before the election,
since it needs to "repair" the most recent commit by covering it
up with a new one at the majority of replicas, otherwise a double
election in a short period of time may lead to loss of
acknowledged transactions - please check out the RAFT paper for
the details. How is this mode communicated to the new leader and
how is it handled, i.e. how does the new leader leave it?

In fact I believe that it would be impossible to make this impl
work in single-master mode, and we have to assume the external
entity will be failing or buggy, so multi-master mode should be
assumed.

At first I was thinking this is a skeleton of the next step you
plan to take after in-memory relay, and it can be refined along
the way, so looked at this spec very favourably. I also expected
that we have a lot of time to refine it still, because in-memory
relay is not progressing.
Yesterday I learned that the plan is to build this implementation
based on existing wal-relay design. This is very saddening - that
such crucial decisions are not specified out first - and changes
my attitude towards this spec. I don't think it is detailed enough
to start implementation - every time I look at it I find some new
issues with it.

> The aim of this RFC is to address the following list of problems
> formulated at MRG planning meeting:
>   - protocol backward compatibility to enable cluster upgrade w/o
>     downtime
>   - consistency of data on replica and leader
>   - switch from leader to replica without data loss
>   - up to date replicas to run read-only requests
>   - ability to switch async replicas into sync ones
>   - guarantee of rollback on leader and sync replicas
>   - simplicity of cluster orchestration
>  
> What this RFC is not:
>  
>   - high availability (HA) solution with automated failover, roles
>     assignments an so on
>   - master-master configuration support
>  
> Quorum-confirmed commit details
>  
> The main idea behind the proposal is to reuse existent machinery as much                                                  
> as possible. It will ensure the well-tested and proven functionality
> across many instances in MRG and beyond is used. The transaction rollback
> mechanism is in place and works for WAL write failure. If we substitute 
> the WAL success with a new situation which is named 'quorum’ later in 
> this  document then no changes to the machinery is needed. The same is 
> true for  snapshot machinery that allows to create a copy of the database
> in memory for the whole period of snapshot file write. Adding quorum here 
> also minimizes changes.
>  
> Currently replication represented by the following scheme:
>  
>  
>  
> To introduce the 'quorum' we have to receive confirmation from replicas
> to make a decision on whether the quorum is actually present. Leader 
> collects necessary amount of replicas confirmation plus its own WAL 
> success. This state is named 'quorum' and gives leader the right to 
> complete the customers' request. So the picture will change to:
>  
>  
> The quorum should be collected as a table for a list of transactions 
> waiting for quorum. The latest transaction that collects the quorum is
> considered as complete, as well as all transactions prior to it, since
> all transactions should be applied in order. Leader writes a ‘quorum’
> message to the WAL and it is delivered to Replicas.
>  
> Replica should report a positive or a negative result  of the TXN to the 
> Leader via the IPROTO explicitly to allow Leader to collect the quorum 
> or anti-quorum for the TXN. In case negative result for the TXN received
> from minor number of Replicas, then Leader has to send an error message
> to each Replica, which in turn has to disconnect from the replication
> the same way as it is done now in case of conflict.
>  
> In case Leader receives enough error messages to do not achieve the 
> quorum it should write the ‘rollback’ message  in the WAL. After that 
> Leader and Replicas will perform  the rollback  for all TXN that didn’t 
> receive quorum.
>  
> Recovery and failover.
>  
> Tarantool instance during reading WAL should postpone the commit until 
> the quorum is read. In case the WAL eof is achieved, the instance should 
> keep rollback for all transactions that are waiting for a quorum entry 
> until the role of the instance is set. In case this instance become a 
> Replica there are no additional actions needed, sine all info about 
> quorum/rollback will arrive via replication. In case this instance is 
> assigned a Leader role, it should write ‘rollback’ in it’s WAL and 
> perform rollback for all transactions waiting for a quorum.
>  
> In case of a Leader failure a Replica with the biggest LSN with former 
> leader’s ID is elected as a new leader. The replica should record 
> ‘rollback’ in its  WAL which effectively means that all transactions 
> without quorum should be rolled back. This rollback will be delivered to
> all replicas and they will perform rollbacks of all transactions waiting 
> for quorum.
>  
> Snapshot generation.
>  
> We also can reuse current machinery of snapshot generation. Upon
> receiving a request to create a snapshot an instance should request a
> readview for the current commit operation. Although start of the
> snapshot generation should be postponed until this commit operation
> receives its quorum. In case operation is rolled back, the snapshot
> generation should be aborted and restarted using current transaction 
> after rollback  is complete.
>  
> After snapshot is created the WAL should start from the first operation
> that follows the commit operation snapshot is generated for. That means
> WAL will contain a quorum message that refers to a transaction that is
> not present in the WAL. Apparently, we have to allow this for the case
> quorum refers to a transaction with LSN less than the first entry in  t he
> WAL and only once.
>  
> Asynchronous replication.
>  
> Along with synchronous Replicas the cluster can contain asynchronous
> Replicas. That means async Replica doesn’t reply  to the Leader with 
> errors since they’re not contributing into quorum. Still, async 
> Replicas have to follow the new WAL operation, such as keep rollback
> info until ‘quorum’ message is received. This is essential for the case
> of ‘rollback’ message appearance in the WAL. This message assumes 
> Replica is able to perform all necessary rollback by itself.  Cluster 
> information should contain explicit notification of each Replica
> operation mode. 
>  
> Synchronous replication enabling.
>  
> Synchronous operation can be required for a set of spaces in the data
> scheme. That means only transactions that contain data modification for 
> these  spaces should require quorum. Such transactions named synchronous.
> As soon as last operation of synchronous transaction appeared in Leader’s
> WAL, it will cause all following transactions — no matter if they are
> synchronous or not — wait for the quorum. In case quorum is not achieved
> the ‘rollback’ operation will cause rollback of all transactions after
> the synchronous one. It will ensure the consistent state of the data both
> on Leader and Replicas.  In case user doesn’t require synchronous operation
> for any space then  no changes to the WAL generation and replication will
> appear.
>  
> Cluster description should contain explicit attribute for each Replica
> to denote it participates in synchronous activities. Also the description
> should contain criterion on how many Replicas responses are  needed to 
> achieve the quorum.  
>  
>  
>    
>  
>    

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Tarantool-discussions] [rfc] Quorum-based synchronous replication (draft)
  2020-04-23  7:13 ` Konstantin Osipov
@ 2020-04-23  7:26   ` Konstantin Osipov
  0 siblings, 0 replies; 3+ messages in thread
From: Konstantin Osipov @ 2020-04-23  7:26 UTC (permalink / raw)
  To: Sergey Ostanevich
  Cc: Mons Anderson,
	Николай
	Карлов,
	tarantool-discussions, k.nazarov

* Konstantin Osipov <kostja.osipov@gmail.com> [20/04/23 10:13]:
> Re external leader election, I've just recalled that the leader
> works in a different mode first few commits before the election,

after


-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-04-23  7:26 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-25  9:44 [Tarantool-discussions] [rfc] Quorum-based synchronous replication (draft) Sergey Ostanevich
2020-04-23  7:13 ` Konstantin Osipov
2020-04-23  7:26   ` Konstantin Osipov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox