Tarantool development patches archive
 help / color / mirror / Atom feed
From: Sergey Ostanevich <sergos@tarantool.org>
To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Cc: tarantool-patches@dev.tarantool.org
Subject: Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
Date: Tue, 9 Jun 2020 19:19:29 +0300	[thread overview]
Message-ID: <20200609161929.GF50@tarantool.org> (raw)
In-Reply-To: <20200527211735.GA50@tarantool.org>

Hi!

Please, take a look at the latest changes, which include timeouts for
quorum collection and the heartbeat for ensure the leader is alive.


regards,
Sergos


* **Status**: In progress
* **Start date**: 31-03-2020
* **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\>
* **Issues**: https://github.com/tarantool/tarantool/issues/4842

## Summary

The aim of this RFC is to address the following list of problems
formulated at MRG planning meeting:
  - protocol backward compatibility to enable cluster upgrade w/o
    downtime
  - consistency of data on replica and leader
  - switch from leader to replica without data loss
  - up to date replicas to run read-only requests
  - ability to switch async replicas into sync ones and vice versa
  - guarantee of rollback on leader and sync replicas
  - simplicity of cluster orchestration

What this RFC is not:

  - high availability (HA) solution with automated failover, roles
    assignments an so on
  - master-master configuration support

## Background and motivation

There are number of known implementation of consistent data presence in
a Tarantool cluster. They can be commonly named as "wait for LSN"
technique. The biggest issue with this technique is the absence of
rollback guarantees at replica in case of transaction failure on one
master or some of the replicas in the cluster.

To provide such capabilities a new functionality should be introduced in
Tarantool core, with requirements mentioned before - backward
compatibility and ease of cluster orchestration.

The cluster operation is expected to be in a full-mesh topology, although
the process of automated topology support is beyond this RFC.

## Detailed design

### Quorum commit

The main idea behind the proposal is to reuse existent machinery as much
as possible. It will ensure the well-tested and proven functionality
across many instances in MRG and beyond is used. The transaction rollback
mechanism is in place and works for WAL write failure. If we substitute
the WAL success with a new situation which is named 'quorum' later in
this document then no changes to the machinery is needed. The same is
true for snapshot machinery that allows to create a copy of the database
in memory for the whole period of snapshot file write. Adding quorum here
also minimizes changes.

Currently replication represented by the following scheme:
```
Customer        Leader          WAL(L)        Replica        WAL(R)
   |------TXN----->|              |             |              |
   |               |              |             |              |
   |         [TXN undo log        |             |              |
   |            created]          |             |              |
   |               |              |             |              |
   |               |-----TXN----->|             |              |
   |               |              |             |              |
   |               |<---WAL Ok----|             |              |
   |               |              |             |              |
   |         [TXN undo log        |             |              |
   |           destroyed]         |             |              |
   |               |              |             |              |
   |<----TXN Ok----|              |             |              |
   |               |-------Replicate TXN------->|              |
   |               |              |             |              |
   |               |              |       [TXN undo log        |
   |               |              |          created]          |
   |               |              |             |              |
   |               |              |             |-----TXN----->|
   |               |              |             |              |
   |               |              |             |<---WAL Ok----|
   |               |              |             |              |
   |               |              |       [TXN undo log        |
   |               |              |         destroyed]         |
   |               |              |             |              |
```

To introduce the 'quorum' we have to receive confirmation from replicas
to make a decision on whether the quorum is actually present. Leader
collects necessary amount of replicas confirmation plus its own WAL
success. This state is named 'quorum' and gives leader the right to
complete the customers' request. So the picture will change to:
```
Customer        Leader          WAL(L)        Replica        WAL(R)
   |------TXN----->|              |             |              |
   |               |              |             |              |
   |         [TXN undo log        |             |              |
   |            created]          |             |              |
   |               |              |             |              |
   |               |-----TXN----->|             |              |
   |               |              |             |              |
   |               |-------Replicate TXN------->|              |
   |               |              |             |              |
   |               |              |       [TXN undo log        |
   |               |<---WAL Ok----|          created]          |
   |               |              |             |              |
   |           [Waiting           |             |-----TXN----->|
   |         of a quorum]         |             |              |
   |               |              |             |<---WAL Ok----|
   |               |              |             |              |
   |               |<------Replication Ok-------|              |
   |               |              |             |              |
   |            [Quorum           |             |              |
   |           achieved]          |             |              |
   |               |              |             |              |
   |               |---Confirm--->|             |              |
   |               |              |             |              |
   |               |----------Confirm---------->|              |
   |               |              |             |              |
   |<---TXN Ok-----|              |             |---Confirm--->|
   |               |              |             |              |
   |         [TXN undo log        |       [TXN undo log        |
   |           destroyed]         |         destroyed]         |
   |               |              |             |              |
```

The quorum should be collected as a table for a list of transactions
waiting for quorum. The latest transaction that collects the quorum is
considered as complete, as well as all transactions prior to it, since
all transactions should be applied in order. Leader writes a 'confirm'
message to the WAL that refers to the transaction's [LEADER_ID, LSN] and
the confirm has its own LSN. This confirm message is delivered to all
replicas through the existing replication mechanism.

Replica should report a TXN application success to the leader via the
IPROTO explicitly to allow leader to collect the quorum for the TXN.
In case of application failure the replica has to disconnect from the
replication the same way as it is done now. The replica also has to
report its disconnection to the orchestrator. Further actions require
human intervention, since failure means either technical problem (such
as not enough space for WAL) that has to be resolved or an inconsistent
state that requires rejoin.

Currently Tarantool provides no protection from dirty read from the
memtx during the TXN write into the WAL. So, there is a chance of TXN
can fail to be written to the WAL, while some read requests can report
success of TXN. In this RFC we make no attempt to resolve the dirty
read, so it should be addressed by user code. Although we plan to
introduce an MVCC machinery similar to available in vinyl engnie which
will resolve the dirty read problem.

### Connection liveness

There is a timeout-based mechanism in Tarantool that controls the
asynchronous replication, which uses the following config:
```
* replication_connect_timeout  = 4
* replication_sync_lag         = 10
* replication_sync_timeout     = 300
* replication_timeout          = 1
```
For backward compatibility and to differentiate the async replication
we should augment the configuration with the following:
```
* synchro_replication_heartbeat = 4
* synchro_replication_quorum_timeout = 4
```
Leader should send a heartbeat every synchro_replication_heartbeat if
there were no messages sent. Replicas should respond to the heartbeat
just the same way as they do it now. As soon as Leader has no response
for another heartbeat interval, it should consider the replica is lost.
As soon as leader appears in a situation it has not enough replicas
to achieve quorum, it should stop accepting write requests. There's an
option for leader to rollback to the latest transaction that has quorum:
leader issues a 'rollback' message referring to the [LEADER_ID, LSN]
where LSN is of the first transaction in the leader's undo log. The
rollback message replicated to the available cluster will put it in a
consistent state. After that configuration of the cluster can be
updated to a new available quorum and leader can be switched back to
write mode.

During the quorum collection it can happen that some of replicas become
unavailable due to some reason, so leader should wait at most for
synchro_replication_quorum_timeout after which it issues a Rollback
pointing to the oldest TXN in the waiting list.

### Leader role assignment.

Be it a user-initiated assignment or an algorithmic one, it should use
a common interface to assign the leader role. By now we implement a
simplified machinery, still it should be feasible in the future to fit
the algorithms, such as RAFT or proposed before box.ctl.promote.

A system space \_voting can be used to replicate the voting among the
cluster, this space should be writable even for a read-only instance.
This space should contain a CURRENT_LEADER_ID at any time - means the
current leader, can be a zero value at the start. This is needed to
compare the appropriate vclock component below.

All replicas should be subscribed to changes in the space and react as
described below.

 promote(ID) - should be called from a replica with it's own ID.
   Writes an entry in the voting space about this ID is waiting for
   votes from cluster. The entry should also contain the current
   vclock[CURRENT_LEADER_ID] of the nominee.

Upon changes in the space each replica should compare its appropriate
vclock component with submitted one and append its vote to the space:
AYE in case nominee's vclock is bigger or equal to the replica's one,
NAY otherwise.

As soon as nominee collects the quorum for being elected, it claims
himself a Leader by switching in rw mode, writes CURRENT_LEADER_ID as
a FORMER_LEADER_ID in the \_voting space and put its ID as a
CURRENT_LEADER_ID. In case a NAY is appeared in the \_voting or a
timeout predefined in box.cfg is reached, the nominee should remove
it's entry from the space.

The leader should assure that number of available instances in the
cluster is enough to achieve the quorum and proceed to step 3, otherwise
the leader should report the situation of incomplete quorum, as
described in the last paragraph of previous section.

The new Leader has to take the responsibility to replicate former Leader's
entries from its WAL, obtain quorum and commit confirm messages referring
to [FORMER_LEADER_ID, LSN] in its WAL, replicating to the cluster, after
that it can start adding its own entries into the WAL.

 demote(ID) - should be called from the Leader instance.
   The Leader has to switch in ro mode and wait for its' undo log is
   empty. This effectively means all transactions are committed in the
   cluster and it is safe pass the leadership. Then it should write
   CURRENT_LEADER_ID as a FORMER_LEADER_ID and put CURRENT_LEADER_ID
   into 0.

### Recovery and failover.

Tarantool instance during reading WAL should postpone the undo log
deletion until the 'confirm' is read. In case the WAL eof is achieved,
the instance should keep undo log for all transactions that are waiting
for a confirm entry until the role of the instance is set.

If this instance will be assigned a leader role then all transactions
that have no corresponding confirm message should be confirmed (see the
leader role assignment).

In case there's not enough replicas to set up a quorum the cluster can
be switched into a read-only mode. Note, this can't be done by default
since some of transactions can have confirmed state. It is up to human
intervention to force rollback of all transactions that have no confirm
and to put the cluster into a consistent state.

In case the instance will be assigned a replica role, it may appear in
a state that it has conflicting WAL entries, in case it recovered from a
leader role and some of transactions didn't replicated to the current
leader. This situation should be resolved through rejoin of the instance.

Consider an example below. Originally instance with ID1 was assigned a
Leader role and the cluster had 2 replicas with quorum set to 2.

```
+---------------------+---------------------+---------------------+
| ID1                 | ID2                 | ID3                 |
| Leader              | Replica 1           | Replica 2           |
+---------------------+---------------------+---------------------+
| ID1 Tx1             | ID1 Tx1             | ID1 Tx1             |
+---------------------+---------------------+---------------------+
| ID1 Tx2             | ID1 Tx2             | ID1 Tx2             |
+---------------------+---------------------+---------------------+
| ID1 Tx3             | ID1 Tx3             | ID1 Tx3             |
+---------------------+---------------------+---------------------+
| ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] |                     |
+---------------------+---------------------+---------------------+
| ID1 Tx4             | ID1 Tx4             |                     |
+---------------------+---------------------+---------------------+
| ID1 Tx5             | ID1 Tx5             |                     |
+---------------------+---------------------+---------------------+
| ID1 Conf [ID1, Tx2] |                     |                     |
+---------------------+---------------------+---------------------+
| Tx6                 |                     |                     |
+---------------------+---------------------+---------------------+
| Tx7                 |                     |                     |
+---------------------+---------------------+---------------------+
```
Suppose at this moment the ID1 instance crashes. Then the ID2 instance
should be assigned a leader role since its ID1 LSN is the biggest.
Then this new leader will deliver its WAL to all replicas.

As soon as quorum for Tx4 and Tx5 will be obtained, it should write the
corresponding Confirms to its WAL. Note that Tx are still uses ID1.
```
+---------------------+---------------------+---------------------+
| ID1                 | ID2                 | ID3                 |
| (dead)              | Leader              | Replica 2           |
+---------------------+---------------------+---------------------+
| ID1 Tx1             | ID1 Tx1             | ID1 Tx1             |
+---------------------+---------------------+---------------------+
| ID1 Tx2             | ID1 Tx2             | ID1 Tx2             |
+---------------------+---------------------+---------------------+
| ID1 Tx3             | ID1 Tx3             | ID1 Tx3             |
+---------------------+---------------------+---------------------+
| ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] |
+---------------------+---------------------+---------------------+
| ID1 Tx4             | ID1 Tx4             | ID1 Tx4             |
+---------------------+---------------------+---------------------+
| ID1 Tx5             | ID1 Tx5             | ID1 Tx5             |
+---------------------+---------------------+---------------------+
| ID1 Conf [ID1, Tx2] | ID2 Conf [ID1, Tx5] | ID2 Conf [ID1, Tx5] |
+---------------------+---------------------+---------------------+
| ID1 Tx6             |                     |                     |
+---------------------+---------------------+---------------------+
| ID1 Tx7             |                     |                     |
+---------------------+---------------------+---------------------+
```
After rejoining ID1 will figure out the inconsistency of its WAL: the
last WAL entry it has is corresponding to Tx7, while in Leader's log the
last entry with ID1 is Tx5. Confirm for a Tx can only be issued after
appearance of the Tx on the majoirty of replicas, hence there's a good
chances that ID1 will have inconsistency in its WAL covered with undo
log. So, by rolling back all excessive Txs (in the example they are Tx6
and Tx7) the ID1 can put its memtx and vynil in consistent state.

At this point a snapshot can be created at ID1 with appropriate WAL
rotation. The old WAL should be renamed so it will not be reused in the
future and can be kept for postmortem.
```
+---------------------+---------------------+---------------------+
| ID1                 | ID2                 | ID3                 |
| Replica 1           | Leader              | Replica 2           |
+---------------------+---------------------+---------------------+
| ID1 Tx1             | ID1 Tx1             | ID1 Tx1             |
+---------------------+---------------------+---------------------+
| ID1 Tx2             | ID1 Tx2             | ID1 Tx2             |
+---------------------+---------------------+---------------------+
| ID1 Tx3             | ID1 Tx3             | ID1 Tx3             |
+---------------------+---------------------+---------------------+
| ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] |
+---------------------+---------------------+---------------------+
| ID1 Tx4             | ID1 Tx4             | ID1 Tx4             |
+---------------------+---------------------+---------------------+
| ID1 Tx5             | ID1 Tx5             | ID1 Tx5             |
+---------------------+---------------------+---------------------+
|                     | ID2 Conf [ID1, Tx5] | ID2 Conf [ID1, Tx5] |
+---------------------+---------------------+---------------------+
|                     | ID2 Tx1             | ID2 Tx1             |
+---------------------+---------------------+---------------------+
|                     | ID2 Tx2             | ID2 Tx2             |
+---------------------+---------------------+---------------------+
```
Although, in case undo log is not enough to cover the WAL inconsistence
with the new leader, the ID1 needs a complete rejoin.

### Snapshot generation.

We also can reuse current machinery of snapshot generation. Upon
receiving a request to create a snapshot an instance should request a
readview for the current commit operation. Although start of the
snapshot generation should be postponed until this commit operation
receives its confirmation. In case operation is rolled back, the snapshot
generation should be aborted and restarted using current transaction
after rollback is complete.

After snapshot is created the WAL should start from the first operation
that follows the commit operation snapshot is generated for. That means
WAL will contain 'confirm' messages that refer to transactions that are
not present in the WAL. Apparently, we have to allow this for the case
'confirm' refers to a transaction with LSN less than the first entry in
the WAL.

In case master appears unavailable a replica still have to be able to
create a snapshot. Replica can perform rollback for all transactions that
are not confirmed and claim its LSN as the latest confirmed txn. Then it
can create a snapshot in a regular way and start with blank xlog file.
All rolled back transactions will appear through the regular replication
in case master reappears later on.

### Asynchronous replication.

Along with synchronous replicas the cluster can contain asynchronous
replicas. That means async replica doesn't reply to the leader with
errors since they're not contributing into quorum. Still, async
replicas have to follow the new WAL operation, such as keep rollback
info until 'quorum' message is received. This is essential for the case
of 'rollback' message appearance in the WAL. This message assumes
replica is able to perform all necessary rollback by itself. Cluster
information should contain explicit notification of each replica
operation mode.

### Synchronous replication enabling.

Synchronous operation can be required for a set of spaces in the data
scheme. That means only transactions that contain data modification for
these spaces should require quorum. Such transactions named synchronous.
As soon as last operation of synchronous transaction appeared in leader's
WAL, it will cause all following transactions - no matter if they are
synchronous or not - wait for the quorum. In case quorum is not achieved
the 'rollback' operation will cause rollback of all transactions after
the synchronous one. It will ensure the consistent state of the data both
on leader and replicas. In case user doesn't require synchronous operation
for any space then no changes to the WAL generation and replication will
appear.

Cluster description should contain explicit attribute for each replica
to denote it participates in synchronous activities. Also the description
should contain criterion on how many replicas responses are needed to
achieve the quorum.

## Rationale and alternatives

There is an implementation of synchronous replication as part of gh-980
activities, still it is not in a state to get into the product. More
than that it intentionally breaks backward compatibility which is a
prerequisite for this proposal.

On 28 мая 00:17, Sergey Ostanevich wrote:
> Hi!
> 
> Thanks for review!
> 
> Some comments below.
> On 26 мая 01:41, Vladislav Shpilevoy wrote:
> > >>
> > >> The reads should not be inconsistent - so that cluster will keep
> > >> answering A or B for the same request. And in case we lost quorum we
> > >> can't say for sure that all instances will answer the same.
> > >>
> > >> As we discussed it before, if leader appears in minor part of the
> > >> cluster it can't issue rollback for all unconfirmed txns, since the
> > >> majority will re-elect leader who will collect quorum for them. Means,
> > >> we will appear is a state that cluster split in two. So the minor part
> > >> should stop. Am I wrong here?
> >
> > Yeah, kinda. As long as you allow reading from replicas, you *always* will
> > have a time slot, when you will be able to read different data for the
> > same key on different replicas. Even with reads going through quorum.
> >
> > Because it is physically impossible to make nodes A and B start answering
> > the same data at the same time moment. To notify them about a confirm you will
> > send network messages, they will have not the same delay, won't be processed
> > in the same moment of time, and some of them probably won't be even delivered.
> >
> > The only correct way to read the same - read from one node only. From the
> > leader. And since this is not our way, it means we can't beat the 'inconsistent'
> > reads problems. And I don't think we should. Because if somebody needs to do
> > 'consistent' reads, they should read from leader only.
> >
> > In other words, the concept of 'consistency' is highly application dependent
> > here. If we provide a way to read from replicas, we give flexibility to choose:
> > read from leader only and see always the same data, or read from all, and have
> > a possibility, that requests may see different data on different replicas
> > sometimes.
> 
> So, it looks like we will follow the current approach: if quorum can't
> be achieved, cluster appears in r/o mode. Objections?
> 
> > >
> > > Replica should report a TXN application success to the leader via the
> > > IPROTO explicitly to allow leader to collect the quorum for the TXN.
> > > In case of application failure the replica has to disconnect from the
> > > replication the same way as it is done now. The replica also has to
> > > report its disconnection to the orchestrator. Further actions require
> > > human intervention, since failure means either technical problem (such
> > > as not enough space for WAL) that has to be resolved or an inconsistent
> > > state that requires rejoin.
> >
> > I don't think a replica should report disconnection. Problem of
> > disconnection is that it leads to loosing the connection. So it may be
> > not able to connect to the orchestrator. Also it would be strange for
> > tarantool to depend on some external service, to which it should report.
> > This looks like the orchestrator's business how will it determine
> > connectivity. Replica has nothing to do with it from its side.
> 
> External service is something I expect to be useful for the first part
> of implementation - the quorum part. Definitely, we will move onward to
> achieve some automation in leader election and failover. I just don't
> expect this to be part of this RFC.
> 
> Anyways, orchestrator has to ask replica to figure out the connectivity
> between replica and leader.
> 
> >
> > > As soon as leader appears in a situation it has not enough replicas
> > > to achieve quorum, the cluster should stop accepting any requests - both
> > > write and read.
> >
> > The moment of not having enough replicas can't be determined properly.
> > You may loose connection to replicas (they could be powered off), but
> > TCP won't see that, and the node will continue working. The failure will
> > be discovered only when a 'write' request will try to collect a quorum,
> > or after a timeout will pass on not delivering heartbeats. During this
> > time reads will be served. And there is no way to prevent them except
> > collecting a quorum on that. See my first comment in this email for more
> > details.
> >
> > On the summary: we can't stop accepting read requests.
> >
> > Btw, what to do with reads, which were *in-progress*, when the quorum
> > was lost? Such as long vinyl reads.
> 
> But the quorum was in place at the start of it? Then according to
> transaction manager behavior only older version data will be available
> for read - means data that collected quorum.
> 
> >
> > > The reason for this is that replication of transactions
> > > can achieve quorum on replicas not visible to the leader. On the other
> > > hand, leader can't achieve quorum with available minority. Leader has to
> > > report the state and wait for human intervention.
> >
> > Yeah, but if the leader couldn't achieve a quorum on some transactions,
> > they are not visible (assuming MVCC will work properly). So they can't
> > be read anyway. And if a leader answered an error, it does not mean that
> > the transaction wasn't replicated on the majority, as we discussed at some
> > meeting, I don't already remember when. So here read allowance also works
> > fine - not having some data visible and getting error at a sync transaction
> > does not mean it is not committed. A user should be aware of that.
> 
> True, we discussed that we should guarantee only that if we answered
> 'Ok' then data is present in quorum number of instances.
> 
> [...]
> 
> > >  demote(ID) - should be called from the Leader instance.
> > >    The Leader has to switch in ro mode and wait for its' undo log is
> > >    empty. This effectively means all transactions are committed in the
> > >    cluster and it is safe pass the leadership. Then it should write
> > >    CURRENT_LEADER_ID as a FORMER_LEADER_ID and put CURRENT_LEADER_ID
> > >    into 0.
> >
> > This looks like box.ctl.promote() algorithm. Although I thought we decided
> > not to implement any kind of auto election here, no? Box.ctl.promote()
> > assumed, that it does all the steps automatically, except choosing on which
> > node to call this function. This is what it was so complicated. It was
> > basically raft.
> >
> > But yeah, as discussed verbally, this is a subject for improvement.
> 
> I personally would like to postpone the algorithm should be postponed
> for the next stage (Q3-Q4) but now we should not mess up too much to
> revamp. Hence, we have to elaborate the internals - such as _voting
> table I mentioned.
> 
> Even with introduction of terms for each leader - as in RAFT for example
> - we still can keep it in a replicated space, isn't it?
> 
> >
> > The way I see it is that we need to give vclock based algorithm of choosing
> > a new leader; tell how to stop replication from the old leader; allow to
> > read vclock from replicas (basically, let the external service read box.info).
> 
> That's the #1 for me by now: how a read-only replica can quit listening
> to a demoted leader, which can be not aware of its demotion? Still, for
> efficiency it should be done w/o disconnection.
> 
> >
> > Since you said you think we should not provide an API for all sync transactions
> > rollback, it looks like no need in a special new API. But if we still want
> > to allow to rollback all pending transactions of the old leader on a new leader
> > (like Mons wants) then yeah, seems like we would need a new function. For example,
> > box.ctl.sync_rollback() to rollback all pending. And box.ctl.sync_confirm() to
> > confirm all pending. Perhaps we could add more admin-line parameters such as
> > replica_id with which to write 'confirm/rollback' message.
> 
> I believe it's a good point to keep two approaches and perhaps set one
> of the two in configuration. This should resolve the issue with 'the
> rest of the cluster confirms old leader's transactions and because of it
> leader can't rollback'.
> 
> >
> > > ### Recovery and failover.
> > >
> > > Tarantool instance during reading WAL should postpone the undo log
> > > deletion until the 'confirm' is read. In case the WAL eof is achieved,
> > > the instance should keep undo log for all transactions that are waiting
> > > for a confirm entry until the role of the instance is set.
> > >
> > > If this instance will be assigned a leader role then all transactions
> > > that have no corresponding confirm message should be confirmed (see the
> > > leader role assignment).
> > >
> > > In case there's not enough replicas to set up a quorum the cluster can
> > > be switched into a read-only mode. Note, this can't be done by default
> > > since some of transactions can have confirmed state. It is up to human
> > > intervention to force rollback of all transactions that have no confirm
> > > and to put the cluster into a consistent state.
> >
> > Above you said:
> >
> > >> As soon as leader appears in a situation it has not enough replicas
> > >> to achieve quorum, the cluster should stop accepting any requests - both
> > >> write and read.
> >
> > But here I see, that the cluster "switched into a read-only mode". So there
> > is a contradiction. And I think it should be resolved in favor of
> > 'read-only mode'. I explained why in the previous comments.
> 
> My bad, I was moving around this problem already and tend to allow r/o.
> Will update.
> 
> >
> > > In case the instance will be assigned a replica role, it may appear in
> > > a state that it has conflicting WAL entries, in case it recovered from a
> > > leader role and some of transactions didn't replicated to the current
> > > leader. This situation should be resolved through rejoin of the instance.
> > >
> > > Consider an example below. Originally instance with ID1 was assigned a
> > > Leader role and the cluster had 2 replicas with quorum set to 2.
> > >
> > > ```
> > > +---------------------+---------------------+---------------------+
> > > | ID1                 | ID2                 | ID3                 |
> > > | Leader              | Replica 1           | Replica 2           |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx1             | ID1 Tx1             | ID1 Tx1             |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx2             | ID1 Tx2             | ID1 Tx2             |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx3             | ID1 Tx3             | ID1 Tx3             |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] |                     |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx4             | ID1 Tx4             |                     |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx5             | ID1 Tx5             |                     |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Conf [ID1, Tx2] |                     |                     |
> > > +---------------------+---------------------+---------------------+
> > > | Tx6                 |                     |                     |
> > > +---------------------+---------------------+---------------------+
> > > | Tx7                 |                     |                     |
> > > +---------------------+---------------------+---------------------+
> > > ```
> > > Suppose at this moment the ID1 instance crashes. Then the ID2 instance
> > > should be assigned a leader role since its ID1 LSN is the biggest.
> > > Then this new leader will deliver its WAL to all replicas.
> > >
> > > As soon as quorum for Tx4 and Tx5 will be obtained, it should write the
> > > corresponding Confirms to its WAL. Note that Tx are still uses ID1.
> > > ```
> > > +---------------------+---------------------+---------------------+
> > > | ID1                 | ID2                 | ID3                 |
> > > | (dead)              | Leader              | Replica 2           |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx1             | ID1 Tx1             | ID1 Tx1             |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx2             | ID1 Tx2             | ID1 Tx2             |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx3             | ID1 Tx3             | ID1 Tx3             |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx4             | ID1 Tx4             | ID1 Tx4             |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx5             | ID1 Tx5             | ID1 Tx5             |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Conf [ID1, Tx2] | ID2 Conf [Id1, Tx5] | ID2 Conf [Id1, Tx5] |
> >
> > Id1 -> ID1 (typo)
> 
> Thanks!
> 
> >
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx6             |                     |                     |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx7             |                     |                     |
> > > +---------------------+---------------------+---------------------+
> > > ```
> > > After rejoining ID1 will figure out the inconsistency of its WAL: the
> > > last WAL entry it has is corresponding to Tx7, while in Leader's log the
> > > last entry with ID1 is Tx5. Confirm for a Tx can only be issued after
> > > appearance of the Tx on the majoirty of replicas, hence there's a good
> > > chances that ID1 will have inconsistency in its WAL covered with undo
> > > log. So, by rolling back all excessive Txs (in the example they are Tx6
> > > and Tx7) the ID1 can put its memtx and vynil in consistent state.
> >
> > Yeah, but the problem is that the node1 has vclock[ID1] == 'Conf [ID1, Tx2]'.
> > This row can't be rolled back. So looks like node1 needs a rejoin.
> 
> Confirm message is equivalent to a NOP - @sergepetrenko apparently does
> implementation exactly this way. So there's no need to roll it back in
> an engine, rather perform the xlog rotation before it.
> 
> >
> > > At this point a snapshot can be created at ID1 with appropriate WAL
> > > rotation. The old WAL should be renamed so it will not be reused in the
> > > future and can be kept for postmortem.
> > > ```
> > > +---------------------+---------------------+---------------------+
> > > | ID1                 | ID2                 | ID3                 |
> > > | Replica 1           | Leader              | Replica 2           |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx1             | ID1 Tx1             | ID1 Tx1             |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx2             | ID1 Tx2             | ID1 Tx2             |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx3             | ID1 Tx3             | ID1 Tx3             |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx4             | ID1 Tx4             | ID1 Tx4             |
> > > +---------------------+---------------------+---------------------+
> > > | ID1 Tx5             | ID1 Tx5             | ID1 Tx5             |
> > > +---------------------+---------------------+---------------------+
> > > |                     | ID2 Conf [Id1, Tx5] | ID2 Conf [Id1, Tx5] |
> > > +---------------------+---------------------+---------------------+
> > > |                     | ID2 Tx1             | ID2 Tx1             |
> > > +---------------------+---------------------+---------------------+
> > > |                     | ID2 Tx2             | ID2 Tx2             |
> > > +---------------------+---------------------+---------------------+
> > > ```
> > > Although, in case undo log is not enough to cover the WAL inconsistence
> > > with the new leader, the ID1 needs a complete rejoin.

  reply	other threads:[~2020-06-09 16:19 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-03 21:08 Sergey Ostanevich
2020-04-07 13:02 ` Aleksandr Lyapunov
2020-04-08  9:18   ` Sergey Ostanevich
2020-04-08 14:05     ` Konstantin Osipov
2020-04-08 15:06       ` Sergey Ostanevich
2020-04-14 12:58 ` Sergey Bronnikov
2020-04-14 14:43   ` Sergey Ostanevich
2020-04-15 11:09     ` sergos
2020-04-15 14:50       ` sergos
2020-04-16  7:13         ` Aleksandr Lyapunov
2020-04-17 10:10         ` Konstantin Osipov
2020-04-17 13:45           ` Sergey Ostanevich
2020-04-20 11:20         ` Serge Petrenko
2020-04-20 23:32 ` Vladislav Shpilevoy
2020-04-21 10:49   ` Sergey Ostanevich
2020-04-21 22:17     ` Vladislav Shpilevoy
2020-04-22 16:50       ` Sergey Ostanevich
2020-04-22 20:28         ` Vladislav Shpilevoy
2020-04-23  6:58       ` Konstantin Osipov
2020-04-23  9:14         ` Konstantin Osipov
2020-04-23 11:27           ` Sergey Ostanevich
2020-04-23 11:43             ` Konstantin Osipov
2020-04-23 15:11               ` Sergey Ostanevich
2020-04-23 20:39                 ` Konstantin Osipov
2020-04-23 21:38 ` Vladislav Shpilevoy
2020-04-23 22:28   ` Konstantin Osipov
2020-04-30 14:50   ` Sergey Ostanevich
2020-05-06  8:52     ` Konstantin Osipov
2020-05-06 16:39       ` Sergey Ostanevich
2020-05-06 18:44         ` Konstantin Osipov
2020-05-12 15:55           ` Sergey Ostanevich
2020-05-12 16:42             ` Konstantin Osipov
2020-05-13 21:39             ` Vladislav Shpilevoy
2020-05-13 23:54               ` Konstantin Osipov
2020-05-14 20:38               ` Sergey Ostanevich
2020-05-20 20:59                 ` Sergey Ostanevich
2020-05-25 23:41                   ` Vladislav Shpilevoy
2020-05-27 21:17                     ` Sergey Ostanevich
2020-06-09 16:19                       ` Sergey Ostanevich [this message]
2020-06-11 15:17                         ` Vladislav Shpilevoy
2020-06-12 20:31                           ` Sergey Ostanevich
2020-05-13 21:36         ` Vladislav Shpilevoy
2020-05-13 23:45           ` Konstantin Osipov
2020-05-06 18:55     ` Konstantin Osipov
2020-05-06 19:10       ` Konstantin Osipov
2020-05-12 16:03         ` Sergey Ostanevich
2020-05-13 21:42       ` Vladislav Shpilevoy
2020-05-14  0:05         ` Konstantin Osipov
2020-05-07 23:01     ` Konstantin Osipov
2020-05-12 16:40       ` Sergey Ostanevich
2020-05-12 17:47         ` Konstantin Osipov
2020-05-13 21:34           ` Vladislav Shpilevoy
2020-05-13 23:31             ` Konstantin Osipov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200609161929.GF50@tarantool.org \
    --to=sergos@tarantool.org \
    --cc=tarantool-patches@dev.tarantool.org \
    --cc=v.shpilevoy@tarantool.org \
    --subject='Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox