* [Tarantool-patches] [RFC] Quorum-based synchronous replication @ 2020-04-03 21:08 Sergey Ostanevich 2020-04-07 13:02 ` Aleksandr Lyapunov ` (3 more replies) 0 siblings, 4 replies; 53+ messages in thread From: Sergey Ostanevich @ 2020-04-03 21:08 UTC (permalink / raw) To: tarantool-patches * **Status**: In progress * **Start date**: 31-03-2020 * **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\> * **Issues**: ## Summary The aim of this RFC is to address the following list of problems formulated at MRG planning meeting: - protocol backward compatibility to enable cluster upgrade w/o downtime - consistency of data on replica and leader - switch from leader to replica without data loss - up to date replicas to run read-only requests - ability to switch async replicas into sync ones - guarantee of rollback on leader and sync replicas - simplicity of cluster orchestration What this RFC is not: - high availability (HA) solution with automated failover, roles assignments an so on - master-master configuration support ## Background and motivation There are number of known implemenatation of consistent data presence in a cluster. They can be commonly named as "wait for LSN" technique. The biggest issue with this technique is the abscence of rollback gauarantees at replica in case of transaction failure on one master or some of the replics in the cluster. To provide such capabilities a new functionality should be introduced in Tarantool core, with limitation mentioned before - backward compatilibity and ease of cluster orchestration. ## Detailed design ### Quorum commit The main idea behind the proposal is to reuse existent machinery as much as possible. It will ensure the well-tested and proven functionality across many instances in MRG and beyond is used. The transaction rollback mechanism is in place and works for WAL write failure. If we substitute the WAL success with a new situation which is named 'quorum' later in this document then no changes to the machinery is needed. The same is true for snapshot machinery that allows to create a copy of the database in memory for the whole period of snapshot file write. Adding quorum here also minimizes changes. Currently replication represented by the following scheme: ``` Customer Leader WAL(L) Replica WAL(R) |------TXN----->| | | | | | | | | | [TXN Rollback | | | | created] | | | | | | | | | |-----TXN----->| | | | | | | | | |<---WAL Ok----| | | | | | | | | [TXN Rollback | | | | destroyed] | | | | | | | | |<----TXN Ok----| | | | | |-------Replicate TXN------->| | | | | | | | | | [TXN Rollback | | | | created] | | | | | | | | | |-----TXN----->| | | | | | | | | |<---WAL Ok----| | | | | | | | | [TXN Rollback | | | | destroyed] | | | | | | ``` To introduce the 'quorum' we have to receive confirmation from replicas to make a decision on whether the quorum is actually present. Leader collects necessary amount of replicas confirmation plus its own WAL success. This state is named 'quorum' and gives leader the right to complete the customers' request. So the picture will change to: ``` Customer Leader WAL(L) Replica WAL(R) |------TXN----->| | | | | | | | | | [TXN Rollback | | | | created] | | | | | | | | | |-----TXN----->| | | | | | | | | |-------Replicate TXN------->| | | | | | | | | | [TXN Rollback | | |<---WAL Ok----| created] | | | | | | | [Waiting | |-----TXN----->| | of a quorum] | | | | | | |<---WAL Ok----| | | | | | | |<------Replication Ok-------| | | | | | | | [Quorum | | | | achieved] | | | | | | | | | [TXN Rollback | | | | destroyed] | | | | | | | | | |----Quorum--->| | | | | | | | | |-----------Quorum---------->| | | | | | | |<---TXN Ok-----| | [TXN Rollback | | | | destroyed] | | | | | | | | | |----Quorum--->| | | | | | ``` The quorum should be collected as a table for a list of transactions waiting for quorum. The latest transaction that collects the quorum is considered as complete, as well as all transactions prior to it, since all transactions should be applied in order. Leader writes a 'quorum' message to the WAL and it is delivered to Replicas. Replica should report a positive or a negative result of the TXN to the Leader via the IPROTO explicitly to allow Leader to collect the quorum or anti-quorum for the TXN. In case negative result for the TXN received from minor number of Replicas, then Leader has to send an error message to each Replica, which in turn has to disconnect from the replication the same way as it is done now in case of conflict. In case Leader receives enough error messages to do not achieve the quorum it should write the 'rollback' message in the WAL. After that Leader and Replicas will perform the rollback for all TXN that didn't receive quorum. ### Recovery and failover. Tarantool instance during reading WAL should postpone the commit until the quorum is read. In case the WAL eof is achieved, the instance should keep rollback for all transactions that are waiting for a quorum entry until the role of the instance is set. In case this instance become a Replica there are no additional actions needed, sine all info about quorum/rollback will arrive via replication. In case this instance is assigned a Leader role, it should write 'rollback' in its WAL and perform rollback for all transactions waiting for a quorum. In case of a Leader failure a Replica with the biggest LSN with former leader's ID is elected as a new leader. The replica should record 'rollback' in its WAL which effectively means that all transactions without quorum should be rolled back. This rollback will be delivered to all replicas and they will perform rollbacks of all transactions waiting for quorum. ### Snapshot generation. We also can reuse current machinery of snapshot generation. Upon receiving a request to create a snapshot an instance should request a readview for the current commit operation. Although start of the snapshot generation should be postponed until this commit operation receives its quorum. In case operation is rolled back, the snapshot generation should be aborted and restarted using current transaction after rollback is complete. After snapshot is created the WAL should start from the first operation that follows the commit operation snapshot is generated for. That means WAL will contain a quorum message that refers to a transaction that is not present in the WAL. Apparently, we have to allow this for the case quorum refers to a transaction with LSN less than the first entry in the WAL and only once. ### Asynchronous replication. Along with synchronous Replicas the cluster can contain asynchronous Replicas. That means async Replica doesn't reply to the Leader with errors since they're not contributing into quorum. Still, async Replicas have to follow the new WAL operation, such as keep rollback info until 'quorum' message is received. This is essential for the case of 'rollback' message appearance in the WAL. This message assumes Replica is able to perform all necessary rollback by itself. Cluster information should contain explicit notification of each Replica operation mode. ### Synchronous replication enabling. Synchronous operation can be required for a set of spaces in the data scheme. That means only transactions that contain data modification for these spaces should require quorum. Such transactions named synchronous. As soon as last operation of synchronous transaction appeared in Leader's WAL, it will cause all following transactions - matter if they are synchronous or not - wait for the quorum. In case quorum is not achieved the 'rollback' operation will cause rollback of all transactions after the synchronous one. It will ensure the consistent state of the data both on Leader and Replicas. In case user doesn't require synchronous operation for any space then no changes to the WAL generation and replication will appear. Cluster description should contain explicit attribute for each Replica to denote it participates in synchronous activities. Also the description should contain criterion on how many Replicas responses are needed to achieve the quorum. ## Rationale and alternatives There is an implementation of synchronous replication as part of gh-980 activities, still it is not in a state to get into the product. More than that it intentionally breaks backward compatibility which is a prerequisite for this proposal. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-04-03 21:08 [Tarantool-patches] [RFC] Quorum-based synchronous replication Sergey Ostanevich @ 2020-04-07 13:02 ` Aleksandr Lyapunov 2020-04-08 9:18 ` Sergey Ostanevich 2020-04-14 12:58 ` Sergey Bronnikov ` (2 subsequent siblings) 3 siblings, 1 reply; 53+ messages in thread From: Aleksandr Lyapunov @ 2020-04-07 13:02 UTC (permalink / raw) To: Sergey Ostanevich, tarantool-patches On 4/4/20 12:08 AM, Sergey Ostanevich wrote: > * **Status**: In progress > * **Start date**: 31-03-2020 > * **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\> > * **Issues**: > > ## Summary > > The aim of this RFC is to address the following list of problems > formulated at MRG planning meeting: > - protocol backward compatibility to enable cluster upgrade w/o > downtime > - consistency of data on replica and leader > - switch from leader to replica without data loss > - up to date replicas to run read-only requests > - ability to switch async replicas into sync ones > - guarantee of rollback on leader and sync replicas > - simplicity of cluster orchestration > > What this RFC is not: > > - high availability (HA) solution with automated failover, roles > assignments an so on > - master-master configuration support > > > ## Background and motivation > > There are number of known implemenatation of consistent data presence in > a cluster. They can be commonly named as "wait for LSN" technique. The > biggest issue with this technique is the abscence of rollback gauarantees > at replica in case of transaction failure on one master or some of the > replics in the cluster. > > To provide such capabilities a new functionality should be introduced in > Tarantool core, with limitation mentioned before - backward compatilibity > and ease of cluster orchestration. > > ## Detailed design > > ### Quorum commit > The main idea behind the proposal is to reuse existent machinery as much > as possible. It will ensure the well-tested and proven functionality > across many instances in MRG and beyond is used. The transaction rollback > mechanism is in place and works for WAL write failure. If we substitute > the WAL success with a new situation which is named 'quorum' later in > this document then no changes to the machinery is needed. The same is > true for snapshot machinery that allows to create a copy of the database > in memory for the whole period of snapshot file write. Adding quorum here > also minimizes changes. > > Currently replication represented by the following scheme: > ``` > Customer Leader WAL(L) Replica WAL(R) > |------TXN----->| | | | > | | | | | > | [TXN Rollback | | | > | created] | | | > | | | | | > | |-----TXN----->| | | > | | | | | > | |<---WAL Ok----| | | > | | | | | > | [TXN Rollback | | | > | destroyed] | | | > | | | | | > |<----TXN Ok----| | | | > | |-------Replicate TXN------->| | > | | | | | > | | | [TXN Rollback | > | | | created] | > | | | | | > | | | |-----TXN----->| > | | | | | > | | | |<---WAL Ok----| > | | | | | > | | | [TXN Rollback | > | | | destroyed] | > | | | | | > ``` > > > To introduce the 'quorum' we have to receive confirmation from replicas > to make a decision on whether the quorum is actually present. Leader > collects necessary amount of replicas confirmation plus its own WAL > success. This state is named 'quorum' and gives leader the right to > complete the customers' request. So the picture will change to: > ``` > Customer Leader WAL(L) Replica WAL(R) > |------TXN----->| | | | > | | | | | > | [TXN Rollback | | | > | created] | | | > | | | | | > | |-----TXN----->| | | > | | | | | > | |-------Replicate TXN------->| | > | | | | | > | | | [TXN Rollback | > | |<---WAL Ok----| created] | > | | | | | > | [Waiting | |-----TXN----->| > | of a quorum] | | | > | | | |<---WAL Ok----| > | | | | | > | |<------Replication Ok-------| | > | | | | | > | [Quorum | | | > | achieved] | | | > | | | | | > | [TXN Rollback | | | > | destroyed] | | | > | | | | | > | |----Quorum--->| | | > | | | | | > | |-----------Quorum---------->| | > | | | | | > |<---TXN Ok-----| | [TXN Rollback | > | | | destroyed] | > | | | | | > | | | |----Quorum--->| > | | | | | > ``` > > The quorum should be collected as a table for a list of transactions > waiting for quorum. The latest transaction that collects the quorum is > considered as complete, as well as all transactions prior to it, since > all transactions should be applied in order. Leader writes a 'quorum' > message to the WAL and it is delivered to Replicas. I think we should cal the the message something like 'confirm' (not 'quorum'), and mention here that it has its own LSN. Besides, it's very similar to phase two of two-phase-commit, we'll need it later. > > Replica should report a positive or a negative result of the TXN to the > Leader via the IPROTO explicitly to allow Leader to collect the quorum > or anti-quorum for the TXN. In case negative result for the TXN received > from minor number of Replicas, then Leader has to send an error message > to each Replica, which in turn has to disconnect from the replication > the same way as it is done now in case of conflict. I'm sure that unconfirmed transactions must not be visible both on master and on replica since the could be aborted. We need read-committed. > > In case Leader receives enough error messages to do not achieve the > quorum it should write the 'rollback' message in the WAL. After that > Leader and Replicas will perform the rollback for all TXN that didn't > receive quorum. > > ### Recovery and failover. > > Tarantool instance during reading WAL should postpone the commit until > the quorum is read. In case the WAL eof is achieved, the instance should > keep rollback for all transactions that are waiting for a quorum entry > until the role of the instance is set. In case this instance become a > Replica there are no additional actions needed, sine all info about > quorum/rollback will arrive via replication. In case this instance is > assigned a Leader role, it should write 'rollback' in its WAL and > perform rollback for all transactions waiting for a quorum. > > In case of a Leader failure a Replica with the biggest LSN with former > leader's ID is elected as a new leader. The replica should record > 'rollback' in its WAL which effectively means that all transactions > without quorum should be rolled back. This rollback will be delivered to > all replicas and they will perform rollbacks of all transactions waiting > for quorum. > > ### Snapshot generation. > > We also can reuse current machinery of snapshot generation. Upon > receiving a request to create a snapshot an instance should request a > readview for the current commit operation. Although start of the > snapshot generation should be postponed until this commit operation > receives its quorum. In case operation is rolled back, the snapshot > generation should be aborted and restarted using current transaction > after rollback is complete. There is no guarantee that the replica will ever receive 'confirm' ('quorum') message, for example when the master is dead forever. That means that in some cases we are unable to make a snapshot.. But if we make unconfirmed transactions invisible, the current read view will give us exactly what we need, but I have no idea how to handle WAL rotation ('restart') in this case. > > After snapshot is created the WAL should start from the first operation > that follows the commit operation snapshot is generated for. That means > WAL will contain a quorum message that refers to a transaction that is > not present in the WAL. Apparently, we have to allow this for the case > quorum refers to a transaction with LSN less than the first entry in the > WAL and only once. Not 'only once', there could be several unconfirmed transactions and thus several 'confirm' messages. > > ### Asynchronous replication. > > Along with synchronous Replicas the cluster can contain asynchronous > Replicas. That means async Replica doesn't reply to the Leader with > errors since they're not contributing into quorum. Still, async > Replicas have to follow the new WAL operation, such as keep rollback > info until 'quorum' message is received. This is essential for the case > of 'rollback' message appearance in the WAL. This message assumes > Replica is able to perform all necessary rollback by itself. Cluster > information should contain explicit notification of each Replica > operation mode. > > ### Synchronous replication enabling. > > Synchronous operation can be required for a set of spaces in the data > scheme. That means only transactions that contain data modification for > these spaces should require quorum. Such transactions named synchronous. > As soon as last operation of synchronous transaction appeared in Leader's > WAL, it will cause all following transactions - matter if they are > synchronous or not - wait for the quorum. In case quorum is not achieved > the 'rollback' operation will cause rollback of all transactions after > the synchronous one. It will ensure the consistent state of the data both > on Leader and Replicas. In case user doesn't require synchronous operation > for any space then no changes to the WAL generation and replication will > appear. > > Cluster description should contain explicit attribute for each Replica > to denote it participates in synchronous activities. Also the description > should contain criterion on how many Replicas responses are needed to > achieve the quorum. > > > ## Rationale and alternatives > > There is an implementation of synchronous replication as part of gh-980 > activities, still it is not in a state to get into the product. More > than that it intentionally breaks backward compatibility which is a > prerequisite for this proposal. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-04-07 13:02 ` Aleksandr Lyapunov @ 2020-04-08 9:18 ` Sergey Ostanevich 2020-04-08 14:05 ` Konstantin Osipov 0 siblings, 1 reply; 53+ messages in thread From: Sergey Ostanevich @ 2020-04-08 9:18 UTC (permalink / raw) To: Aleksandr Lyapunov; +Cc: tarantool-patches Hi! Thanks for review! Latest version is availabe at https://github.com/tarantool/tarantool/blob/sergos/quorum-based-synchro/doc/rfc/quorum-based-synchro.md > > The quorum should be collected as a table for a list of transactions > > waiting for quorum. The latest transaction that collects the quorum is > > considered as complete, as well as all transactions prior to it, since > > all transactions should be applied in order. Leader writes a 'quorum' > > message to the WAL and it is delivered to Replicas. > I think we should cal the the message something like 'confirm' > (not 'quorum'), and mention here that it has its own LSN. I believe it was clear from the mention that it goes to WAL. Updated. The quorum should be collected as a table for a list of transactions waiting for quorum. The latest transaction that collects the quorum is considered as complete, as well as all transactions prior to it, since all transactions should be applied in order. Leader writes a 'confirm' message to the WAL that refers to the transaction's LSN and it has its own LSN. This confirm message is delivered to all replicas through the existing replication mechanism. > Besides, it's very similar to phase two of two-phase-commit, > we'll need it later. We already discussed this, similarity is ended as soon as one quorum means confirmation of the whole bunch of transactions before it, not the one. > > Replica should report a positive or a negative result of the TXN to the > > Leader via the IPROTO explicitly to allow Leader to collect the quorum > > or anti-quorum for the TXN. In case negative result for the TXN received > > from minor number of Replicas, then Leader has to send an error message > > to each Replica, which in turn has to disconnect from the replication > > the same way as it is done now in case of conflict. > I'm sure that unconfirmed transactions must not be visible both > on master and on replica since the could be aborted. > We need read-committed. So far I don't envision any problems with read-committed after we enable transaction manager similar to vinyl. From the standpoint of replication the rollback message will cancel all transactions that are later than confirmed one. No matter if they are visible or not. > > ### Snapshot generation. > > We also can reuse current machinery of snapshot generation. Upon > > receiving a request to create a snapshot an instance should request a > > readview for the current commit operation. Although start of the > > snapshot generation should be postponed until this commit operation > > receives its quorum. In case operation is rolled back, the snapshot > > generation should be aborted and restarted using current transaction > > after rollback is complete. > There is no guarantee that the replica will ever receive 'confirm' > ('quorum') message, for example when the master is dead forever. > That means that in some cases we are unable to make a snapshot.. > But if we make unconfirmed transactions invisible, the current > read view will give us exactly what we need, but I have no idea > how to handle WAL rotation ('restart') in this case. Updated. In case master appears unavailable a replica still have to be able to create a snapshot. Replica can perform rollback for all transactions that are not confirmed and claim its LSN as the latest confirmed txn. Then it can create a snapshot in a regular way and start with blank xlog file. All rolled back transactions will appear through the regular replication in case master reappears later on. > > After snapshot is created the WAL should start from the first operation > > that follows the commit operation snapshot is generated for. That means > > WAL will contain a quorum message that refers to a transaction that is > > not present in the WAL. Apparently, we have to allow this for the case > > quorum refers to a transaction with LSN less than the first entry in the > > WAL and only once. > Not 'only once', there could be several unconfirmed transactions > and thus several 'confirm' messages. Updated. After snapshot is created the WAL should start from the first operation that follows the commit operation snapshot is generated for. That means WAL will contain 'confirm' messages that refer to transactions that are not present in the WAL. Apparently, we have to allow this for the case 'confirm' refers to a transaction with LSN less than the first entry in the WAL. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-04-08 9:18 ` Sergey Ostanevich @ 2020-04-08 14:05 ` Konstantin Osipov 2020-04-08 15:06 ` Sergey Ostanevich 0 siblings, 1 reply; 53+ messages in thread From: Konstantin Osipov @ 2020-04-08 14:05 UTC (permalink / raw) To: Sergey Ostanevich; +Cc: tarantool-patches * Sergey Ostanevich <sergos@tarantool.org> [20/04/08 12:23]: One thing I continue not understanding is why settle on RFC now when in-memory wal is not in yet? There is an unpleasant risk of committing to something that turns out to not work out in the best possible way. -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-04-08 14:05 ` Konstantin Osipov @ 2020-04-08 15:06 ` Sergey Ostanevich 0 siblings, 0 replies; 53+ messages in thread From: Sergey Ostanevich @ 2020-04-08 15:06 UTC (permalink / raw) To: Konstantin Osipov, Aleksandr Lyapunov, tarantool-patches Hi! Thanks for review! On 08 апр 17:05, Konstantin Osipov wrote: > * Sergey Ostanevich <sergos@tarantool.org> [20/04/08 12:23]: > > One thing I continue not understanding is why settle on RFC > now when in-memory wal is not in yet? Does this RFC depend on in-memory WAL after all? The formulation of principles in RFC neither rely on nor deny any optimizations of underlying infrastructure. I believe in-memory can be introduced indepenetly. Correct me, if I'm wrong. > There is an unpleasant risk of committing to something that turns > out to not work out in the best possible way. It is maxima of current MRG management: instead of perpetually inventing 'best possible' without clear roadmap - and not finish it - identify what's needed and perform to its timely appearance. Again, if you see some conflicts between RFC and any technologies being developed - name them, let's try to resolve them. Regards, Sergos ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-04-03 21:08 [Tarantool-patches] [RFC] Quorum-based synchronous replication Sergey Ostanevich 2020-04-07 13:02 ` Aleksandr Lyapunov @ 2020-04-14 12:58 ` Sergey Bronnikov 2020-04-14 14:43 ` Sergey Ostanevich 2020-04-20 23:32 ` Vladislav Shpilevoy 2020-04-23 21:38 ` Vladislav Shpilevoy 3 siblings, 1 reply; 53+ messages in thread From: Sergey Bronnikov @ 2020-04-14 12:58 UTC (permalink / raw) To: Sergey Ostanevich; +Cc: tarantool-patches Hi, see 5 comments inline On 00:08 Sat 04 Apr , Sergey Ostanevich wrote: > > * **Status**: In progress > * **Start date**: 31-03-2020 > * **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\> > * **Issues**: 1. Just for convenience, please add https://github.com/tarantool/tarantool/issues/4842 > ## Summary > > The aim of this RFC is to address the following list of problems > formulated at MRG planning meeting: > - protocol backward compatibility to enable cluster upgrade w/o > downtime > - consistency of data on replica and leader > - switch from leader to replica without data loss > - up to date replicas to run read-only requests > - ability to switch async replicas into sync ones 2. Ability to switch async replicas into sync ones and vice-versa? Or not? > - guarantee of rollback on leader and sync replicas > - simplicity of cluster orchestration > > What this RFC is not: > > - high availability (HA) solution with automated failover, roles > assignments an so on > - master-master configuration support > > > ## Background and motivation > > There are number of known implemenatation of consistent data presence in > a cluster. They can be commonly named as "wait for LSN" technique. The > biggest issue with this technique is the abscence of rollback gauarantees 3. typo: gauarantees -> guarantees > at replica in case of transaction failure on one master or some of the > replics in the cluster. 4. typo: replics -> replicas > > To provide such capabilities a new functionality should be introduced in > Tarantool core, with limitation mentioned before - backward compatilibity > and ease of cluster orchestration. 5. but there is nothing mentioned before about these limitations. > ## Detailed design > > ### Quorum commit > The main idea behind the proposal is to reuse existent machinery as much > as possible. It will ensure the well-tested and proven functionality > across many instances in MRG and beyond is used. The transaction rollback > mechanism is in place and works for WAL write failure. If we substitute > the WAL success with a new situation which is named 'quorum' later in > this document then no changes to the machinery is needed. The same is > true for snapshot machinery that allows to create a copy of the database > in memory for the whole period of snapshot file write. Adding quorum here > also minimizes changes. > > Currently replication represented by the following scheme: > ``` > Customer Leader WAL(L) Replica WAL(R) > |------TXN----->| | | | > | | | | | > | [TXN Rollback | | | > | created] | | | > | | | | | > | |-----TXN----->| | | > | | | | | > | |<---WAL Ok----| | | > | | | | | > | [TXN Rollback | | | > | destroyed] | | | > | | | | | > |<----TXN Ok----| | | | > | |-------Replicate TXN------->| | > | | | | | > | | | [TXN Rollback | > | | | created] | > | | | | | > | | | |-----TXN----->| > | | | | | > | | | |<---WAL Ok----| > | | | | | > | | | [TXN Rollback | > | | | destroyed] | > | | | | | > ``` > > > To introduce the 'quorum' we have to receive confirmation from replicas > to make a decision on whether the quorum is actually present. Leader > collects necessary amount of replicas confirmation plus its own WAL > success. This state is named 'quorum' and gives leader the right to > complete the customers' request. So the picture will change to: > ``` > Customer Leader WAL(L) Replica WAL(R) > |------TXN----->| | | | > | | | | | > | [TXN Rollback | | | > | created] | | | > | | | | | > | |-----TXN----->| | | > | | | | | > | |-------Replicate TXN------->| | > | | | | | > | | | [TXN Rollback | > | |<---WAL Ok----| created] | > | | | | | > | [Waiting | |-----TXN----->| > | of a quorum] | | | > | | | |<---WAL Ok----| > | | | | | > | |<------Replication Ok-------| | > | | | | | > | [Quorum | | | > | achieved] | | | > | | | | | > | [TXN Rollback | | | > | destroyed] | | | > | | | | | > | |----Quorum--->| | | > | | | | | > | |-----------Quorum---------->| | > | | | | | > |<---TXN Ok-----| | [TXN Rollback | > | | | destroyed] | > | | | | | > | | | |----Quorum--->| > | | | | | > ``` > > The quorum should be collected as a table for a list of transactions > waiting for quorum. The latest transaction that collects the quorum is > considered as complete, as well as all transactions prior to it, since > all transactions should be applied in order. Leader writes a 'quorum' > message to the WAL and it is delivered to Replicas. > > Replica should report a positive or a negative result of the TXN to the > Leader via the IPROTO explicitly to allow Leader to collect the quorum > or anti-quorum for the TXN. In case negative result for the TXN received > from minor number of Replicas, then Leader has to send an error message > to each Replica, which in turn has to disconnect from the replication > the same way as it is done now in case of conflict. > > In case Leader receives enough error messages to do not achieve the > quorum it should write the 'rollback' message in the WAL. After that > Leader and Replicas will perform the rollback for all TXN that didn't > receive quorum. > > ### Recovery and failover. > > Tarantool instance during reading WAL should postpone the commit until > the quorum is read. In case the WAL eof is achieved, the instance should > keep rollback for all transactions that are waiting for a quorum entry > until the role of the instance is set. In case this instance become a > Replica there are no additional actions needed, sine all info about > quorum/rollback will arrive via replication. In case this instance is > assigned a Leader role, it should write 'rollback' in its WAL and > perform rollback for all transactions waiting for a quorum. > > In case of a Leader failure a Replica with the biggest LSN with former > leader's ID is elected as a new leader. The replica should record > 'rollback' in its WAL which effectively means that all transactions > without quorum should be rolled back. This rollback will be delivered to > all replicas and they will perform rollbacks of all transactions waiting > for quorum. > > ### Snapshot generation. > > We also can reuse current machinery of snapshot generation. Upon > receiving a request to create a snapshot an instance should request a > readview for the current commit operation. Although start of the > snapshot generation should be postponed until this commit operation > receives its quorum. In case operation is rolled back, the snapshot > generation should be aborted and restarted using current transaction > after rollback is complete. > > After snapshot is created the WAL should start from the first operation > that follows the commit operation snapshot is generated for. That means > WAL will contain a quorum message that refers to a transaction that is > not present in the WAL. Apparently, we have to allow this for the case > quorum refers to a transaction with LSN less than the first entry in the > WAL and only once. > > ### Asynchronous replication. > > Along with synchronous Replicas the cluster can contain asynchronous > Replicas. That means async Replica doesn't reply to the Leader with > errors since they're not contributing into quorum. Still, async > Replicas have to follow the new WAL operation, such as keep rollback > info until 'quorum' message is received. This is essential for the case > of 'rollback' message appearance in the WAL. This message assumes > Replica is able to perform all necessary rollback by itself. Cluster > information should contain explicit notification of each Replica > operation mode. > > ### Synchronous replication enabling. > > Synchronous operation can be required for a set of spaces in the data > scheme. That means only transactions that contain data modification for > these spaces should require quorum. Such transactions named synchronous. > As soon as last operation of synchronous transaction appeared in Leader's > WAL, it will cause all following transactions - matter if they are > synchronous or not - wait for the quorum. In case quorum is not achieved > the 'rollback' operation will cause rollback of all transactions after > the synchronous one. It will ensure the consistent state of the data both > on Leader and Replicas. In case user doesn't require synchronous operation > for any space then no changes to the WAL generation and replication will > appear. > > Cluster description should contain explicit attribute for each Replica > to denote it participates in synchronous activities. Also the description > should contain criterion on how many Replicas responses are needed to > achieve the quorum. > > > ## Rationale and alternatives > > There is an implementation of synchronous replication as part of gh-980 > activities, still it is not in a state to get into the product. More > than that it intentionally breaks backward compatibility which is a > prerequisite for this proposal. -- sergeyb@ ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-04-14 12:58 ` Sergey Bronnikov @ 2020-04-14 14:43 ` Sergey Ostanevich 2020-04-15 11:09 ` sergos 0 siblings, 1 reply; 53+ messages in thread From: Sergey Ostanevich @ 2020-04-14 14:43 UTC (permalink / raw) To: Sergey Bronnikov; +Cc: tarantool-patches Hi! Thanks for review! On 14 апр 15:58, Sergey Bronnikov wrote: > Hi, > > see 5 comments inline > > On 00:08 Sat 04 Apr , Sergey Ostanevich wrote: > > > > * **Status**: In progress > > * **Start date**: 31-03-2020 > > * **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\> > > * **Issues**: > > 1. Just for convenience, please add https://github.com/tarantool/tarantool/issues/4842 > Done. > > ## Summary > > > > The aim of this RFC is to address the following list of problems > > formulated at MRG planning meeting: > > - protocol backward compatibility to enable cluster upgrade w/o > > downtime > > - consistency of data on replica and leader > > - switch from leader to replica without data loss > > - up to date replicas to run read-only requests > > - ability to switch async replicas into sync ones > > 2. Ability to switch async replicas into sync ones and vice-versa? Or not? > Both ways, updated. > > - guarantee of rollback on leader and sync replicas > > - simplicity of cluster orchestration > > > > What this RFC is not: > > > > - high availability (HA) solution with automated failover, roles > > assignments an so on > > - master-master configuration support > > > > > > ## Background and motivation > > > > There are number of known implemenatation of consistent data presence in > > a cluster. They can be commonly named as "wait for LSN" technique. The > > biggest issue with this technique is the abscence of rollback gauarantees > > 3. typo: gauarantees -> guarantees > done > > at replica in case of transaction failure on one master or some of the > > replics in the cluster. > > 4. typo: replics -> replicas > > done > > To provide such capabilities a new functionality should be introduced in > > Tarantool core, with limitation mentioned before - backward compatilibity > > and ease of cluster orchestration. > > 5. but there is nothing mentioned before about these limitations. > They were named as problems to address, so I renamed them as requirements. [cut] Pushed updated version to the branch. Thanks, Sergos ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-04-14 14:43 ` Sergey Ostanevich @ 2020-04-15 11:09 ` sergos 2020-04-15 14:50 ` sergos 0 siblings, 1 reply; 53+ messages in thread From: sergos @ 2020-04-15 11:09 UTC (permalink / raw) To: tarantool-patches Cc: Николай Карлов, Тимур Сафин Hi! The latest version is below, also available at https://github.com/tarantool/tarantool/blob/sergos/quorum-based-synchro/doc/rfc/quorum-based-synchro.md --- * **Status**: In progress * **Start date**: 31-03-2020 * **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\> * **Issues**: https://github.com/tarantool/tarantool/issues/4842 ## Summary The aim of this RFC is to address the following list of problems formulated at MRG planning meeting: - protocol backward compatibility to enable cluster upgrade w/o downtime - consistency of data on replica and leader - switch from leader to replica without data loss - up to date replicas to run read-only requests - ability to switch async replicas into sync ones and vice versa - guarantee of rollback on leader and sync replicas - simplicity of cluster orchestration What this RFC is not: - high availability (HA) solution with automated failover, roles assignments an so on - master-master configuration support ## Background and motivation There are number of known implementation of consistent data presence in a cluster. They can be commonly named as "wait for LSN" technique. The biggest issue with this technique is thecompatibility absence of rollback guarantees at replica in case of transaction failure on one master or some of the replicas in the cluster. To provide such capabilities a new functionality should be introduced in Tarantool core, with requirements mentioned before - backward compatilibity and ease of cluster orchestration. ## Detailed design ### Quorum commit The main idea behind the proposal is to reuse existent machinery as much as possible. It will ensure the well-tested and proven functionality across many instances in MRG and beyond is used. The transaction rollback mechanism is in place and works for WAL write failure. If we substitute the WAL success with a new situation which is named 'quorum' later in this document then no changes to the machinery is needed. The same is true for snapshot machinery that allows to create a copy of the database in memory for the whole period of snapshot file write. Adding quorum here also minimizes changes. Currently replication represented by the following scheme: ``` Customer Leader WAL(L) Replica WAL(R) |------TXN----->| | | | | | | | | | [TXN Rollback | | | | created] | | | | | | | | | |-----TXN----->| | | | | | | | | |<---WAL Ok----| | | | | | | | | [TXN Rollback | | | | destroyed] | | | | | | | | |<----TXN Ok----| | | | | |-------Replicate TXN------->| | | | | | | | | | [TXN Rollback | | | | created] | | | | | | | | | |-----TXN----->| | | | | | | | | |<---WAL Ok----| | | | | | | | | [TXN Rollback | | | | destroyed] | | | | | | ``` To introduce the 'quorum' we have to receive confirmation from replicas to make a decision on whether the quorum is actually present. Leader collects necessary amount of replicas confirmation plus its own WAL success. This state is named 'quorum' and gives leader the right to complete the customers' request. So the picture will change to: ``` Customer Leader WAL(L) Replica WAL(R) |------TXN----->| | | | | | | | | | [TXN Rollback | | | | created] | | | | | | | | | |-----TXN----->| | | | | | | | | |-------Replicate TXN------->| | | | | | | | | | [TXN Rollback | | |<---WAL Ok----| created] | | | | | | | [Waiting | |-----TXN----->| | of a quorum] | | | | | | |<---WAL Ok----| | | | | | | |<------Replication Ok-------| | | | | | | | [Quorum | | | | achieved] | | | | | | | | | [TXN Rollback | | | | destroyed] | | | | | | | | | |---Confirm--->| | | | | | | | | |----------Confirm---------->| | | | | | | |<---TXN Ok-----| | [TXN Rollback | | | | destroyed] | | | | | | | | | |---Confirm--->| | | | | | ``` The quorum should be collected as a table for a list of transactions waiting for quorum. The latest transaction that collects the quorum is considered as complete, as well as all transactions prior to it, since all transactions should be applied in order. Leader writes a 'confirm' message to the WAL that refers to the transaction's LSN and it has its own LSN. This confirm message is delivered to all replicas through the existing replication mechanism. Replica should report a positive or a negative result of the TXN to the leader via the IPROTO explicitly to allow leader to collect the quorum or anti-quorum for the TXN. In case a negative result for the TXN is received from minor number of replicas, then leader has to send an error message to the replicas, which in turn have to disconnect from the replication the same way as it is done now in case of conflict. In case leader receives enough error messages to do not achieve the quorum it should write the 'rollback' message in the WAL. After that leader and replicas will perform the rollback for all TXN that didn't receive quorum. ### Recovery and failover. Tarantool instance during reading WAL should postpone the commit until the 'confirm' is read. In case the WAL eof is achieved, the instance should keep rollback for all transactions that are waiting for a confirm entry until the role of the instance is set. In case this instance become a replica there are no additional actions needed, since all info about quorum/rollback will arrive via replication. In case this instance is assigned a leader role, it should write 'rollback' in its WAL and perform rollback for all transactions waiting for a quorum. In case of a leader failure a replica with the biggest LSN with former leader's ID is elected as a new leader. The replica should record 'rollback' in its WAL which effectively means that all transactions without quorum should be rolled back. This rollback will be delivered to all replicas and they will perform rollbacks of all transactions waiting for quorum. An interface to force apply pending transactions by issuing a confirm entry for them have to be introduced for manual recovery. ### Snapshot generation. We also can reuse current machinery of snapshot generation. Upon receiving a request to create a snapshot an instance should request a readview for the current commit operation. Although start of the snapshot generation should be postponed until this commit operation receives its confirmation. In case operation is rolled back, the snapshot generation should be aborted and restarted using current transaction after rollback is complete. After snapshot is created the WAL should start from the first operation that follows the commit operation snapshot is generated for. That means WAL will contain 'confirm' messages that refer to transactions that are not present in the WAL. Apparently, we have to allow this for the case 'confirm' refers to a transaction with LSN less than the first entry in the WAL. In case master appears unavailable a replica still have to be able to create a snapshot. Replica can perform rollback for all transactions that are not confirmed and claim its LSN as the latest confirmed txn. Then it can create a snapshot in a regular way and start with blank xlog file. All rolled back transactions will appear through the regular replication in case master reappears later on. ### Asynchronous replication. Along with synchronous replicas the cluster can contain asynchronous replicas. That means async replica doesn't reply to the leader with errors since they're not contributing into quorum. Still, async replicas have to follow the new WAL operation, such as keep rollback info until 'quorum' message is received. This is essential for the case of 'rollback' message appearance in the WAL. This message assumes replica is able to perform all necessary rollback by itself. Cluster information should contain explicit notification of each replica operation mode. ### Synchronous replication enabling. Synchronous operation can be required for a set of spaces in the data scheme. That means only transactions that contain data modification for these spaces should require quorum. Such transactions named synchronous. As soon as last operation of synchronous transaction appeared in leader's WAL, it will cause all following transactions - matter if they are synchronous or not - wait for the quorum. In case quorum is not achieved the 'rollback' operation will cause rollback of all transactions after the synchronous one. It will ensure the consistent state of the data both on leader and replicas. In case user doesn't require synchronous operation for any space then no changes to the WAL generation and replication will appear. Cluster description should contain explicit attribute for each replica to denote it participates in synchronous activities. Also the description should contain criterion on how many replicas responses are needed to achieve the quorum. ## Rationale and alternatives There is an implementation of synchronous replication as part of gh-980 activities, still it is not in a state to get into the product. More than that it intentionally breaks backward compatibility which is a prerequisite for this proposal. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-04-15 11:09 ` sergos @ 2020-04-15 14:50 ` sergos 2020-04-16 7:13 ` Aleksandr Lyapunov ` (2 more replies) 0 siblings, 3 replies; 53+ messages in thread From: sergos @ 2020-04-15 14:50 UTC (permalink / raw) To: Николай Карлов, Тимур Сафин, Mons Anderson, Aleksandr Lyapunov, Sergey Bronnikov Cc: tarantool-patches Sorry for mess introduced by mail client in previous message. Here’s the correct version with 3 more misprints fixed. The version is available here https://github.com/tarantool/tarantool/blob/sergos/quorum-based-synchro/doc/rfc/quorum-based-synchro.md Please, reply all with your comments/blessings today. Regards, Sergos --- * **Status**: In progress * **Start date**: 31-03-2020 * **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\> * **Issues**: https://github.com/tarantool/tarantool/issues/4842 ## Summary The aim of this RFC is to address the following list of problems formulated at MRG planning meeting: - protocol backward compatibility to enable cluster upgrade w/o downtime - consistency of data on replica and leader - switch from leader to replica without data loss - up to date replicas to run read-only requests - ability to switch async replicas into sync ones and vice versa - guarantee of rollback on leader and sync replicas - simplicity of cluster orchestration What this RFC is not: - high availability (HA) solution with automated failover, roles assignments an so on - master-master configuration support ## Background and motivation There are number of known implementation of consistent data presence in a cluster. They can be commonly named as "wait for LSN" technique. The biggest issue with this technique is the absence of rollback guarantees at replica in case of transaction failure on one master or some of the replicas in the cluster. To provide such capabilities a new functionality should be introduced in Tarantool core, with requirements mentioned before - backward compatibility and ease of cluster orchestration. ## Detailed design ### Quorum commit The main idea behind the proposal is to reuse existent machinery as much as possible. It will ensure the well-tested and proven functionality across many instances in MRG and beyond is used. The transaction rollback mechanism is in place and works for WAL write failure. If we substitute the WAL success with a new situation which is named 'quorum' later in this document then no changes to the machinery is needed. The same is true for snapshot machinery that allows to create a copy of the database in memory for the whole period of snapshot file write. Adding quorum here also minimizes changes. Currently replication represented by the following scheme: ``` Customer Leader WAL(L) Replica WAL(R) |------TXN----->| | | | | | | | | | [TXN Rollback | | | | created] | | | | | | | | | |-----TXN----->| | | | | | | | | |<---WAL Ok----| | | | | | | | | [TXN Rollback | | | | destroyed] | | | | | | | | |<----TXN Ok----| | | | | |-------Replicate TXN------->| | | | | | | | | | [TXN Rollback | | | | created] | | | | | | | | | |-----TXN----->| | | | | | | | | |<---WAL Ok----| | | | | | | | | [TXN Rollback | | | | destroyed] | | | | | | ``` To introduce the 'quorum' we have to receive confirmation from replicas to make a decision on whether the quorum is actually present. Leader collects necessary amount of replicas confirmation plus its own WAL success. This state is named 'quorum' and gives leader the right to complete the customers' request. So the picture will change to: ``` Customer Leader WAL(L) Replica WAL(R) |------TXN----->| | | | | | | | | | [TXN Rollback | | | | created] | | | | | | | | | |-----TXN----->| | | | | | | | | |-------Replicate TXN------->| | | | | | | | | | [TXN Rollback | | |<---WAL Ok----| created] | | | | | | | [Waiting | |-----TXN----->| | of a quorum] | | | | | | |<---WAL Ok----| | | | | | | |<------Replication Ok-------| | | | | | | | [Quorum | | | | achieved] | | | | | | | | | [TXN Rollback | | | | destroyed] | | | | | | | | | |---Confirm--->| | | | | | | | | |----------Confirm---------->| | | | | | | |<---TXN Ok-----| | [TXN Rollback | | | | destroyed] | | | | | | | | | |---Confirm--->| | | | | | ``` The quorum should be collected as a table for a list of transactions waiting for quorum. The latest transaction that collects the quorum is considered as complete, as well as all transactions prior to it, since all transactions should be applied in order. Leader writes a 'confirm' message to the WAL that refers to the transaction's LSN and it has its own LSN. This confirm message is delivered to all replicas through the existing replication mechanism. Replica should report a positive or a negative result of the TXN to the leader via the IPROTO explicitly to allow leader to collect the quorum or anti-quorum for the TXN. In case a negative result for the TXN is received from minor number of replicas, then leader has to send an error message to the replicas, which in turn have to disconnect from the replication the same way as it is done now in case of conflict. In case leader receives enough error messages to do not achieve the quorum it should write the 'rollback' message in the WAL. After that leader and replicas will perform the rollback for all TXN that didn't receive quorum. ### Recovery and failover. Tarantool instance during reading WAL should postpone the commit until the 'confirm' is read. In case the WAL eof is achieved, the instance should keep rollback for all transactions that are waiting for a confirm entry until the role of the instance is set. In case this instance become a replica there are no additional actions needed, since all info about quorum/rollback will arrive via replication. In case this instance is assigned a leader role, it should write 'rollback' in its WAL and perform rollback for all transactions waiting for a quorum. In case of a leader failure a replica with the biggest LSN with former leader's ID is elected as a new leader. The replica should record 'rollback' in its WAL which effectively means that all transactions without quorum should be rolled back. This rollback will be delivered to all replicas and they will perform rollbacks of all transactions waiting for quorum. An interface to force apply pending transactions by issuing a confirm entry for them have to be introduced for manual recovery. ### Snapshot generation. We also can reuse current machinery of snapshot generation. Upon receiving a request to create a snapshot an instance should request a readview for the current commit operation. Although start of the snapshot generation should be postponed until this commit operation receives its confirmation. In case operation is rolled back, the snapshot generation should be aborted and restarted using current transaction after rollback is complete. After snapshot is created the WAL should start from the first operation that follows the commit operation snapshot is generated for. That means WAL will contain 'confirm' messages that refer to transactions that are not present in the WAL. Apparently, we have to allow this for the case 'confirm' refers to a transaction with LSN less than the first entry in the WAL. In case master appears unavailable a replica still have to be able to create a snapshot. Replica can perform rollback for all transactions that are not confirmed and claim its LSN as the latest confirmed txn. Then it can create a snapshot in a regular way and start with blank xlog file. All rolled back transactions will appear through the regular replication in case master reappears later on. ### Asynchronous replication. Along with synchronous replicas the cluster can contain asynchronous replicas. That means async replica doesn't reply to the leader with errors since they're not contributing into quorum. Still, async replicas have to follow the new WAL operation, such as keep rollback info until 'quorum' message is received. This is essential for the case of 'rollback' message appearance in the WAL. This message assumes replica is able to perform all necessary rollback by itself. Cluster information should contain explicit notification of each replica operation mode. ### Synchronous replication enabling. Synchronous operation can be required for a set of spaces in the data scheme. That means only transactions that contain data modification for these spaces should require quorum. Such transactions named synchronous. As soon as last operation of synchronous transaction appeared in leader's WAL, it will cause all following transactions - matter if they are synchronous or not - wait for the quorum. In case quorum is not achieved the 'rollback' operation will cause rollback of all transactions after the synchronous one. It will ensure the consistent state of the data both on leader and replicas. In case user doesn't require synchronous operation for any space then no changes to the WAL generation and replication will appear. Cluster description should contain explicit attribute for each replica to denote it participates in synchronous activities. Also the description should contain criterion on how many replicas responses are needed to achieve the quorum. ## Rationale and alternatives There is an implementation of synchronous replication as part of gh-980 activities, still it is not in a state to get into the product. More than that it intentionally breaks backward compatibility which is a prerequisite for this proposal. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-04-15 14:50 ` sergos @ 2020-04-16 7:13 ` Aleksandr Lyapunov 2020-04-17 10:10 ` Konstantin Osipov 2020-04-20 11:20 ` Serge Petrenko 2 siblings, 0 replies; 53+ messages in thread From: Aleksandr Lyapunov @ 2020-04-16 7:13 UTC (permalink / raw) To: sergos, Николай Карлов, Тимур Сафин, Mons Anderson, Sergey Bronnikov Cc: tarantool-patches lgtm On 4/15/20 5:50 PM, sergos@tarantool.org wrote: > Sorry for mess introduced by mail client in previous message. > Here’s the correct version with 3 more misprints fixed. > > The version is available here > https://github.com/tarantool/tarantool/blob/sergos/quorum-based-synchro/doc/rfc/quorum-based-synchro.md > > Please, reply all with your comments/blessings today. > > Regards, > Sergos > > --- > * **Status**: In progress > * **Start date**: 31-03-2020 > * **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\> > * **Issues**: https://github.com/tarantool/tarantool/issues/4842 > > ## Summary > > The aim of this RFC is to address the following list of problems > formulated at MRG planning meeting: > - protocol backward compatibility to enable cluster upgrade w/o > downtime > - consistency of data on replica and leader > - switch from leader to replica without data loss > - up to date replicas to run read-only requests > - ability to switch async replicas into sync ones and vice versa > - guarantee of rollback on leader and sync replicas > - simplicity of cluster orchestration > > What this RFC is not: > > - high availability (HA) solution with automated failover, roles > assignments an so on > - master-master configuration support > > ## Background and motivation > > There are number of known implementation of consistent data presence in > a cluster. They can be commonly named as "wait for LSN" technique. The > biggest issue with this technique is the absence of rollback guarantees > at replica in case of transaction failure on one master or some of the > replicas in the cluster. > > To provide such capabilities a new functionality should be introduced in > Tarantool core, with requirements mentioned before - backward > compatibility and ease of cluster orchestration. > > ## Detailed design > > ### Quorum commit > > The main idea behind the proposal is to reuse existent machinery as much > as possible. It will ensure the well-tested and proven functionality > across many instances in MRG and beyond is used. The transaction rollback > mechanism is in place and works for WAL write failure. If we substitute > the WAL success with a new situation which is named 'quorum' later in > this document then no changes to the machinery is needed. The same is > true for snapshot machinery that allows to create a copy of the database > in memory for the whole period of snapshot file write. Adding quorum here > also minimizes changes. > > Currently replication represented by the following scheme: > ``` > Customer Leader WAL(L) Replica WAL(R) > |------TXN----->| | | | > | | | | | > | [TXN Rollback | | | > | created] | | | > | | | | | > | |-----TXN----->| | | > | | | | | > | |<---WAL Ok----| | | > | | | | | > | [TXN Rollback | | | > | destroyed] | | | > | | | | | > |<----TXN Ok----| | | | > | |-------Replicate TXN------->| | > | | | | | > | | | [TXN Rollback | > | | | created] | > | | | | | > | | | |-----TXN----->| > | | | | | > | | | |<---WAL Ok----| > | | | | | > | | | [TXN Rollback | > | | | destroyed] | > | | | | | > ``` > > To introduce the 'quorum' we have to receive confirmation from replicas > to make a decision on whether the quorum is actually present. Leader > collects necessary amount of replicas confirmation plus its own WAL > success. This state is named 'quorum' and gives leader the right to > complete the customers' request. So the picture will change to: > ``` > Customer Leader WAL(L) Replica WAL(R) > |------TXN----->| | | | > | | | | | > | [TXN Rollback | | | > | created] | | | > | | | | | > | |-----TXN----->| | | > | | | | | > | |-------Replicate TXN------->| | > | | | | | > | | | [TXN Rollback | > | |<---WAL Ok----| created] | > | | | | | > | [Waiting | |-----TXN----->| > | of a quorum] | | | > | | | |<---WAL Ok----| > | | | | | > | |<------Replication Ok-------| | > | | | | | > | [Quorum | | | > | achieved] | | | > | | | | | > | [TXN Rollback | | | > | destroyed] | | | > | | | | | > | |---Confirm--->| | | > | | | | | > | |----------Confirm---------->| | > | | | | | > |<---TXN Ok-----| | [TXN Rollback | > | | | destroyed] | > | | | | | > | | | |---Confirm--->| > | | | | | > ``` > > The quorum should be collected as a table for a list of transactions > waiting for quorum. The latest transaction that collects the quorum is > considered as complete, as well as all transactions prior to it, since > all transactions should be applied in order. Leader writes a 'confirm' > message to the WAL that refers to the transaction's LSN and it has its > own LSN. This confirm message is delivered to all replicas through the > existing replication mechanism. > > Replica should report a positive or a negative result of the TXN to the > leader via the IPROTO explicitly to allow leader to collect the quorum > or anti-quorum for the TXN. In case a negative result for the TXN is > received from minor number of replicas, then leader has to send an error > message to the replicas, which in turn have to disconnect from the > replication the same way as it is done now in case of conflict. > > In case leader receives enough error messages to do not achieve the > quorum it should write the 'rollback' message in the WAL. After that > leader and replicas will perform the rollback for all TXN that didn't > receive quorum. > > ### Recovery and failover. > > Tarantool instance during reading WAL should postpone the commit until > the 'confirm' is read. In case the WAL eof is achieved, the instance > should keep rollback for all transactions that are waiting for a confirm > entry until the role of the instance is set. In case this instance > become a replica there are no additional actions needed, since all info > about quorum/rollback will arrive via replication. In case this instance > is assigned a leader role, it should write 'rollback' in its WAL and > perform rollback for all transactions waiting for a quorum. > > In case of a leader failure a replica with the biggest LSN with former > leader's ID is elected as a new leader. The replica should record > 'rollback' in its WAL which effectively means that all transactions > without quorum should be rolled back. This rollback will be delivered to > all replicas and they will perform rollbacks of all transactions waiting > for quorum. > > An interface to force apply pending transactions by issuing a confirm > entry for them have to be introduced for manual recovery. > > ### Snapshot generation. > > We also can reuse current machinery of snapshot generation. Upon > receiving a request to create a snapshot an instance should request a > readview for the current commit operation. Although start of the > snapshot generation should be postponed until this commit operation > receives its confirmation. In case operation is rolled back, the snapshot > generation should be aborted and restarted using current transaction > after rollback is complete. > > After snapshot is created the WAL should start from the first operation > that follows the commit operation snapshot is generated for. That means > WAL will contain 'confirm' messages that refer to transactions that are > not present in the WAL. Apparently, we have to allow this for the case > 'confirm' refers to a transaction with LSN less than the first entry in > the WAL. > > In case master appears unavailable a replica still have to be able to > create a snapshot. Replica can perform rollback for all transactions that > are not confirmed and claim its LSN as the latest confirmed txn. Then it > can create a snapshot in a regular way and start with blank xlog file. > All rolled back transactions will appear through the regular replication > in case master reappears later on. > > ### Asynchronous replication. > > Along with synchronous replicas the cluster can contain asynchronous > replicas. That means async replica doesn't reply to the leader with > errors since they're not contributing into quorum. Still, async > replicas have to follow the new WAL operation, such as keep rollback > info until 'quorum' message is received. This is essential for the case > of 'rollback' message appearance in the WAL. This message assumes > replica is able to perform all necessary rollback by itself. Cluster > information should contain explicit notification of each replica > operation mode. > > ### Synchronous replication enabling. > > Synchronous operation can be required for a set of spaces in the data > scheme. That means only transactions that contain data modification for > these spaces should require quorum. Such transactions named synchronous. > As soon as last operation of synchronous transaction appeared in leader's > WAL, it will cause all following transactions - matter if they are > synchronous or not - wait for the quorum. In case quorum is not achieved > the 'rollback' operation will cause rollback of all transactions after > the synchronous one. It will ensure the consistent state of the data both > on leader and replicas. In case user doesn't require synchronous operation > for any space then no changes to the WAL generation and replication will > appear. > > Cluster description should contain explicit attribute for each replica > to denote it participates in synchronous activities. Also the description > should contain criterion on how many replicas responses are needed to > achieve the quorum. > > ## Rationale and alternatives > > There is an implementation of synchronous replication as part of gh-980 > activities, still it is not in a state to get into the product. More > than that it intentionally breaks backward compatibility which is a > prerequisite for this proposal. > > > ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-04-15 14:50 ` sergos 2020-04-16 7:13 ` Aleksandr Lyapunov @ 2020-04-17 10:10 ` Konstantin Osipov 2020-04-17 13:45 ` Sergey Ostanevich 2020-04-20 11:20 ` Serge Petrenko 2 siblings, 1 reply; 53+ messages in thread From: Konstantin Osipov @ 2020-04-17 10:10 UTC (permalink / raw) To: sergos Cc: Николай Карлов, Mons Anderson, tarantool-patches, Тимур Сафин * sergos@tarantool.org <sergos@tarantool.org> [20/04/15 17:51]: > ### Quorum commit This part looks correct. It only describes two paths out of many though: - leader is able to collect the majority - leader is not able to collect the majority What happens when a leader receives a message for a round which is complete? How does a replica which missed a round catch up? What happens if replica fails to apply txn 1 (e.g. because of a duplciate key), but confirms txn 2? What happens if txn1 gets no majority at the leader, but txn 2 gets a majority? How are the followers rolled back? > The main idea behind the proposal is to reuse existent machinery as much > as possible. It will ensure the well-tested and proven functionality > across many instances in MRG and beyond is used. The transaction rollback > mechanism is in place and works for WAL write failure. If we substitute > the WAL success with a new situation which is named 'quorum' later in > this document then no changes to the machinery is needed. The same is > true for snapshot machinery that allows to create a copy of the database > in memory for the whole period of snapshot file write. Adding quorum here > also minimizes changes. > > Currently replication represented by the following scheme: > ``` > Customer Leader WAL(L) Replica WAL(R) > |------TXN----->| | | | > | | | | | > | [TXN Rollback | | | > | created] | | | > | | | | | > | |-----TXN----->| | | > | | | | | > | |<---WAL Ok----| | | > | | | | | > | [TXN Rollback | | | > | destroyed] | | | > | | | | | > |<----TXN Ok----| | | | > | |-------Replicate TXN------->| | > | | | | | > | | | [TXN Rollback | > | | | created] | > | | | | | > | | | |-----TXN----->| > | | | | | > | | | |<---WAL Ok----| > | | | | | > | | | [TXN Rollback | > | | | destroyed] | > | | | | | > ``` > > To introduce the 'quorum' we have to receive confirmation from replicas > to make a decision on whether the quorum is actually present. Leader > collects necessary amount of replicas confirmation plus its own WAL > success. This state is named 'quorum' and gives leader the right to > complete the customers' request. So the picture will change to: > ``` > Customer Leader WAL(L) Replica WAL(R) > |------TXN----->| | | | > | | | | | > | [TXN Rollback | | | > | created] | | | > | | | | | > | |-----TXN----->| | | > | | | | | > | |-------Replicate TXN------->| | > | | | | | > | | | [TXN Rollback | > | |<---WAL Ok----| created] | > | | | | | > | [Waiting | |-----TXN----->| > | of a quorum] | | | > | | | |<---WAL Ok----| > | | | | | > | |<------Replication Ok-------| | > | | | | | > | [Quorum | | | > | achieved] | | | > | | | | | > | [TXN Rollback | | | > | destroyed] | | | > | | | | | > | |---Confirm--->| | | > | | | | | > | |----------Confirm---------->| | > | | | | | > |<---TXN Ok-----| | [TXN Rollback | > | | | destroyed] | > | | | | | > | | | |---Confirm--->| > | | | | | > ``` > > The quorum should be collected as a table for a list of transactions > waiting for quorum. The latest transaction that collects the quorum is > considered as complete, as well as all transactions prior to it, since > all transactions should be applied in order. Leader writes a 'confirm' > message to the WAL that refers to the transaction's LSN and it has its > own LSN. This confirm message is delivered to all replicas through the > existing replication mechanism. > > Replica should report a positive or a negative result of the TXN to the > leader via the IPROTO explicitly to allow leader to collect the quorum > or anti-quorum for the TXN. In case a negative result for the TXN is > received from minor number of replicas, then leader has to send an error > message to the replicas, which in turn have to disconnect from the > replication the same way as it is done now in case of conflict. > > In case leader receives enough error messages to do not achieve the > quorum it should write the 'rollback' message in the WAL. After that > leader and replicas will perform the rollback for all TXN that didn't > receive quorum. > > ### Recovery and failover. > > Tarantool instance during reading WAL should postpone the commit until > the 'confirm' is read. In case the WAL eof is achieved, the instance > should keep rollback for all transactions that are waiting for a confirm > entry until the role of the instance is set. In case this instance > become a replica there are no additional actions needed, since all info > about quorum/rollback will arrive via replication. In case this instance > is assigned a leader role, it should write 'rollback' in its WAL and > perform rollback for all transactions waiting for a quorum. > > In case of a leader failure a replica with the biggest LSN with former > leader's ID is elected as a new leader. As long as multi-master is not banned, there may be multiple leaders. Does this proposal suggest multi-master is banned? Then it should describe the implementation of this, and in absense of transparent query forwarding it will break all clients. > The replica should record > 'rollback' in its WAL which effectively means that all transactions > without quorum should be rolled back. This rollback will be delivered to > all replicas and they will perform rollbacks of all transactions waiting > for quorum. > > An interface to force apply pending transactions by issuing a confirm > entry for them have to be introduced for manual recovery. > > ### Snapshot generation. > > We also can reuse current machinery of snapshot generation. Upon > receiving a request to create a snapshot an instance should request a > readview for the current commit operation. Although start of the > snapshot generation should be postponed until this commit operation > receives its confirmation. In case operation is rolled back, the snapshot > generation should be aborted and restarted using current transaction > after rollback is complete. > > After snapshot is created the WAL should start from the first operation > that follows the commit operation snapshot is generated for. That means > WAL will contain 'confirm' messages that refer to transactions that are > not present in the WAL. Apparently, we have to allow this for the case > 'confirm' refers to a transaction with LSN less than the first entry in > the WAL. > > In case master appears unavailable a replica still have to be able to > create a snapshot. Replica can perform rollback for all transactions that > are not confirmed and claim its LSN as the latest confirmed txn. Then it > can create a snapshot in a regular way and start with blank xlog file. > All rolled back transactions will appear through the regular replication > in case master reappears later on. > > ### Asynchronous replication. > > Along with synchronous replicas the cluster can contain asynchronous > replicas. That means async replica doesn't reply to the leader with > errors since they're not contributing into quorum. Still, async > replicas have to follow the new WAL operation, such as keep rollback > info until 'quorum' message is received. This is essential for the case > of 'rollback' message appearance in the WAL. This message assumes > replica is able to perform all necessary rollback by itself. Cluster > information should contain explicit notification of each replica > operation mode. > > ### Synchronous replication enabling. > > Synchronous operation can be required for a set of spaces in the data > scheme. That means only transactions that contain data modification for > these spaces should require quorum. Such transactions named synchronous. > As soon as last operation of synchronous transaction appeared in leader's > WAL, it will cause all following transactions - matter if they are > synchronous or not - wait for the quorum. In case quorum is not achieved > the 'rollback' operation will cause rollback of all transactions after > the synchronous one. It will ensure the consistent state of the data both > on leader and replicas. In case user doesn't require synchronous operation > for any space then no changes to the WAL generation and replication will > appear. > > Cluster description should contain explicit attribute for each replica > to denote it participates in synchronous activities. Also the description > should contain criterion on how many replicas responses are needed to > achieve the quorum. > > ## Rationale and alternatives > > There is an implementation of synchronous replication as part of gh-980 > activities, still it is not in a state to get into the product. More > than that it intentionally breaks backward compatibility which is a > prerequisite for this proposal. > > -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-04-17 10:10 ` Konstantin Osipov @ 2020-04-17 13:45 ` Sergey Ostanevich 0 siblings, 0 replies; 53+ messages in thread From: Sergey Ostanevich @ 2020-04-17 13:45 UTC (permalink / raw) To: Konstantin Osipov, Николай Карлов, Тимур Сафин, Mons Anderson, Aleksandr Lyapunov, Sergey Bronnikov, tarantool-patches Hi, thanks for review! On 17 апр 13:10, Konstantin Osipov wrote: > * sergos@tarantool.org <sergos@tarantool.org> [20/04/15 17:51]: > > ### Quorum commit > > This part looks correct. It only describes two paths out of many > though: > - leader is able to collect the majority > - leader is not able to collect the majority > > What happens when a leader receives a message for a round which is > complete? It just ignores it, the reason - see next comment. > How does a replica which missed a round catch up? > What happens if replica fails to apply txn 1 (e.g. because of a > duplciate key), but confirms txn 2? This should never happen, since each replica applies txns in strict order, means failure of txn 1 will happen before the confirmation of txn 2. As soon as replica fails to apply a txn it should report an error, disconnect and roll back all txns in it's pipeline. After that the replica will ne in a consistent state with Leader's lsn before the txn 1. > > What happens if txn1 gets no majority at the leader, but txn 2 > gets a majority? How are the followers rolled back? This situation means that some of ACKs from replicas didn't arrive. Which doesn't mean they failed to apply txn 1. Althoug, success of txn 2 means the txn 1 was also applied - hence, receiveing a txn N ACK from a replica means ACK for each txn M: M < N. > > In case of a leader failure a replica with the biggest LSN with former > > leader's ID is elected as a new leader. > > As long as multi-master is not banned, there may be multiple > leaders. Does this proposal suggest multi-master is banned? Then > it should describe the implementation of this, and in absense of > transparent query forwarding it will break all clients. > It was mentioned at the top of RFC: > What this RFC is not: > > - high availability (HA) solution with automated failover, roles > assignments an so on > - master-master configuration support Which I tend to describe as 'do not recommend'. Similar to what we have in documentation about the cascading replication configuration. Although, I heard from some users that they successfuly use such config. Regards, Sergos ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-04-15 14:50 ` sergos 2020-04-16 7:13 ` Aleksandr Lyapunov 2020-04-17 10:10 ` Konstantin Osipov @ 2020-04-20 11:20 ` Serge Petrenko 2 siblings, 0 replies; 53+ messages in thread From: Serge Petrenko @ 2020-04-20 11:20 UTC (permalink / raw) To: Sergey Ostanevich Cc: Николай Карлов, Mons Anderson, tarantool-patches, Тимур Сафин LGTM. -- Serge Petrenko sergepetrenko@tarantool.org > 15 апр. 2020 г., в 17:50, sergos@tarantool.org написал(а): > > Sorry for mess introduced by mail client in previous message. > Here’s the correct version with 3 more misprints fixed. > > The version is available here > https://github.com/tarantool/tarantool/blob/sergos/quorum-based-synchro/doc/rfc/quorum-based-synchro.md > > Please, reply all with your comments/blessings today. > > Regards, > Sergos > > --- > * **Status**: In progress > * **Start date**: 31-03-2020 > * **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\> > * **Issues**: https://github.com/tarantool/tarantool/issues/4842 > > ## Summary > > The aim of this RFC is to address the following list of problems > formulated at MRG planning meeting: > - protocol backward compatibility to enable cluster upgrade w/o > downtime > - consistency of data on replica and leader > - switch from leader to replica without data loss > - up to date replicas to run read-only requests > - ability to switch async replicas into sync ones and vice versa > - guarantee of rollback on leader and sync replicas > - simplicity of cluster orchestration > > What this RFC is not: > > - high availability (HA) solution with automated failover, roles > assignments an so on > - master-master configuration support > > ## Background and motivation > > There are number of known implementation of consistent data presence in > a cluster. They can be commonly named as "wait for LSN" technique. The > biggest issue with this technique is the absence of rollback guarantees > at replica in case of transaction failure on one master or some of the > replicas in the cluster. > > To provide such capabilities a new functionality should be introduced in > Tarantool core, with requirements mentioned before - backward > compatibility and ease of cluster orchestration. > > ## Detailed design > > ### Quorum commit > > The main idea behind the proposal is to reuse existent machinery as much > as possible. It will ensure the well-tested and proven functionality > across many instances in MRG and beyond is used. The transaction rollback > mechanism is in place and works for WAL write failure. If we substitute > the WAL success with a new situation which is named 'quorum' later in > this document then no changes to the machinery is needed. The same is > true for snapshot machinery that allows to create a copy of the database > in memory for the whole period of snapshot file write. Adding quorum here > also minimizes changes. > > Currently replication represented by the following scheme: > ``` > Customer Leader WAL(L) Replica WAL(R) > |------TXN----->| | | | > | | | | | > | [TXN Rollback | | | > | created] | | | > | | | | | > | |-----TXN----->| | | > | | | | | > | |<---WAL Ok----| | | > | | | | | > | [TXN Rollback | | | > | destroyed] | | | > | | | | | > |<----TXN Ok----| | | | > | |-------Replicate TXN------->| | > | | | | | > | | | [TXN Rollback | > | | | created] | > | | | | | > | | | |-----TXN----->| > | | | | | > | | | |<---WAL Ok----| > | | | | | > | | | [TXN Rollback | > | | | destroyed] | > | | | | | > ``` > > To introduce the 'quorum' we have to receive confirmation from replicas > to make a decision on whether the quorum is actually present. Leader > collects necessary amount of replicas confirmation plus its own WAL > success. This state is named 'quorum' and gives leader the right to > complete the customers' request. So the picture will change to: > ``` > Customer Leader WAL(L) Replica WAL(R) > |------TXN----->| | | | > | | | | | > | [TXN Rollback | | | > | created] | | | > | | | | | > | |-----TXN----->| | | > | | | | | > | |-------Replicate TXN------->| | > | | | | | > | | | [TXN Rollback | > | |<---WAL Ok----| created] | > | | | | | > | [Waiting | |-----TXN----->| > | of a quorum] | | | > | | | |<---WAL Ok----| > | | | | | > | |<------Replication Ok-------| | > | | | | | > | [Quorum | | | > | achieved] | | | > | | | | | > | [TXN Rollback | | | > | destroyed] | | | > | | | | | > | |---Confirm--->| | | > | | | | | > | |----------Confirm---------->| | > | | | | | > |<---TXN Ok-----| | [TXN Rollback | > | | | destroyed] | > | | | | | > | | | |---Confirm--->| > | | | | | > ``` > > The quorum should be collected as a table for a list of transactions > waiting for quorum. The latest transaction that collects the quorum is > considered as complete, as well as all transactions prior to it, since > all transactions should be applied in order. Leader writes a 'confirm' > message to the WAL that refers to the transaction's LSN and it has its > own LSN. This confirm message is delivered to all replicas through the > existing replication mechanism. > > Replica should report a positive or a negative result of the TXN to the > leader via the IPROTO explicitly to allow leader to collect the quorum > or anti-quorum for the TXN. In case a negative result for the TXN is > received from minor number of replicas, then leader has to send an error > message to the replicas, which in turn have to disconnect from the > replication the same way as it is done now in case of conflict. > > In case leader receives enough error messages to do not achieve the > quorum it should write the 'rollback' message in the WAL. After that > leader and replicas will perform the rollback for all TXN that didn't > receive quorum. > > ### Recovery and failover. > > Tarantool instance during reading WAL should postpone the commit until > the 'confirm' is read. In case the WAL eof is achieved, the instance > should keep rollback for all transactions that are waiting for a confirm > entry until the role of the instance is set. In case this instance > become a replica there are no additional actions needed, since all info > about quorum/rollback will arrive via replication. In case this instance > is assigned a leader role, it should write 'rollback' in its WAL and > perform rollback for all transactions waiting for a quorum. > > In case of a leader failure a replica with the biggest LSN with former > leader's ID is elected as a new leader. The replica should record > 'rollback' in its WAL which effectively means that all transactions > without quorum should be rolled back. This rollback will be delivered to > all replicas and they will perform rollbacks of all transactions waiting > for quorum. > > An interface to force apply pending transactions by issuing a confirm > entry for them have to be introduced for manual recovery. > > ### Snapshot generation. > > We also can reuse current machinery of snapshot generation. Upon > receiving a request to create a snapshot an instance should request a > readview for the current commit operation. Although start of the > snapshot generation should be postponed until this commit operation > receives its confirmation. In case operation is rolled back, the snapshot > generation should be aborted and restarted using current transaction > after rollback is complete. > > After snapshot is created the WAL should start from the first operation > that follows the commit operation snapshot is generated for. That means > WAL will contain 'confirm' messages that refer to transactions that are > not present in the WAL. Apparently, we have to allow this for the case > 'confirm' refers to a transaction with LSN less than the first entry in > the WAL. > > In case master appears unavailable a replica still have to be able to > create a snapshot. Replica can perform rollback for all transactions that > are not confirmed and claim its LSN as the latest confirmed txn. Then it > can create a snapshot in a regular way and start with blank xlog file. > All rolled back transactions will appear through the regular replication > in case master reappears later on. > > ### Asynchronous replication. > > Along with synchronous replicas the cluster can contain asynchronous > replicas. That means async replica doesn't reply to the leader with > errors since they're not contributing into quorum. Still, async > replicas have to follow the new WAL operation, such as keep rollback > info until 'quorum' message is received. This is essential for the case > of 'rollback' message appearance in the WAL. This message assumes > replica is able to perform all necessary rollback by itself. Cluster > information should contain explicit notification of each replica > operation mode. > > ### Synchronous replication enabling. > > Synchronous operation can be required for a set of spaces in the data > scheme. That means only transactions that contain data modification for > these spaces should require quorum. Such transactions named synchronous. > As soon as last operation of synchronous transaction appeared in leader's > WAL, it will cause all following transactions - matter if they are > synchronous or not - wait for the quorum. In case quorum is not achieved > the 'rollback' operation will cause rollback of all transactions after > the synchronous one. It will ensure the consistent state of the data both > on leader and replicas. In case user doesn't require synchronous operation > for any space then no changes to the WAL generation and replication will > appear. > > Cluster description should contain explicit attribute for each replica > to denote it participates in synchronous activities. Also the description > should contain criterion on how many replicas responses are needed to > achieve the quorum. > > ## Rationale and alternatives > > There is an implementation of synchronous replication as part of gh-980 > activities, still it is not in a state to get into the product. More > than that it intentionally breaks backward compatibility which is a > prerequisite for this proposal. > > > ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-04-03 21:08 [Tarantool-patches] [RFC] Quorum-based synchronous replication Sergey Ostanevich 2020-04-07 13:02 ` Aleksandr Lyapunov 2020-04-14 12:58 ` Sergey Bronnikov @ 2020-04-20 23:32 ` Vladislav Shpilevoy 2020-04-21 10:49 ` Sergey Ostanevich 2020-04-23 21:38 ` Vladislav Shpilevoy 3 siblings, 1 reply; 53+ messages in thread From: Vladislav Shpilevoy @ 2020-04-20 23:32 UTC (permalink / raw) To: Sergey Ostanevich, tarantool-patches Hi! This is the latest version I found on the branch. I give my comments for it. Keep in mind I didn't read other reviews before writing my own, assuming that all questions were fixed, and by idea I should have understood everything after reading this now. Nonetheless see 12 comments below. > * **Status**: In progress > * **Start date**: 31-03-2020 > * **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\> > * **Issues**: https://github.com/tarantool/tarantool/issues/4842 > > ## Summary > > The aim of this RFC is to address the following list of problems > formulated at MRG planning meeting: > - protocol backward compatibility to enable cluster upgrade w/o > downtime > - consistency of data on replica and leader > - switch from leader to replica without data loss > - up to date replicas to run read-only requests > - ability to switch async replicas into sync ones and vice versa > - guarantee of rollback on leader and sync replicas > - simplicity of cluster orchestration > > What this RFC is not: > > - high availability (HA) solution with automated failover, roles > assignments an so on 1. So no leader election? That essentially makes single failure point for RW requests, is it correct? On the other hand I see section 'Recovery and failover.' below. And it seems to be automated, with selecting a replica with the biggest LSN. Where is the truth? > - master-master configuration support > > ## Background and motivation > > There are number of known implementation of consistent data presence in > a cluster. They can be commonly named as "wait for LSN" technique. The > biggest issue with this technique is the absence of rollback guarantees > at replica in case of transaction failure on one master or some of the > replicas in the cluster. > > To provide such capabilities a new functionality should be introduced in > Tarantool core, with requirements mentioned before - backward > compatibility and ease of cluster orchestration. > > ## Detailed design > > ### Quorum commit > > The main idea behind the proposal is to reuse existent machinery as much > as possible. It will ensure the well-tested and proven functionality > across many instances in MRG and beyond is used. The transaction rollback > mechanism is in place and works for WAL write failure. If we substitute > the WAL success with a new situation which is named 'quorum' later in > this document then no changes to the machinery is needed. 2. The problem here is that you create dependency on WAL. According to your words, replication is inside WAL, and if WAL gave ok, then all is replicated and applied. But that makes current code structure even worse than it is. Now WAL, GC, and replication code is spaghetti, basically. All depends on all. I was rather thinking, that we should fix that first. Not aggravate. WAL should provide API for writing to disk. Replication should not bother about WAL. GC should not bother about replication. All should be independent, and linked in one place by some kind of a manager, which would just use their APIs. I believe Cyrill G. would agree with me here, I remember him complaining about replication-wal-gc code inter-dependencies too. Please, request his review on this, if you didn't yet. > The same is > true for snapshot machinery that allows to create a copy of the database > in memory for the whole period of snapshot file write. Adding quorum here > also minimizes changes. > Currently replication represented by the following scheme: > ``` > Customer Leader WAL(L) Replica WAL(R) 3. Are you saying 'leader' === 'master'? > |------TXN----->| | | | > | | | | | > | [TXN Rollback | | | > | created] | | | 4. What is 'txn rollback', and why is it created before even a transaction is started? At least, rollback is a verb. Maybe you meant 'undo log'? > | | | | | > | |-----TXN----->| | | > | | | | | > | |<---WAL Ok----| | | > | | | | | > | [TXN Rollback | | | > | destroyed] | | | > | | | | | > |<----TXN Ok----| | | | > | |-------Replicate TXN------->| | > | | | | | > | | | [TXN Rollback | > | | | created] | > | | | | | > | | | |-----TXN----->| > | | | | | > | | | |<---WAL Ok----| > | | | | | > | | | [TXN Rollback | > | | | destroyed] | > | | | | | > ``` > > To introduce the 'quorum' we have to receive confirmation from replicas > to make a decision on whether the quorum is actually present. Leader > collects necessary amount of replicas confirmation plus its own WAL 5. Please, define 'necessary amount'? > success. This state is named 'quorum' and gives leader the right to > complete the customers' request. So the picture will change to: > ``` > Customer Leader WAL(L) Replica WAL(R) > |------TXN----->| | | | > | | | | | > | [TXN Rollback | | | > | created] | | | > | | | | | > | |-----TXN----->| | | 6. Are we going to replicate transaction after user writes commit()? Or will we replicate it while it is in progress? So called 'presumed commit'. I remember I read some papers explaining how it significantly speeds up synchronous transactions. Probably that was a paper about 2-phase commit, can't remember already. But the idea is still applicable for the replication too. > | | | | | > | |-------Replicate TXN------->| | > | | | | | > | | | [TXN Rollback | > | |<---WAL Ok----| created] | > | | | | | > | [Waiting | |-----TXN----->| > | of a quorum] | | | > | | | |<---WAL Ok----| > | | | | | > | |<------Replication Ok-------| | > | | | | | > | [Quorum | | | > | achieved] | | | > | | | | | > | [TXN Rollback | | | > | destroyed] | | | > | | | | | > | |---Confirm--->| | | > | | | | | > | |----------Confirm---------->| | > | | | | | > |<---TXN Ok-----| | [TXN Rollback | > | | | destroyed] | > | | | | | > | | | |---Confirm--->| > | | | | | > ``` > > The quorum should be collected as a table for a list of transactions > waiting for quorum. The latest transaction that collects the quorum is > considered as complete, as well as all transactions prior to it, since > all transactions should be applied in order. Leader writes a 'confirm' > message to the WAL that refers to the transaction's LSN and it has its > own LSN. This confirm message is delivered to all replicas through the > existing replication mechanism. > > Replica should report a positive or a negative result of the TXN to the > leader via the IPROTO explicitly to allow leader to collect the quorum > or anti-quorum for the TXN. In case a negative result for the TXN is > received from minor number of replicas, then leader has to send an error > message to the replicas, which in turn have to disconnect from the > replication the same way as it is done now in case of conflict. > In case leader receives enough error messages to do not achieve the > quorum it should write the 'rollback' message in the WAL. After that > leader and replicas will perform the rollback for all TXN that didn't > receive quorum. > > ### Recovery and failover. > > Tarantool instance during reading WAL should postpone the commit until > the 'confirm' is read. In case the WAL eof is achieved, the instance > should keep rollback for all transactions that are waiting for a confirm > entry until the role of the instance is set. In case this instance > become a replica there are no additional actions needed, since all info > about quorum/rollback will arrive via replication. In case this instance > is assigned a leader role, it should write 'rollback' in its WAL and > perform rollback for all transactions waiting for a quorum. > > In case of a leader failure a replica with the biggest LSN with former > leader's ID is elected as a new leader. The replica should record > 'rollback' in its WAL which effectively means that all transactions > without quorum should be rolled back. This rollback will be delivered to > all replicas and they will perform rollbacks of all transactions waiting > for quorum. 7. Please, elaborate leader election. It is not as trivial as just 'elect'. What if the replica with the biggest LSN is temporary not available, but it knows that it has the biggest LSN? Will it become a leader without asking other nodes? What will do the other nodes? Will they wait for the new leader node to become available? Do they have a timeout on that? Basically, it would be nice to see the split-brain problem description here, and its solution for us. How leader failure is detected? Do you rely on our heartbeat messages? Are you going to adapt SWIM for this? Raft has a dedicated subsystem for election, it is not that simple. It involves voting, randomized algorithms. Am I missing something obvious in this RFC, which makes the leader election much simpler specifically for Tarantool? > An interface to force apply pending transactions by issuing a confirm > entry for them have to be introduced for manual recovery. > > ### Snapshot generation. > > We also can reuse current machinery of snapshot generation. Upon > receiving a request to create a snapshot an instance should request a > readview for the current commit operation. Although start of the > snapshot generation should be postponed until this commit operation > receives its confirmation. In case operation is rolled back, the snapshot > generation should be aborted and restarted using current transaction > after rollback is complete. 8. This section highly depends on transaction manager for memtx. If you have a transaction manager, you always have a ready-to-use read-view of the latest committed data. At least this is my understanding. After all, the manager should provide transaction isolation. And it means, that all non-committed transactions are not visible. And for that we need a read-view. Therefore, it could be used to make a snapshot. > After snapshot is created the WAL should start from the first operation > that follows the commit operation snapshot is generated for. That means > WAL will contain 'confirm' messages that refer to transactions that are > not present in the WAL. Apparently, we have to allow this for the case > 'confirm' refers to a transaction with LSN less than the first entry in > the WAL. 9. I couldn't understand that. Why confirm is in WAL for data stored in the snap? I thought you said above, that snapshot should be done for all confirmed data. Besides, having confirm out of snap means the snap is not self-sufficient anymore. > In case master appears unavailable a replica still have to be able to > create a snapshot. Replica can perform rollback for all transactions that > are not confirmed and claim its LSN as the latest confirmed txn. Then it > can create a snapshot in a regular way and start with blank xlog file. > All rolled back transactions will appear through the regular replication > in case master reappears later on. 10. You should be able to make a snapshot without rollback. Read-views are available anyway. At least it is so in Vinyl, from what I remember. And this is going to be similar in memtx. > > ### Asynchronous replication. > > Along with synchronous replicas the cluster can contain asynchronous > replicas. That means async replica doesn't reply to the leader with > errors since they're not contributing into quorum. Still, async > replicas have to follow the new WAL operation, such as keep rollback > info until 'quorum' message is received. This is essential for the case > of 'rollback' message appearance in the WAL. This message assumes > replica is able to perform all necessary rollback by itself. Cluster > information should contain explicit notification of each replica > operation mode. > > ### Synchronous replication enabling. > > Synchronous operation can be required for a set of spaces in the data > scheme. That means only transactions that contain data modification for > these spaces should require quorum. Such transactions named synchronous. > As soon as last operation of synchronous transaction appeared in leader's > WAL, it will cause all following transactions - matter if they are > synchronous or not - wait for the quorum. In case quorum is not achieved > the 'rollback' operation will cause rollback of all transactions after > the synchronous one. It will ensure the consistent state of the data both > on leader and replicas. In case user doesn't require synchronous operation > for any space then no changes to the WAL generation and replication will > appear. > > Cluster description should contain explicit attribute for each replica > to denote it participates in synchronous activities. Also the description > should contain criterion on how many replicas responses are needed to > achieve the quorum. 11. Aha, I see 'necessary amount' from above is a manually set value. Ok. > > ## Rationale and alternatives > > There is an implementation of synchronous replication as part of gh-980 > activities, still it is not in a state to get into the product. More > than that it intentionally breaks backward compatibility which is a > prerequisite for this proposal. 12. How are we going to deal with fsync()? Will it be forcefully enabled on sync replicas and the leader? ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-04-20 23:32 ` Vladislav Shpilevoy @ 2020-04-21 10:49 ` Sergey Ostanevich 2020-04-21 22:17 ` Vladislav Shpilevoy 0 siblings, 1 reply; 53+ messages in thread From: Sergey Ostanevich @ 2020-04-21 10:49 UTC (permalink / raw) To: Vladislav Shpilevoy; +Cc: tarantool-patches Hi! Thanks for review! > > - high availability (HA) solution with automated failover, roles > > assignments an so on > > 1. So no leader election? That essentially makes single failure point > for RW requests, is it correct? > > On the other hand I see section 'Recovery and failover.' below. And > it seems to be automated, with selecting a replica with the biggest > LSN. Where is the truth? > The failover can be manual or implemented independnetnly. By no means this means we should not explain how this should be done according to the replication schema discussed. And yes, the SPOF is the leader of the cluster. This is expected and is Ok according to all MRG planning meeting participants. > > - master-master configuration support > > > > ## Background and motivation > > > > There are number of known implementation of consistent data presence in > > a cluster. They can be commonly named as "wait for LSN" technique. The > > biggest issue with this technique is the absence of rollback guarantees > > at replica in case of transaction failure on one master or some of the > > replicas in the cluster. > > > > To provide such capabilities a new functionality should be introduced in > > Tarantool core, with requirements mentioned before - backward > > compatibility and ease of cluster orchestration. > > > > ## Detailed design > > > > ### Quorum commit > > > > The main idea behind the proposal is to reuse existent machinery as much > > as possible. It will ensure the well-tested and proven functionality > > across many instances in MRG and beyond is used. The transaction rollback > > mechanism is in place and works for WAL write failure. If we substitute > > the WAL success with a new situation which is named 'quorum' later in > > this document then no changes to the machinery is needed. > > 2. The problem here is that you create dependency on WAL. According to > your words, replication is inside WAL, and if WAL gave ok, then all is 'Replication is inside WAL' - what do you mean by that? The replication in its current state works from WAL, although it's an exaggregation to say it is 'inside WAL'. Why it means a new dependency after that? > replicated and applied. But that makes current code structure even worse > than it is. Now WAL, GC, and replication code is spaghetti, basically. > All depends on all. I was rather thinking, that we should fix that first. > Not aggravate. > > WAL should provide API for writing to disk. Replication should not bother > about WAL. GC should not bother about replication. All should be independent, > and linked in one place by some kind of a manager, which would just use their So you want to introduce a single point that will translate all messages between all participants? I believe current state was introduced exactly to avoid this situation. Each participant can be subscribed for a particular trigger inside another participant and take it into account in its activities - at the right time for itself. > APIs. I believe Cyrill G. would agree with me here, I remember him > complaining about replication-wal-gc code inter-dependencies too. Please, > request his review on this, if you didn't yet. > I personally have the same problem trying to implement a trivial test, by just figuring out the layers and dependencies of participants. This is about poor documentation im my understanding, not poor design. > > The same is > > true for snapshot machinery that allows to create a copy of the database > > in memory for the whole period of snapshot file write. Adding quorum here > > also minimizes changes. > > Currently replication represented by the following scheme: > > ``` > > Customer Leader WAL(L) Replica WAL(R) > > 3. Are you saying 'leader' === 'master'? According to international polite naming. Mark Twain nowadays goes obscene with his 'Mars Tom'. > > > |------TXN----->| | | | > > | | | | | > > | [TXN Rollback | | | > > | created] | | | > > 4. What is 'txn rollback', and why is it created before even a transaction > is started? At least, rollback is a verb. Maybe you meant 'undo log'? No, rollback has both verb and noun meaning. Nevertheless, if you tripped over this - it should be fixed. > > > | | | | | > > | |-----TXN----->| | | > > | | | | | > > | |<---WAL Ok----| | | > > | | | | | > > | [TXN Rollback | | | > > | destroyed] | | | > > | | | | | > > |<----TXN Ok----| | | | > > | |-------Replicate TXN------->| | > > | | | | | > > | | | [TXN Rollback | > > | | | created] | > > | | | | | > > | | | |-----TXN----->| > > | | | | | > > | | | |<---WAL Ok----| > > | | | | | > > | | | [TXN Rollback | > > | | | destroyed] | > > | | | | | > > ``` > > > > To introduce the 'quorum' we have to receive confirmation from replicas > > to make a decision on whether the quorum is actually present. Leader > > collects necessary amount of replicas confirmation plus its own WAL > > 5. Please, define 'necessary amount'? Apparently, resolved with comment #11 > > > success. This state is named 'quorum' and gives leader the right to > > complete the customers' request. So the picture will change to: > > ``` > > Customer Leader WAL(L) Replica WAL(R) > > |------TXN----->| | | | > > | | | | | > > | [TXN Rollback | | | > > | created] | | | > > | | | | | > > | |-----TXN----->| | | > > 6. Are we going to replicate transaction after user writes commit()? Does your 'user' means customer from the picture? In such a case do you expect to have an interactive transaction? Definitely, we do not consider it here in any form, since replication happens only for the comlpete transaction. > Or will we replicate it while it is in progress? So called 'presumed > commit'. I remember I read some papers explaining how it significantly > speeds up synchronous transactions. Probably that was a paper about > 2-phase commit, can't remember already. But the idea is still applicable > for the replication too. This can be considered only after MVCC is introduced - currently running as a separate activity. Then we can replicate transaction 'in the fly' into a separate blob/readview/better_name. By now this means we will be too much interwoven to correctly rollback afterwards, at the time of quorum is failed. > > In case of a leader failure a replica with the biggest LSN with former > > leader's ID is elected as a new leader. The replica should record > > 'rollback' in its WAL which effectively means that all transactions > > without quorum should be rolled back. This rollback will be delivered to > > all replicas and they will perform rollbacks of all transactions waiting > > for quorum. > > 7. Please, elaborate leader election. It is not as trivial as just 'elect'. > What if the replica with the biggest LSN is temporary not available, but > it knows that it has the biggest LSN? Will it become a leader without > asking other nodes? What will do the other nodes? Will they wait for the > new leader node to become available? Do they have a timeout on that? > By now I do not plan any activities on HA - including automated failover and leader re-election. In case leader sees insufficient number of replicas to achieve quorum - it stops, reporting the problem to the external orchestrator. > Basically, it would be nice to see the split-brain problem description here, > and its solution for us. > I believe the split-brain is under orchestrator control either - we should provide API to switch leader in the cluster, so that when a former leader came back it will not get quorum for any txn it has, replying to customers with failure as a result. > How leader failure is detected? Do you rely on our heartbeat messages? > Are you going to adapt SWIM for this? > > Raft has a dedicated subsystem for election, it is not that simple. It > involves voting, randomized algorithms. Am I missing something obvious in > this RFC, which makes the leader election much simpler specifically for > Tarantool? > All of these I assume as HA features, when Tarantool can automate the failover and leader re-election. Out of the scope by now. > > An interface to force apply pending transactions by issuing a confirm > > entry for them have to be introduced for manual recovery. > > > > ### Snapshot generation. > > > > We also can reuse current machinery of snapshot generation. Upon > > receiving a request to create a snapshot an instance should request a > > readview for the current commit operation. Although start of the > > snapshot generation should be postponed until this commit operation > > receives its confirmation. In case operation is rolled back, the snapshot > > generation should be aborted and restarted using current transaction > > after rollback is complete. > > 8. This section highly depends on transaction manager for memtx. If you > have a transaction manager, you always have a ready-to-use read-view > of the latest committed data. At least this is my understanding. > > After all, the manager should provide transaction isolation. And it means, > that all non-committed transactions are not visible. And for that we need > a read-view. Therefore, it could be used to make a snapshot. > Currently there's no such manager for memtx. So I proposed this workaround with minimal impact on our current machinery. Alexander Lyapunov is working on the manager in parallel, he reviewed and blessed this RFC, so apparently there's no contradiction with his plans. > > After snapshot is created the WAL should start from the first operation > > that follows the commit operation snapshot is generated for. That means > > WAL will contain 'confirm' messages that refer to transactions that are > > not present in the WAL. Apparently, we have to allow this for the case > > 'confirm' refers to a transaction with LSN less than the first entry in > > the WAL. > > 9. I couldn't understand that. Why confirm is in WAL for data stored in > the snap? I thought you said above, that snapshot should be done for all > confirmed data. Besides, having confirm out of snap means the snap is > not self-sufficient anymore. > Snap waits for confirm message to start. During this wait the WAL keep growing. At the moment confirm arrived the snap will be created - say, for txn #10. The WAL will be started with lsn #11 and commit can be somewhere lsn #30. So, starting with this snap data appears consistent for lsn #10 - it is guaranteed by the wait of commit message. Then replay of WAL will come to a confirm message lsn #30 - referring to lsn #10 - that actually ignored, since it looks beyond the WAL start. There could be confirm messages for even earlier txns if wait takes sufficient time - all of them will refer to lsn beyond the WAL. And it is Ok. > > In case master appears unavailable a replica still have to be able to > > create a snapshot. Replica can perform rollback for all transactions that > > are not confirmed and claim its LSN as the latest confirmed txn. Then it > > can create a snapshot in a regular way and start with blank xlog file. > > All rolled back transactions will appear through the regular replication > > in case master reappears later on. > > 10. You should be able to make a snapshot without rollback. Read-views are > available anyway. At least it is so in Vinyl, from what I remember. And this > is going to be similar in memtx. > You have to make a snapshot for a consistent data state. Unless we have transaction manager in memtx - this is the way to do so. And as I mentioned, this is a different activity. > > Cluster description should contain explicit attribute for each replica > > to denote it participates in synchronous activities. Also the description > > should contain criterion on how many replicas responses are needed to > > achieve the quorum. > > 11. Aha, I see 'necessary amount' from above is a manually set value. Ok. > > > > > ## Rationale and alternatives > > > > There is an implementation of synchronous replication as part of gh-980 > > activities, still it is not in a state to get into the product. More > > than that it intentionally breaks backward compatibility which is a > > prerequisite for this proposal. > > 12. How are we going to deal with fsync()? Will it be forcefully enabled > on sync replicas and the leader? To my understanding - it's up to user. I was considering a cluster that has no WAL at all - relying on sychro replication and sufficient number of replicas. Everyone who I asked about it told me I'm nuts. To my great surprise Alexander Lyapunov brought exactly the same idea to discuss. All of these is for one resolution: I would keep it for user to decide. Obviously, to speed up the processing leader can disable wal completely, but to do so we have to re-work the relay to work from memory. Replicas can use WAL in a way user wants: 2 replicas with slow HDD should'n wait for fsync(), while super-fast Intel DCPMM one can enable it. Balancing is up to user. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-04-21 10:49 ` Sergey Ostanevich @ 2020-04-21 22:17 ` Vladislav Shpilevoy 2020-04-22 16:50 ` Sergey Ostanevich 2020-04-23 6:58 ` Konstantin Osipov 0 siblings, 2 replies; 53+ messages in thread From: Vladislav Shpilevoy @ 2020-04-21 22:17 UTC (permalink / raw) To: Sergey Ostanevich; +Cc: tarantool-patches >>> - high availability (HA) solution with automated failover, roles >>> assignments an so on >> >> 1. So no leader election? That essentially makes single failure point >> for RW requests, is it correct? >> >> On the other hand I see section 'Recovery and failover.' below. And >> it seems to be automated, with selecting a replica with the biggest >> LSN. Where is the truth? >> > > The failover can be manual or implemented independnetnly. By no means > this means we should not explain how this should be done according to > the replication schema discussed. But it is not explained. You just said 'the biggest LSN owner is chosen'. This looks very good in theory and on the paper, but it is not that simple. If you are going to explain how it works. You said 'by no means we should not explain'. Talking of the election in scope of our replication schema, I don't see where it is discussed. Is there a separate RFC I am missing? I am asking exactly about that - where are pings, healthchecks, timeouts, voting on top of our replication schema? If you don't want to make the election a part of this RFC at all, then why is there a section, which literally says that the election is present and it is 'the biggest LSN owner is chosen'? In case the election is out of this for now, did you think about how a possible future leader election algorithm could be implemented on top of this sync replication? Just to be sure we are not ruining some things which will be necessary for auto election later on. > And yes, the SPOF is the leader of the cluster. This is expected and is > Ok according to all MRG planning meeting participants. That is not about MRG only, Tarantool is not a closed-source MRG-only DB. I am not against making it non-automated for now, but I want to be sure it will be possible to implement this as an enhancement. >>> - master-master configuration support >>> >>> ## Background and motivation >>> >>> There are number of known implementation of consistent data presence in >>> a cluster. They can be commonly named as "wait for LSN" technique. The >>> biggest issue with this technique is the absence of rollback guarantees >>> at replica in case of transaction failure on one master or some of the >>> replicas in the cluster. >>> >>> To provide such capabilities a new functionality should be introduced in >>> Tarantool core, with requirements mentioned before - backward >>> compatibility and ease of cluster orchestration. >>> >>> ## Detailed design >>> >>> ### Quorum commit >>> >>> The main idea behind the proposal is to reuse existent machinery as much >>> as possible. It will ensure the well-tested and proven functionality >>> across many instances in MRG and beyond is used. The transaction rollback >>> mechanism is in place and works for WAL write failure. If we substitute >>> the WAL success with a new situation which is named 'quorum' later in >>> this document then no changes to the machinery is needed. >> >> 2. The problem here is that you create dependency on WAL. According to >> your words, replication is inside WAL, and if WAL gave ok, then all is > > 'Replication is inside WAL' - what do you mean by that? The replication in > its current state works from WAL, although it's an exaggregation to say > it is 'inside WAL'. Why it means a new dependency after that? You said 'WAL success' is substituted with a new situation 'quorum'. It means, strictly interpreting your words, that 'wal_write()' function won't return 0 until quorum is collected. This is what I mean by moving replication into WAL subsystem. >> replicated and applied. But that makes current code structure even worse >> than it is. Now WAL, GC, and replication code is spaghetti, basically. >> All depends on all. I was rather thinking, that we should fix that first. >> Not aggravate. >> >> WAL should provide API for writing to disk. Replication should not bother >> about WAL. GC should not bother about replication. All should be independent, >> and linked in one place by some kind of a manager, which would just use their > > So you want to introduce a single point that will translate all messages > between all participants? Well, this is called 'cbus', we already have it. This is a separate headache, which no one understands except Georgy. However I was not talking about it. I am wrong, it should not be a single manager. But all the subsystems should be as independent as possible still. > I believe current state was introduced exactly > to avoid this situation. Each participant can be subscribed for a > particular trigger inside another participant and take it into account > in its activities - at the right time for itself. Whatever. I don't know the code good enough, so I am probably wrong somewhere here. But every time looking at these numerous triggers depending on each other and called at arbitrary moments was enough. I tried to fix these trigger-dependencies in scope of different tasks, but cleaning this code appears to be a huge task by itself. This can be done without obscure triggers called from arbitrary threads at atribtrary moments of time. That in fact was the main blocker for the in-memory WAL, when I tried to finish it after Georgy in January. We have fibers exactly to avoid triggers. To be able to write linear and simple code. The triggers can be replaced by dedicated fibers, and fiber condition variables can be used to wait for exact moments of time of needed events where functionality is event based. >> APIs. I believe Cyrill G. would agree with me here, I remember him >> complaining about replication-wal-gc code inter-dependencies too. Please, >> request his review on this, if you didn't yet. >> > I personally have the same problem trying to implement a trivial test, > by just figuring out the layers and dependencies of participants. This > is about poor documentation im my understanding, not poor design. There is almost more documentation than the code. Just look at the number of comments and level of their details. And still it does not help. So looks like just a bad API and dependency design. Does not matter how hard it is documented, it just becomes more interdepending and harder to wrap a mind around it. >>> In case of a leader failure a replica with the biggest LSN with former >>> leader's ID is elected as a new leader. The replica should record >>> 'rollback' in its WAL which effectively means that all transactions >>> without quorum should be rolled back. This rollback will be delivered to >>> all replicas and they will perform rollbacks of all transactions waiting >>> for quorum. >> >> 7. Please, elaborate leader election. It is not as trivial as just 'elect'. >> What if the replica with the biggest LSN is temporary not available, but >> it knows that it has the biggest LSN? Will it become a leader without >> asking other nodes? What will do the other nodes? Will they wait for the >> new leader node to become available? Do they have a timeout on that? >> > By now I do not plan any activities on HA - including automated failover > and leader re-election. In case leader sees insufficient number of > replicas to achieve quorum - it stops, reporting the problem to the > external orchestrator. From the text in this section it does not look like a not planned activity, but like an already made decision. It is not even in the 'Plans' section. You just said 'is elected'. By whom? How? If the election is not a part of the RFC, I would suggest moving this out, or into a separate section 'Plans' or something. Or reformulate this by saying like 'the cluster stops serving write requests until external intervention sets a new leader'. And 'it is *advised* to use the one with the biggest LSN in the old leader's vclock component'. Something like that. >> Basically, it would be nice to see the split-brain problem description here, >> and its solution for us. >> > I believe the split-brain is under orchestrator control either - we > should provide API to switch leader in the cluster, so that when a > former leader came back it will not get quorum for any txn it has, > replying to customers with failure as a result. Exactly. We should provide something for this from inside. But are there any details? How should that work? Should all the healthy replicas reject everything from the false-leader? Should the false-leader somehow realize, that it is not considered a leader anymore, and should stop itself? If we choose the former way, how a replica defines who is the true leader? For example, some replicas still may consider the old leader as a true master. If we choose the latter way, what is the algorithm of determining that we are not a leader anymore? >>> After snapshot is created the WAL should start from the first operation >>> that follows the commit operation snapshot is generated for. That means >>> WAL will contain 'confirm' messages that refer to transactions that are >>> not present in the WAL. Apparently, we have to allow this for the case >>> 'confirm' refers to a transaction with LSN less than the first entry in >>> the WAL. >> >> 9. I couldn't understand that. Why confirm is in WAL for data stored in >> the snap? I thought you said above, that snapshot should be done for all >> confirmed data. Besides, having confirm out of snap means the snap is >> not self-sufficient anymore. >> > Snap waits for confirm message to start. During this wait the WAL keep > growing. At the moment confirm arrived the snap will be created - say, > for txn #10. The WAL will be started with lsn #11 and commit can be > somewhere lsn #30. > So, starting with this snap data appears consistent for lsn #10 - it is > guaranteed by the wait of commit message. Then replay of WAL will come > to a confirm message lsn #30 - referring to lsn #10 - that actually > ignored, since it looks beyond the WAL start. There could be confirm > messages for even earlier txns if wait takes sufficient time - all of > them will refer to lsn beyond the WAL. And it is Ok. What is 'commit message'? I don't see it on the schema above. I see only confirms. So the problem is that some data may be written to WAL after we started committing our transactions going to the snap, but before we received a quorum. And we can't truncate the WAL by the quorum, because there is already newer data, which was not included into the snap. Because WAL is not stopped, it still accepts new transactions. Now I understand. Would be good to have this example in the RFC. >>> Cluster description should contain explicit attribute for each replica >>> to denote it participates in synchronous activities. Also the description >>> should contain criterion on how many replicas responses are needed to >>> achieve the quorum. >> >> 11. Aha, I see 'necessary amount' from above is a manually set value. Ok. >> >>> >>> ## Rationale and alternatives >>> >>> There is an implementation of synchronous replication as part of gh-980 >>> activities, still it is not in a state to get into the product. More >>> than that it intentionally breaks backward compatibility which is a >>> prerequisite for this proposal. >> >> 12. How are we going to deal with fsync()? Will it be forcefully enabled >> on sync replicas and the leader? > > To my understanding - it's up to user. I was considering a cluster that > has no WAL at all - relying on sychro replication and sufficient number > of replicas. Everyone who I asked about it told me I'm nuts. To my great > surprise Alexander Lyapunov brought exactly the same idea to discuss. I didn't see an RFC on that, and this can become easily possible, when in-memory relay is implemented. If it is implemented in a clean way. We just can turn off the disk backoff, and it will work from memory-only. > All of these is for one resolution: I would keep it for user to decide. > Obviously, to speed up the processing leader can disable wal completely, > but to do so we have to re-work the relay to work from memory. Replicas > can use WAL in a way user wants: 2 replicas with slow HDD should'n wait > for fsync(), while super-fast Intel DCPMM one can enable it. Balancing > is up to user. Possibility of omitting fsync means that it is possible, that all nodes write confirm, which is reported to the client, then the nodes restart, and the data is lost. I would say it somewhere. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-04-21 22:17 ` Vladislav Shpilevoy @ 2020-04-22 16:50 ` Sergey Ostanevich 2020-04-22 20:28 ` Vladislav Shpilevoy 2020-04-23 6:58 ` Konstantin Osipov 1 sibling, 1 reply; 53+ messages in thread From: Sergey Ostanevich @ 2020-04-22 16:50 UTC (permalink / raw) To: Vladislav Shpilevoy; +Cc: tarantool-patches Hi! On 22 апр 00:17, Vladislav Shpilevoy wrote: > >>> - high availability (HA) solution with automated failover, roles > >>> assignments an so on > >> > >> 1. So no leader election? That essentially makes single failure point > >> for RW requests, is it correct? > >> > >> On the other hand I see section 'Recovery and failover.' below. And > >> it seems to be automated, with selecting a replica with the biggest > >> LSN. Where is the truth? > >> > > > > The failover can be manual or implemented independnetnly. By no means > > this means we should not explain how this should be done according to > > the replication schema discussed. > > But it is not explained. You just said 'the biggest LSN owner is chosen'. > This looks very good in theory and on the paper, but it is not that simple. > If you are going to explain how it works. You said 'by no means we should not > explain'. > I expect this to be a reference to what is currently implemented by Tarantool users in many ways. I think I have to rephrase that one 'can keep current election apporach using biggest LSN' since proposed solution does not change current semantics of WAL generation, just 'confirm' and 'rollback' operations that are regular entries in the WAL. > Talking of the election in scope of our replication schema, I don't see > where it is discussed. Is there a separate RFC I am missing? I am asking exactly > about that - where are pings, healthchecks, timeouts, voting on top of our > replication schema? If you don't want to make the election a part of this RFC > at all, then why is there a section, which literally says that the election > is present and it is 'the biggest LSN owner is chosen'? > > In case the election is out of this for now, did you think about how a possible > future leader election algorithm could be implemented on top of this sync > replication? Just to be sure we are not ruining some things which will be > necessary for auto election later on. > My answer will be the same - no changes to WAL from this point of view, just replicas has their respective undo logs and can rollback to the consistent state. > > And yes, the SPOF is the leader of the cluster. This is expected and is > > Ok according to all MRG planning meeting participants. > > That is not about MRG only, Tarantool is not a closed-source MRG-only DB. > I am not against making it non-automated for now, but I want to be sure it > will be possible to implement this as an enhancement. > Sure we want to - using the existing SWIM module for membership and to elaborate something close to RAFT, still it is not in our immediate plans. > >>> - master-master configuration support > >>> > >>> ## Background and motivation > >>> > >>> There are number of known implementation of consistent data presence in > >>> a cluster. They can be commonly named as "wait for LSN" technique. The > >>> biggest issue with this technique is the absence of rollback guarantees > >>> at replica in case of transaction failure on one master or some of the > >>> replicas in the cluster. > >>> > >>> To provide such capabilities a new functionality should be introduced in > >>> Tarantool core, with requirements mentioned before - backward > >>> compatibility and ease of cluster orchestration. > >>> > >>> ## Detailed design > >>> > >>> ### Quorum commit > >>> > >>> The main idea behind the proposal is to reuse existent machinery as much > >>> as possible. It will ensure the well-tested and proven functionality > >>> across many instances in MRG and beyond is used. The transaction rollback > >>> mechanism is in place and works for WAL write failure. If we substitute > >>> the WAL success with a new situation which is named 'quorum' later in > >>> this document then no changes to the machinery is needed. > >> > >> 2. The problem here is that you create dependency on WAL. According to > >> your words, replication is inside WAL, and if WAL gave ok, then all is > > > > 'Replication is inside WAL' - what do you mean by that? The replication in > > its current state works from WAL, although it's an exaggregation to say > > it is 'inside WAL'. Why it means a new dependency after that? > > You said 'WAL success' is substituted with a new situation 'quorum'. It > means, strictly interpreting your words, that 'wal_write()' function > won't return 0 until quorum is collected. > > This is what I mean by moving replication into WAL subsystem. > The wal_write() should report the result of WAL operation. It should not return quorum - the WAL result should be used along with quorum messages from replicas to claim txn is complete. This shouldn't be part of WAL. > >> replicated and applied. But that makes current code structure even worse > >> than it is. Now WAL, GC, and replication code is spaghetti, basically. > >> All depends on all. I was rather thinking, that we should fix that first. > >> Not aggravate. > >> > >> WAL should provide API for writing to disk. Replication should not bother > >> about WAL. GC should not bother about replication. All should be independent, > >> and linked in one place by some kind of a manager, which would just use their > > > > So you want to introduce a single point that will translate all messages > > between all participants? > > Well, this is called 'cbus', we already have it. This is a separate headache, > which no one understands except Georgy. However I was not talking about it. > I am wrong, it should not be a single manager. But all the subsystems should > be as independent as possible still. > > > I believe current state was introduced exactly > > to avoid this situation. Each participant can be subscribed for a > > particular trigger inside another participant and take it into account > > in its activities - at the right time for itself. > > Whatever. I don't know the code good enough, so I am probably wrong > somewhere here. But every time looking at these numerous triggers depending > on each other and called at arbitrary moments was enough. I tried to fix > these trigger-dependencies in scope of different tasks, but cleaning this > code appears to be a huge task by itself. > > This can be done without obscure triggers called from arbitrary threads at > atribtrary moments of time. That in fact was the main blocker for the in-memory > WAL, when I tried to finish it after Georgy in January. We have fibers exactly > to avoid triggers. To be able to write linear and simple code. The triggers > can be replaced by dedicated fibers, and fiber condition variables can be used > to wait for exact moments of time of needed events where functionality is > event based. > I totaly agree to you that it is a big task itself. I believe we won't introduce too much extra dependencies between the parties - just tweak some of them. So far I want to start an ativity - Cyrill Gorcunov supports me - to draw a mutual dependency map of all participants: their threads, fibers, triggers and how they are connected. I believe it will help us to prepare a first step to redesign the system - or make a thoughtful decision to keep it as is. > >> APIs. I believe Cyrill G. would agree with me here, I remember him > >> complaining about replication-wal-gc code inter-dependencies too. Please, > >> request his review on this, if you didn't yet. > >> > > I personally have the same problem trying to implement a trivial test, > > by just figuring out the layers and dependencies of participants. This > > is about poor documentation im my understanding, not poor design. > > There is almost more documentation than the code. Just look at the number > of comments and level of their details. And still it does not help. So > looks like just a bad API and dependency design. Does not matter how hard > it is documented, it just becomes more interdepending and harder to wrap a > mind around it. > You can document every line of code with explanation of what it does, but there' no such line that will require to draw the 'big picture'. I believe it is the problem. Design is not the code, rather guideline for the code. To decypher it back from (perhaps not-so good sometimes) code is a big task itself. This should help to understand - and only then improve - the implementation. > >>> In case of a leader failure a replica with the biggest LSN with former > >>> leader's ID is elected as a new leader. The replica should record > >>> 'rollback' in its WAL which effectively means that all transactions > >>> without quorum should be rolled back. This rollback will be delivered to > >>> all replicas and they will perform rollbacks of all transactions waiting > >>> for quorum. > >> > >> 7. Please, elaborate leader election. It is not as trivial as just 'elect'. > >> What if the replica with the biggest LSN is temporary not available, but > >> it knows that it has the biggest LSN? Will it become a leader without > >> asking other nodes? What will do the other nodes? Will they wait for the > >> new leader node to become available? Do they have a timeout on that? > >> > > By now I do not plan any activities on HA - including automated failover > > and leader re-election. In case leader sees insufficient number of > > replicas to achieve quorum - it stops, reporting the problem to the > > external orchestrator. > > From the text in this section it does not look like a not planned activity, > but like an already made decision. It is not even in the 'Plans' section. > You just said 'is elected'. By whom? How? > Again, I want to address this to current users - I would rephrase. > If the election is not a part of the RFC, I would suggest moving this out, > or into a separate section 'Plans' or something. Or reformulate this by > saying like 'the cluster stops serving write requests until external > intervention sets a new leader'. And 'it is *advised* to use the one with > the biggest LSN in the old leader's vclock component'. Something like that. > I don't think the automated election should be even a plan for SR. It is a feature on top of it, shouldn't be a prerequisite in any form. > >> Basically, it would be nice to see the split-brain problem description here, > >> and its solution for us. > >> > > I believe the split-brain is under orchestrator control either - we > > should provide API to switch leader in the cluster, so that when a > > former leader came back it will not get quorum for any txn it has, > > replying to customers with failure as a result. > > Exactly. We should provide something for this from inside. But are there > any details? How should that work? Should all the healthy replicas reject > everything from the false-leader? Should the false-leader somehow realize, > that it is not considered a leader anymore, and should stop itself? If we > choose the former way, how a replica defines who is the true leader? For > example, some replicas still may consider the old leader as a true master. > If we choose the latter way, what is the algorithm of determining that we > are not a leader anymore? > It is all about external orchestration - if replica can't get ping from leader it stops, reporting its status to orchestrator. If leader lost number of replicas that makes quorum impossible - it stops replication, reporting to the orchestrator. Will it be sufficient to cover the question? > >>> After snapshot is created the WAL should start from the first operation > >>> that follows the commit operation snapshot is generated for. That means > >>> WAL will contain 'confirm' messages that refer to transactions that are > >>> not present in the WAL. Apparently, we have to allow this for the case > >>> 'confirm' refers to a transaction with LSN less than the first entry in > >>> the WAL. > >> > >> 9. I couldn't understand that. Why confirm is in WAL for data stored in > >> the snap? I thought you said above, that snapshot should be done for all > >> confirmed data. Besides, having confirm out of snap means the snap is > >> not self-sufficient anymore. > >> > > Snap waits for confirm message to start. During this wait the WAL keep > > growing. At the moment confirm arrived the snap will be created - say, > > for txn #10. The WAL will be started with lsn #11 and commit can be > > somewhere lsn #30. > > So, starting with this snap data appears consistent for lsn #10 - it is > > guaranteed by the wait of commit message. Then replay of WAL will come > > to a confirm message lsn #30 - referring to lsn #10 - that actually > > ignored, since it looks beyond the WAL start. There could be confirm > > messages for even earlier txns if wait takes sufficient time - all of > > them will refer to lsn beyond the WAL. And it is Ok. > > What is 'commit message'? I don't see it on the schema above. I see only > confirms. > Sorry, it is a misprint - I meant 'confirm'. > So the problem is that some data may be written to WAL after we started > committing our transactions going to the snap, but before we received a > quorum. And we can't truncate the WAL by the quorum, because there is > already newer data, which was not included into the snap. Because WAL is > not stopped, it still accepts new transactions. Now I understand. > > Would be good to have this example in the RFC. > Ok, I will try to elaborate on this. > >>> Cluster description should contain explicit attribute for each replica > >>> to denote it participates in synchronous activities. Also the description > >>> should contain criterion on how many replicas responses are needed to > >>> achieve the quorum. > >> > >> 11. Aha, I see 'necessary amount' from above is a manually set value. Ok. > >> > >>> > >>> ## Rationale and alternatives > >>> > >>> There is an implementation of synchronous replication as part of gh-980 > >>> activities, still it is not in a state to get into the product. More > >>> than that it intentionally breaks backward compatibility which is a > >>> prerequisite for this proposal. > >> > >> 12. How are we going to deal with fsync()? Will it be forcefully enabled > >> on sync replicas and the leader? > > > > To my understanding - it's up to user. I was considering a cluster that > > has no WAL at all - relying on sychro replication and sufficient number > > of replicas. Everyone who I asked about it told me I'm nuts. To my great > > surprise Alexander Lyapunov brought exactly the same idea to discuss. > > I didn't see an RFC on that, and this can become easily possible, when > in-memory relay is implemented. If it is implemented in a clean way. We > just can turn off the disk backoff, and it will work from memory-only. > It is not in RFC and we had no support from the customers in question. > > All of these is for one resolution: I would keep it for user to decide. > > Obviously, to speed up the processing leader can disable wal completely, > > but to do so we have to re-work the relay to work from memory. Replicas > > can use WAL in a way user wants: 2 replicas with slow HDD should'n wait > > for fsync(), while super-fast Intel DCPMM one can enable it. Balancing > > is up to user. > > Possibility of omitting fsync means that it is possible, that all nodes > write confirm, which is reported to the client, then the nodes restart, > and the data is lost. I would say it somewhere. The data will not be lost, unless _all_ nodes fail at the same time - including leader. Otherwise the data will be propagated from the survivor through the regular replication. No changes here to what we have currently. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-04-22 16:50 ` Sergey Ostanevich @ 2020-04-22 20:28 ` Vladislav Shpilevoy 0 siblings, 0 replies; 53+ messages in thread From: Vladislav Shpilevoy @ 2020-04-22 20:28 UTC (permalink / raw) To: Sergey Ostanevich; +Cc: tarantool-patches >>>> Basically, it would be nice to see the split-brain problem description here, >>>> and its solution for us. >>>> >>> I believe the split-brain is under orchestrator control either - we >>> should provide API to switch leader in the cluster, so that when a >>> former leader came back it will not get quorum for any txn it has, >>> replying to customers with failure as a result. >> >> Exactly. We should provide something for this from inside. But are there >> any details? How should that work? Should all the healthy replicas reject >> everything from the false-leader? Should the false-leader somehow realize, >> that it is not considered a leader anymore, and should stop itself? If we >> choose the former way, how a replica defines who is the true leader? For >> example, some replicas still may consider the old leader as a true master. >> If we choose the latter way, what is the algorithm of determining that we >> are not a leader anymore? >> > It is all about external orchestration - if replica can't get ping from > leader it stops, reporting its status to orchestrator. > If leader lost number of replicas that makes quorum impossible - it > stops replication, reporting to the orchestrator. > Will it be sufficient to cover the question? Perhaps. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-04-21 22:17 ` Vladislav Shpilevoy 2020-04-22 16:50 ` Sergey Ostanevich @ 2020-04-23 6:58 ` Konstantin Osipov 2020-04-23 9:14 ` Konstantin Osipov 1 sibling, 1 reply; 53+ messages in thread From: Konstantin Osipov @ 2020-04-23 6:58 UTC (permalink / raw) To: Vladislav Shpilevoy; +Cc: tarantool-patches * Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/04/22 01:21]: > > To my understanding - it's up to user. I was considering a cluster that > > has no WAL at all - relying on sychro replication and sufficient number > > of replicas. Everyone who I asked about it told me I'm nuts. To my great > > surprise Alexander Lyapunov brought exactly the same idea to discuss. > > I didn't see an RFC on that, and this can become easily possible, when > in-memory relay is implemented. If it is implemented in a clean way. We > just can turn off the disk backoff, and it will work from memory-only. Sync replication must work from in-memory relay only. It works as a natural failure detector: a replica which is slow or unavailable is first removed from the subscribers of in-memory relay, and only then (possibly much much later) is marked as down. By looking at the in-memory relay you have a clear idea what peers are available and can abort a transaction if a cluster is in the downgraded state right away. You never wait for impossible events. If you do have to wait, and say your wait timeout is 1 second, you quickly run out of any fibers in the fiber pool for any work, because all of them will be waiting on the sync transactions they picked up from iproto to finish. The system will loose its throttling capability. There are other reasons, too: the protocol will eventually be quite tricky and the logic has to reside in a single place and not require inter-thread communication. Committing a transaction purely anywhere outside WAL will require inter-thread communication, which is costly and should be avoided. I am surprised I have to explain this again and again - I never assumed this spec is a precursor for a half-backed implementation, only as a high-level description of the next steps after in-memory relay is in. > > All of these is for one resolution: I would keep it for user to decide. > > Obviously, to speed up the processing leader can disable wal completely, > > but to do so we have to re-work the relay to work from memory. Replicas > > can use WAL in a way user wants: 2 replicas with slow HDD should'n wait > > for fsync(), while super-fast Intel DCPMM one can enable it. Balancing > > is up to user. > > Possibility of omitting fsync means that it is possible, that all nodes > write confirm, which is reported to the client, then the nodes restart, > and the data is lost. I would say it somewhere. Worse yet you can elect a leader "based on WAL length" and then it is no longer the leader, because it lost it long WAL after restart. fcync() is mandatory during election, in other cases it shouldn't impact durability in our case. -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-04-23 6:58 ` Konstantin Osipov @ 2020-04-23 9:14 ` Konstantin Osipov 2020-04-23 11:27 ` Sergey Ostanevich 0 siblings, 1 reply; 53+ messages in thread From: Konstantin Osipov @ 2020-04-23 9:14 UTC (permalink / raw) To: Vladislav Shpilevoy, Sergey Ostanevich, tarantool-patches * Konstantin Osipov <kostja.osipov@gmail.com> [20/04/23 09:58]: > > > To my understanding - it's up to user. I was considering a cluster that > > > has no WAL at all - relying on sychro replication and sufficient number > > > of replicas. Everyone who I asked about it told me I'm nuts. To my great > > > surprise Alexander Lyapunov brought exactly the same idea to discuss. > > > > I didn't see an RFC on that, and this can become easily possible, when > > in-memory relay is implemented. If it is implemented in a clean way. We > > just can turn off the disk backoff, and it will work from memory-only. > > Sync replication must work from in-memory relay only. It works as > a natural failure detector: a replica which is slow or unavailable > is first removed from the subscribers of in-memory relay, and only > then (possibly much much later) is marked as down. > > By looking at the in-memory relay you have a clear idea what peers > are available and can abort a transaction if a cluster is in the > downgraded state right away. You never wait for impossible events. > > If you do have to wait, and say your wait timeout is 1 second, you > quickly run out of any fibers in the fiber pool for any work, > because all of them will be waiting on the sync transactions they > picked up from iproto to finish. The system will loose its > throttling capability. The other issue is that if your replicas are alive but slow/lagging behind, you can't let too many undo records to pile up unacknowledged in tx thread. The in-memory relay solves this nicely too, because it kicks out replicas from memory to file mode if they are unable to keep up with the speed of change. -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-04-23 9:14 ` Konstantin Osipov @ 2020-04-23 11:27 ` Sergey Ostanevich 2020-04-23 11:43 ` Konstantin Osipov 0 siblings, 1 reply; 53+ messages in thread From: Sergey Ostanevich @ 2020-04-23 11:27 UTC (permalink / raw) To: Konstantin Osipov, Vladislav Shpilevoy, tarantool-patches Hi! Thanks for review! On 23 апр 12:14, Konstantin Osipov wrote: > * Konstantin Osipov <kostja.osipov@gmail.com> [20/04/23 09:58]: > > > > To my understanding - it's up to user. I was considering a cluster that > > > > has no WAL at all - relying on sychro replication and sufficient number > > > > of replicas. Everyone who I asked about it told me I'm nuts. To my great > > > > surprise Alexander Lyapunov brought exactly the same idea to discuss. > > > > > > I didn't see an RFC on that, and this can become easily possible, when > > > in-memory relay is implemented. If it is implemented in a clean way. We > > > just can turn off the disk backoff, and it will work from memory-only. > > > > Sync replication must work from in-memory relay only. It works as > > a natural failure detector: a replica which is slow or unavailable > > is first removed from the subscribers of in-memory relay, and only > > then (possibly much much later) is marked as down. > > > > By looking at the in-memory relay you have a clear idea what peers > > are available and can abort a transaction if a cluster is in the > > downgraded state right away. You never wait for impossible events. > > > > If you do have to wait, and say your wait timeout is 1 second, you > > quickly run out of any fibers in the fiber pool for any work, > > because all of them will be waiting on the sync transactions they > > picked up from iproto to finish. The system will loose its > > throttling capability. > There's no need to explain it to customer: sync replication is not expected to be as fast as pure in-memory. By no means. We have network communication, disk operation, multiple entities quorum - all of these can't be as fast. No need to try cramp more than network can push through, obvoiusly. The quality one buys for this price: consistency of data in multiple instances distributed across different locations. > The other issue is that if your replicas are alive but > slow/lagging behind, you can't let too many undo records to > pile up unacknowledged in tx thread. > The in-memory relay solves this nicely too, because it kicks out > replicas from memory to file mode if they are unable to keep up > with the speed of change. > That is the same problem - resources of leader, so natural limit for throughput. I bet Tarantool faces similar limitations even now, although different ones. The in-memory relay supposed to keep the same interface, so we expect to hop easily to this new shiny express as soon as it appears. This will be an optimization and we're trying to implement something and then speed it up. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-04-23 11:27 ` Sergey Ostanevich @ 2020-04-23 11:43 ` Konstantin Osipov 2020-04-23 15:11 ` Sergey Ostanevich 0 siblings, 1 reply; 53+ messages in thread From: Konstantin Osipov @ 2020-04-23 11:43 UTC (permalink / raw) To: Sergey Ostanevich; +Cc: tarantool-patches, Vladislav Shpilevoy * Sergey Ostanevich <sergos@tarantool.org> [20/04/23 14:29]: > Hi! > > Thanks for review! > > On 23 апр 12:14, Konstantin Osipov wrote: > > * Konstantin Osipov <kostja.osipov@gmail.com> [20/04/23 09:58]: > > > > > To my understanding - it's up to user. I was considering a cluster that > > > > > has no WAL at all - relying on sychro replication and sufficient number > > > > > of replicas. Everyone who I asked about it told me I'm nuts. To my great > > > > > surprise Alexander Lyapunov brought exactly the same idea to discuss. > > > > > > > > I didn't see an RFC on that, and this can become easily possible, when > > > > in-memory relay is implemented. If it is implemented in a clean way. We > > > > just can turn off the disk backoff, and it will work from memory-only. > > > > > > Sync replication must work from in-memory relay only. It works as > > > a natural failure detector: a replica which is slow or unavailable > > > is first removed from the subscribers of in-memory relay, and only > > > then (possibly much much later) is marked as down. > > > > > > By looking at the in-memory relay you have a clear idea what peers > > > are available and can abort a transaction if a cluster is in the > > > downgraded state right away. You never wait for impossible events. > > > > > > If you do have to wait, and say your wait timeout is 1 second, you > > > quickly run out of any fibers in the fiber pool for any work, > > > because all of them will be waiting on the sync transactions they > > > picked up from iproto to finish. The system will loose its > > > throttling capability. > > > There's no need to explain it to customer: sync replication is not > expected to be as fast as pure in-memory. By no means. We have network > communication, disk operation, multiple entities quorum - all of these > can't be as fast. No need to try cramp more than network can push > through, obvoiusly. This expected performance overhead is not a grant to run out of memory or available fibers on a node failure or network partitioning. > The quality one buys for this price: consistency of data in multiple > instances distributed across different locations. The spec should demonstrate the consistency is guaranteed: right now it can easily be violated during a leader change, and this is left out of scope of the spec. My take is that any implementation which is not close enough to a TLA+ proven spec is not trustworthy, so I would not claim myself or trust any one elses claims that it is consistent. At best this RFC could achieve durability, by ensuring that no transaction is committed unless it is delivered to a majority of replicas. Consistency requires implementing RAFT spec in full and showing that leader changes preserve the write ahead log linearizability. > > The other issue is that if your replicas are alive but > > slow/lagging behind, you can't let too many undo records to > > pile up unacknowledged in tx thread. > > The in-memory relay solves this nicely too, because it kicks out > > replicas from memory to file mode if they are unable to keep up > > with the speed of change. > > > That is the same problem - resources of leader, so natural limit for > throughput. I bet Tarantool faces similar limitations even now, > although different ones. > > The in-memory relay supposed to keep the same interface, so we expect to > hop easily to this new shiny express as soon as it appears. This will be > an optimization and we're trying to implement something and then speed > it up. It is pretty clear that the implementation will be different. -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-04-23 11:43 ` Konstantin Osipov @ 2020-04-23 15:11 ` Sergey Ostanevich 2020-04-23 20:39 ` Konstantin Osipov 0 siblings, 1 reply; 53+ messages in thread From: Sergey Ostanevich @ 2020-04-23 15:11 UTC (permalink / raw) To: Konstantin Osipov, Vladislav Shpilevoy, tarantool-patches On 23 апр 14:43, Konstantin Osipov wrote: > > The quality one buys for this price: consistency of data in multiple > > instances distributed across different locations. > > The spec should demonstrate the consistency is guaranteed: right > now it can easily be violated during a leader change, and this is > left out of scope of the spec. > > My take is that any implementation which is not close enough to a > TLA+ proven spec is not trustworthy, so I would not claim myself > or trust any one elses claims that it is consistent. At best this > RFC could achieve durability, by ensuring that no transaction is > committed unless it is delivered to a majority of replicas. What is exactly mentioned in RFC goals. > Consistency requires implementing RAFT spec in full and showing > that leader changes preserve the write ahead log linearizability. > So the leader should stop accepting transactions, wait for all txn in queue resolved into confirmed either issue a rollback - after a timeout as a last resort. Since no automation in leader election the cluster will appear in a consistent state after this. Now a new leader can be appointed with all circumstances taken into account - nodes availability, ping from the proxy, lsn, etc. Again, this RFC is not about any HA features, such as auto-failover. > > > The other issue is that if your replicas are alive but > > > slow/lagging behind, you can't let too many undo records to > > > pile up unacknowledged in tx thread. > > > The in-memory relay solves this nicely too, because it kicks out > > > replicas from memory to file mode if they are unable to keep up > > > with the speed of change. > > > > > That is the same problem - resources of leader, so natural limit for > > throughput. I bet Tarantool faces similar limitations even now, > > although different ones. > > > > The in-memory relay supposed to keep the same interface, so we expect to > > hop easily to this new shiny express as soon as it appears. This will be > > an optimization and we're trying to implement something and then speed > > it up. > > It is pretty clear that the implementation will be different. > Which contradicts to the interface preservance, right? > -- > Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-04-23 15:11 ` Sergey Ostanevich @ 2020-04-23 20:39 ` Konstantin Osipov 0 siblings, 0 replies; 53+ messages in thread From: Konstantin Osipov @ 2020-04-23 20:39 UTC (permalink / raw) To: Sergey Ostanevich; +Cc: tarantool-patches, Vladislav Shpilevoy * Sergey Ostanevich <sergos@tarantool.org> [20/04/23 18:11]: > > The spec should demonstrate the consistency is guaranteed: right > > now it can easily be violated during a leader change, and this is > > left out of scope of the spec. > > > > My take is that any implementation which is not close enough to a > > TLA+ proven spec is not trustworthy, so I would not claim myself > > or trust any one elses claims that it is consistent. At best this > > RFC could achieve durability, by ensuring that no transaction is > > committed unless it is delivered to a majority of replicas. > > What is exactly mentioned in RFC goals. This is durability, though, not consistency. My point is: if consistency can not be guaranteed anyway, why assume single leader. Let's consider what happens if all replicas are allowed to collect acks, define for it the same semantics as we do today in case of async multi-master. Then add the remaining bits of RAFT. > > > Consistency requires implementing RAFT spec in full and showing > > that leader changes preserve the write ahead log linearizability. > > > So the leader should stop accepting transactions, wait for all txn in > queue resolved into confirmed either issue a rollback - after a > timeout as a last resort. > Since no automation in leader election the cluster will appear in a > consistent state after this. Now a new leader can be appointed with > all circumstances taken into account - nodes availability, ping from > the proxy, lsn, etc. > Again, this RFC is not about any HA features, such as auto-failover. > > > > > The other issue is that if your replicas are alive but > > > > slow/lagging behind, you can't let too many undo records to > > > > pile up unacknowledged in tx thread. > > > > The in-memory relay solves this nicely too, because it kicks out > > > > replicas from memory to file mode if they are unable to keep up > > > > with the speed of change. > > > > > > > That is the same problem - resources of leader, so natural limit for > > > throughput. I bet Tarantool faces similar limitations even now, > > > although different ones. > > > > > > The in-memory relay supposed to keep the same interface, so we expect to > > > hop easily to this new shiny express as soon as it appears. This will be > > > an optimization and we're trying to implement something and then speed > > > it up. > > > > It is pretty clear that the implementation will be different. > > > Which contradicts to the interface preservance, right? I don't believe internals and API can be so disconnected. I think in-memory relay is such a significant change that the implementation has to build upon it. The trigger-based implementation was contributed back in 2015 and went nowhere, in fact it was an inspiration to create a backlog of such items as parallel applier, applier in iproto, in-memory relay, and so on - all of these are "review items" for the trigger-based syncrep: https://github.com/Alexey-Ivanensky/tarantool/tree/bsync -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-04-03 21:08 [Tarantool-patches] [RFC] Quorum-based synchronous replication Sergey Ostanevich ` (2 preceding siblings ...) 2020-04-20 23:32 ` Vladislav Shpilevoy @ 2020-04-23 21:38 ` Vladislav Shpilevoy 2020-04-23 22:28 ` Konstantin Osipov 2020-04-30 14:50 ` Sergey Ostanevich 3 siblings, 2 replies; 53+ messages in thread From: Vladislav Shpilevoy @ 2020-04-23 21:38 UTC (permalink / raw) To: Sergey Ostanevich, tarantool-patches, Timur Safin, Mons Anderson Hi! Here is a short summary of our late night discussion and the questions it brought up, while I was trying to design a draft plan of an implementation. Since the RFC is too far from the code, and I needed a more 'pedestrian' and detailed plan. The question is about 'confirm' message and quorum collection. Here is the schema presented in the RFC: > Customer Leader WAL(L) Replica WAL(R) > |------TXN----->| | | | > | | | | | > | [TXN undo log | | | > | created] | | | > | | | | | > | |-----TXN----->| | | > | | | | | > | |-------Replicate TXN------->| | > | | | | | > | | | [TXN undo log | > | |<---WAL Ok----| created] | > | | | | | > | [Waiting | |-----TXN----->| > | of a quorum] | | | > | | | |<---WAL Ok----| > | | | | | > | |<------Replication Ok-------| | > | | | | | > | [Quorum | | | > | achieved] | | | > | | | | | > | [TXN undo log | | | > | destroyed] | | | > | | | | | > | |---Confirm--->| | | > | | | | | > | |----------Confirm---------->| | > | | | | | > |<---TXN Ok-----| | [TXN undo log | > | | | destroyed] | > | | | | | > | | | |---Confirm--->| > | | | | | It says, that once the quorum is collected, and 'confirm' is written to local leader's WAL, it is considered committed and is reported to the client as successful. On the other hand it is said, that in case of leader change the new leader will rollback all not confirmed transactions. That leads to the following bug: Assume we have 4 instances: i1, i2, i3, i4. Leader is i1. It writes a transaction with LSN1. The LSN1 is sent to other nodes, they apply it ok, and send acks to the leader. The leader sees i2-i4 all applied the transaction (propagated their LSNs to LSN1). It writes 'confirm' to its local WAL, reports it to the client as success, the client's request is over, it is returned back to some remote node, etc. The transaction is officially synchronously committed. Then the leader's machine dies - disk is dead. The confirm was not sent to any of the other nodes. For example, it started having problems with network connection to the replicas recently before the death. Or it just didn't manage to hand the confirm out. From now on if any of the other nodes i2-i4 becomes a leader, it will rollback the officially confirmed transaction, even if it has it, and all the other nodes too. That basically means, this sync replication gives exactly the same guarantees as the async replication - 'confirm' on the leader tells nothing about replicas except that they *are able to apply the transaction*, but still may not apply it. Am I missing something? Another issue is with failure detection. Lets assume, that we wait for 'confirm' to be propagated on quorum of replicas too. Assume some replicas responded with an error. So they first said they can apply the transaction, and saved it into their WALs, and then they couldn't apply confirm. That could happen because of 2 reasons: replica has problems with WAL, or the replica becomes unreachable from the master. WAL-problematic replicas can be disconnected forcefully, since they are clearly not able to work properly anymore. But what to do with disconnected replicas? 'Confirm' can't wait for them forever - we will run out of fibers, if we have even just hundreds of RPS of sync transactions, and wait for, lets say, a few minutes. On the other hand we can't roll them back, because 'confirm' has been written to the local WAL already. Note for those who is concerned: this has nothing to do with in-memory relay. It has the same problems, which are in the protocol, not in the implementation. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-04-23 21:38 ` Vladislav Shpilevoy @ 2020-04-23 22:28 ` Konstantin Osipov 2020-04-30 14:50 ` Sergey Ostanevich 1 sibling, 0 replies; 53+ messages in thread From: Konstantin Osipov @ 2020-04-23 22:28 UTC (permalink / raw) To: Vladislav Shpilevoy; +Cc: tarantool-patches * Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/04/24 00:42]: > It says, that once the quorum is collected, and 'confirm' is written > to local leader's WAL, it is considered committed and is reported > to the client as successful. > > On the other hand it is said, that in case of leader change the > new leader will rollback all not confirmed transactions. That leads > to the following bug: > > Assume we have 4 instances: i1, i2, i3, i4. Leader is i1. It > writes a transaction with LSN1. The LSN1 is sent to other nodes, > they apply it ok, and send acks to the leader. The leader sees > i2-i4 all applied the transaction (propagated their LSNs to LSN1). > It writes 'confirm' to its local WAL, reports it to the client as > success, the client's request is over, it is returned back to > some remote node, etc. The transaction is officially synchronously > committed. > > Then the leader's machine dies - disk is dead. The confirm was > not sent to any of the other nodes. For example, it started having > problems with network connection to the replicas recently before > the death. Or it just didn't manage to hand the confirm out. > > >From now on if any of the other nodes i2-i4 becomes a leader, it > will rollback the officially confirmed transaction, even if it > has it, and all the other nodes too. > > That basically means, this sync replication gives exactly the same > guarantees as the async replication - 'confirm' on the leader tells > nothing about replicas except that they *are able to apply the > transaction*, but still may not apply it. > > Am I missing something? This video explains what leader has to do after it's been elected: https://www.youtube.com/watch?v=YbZ3zDzDnrw In short, the transactions in leader's wal has to be committed, not rolled back. Raft paper has https://raft.github.io/raft.pdf has answers in a concise single page summary. Why have this discussion at all, any ambiguity or discrepancy between this document and raft paper should be treated as a mistake in this document. Or do you actually think it's possible to invent a new consensus algorithm this way? > Note for those who is concerned: this has nothing to do with > in-memory relay. It has the same problems, which are in the protocol, > not in the implementation. No, the issues are distinct: 1) there may be cases where this paper doesn't follow RAFT. It should be obvious to everyone, that with the exception to external leader election and failure detection it has to if correctness is of any concern, so it's simply a matter of fixing this doc to match raft. As to the leader election, there are two alternatives: either spec out in this paper how the external election is interacting with the cluster, including finishing up old transactions and neutralizing old leaders, or allow multi-master, so forget about consistency for now. 2) an implementation based on triggers will be complicated and will have performance/stability implications. This is what I hope I was able to convey and in this case we can put the matter to rest. -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-04-23 21:38 ` Vladislav Shpilevoy 2020-04-23 22:28 ` Konstantin Osipov @ 2020-04-30 14:50 ` Sergey Ostanevich 2020-05-06 8:52 ` Konstantin Osipov ` (2 more replies) 1 sibling, 3 replies; 53+ messages in thread From: Sergey Ostanevich @ 2020-04-30 14:50 UTC (permalink / raw) To: Vladislav Shpilevoy; +Cc: tarantool-patches Hi! Thanks for the review! After a long discussion we agreed to rework the RFC. On 23 апр 23:38, Vladislav Shpilevoy wrote: > It says, that once the quorum is collected, and 'confirm' is written > to local leader's WAL, it is considered committed and is reported > to the client as successful. > > On the other hand it is said, that in case of leader change the > new leader will rollback all not confirmed transactions. That leads This is no longer right, we decided to follow the RAFT's approach that leader rules the world, hence committing all changes in it's WAL. > > Another issue is with failure detection. Lets assume, that we wait > for 'confirm' to be propagated on quorum of replicas too. Assume > some replicas responded with an error. So they first said they can > apply the transaction, and saved it into their WALs, and then they > couldn't apply confirm. That could happen because of 2 reasons: > replica has problems with WAL, or the replica becomes unreachable > from the master. > > WAL-problematic replicas can be disconnected forcefully, since they > are clearly not able to work properly anymore. But what to do with > disconnected replicas? 'Confirm' can't wait for them forever - we > will run out of fibers, if we have even just hundreds of RPS of > sync transactions, and wait for, lets say, a few minutes. On the > other hand we can't roll them back, because 'confirm' has been > written to the local WAL already. Here we agreed that replica will be kicked out of cluster and wait for human intervention to fix the problems - probably with rejoin. In case available replics are not enough to achieve the quorum leader also reports the problem and stop the cluster operation until cluster reconfigured or number of replicas will become sufficient. Below is the new RFC, available at https://github.com/tarantool/tarantool/blob/sergos/quorum-based-synchro/doc/rfc/quorum-based-synchro.md --- * **Status**: In progress * **Start date**: 31-03-2020 * **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\> * **Issues**: https://github.com/tarantool/tarantool/issues/4842 ## Summary The aim of this RFC is to address the following list of problems formulated at MRG planning meeting: - protocol backward compatibility to enable cluster upgrade w/o downtime - consistency of data on replica and leader - switch from leader to replica without data loss - up to date replicas to run read-only requests - ability to switch async replicas into sync ones and vice versa - guarantee of rollback on leader and sync replicas - simplicity of cluster orchestration What this RFC is not: - high availability (HA) solution with automated failover, roles assignments an so on - master-master configuration support ## Background and motivation There are number of known implementation of consistent data presence in a Tarantool cluster. They can be commonly named as "wait for LSN" technique. The biggest issue with this technique is the absence of rollback guarantees at replica in case of transaction failure on one master or some of the replicas in the cluster. To provide such capabilities a new functionality should be introduced in Tarantool core, with requirements mentioned before - backward compatibility and ease of cluster orchestration. ## Detailed design ### Quorum commit The main idea behind the proposal is to reuse existent machinery as much as possible. It will ensure the well-tested and proven functionality across many instances in MRG and beyond is used. The transaction rollback mechanism is in place and works for WAL write failure. If we substitute the WAL success with a new situation which is named 'quorum' later in this document then no changes to the machinery is needed. The same is true for snapshot machinery that allows to create a copy of the database in memory for the whole period of snapshot file write. Adding quorum here also minimizes changes. Currently replication represented by the following scheme: ``` Customer Leader WAL(L) Replica WAL(R) |------TXN----->| | | | | | | | | | [TXN undo log | | | | created] | | | | | | | | | |-----TXN----->| | | | | | | | | |<---WAL Ok----| | | | | | | | | [TXN undo log | | | | destroyed] | | | | | | | | |<----TXN Ok----| | | | | |-------Replicate TXN------->| | | | | | | | | | [TXN undo log | | | | created] | | | | | | | | | |-----TXN----->| | | | | | | | | |<---WAL Ok----| | | | | | | | | [TXN undo log | | | | destroyed] | | | | | | ``` To introduce the 'quorum' we have to receive confirmation from replicas to make a decision on whether the quorum is actually present. Leader collects necessary amount of replicas confirmation plus its own WAL success. This state is named 'quorum' and gives leader the right to complete the customers' request. So the picture will change to: ``` Customer Leader WAL(L) Replica WAL(R) |------TXN----->| | | | | | | | | | [TXN undo log | | | | created] | | | | | | | | | |-----TXN----->| | | | | | | | | |-------Replicate TXN------->| | | | | | | | | | [TXN undo log | | |<---WAL Ok----| created] | | | | | | | [Waiting | |-----TXN----->| | of a quorum] | | | | | | |<---WAL Ok----| | | | | | | |<------Replication Ok-------| | | | | | | | [Quorum | | | | achieved] | | | | | | | | | [TXN undo log | | | | destroyed] | | | | | | | | | |---Confirm--->| | | | | | | | | |----------Confirm---------->| | | | | | | |<---TXN Ok-----| | [TXN undo log | | | | destroyed] | | | | | | | | | |---Confirm--->| | | | | | ``` The quorum should be collected as a table for a list of transactions waiting for quorum. The latest transaction that collects the quorum is considered as complete, as well as all transactions prior to it, since all transactions should be applied in order. Leader writes a 'confirm' message to the WAL that refers to the transaction's [LEADER_ID, LSN] and the confirm has its own LSN. This confirm message is delivered to all replicas through the existing replication mechanism. Replica should report a TXN application success to the leader via the IPROTO explicitly to allow leader to collect the quorum for the TXN. In case of application failure the replica has to disconnect from the replication the same way as it is done now. The replica also has to report its disconnection to the orchestrator. Further actions require human intervention, since failure means either technical problem (such as not enough space for WAL) that has to be resovled or an inconsistent state that requires rejoin. As soon as leader appears in a situation it has not enough replicas to achieve quorum, the cluster should stop accepting any requests - both write and read. The reason for this is that replication of transactions can achieve quorum on replicas not visible to the leader. On the other hand, leader can't achieve quorum with available minority. Leader has to report the state and wait for human intervention. There's an option to ask leader to rollback to the latest transaction that has quorum: leader issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN is of the first transaction in the leader's undo log. The rollback message replicated to the available cluster will put it in a consistent state. After that configuration of the cluster can be updated to available quorum and leader can be switched back to write mode. ### Leader role assignment. To assign a leader role to an instance the following should be performed: 1. among all available instances pick the one that has the biggest vclock element of the former leader ID; an arbitrary istance can be selected in case it is first time the leader is assigned 2. the leader should assure that number of available instances in the cluster is enough to achieve the quorum and proceed to step 3, otherwise the leader should report the situation of incomplete quorum, as in the last paragraph of previous section 3. the selected instance has to take the responsibility to replicate former leader entries from its WAL, obtainig quorum and commit confirm messages referring to [FORMER_LEADER_ID, LSN] in its WAL, replicating to the cluster, after that it can start adding its own entries into the WAL ### Recovery and failover. Tarantool instance during reading WAL should postpone the undo log deletion until the 'confirm' is read. In case the WAL eof is achieved, the instance should keep undo log for all transactions that are waiting for a confirm entry until the role of the instance is set. If this instance will be assigned a leader role then all transactions that have no corresponding confirm message should be confirmed (see the leader role assignment). In case there's not enough replicas to set up a quorum the cluster can be switched into a read-only mode. Note, this can't be done by default since some of transactions can have confirmed state. It is up to human intervention to force rollback of all transactions that have no confirm and to put the cluster into a consistent state. In case the instance will be assigned a replica role, it may appear in a state that it has conflicting WAL entries, in case it recovered from a leader role and some of transactions didn't replicated to the current leader. This situation should be resolved through rejoin of the instance. ### Snapshot generation. We also can reuse current machinery of snapshot generation. Upon receiving a request to create a snapshot an instance should request a readview for the current commit operation. Although start of the snapshot generation should be postponed until this commit operation receives its confirmation. In case operation is rolled back, the snapshot generation should be aborted and restarted using current transaction after rollback is complete. After snapshot is created the WAL should start from the first operation that follows the commit operation snapshot is generated for. That means WAL will contain 'confirm' messages that refer to transactions that are not present in the WAL. Apparently, we have to allow this for the case 'confirm' refers to a transaction with LSN less than the first entry in the WAL. In case master appears unavailable a replica still have to be able to create a snapshot. Replica can perform rollback for all transactions that are not confirmed and claim its LSN as the latest confirmed txn. Then it can create a snapshot in a regular way and start with blank xlog file. All rolled back transactions will appear through the regular replication in case master reappears later on. ### Asynchronous replication. Along with synchronous replicas the cluster can contain asynchronous replicas. That means async replica doesn't reply to the leader with errors since they're not contributing into quorum. Still, async replicas have to follow the new WAL operation, such as keep rollback info until 'quorum' message is received. This is essential for the case of 'rollback' message appearance in the WAL. This message assumes replica is able to perform all necessary rollback by itself. Cluster information should contain explicit notification of each replica operation mode. ### Synchronous replication enabling. Synchronous operation can be required for a set of spaces in the data scheme. That means only transactions that contain data modification for these spaces should require quorum. Such transactions named synchronous. As soon as last operation of synchronous transaction appeared in leader's WAL, it will cause all following transactions - no matter if they are synchronous or not - wait for the quorum. In case quorum is not achieved the 'rollback' operation will cause rollback of all transactions after the synchronous one. It will ensure the consistent state of the data both on leader and replicas. In case user doesn't require synchronous operation for any space then no changes to the WAL generation and replication will appear. Cluster description should contain explicit attribute for each replica to denote it participates in synchronous activities. Also the description should contain criterion on how many replicas responses are needed to achieve the quorum. ## Rationale and alternatives There is an implementation of synchronous replication as part of gh-980 activities, still it is not in a state to get into the product. More than that it intentionally breaks backward compatibility which is a prerequisite for this proposal. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-04-30 14:50 ` Sergey Ostanevich @ 2020-05-06 8:52 ` Konstantin Osipov 2020-05-06 16:39 ` Sergey Ostanevich 2020-05-06 18:55 ` Konstantin Osipov 2020-05-07 23:01 ` Konstantin Osipov 2 siblings, 1 reply; 53+ messages in thread From: Konstantin Osipov @ 2020-05-06 8:52 UTC (permalink / raw) To: Sergey Ostanevich; +Cc: tarantool-patches, Vladislav Shpilevoy * Sergey Ostanevich <sergos@tarantool.org> [20/04/30 17:51]: > Hi! > > Thanks for the review! > > After a long discussion we agreed to rework the RFC. > > On 23 апр 23:38, Vladislav Shpilevoy wrote: > > It says, that once the quorum is collected, and 'confirm' is written > > to local leader's WAL, it is considered committed and is reported > > to the client as successful. > > > > On the other hand it is said, that in case of leader change the > > new leader will rollback all not confirmed transactions. That leads > > This is no longer right, we decided to follow the RAFT's approach that > leader rules the world, hence committing all changes in it's WAL. > > > > > Another issue is with failure detection. Lets assume, that we wait > > for 'confirm' to be propagated on quorum of replicas too. Assume > > some replicas responded with an error. So they first said they can > > apply the transaction, and saved it into their WALs, and then they > > couldn't apply confirm. That could happen because of 2 reasons: > > replica has problems with WAL, or the replica becomes unreachable > > from the master. > > > > WAL-problematic replicas can be disconnected forcefully, since they > > are clearly not able to work properly anymore. But what to do with > > disconnected replicas? 'Confirm' can't wait for them forever - we > > will run out of fibers, if we have even just hundreds of RPS of > > sync transactions, and wait for, lets say, a few minutes. On the > > other hand we can't roll them back, because 'confirm' has been > > written to the local WAL already. > > Here we agreed that replica will be kicked out of cluster and wait for > human intervention to fix the problems - probably with rejoin. In case > available replics are not enough to achieve the quorum leader also > reports the problem and stop the cluster operation until cluster > reconfigured or number of replicas will become sufficient. > > Below is the new RFC, available at > https://github.com/tarantool/tarantool/blob/sergos/quorum-based-synchro/doc/rfc/quorum-based-synchro.md > > --- > * **Status**: In progress > * **Start date**: 31-03-2020 > * **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\> > * **Issues**: https://github.com/tarantool/tarantool/issues/4842 > > ## Summary > > The aim of this RFC is to address the following list of problems > formulated at MRG planning meeting: > - protocol backward compatibility to enable cluster upgrade w/o > downtime > - consistency of data on replica and leader > - switch from leader to replica without data loss > - up to date replicas to run read-only requests > - ability to switch async replicas into sync ones and vice versa > - guarantee of rollback on leader and sync replicas > - simplicity of cluster orchestration > > What this RFC is not: > > - high availability (HA) solution with automated failover, roles > assignments an so on > - master-master configuration support > > ## Background and motivation > > There are number of known implementation of consistent data presence in > a Tarantool cluster. They can be commonly named as "wait for LSN" > technique. The biggest issue with this technique is the absence of > rollback guarantees at replica in case of transaction failure on one > master or some of the replicas in the cluster. > > To provide such capabilities a new functionality should be introduced in > Tarantool core, with requirements mentioned before - backward > compatibility and ease of cluster orchestration. > > ## Detailed design > > ### Quorum commit > > The main idea behind the proposal is to reuse existent machinery as much > as possible. It will ensure the well-tested and proven functionality > across many instances in MRG and beyond is used. The transaction rollback > mechanism is in place and works for WAL write failure. If we substitute > the WAL success with a new situation which is named 'quorum' later in > this document then no changes to the machinery is needed. The same is > true for snapshot machinery that allows to create a copy of the database > in memory for the whole period of snapshot file write. Adding quorum here > also minimizes changes. > > Currently replication represented by the following scheme: > ``` > Customer Leader WAL(L) Replica WAL(R) > |------TXN----->| | | | > | | | | | > | [TXN undo log | | | > | created] | | | > | | | | | > | |-----TXN----->| | | > | | | | | > | |<---WAL Ok----| | | > | | | | | > | [TXN undo log | | | > | destroyed] | | | > | | | | | > |<----TXN Ok----| | | | > | |-------Replicate TXN------->| | > | | | | | > | | | [TXN undo log | > | | | created] | > | | | | | > | | | |-----TXN----->| > | | | | | > | | | |<---WAL Ok----| > | | | | | > | | | [TXN undo log | > | | | destroyed] | > | | | | | > ``` > > To introduce the 'quorum' we have to receive confirmation from replicas > to make a decision on whether the quorum is actually present. Leader > collects necessary amount of replicas confirmation plus its own WAL > success. This state is named 'quorum' and gives leader the right to > complete the customers' request. So the picture will change to: > ``` > Customer Leader WAL(L) Replica WAL(R) > |------TXN----->| | | | > | | | | | > | [TXN undo log | | | > | created] | | | > | | | | | > | |-----TXN----->| | | > | | | | | > | |-------Replicate TXN------->| | > | | | | | > | | | [TXN undo log | > | |<---WAL Ok----| created] | > | | | | | > | [Waiting | |-----TXN----->| > | of a quorum] | | | > | | | |<---WAL Ok----| > | | | | | > | |<------Replication Ok-------| | > | | | | | > | [Quorum | | | > | achieved] | | | > | | | | | > | [TXN undo log | | | > | destroyed] | | | > | | | | | > | |---Confirm--->| | | > | | | | | What happens if writing Confirm to WAL fails? TXN und log record is destroyed already. Will the server panic now on WAL failure, even if it is intermittent? > | |----------Confirm---------->| | What happens if peers receive and maybe even write Confirm to their WALs but local WAL write is lost after a restart? WAL is not synced, so we can easily lose the tail of the WAL. Tarantool will sync up with all replicas on restart, but there will be no "Replication OK" messages from them, so it wouldn't know that the transaction is committed on them. How is this handled? We may end up with some replicas confirming the transaction while the leader will roll it back on restart. Do you suggest there is a human intervention on restart as well? > | | | | | > |<---TXN Ok-----| | [TXN undo log | > | | | destroyed] | > | | | | | > | | | |---Confirm--->| > | | | | | > ``` > > The quorum should be collected as a table for a list of transactions > waiting for quorum. The latest transaction that collects the quorum is > considered as complete, as well as all transactions prior to it, since > all transactions should be applied in order. Leader writes a 'confirm' > message to the WAL that refers to the transaction's [LEADER_ID, LSN] and > the confirm has its own LSN. This confirm message is delivered to all > replicas through the existing replication mechanism. > > Replica should report a TXN application success to the leader via the > IPROTO explicitly to allow leader to collect the quorum for the TXN. > In case of application failure the replica has to disconnect from the > replication the same way as it is done now. The replica also has to > report its disconnection to the orchestrator. Further actions require > human intervention, since failure means either technical problem (such > as not enough space for WAL) that has to be resovled or an inconsistent > state that requires rejoin. > As soon as leader appears in a situation it has not enough replicas > to achieve quorum, the cluster should stop accepting any requests - both > write and read. How does *the cluster* know the state of the leader and if it doesn't, how it can possibly implement this? Did you mean the leader should stop accepting transactions here? But how can the leader know if it has not enough replicas during a read transaction, if it doesn't contact any replica to serve a read? > The reason for this is that replication of transactions > can achieve quorum on replicas not visible to the leader. On the other > hand, leader can't achieve quorum with available minority. Leader has to > report the state and wait for human intervention. There's an option to > ask leader to rollback to the latest transaction that has quorum: leader > issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN > is of the first transaction in the leader's undo log. The rollback > message replicated to the available cluster will put it in a consistent > state. After that configuration of the cluster can be updated to > available quorum and leader can be switched back to write mode. As you should be able to conclude from restart scenario, it is possible a replica has the record in *confirmed* state but the leader has it in pending state. The replica will not be able to roll back then. Do you suggest the replica should abort if it can't rollback? This may lead to an avalanche of rejoins on leader restart, bringing performance to a halt. -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-05-06 8:52 ` Konstantin Osipov @ 2020-05-06 16:39 ` Sergey Ostanevich 2020-05-06 18:44 ` Konstantin Osipov 2020-05-13 21:36 ` Vladislav Shpilevoy 0 siblings, 2 replies; 53+ messages in thread From: Sergey Ostanevich @ 2020-05-06 16:39 UTC (permalink / raw) To: Konstantin Osipov, Vladislav Shpilevoy, tarantool-patches Hi! Thanks for review! > > | | | | | > > | [Quorum | | | > > | achieved] | | | > > | | | | | > > | [TXN undo log | | | > > | destroyed] | | | > > | | | | | > > | |---Confirm--->| | | > > | | | | | > > What happens if writing Confirm to WAL fails? TXN und log record > is destroyed already. Will the server panic now on WAL failure, > even if it is intermittent? I would like to have an example of intermittent WAL failure. Can it be other than problem with disc - be it space/availability/malfunction? For all of those it should be resolved outside the DBMS anyways. So, leader should stop and report its problems to orchestrator/admins. I would agree that undo log can be destroyed *after* the Confirm is landed to WAL - same is for replica. > > > | |----------Confirm---------->| | > > What happens if peers receive and maybe even write Confirm to their WALs > but local WAL write is lost after a restart? Did you mean WAL write on leader as a local? Then we have a replica with a bigger LSN for the leader ID. > WAL is not synced, > so we can easily lose the tail of the WAL. Tarantool will sync up > with all replicas on restart, But at this point a new leader will be appointed - the old one is restarted. Then the Confirm message will arrive to the restarted leader through a regular replication. > but there will be no "Replication > OK" messages from them, so it wouldn't know that the transaction > is committed on them. How is this handled? We may end up with some > replicas confirming the transaction while the leader will roll it > back on restart. Do you suggest there is a human intervention on > restart as well? > > > > | | | | | > > |<---TXN Ok-----| | [TXN undo log | > > | | | destroyed] | > > | | | | | > > | | | |---Confirm--->| > > | | | | | > > ``` > > > > The quorum should be collected as a table for a list of transactions > > waiting for quorum. The latest transaction that collects the quorum is > > considered as complete, as well as all transactions prior to it, since > > all transactions should be applied in order. Leader writes a 'confirm' > > message to the WAL that refers to the transaction's [LEADER_ID, LSN] and > > the confirm has its own LSN. This confirm message is delivered to all > > replicas through the existing replication mechanism. > > > > Replica should report a TXN application success to the leader via the > > IPROTO explicitly to allow leader to collect the quorum for the TXN. > > In case of application failure the replica has to disconnect from the > > replication the same way as it is done now. The replica also has to > > report its disconnection to the orchestrator. Further actions require > > human intervention, since failure means either technical problem (such > > as not enough space for WAL) that has to be resovled or an inconsistent > > state that requires rejoin. > > > As soon as leader appears in a situation it has not enough replicas > > to achieve quorum, the cluster should stop accepting any requests - both > > write and read. > > How does *the cluster* know the state of the leader and if it > doesn't, how it can possibly implement this? Did you mean > the leader should stop accepting transactions here? But how can > the leader know if it has not enough replicas during a read > transaction, if it doesn't contact any replica to serve a read? I expect to have a disconnection trigger assigned to all relays so that disconnection will cause the number of replicas decrease. The quorum size is static, so we can stop at the very moment the number dives below. > > > The reason for this is that replication of transactions > > can achieve quorum on replicas not visible to the leader. On the other > > hand, leader can't achieve quorum with available minority. Leader has to > > report the state and wait for human intervention. There's an option to > > ask leader to rollback to the latest transaction that has quorum: leader > > issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN > > is of the first transaction in the leader's undo log. The rollback > > message replicated to the available cluster will put it in a consistent > > state. After that configuration of the cluster can be updated to > > available quorum and leader can be switched back to write mode. > > As you should be able to conclude from restart scenario, it is > possible a replica has the record in *confirmed* state but the > leader has it in pending state. The replica will not be able to > roll back then. Do you suggest the replica should abort if it > can't rollback? This may lead to an avalanche of rejoins on leader > restart, bringing performance to a halt. No, I declare replica with biggest LSN as a new shining leader. More than that, new leader can (so far it will be by default) finalize the former leader life's work by replicating txns and appropriate confirms. Sergos. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-05-06 16:39 ` Sergey Ostanevich @ 2020-05-06 18:44 ` Konstantin Osipov 2020-05-12 15:55 ` Sergey Ostanevich 2020-05-13 21:36 ` Vladislav Shpilevoy 1 sibling, 1 reply; 53+ messages in thread From: Konstantin Osipov @ 2020-05-06 18:44 UTC (permalink / raw) To: Sergey Ostanevich; +Cc: tarantool-patches, Vladislav Shpilevoy * Sergey Ostanevich <sergos@tarantool.org> [20/05/06 19:41]: > > > | | | | | > > > | [Quorum | | | > > > | achieved] | | | > > > | | | | | > > > | [TXN undo log | | | > > > | destroyed] | | | > > > | | | | | > > > | |---Confirm--->| | | > > > | | | | | > > > > What happens if writing Confirm to WAL fails? TXN und log record > > is destroyed already. Will the server panic now on WAL failure, > > even if it is intermittent? > > I would like to have an example of intermittent WAL failure. Can it be > other than problem with disc - be it space/availability/malfunction? For SAN disks it can simply be a networking issue. The same is true for any virtual filesystem in the cloud. For local disks it is most often out of space, but this is not an impossible event. > For all of those it should be resolved outside the DBMS anyways. So, > leader should stop and report its problems to orchestrator/admins. Sergey, I understand that RAFT spec is big and with this spec you try to split it into manageable parts. The question is how useful is this particular piece. I'm trying to point out that "the leader should stop" is not a silver bullet - especially since each such stop may mean a rejoin of some other node. The purpose of sync replication is to provide consistency without reducing availability (i.e. make progress as long as the quorum of nodes make progress). The current spec, suggesting there should be a leader stop in case of most errors, reduces availability significantly, and doesn't make external coordinator job any easier - it still has to follow to the letter the prescriptions of RAFT. > landed to WAL - same is for replica. > > > > > > | |----------Confirm---------->| | > > > > What happens if peers receive and maybe even write Confirm to their WALs > > but local WAL write is lost after a restart? > > Did you mean WAL write on leader as a local? Then we have a replica with > a bigger LSN for the leader ID. > > WAL is not synced, > > so we can easily lose the tail of the WAL. Tarantool will sync up > > with all replicas on restart, > > But at this point a new leader will be appointed - the old one is > restarted. Then the Confirm message will arrive to the restarted leader > through a regular replication. This assumes that restart is guaranteed to be noticed by the external coordinator and there is an election on every restart. > > but there will be no "Replication > > OK" messages from them, so it wouldn't know that the transaction > > is committed on them. How is this handled? We may end up with some > > replicas confirming the transaction while the leader will roll it > > back on restart. Do you suggest there is a human intervention on > > restart as well? > > > > > > > | | | | | > > > |<---TXN Ok-----| | [TXN undo log | > > > | | | destroyed] | > > > | | | | | > > > | | | |---Confirm--->| > > > | | | | | > > > ``` > > > > > > The quorum should be collected as a table for a list of transactions > > > waiting for quorum. The latest transaction that collects the quorum is > > > considered as complete, as well as all transactions prior to it, since > > > all transactions should be applied in order. Leader writes a 'confirm' > > > message to the WAL that refers to the transaction's [LEADER_ID, LSN] and > > > the confirm has its own LSN. This confirm message is delivered to all > > > replicas through the existing replication mechanism. > > > > > > Replica should report a TXN application success to the leader via the > > > IPROTO explicitly to allow leader to collect the quorum for the TXN. > > > In case of application failure the replica has to disconnect from the > > > replication the same way as it is done now. The replica also has to > > > report its disconnection to the orchestrator. Further actions require > > > human intervention, since failure means either technical problem (such > > > as not enough space for WAL) that has to be resovled or an inconsistent > > > state that requires rejoin. > > > > > As soon as leader appears in a situation it has not enough replicas > > > to achieve quorum, the cluster should stop accepting any requests - both > > > write and read. > > > > How does *the cluster* know the state of the leader and if it > > doesn't, how it can possibly implement this? Did you mean > > the leader should stop accepting transactions here? But how can > > the leader know if it has not enough replicas during a read > > transaction, if it doesn't contact any replica to serve a read? > > I expect to have a disconnection trigger assigned to all relays so that > disconnection will cause the number of replicas decrease. The quorum > size is static, so we can stop at the very moment the number dives below. What happens between the event the leader is partitioned away and a new leader is elected? The leader may be unaware of the events and serve a read just fine. So at least you can't say the leader shouldn't be serving reads without quorum - because the only way to achieve it is to collect a quorum of responses to reads as well. > > > The reason for this is that replication of transactions > > > can achieve quorum on replicas not visible to the leader. On the other > > > hand, leader can't achieve quorum with available minority. Leader has to > > > report the state and wait for human intervention. There's an option to > > > ask leader to rollback to the latest transaction that has quorum: leader > > > issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN > > > is of the first transaction in the leader's undo log. The rollback > > > message replicated to the available cluster will put it in a consistent > > > state. After that configuration of the cluster can be updated to > > > available quorum and leader can be switched back to write mode. > > > > As you should be able to conclude from restart scenario, it is > > possible a replica has the record in *confirmed* state but the > > leader has it in pending state. The replica will not be able to > > roll back then. Do you suggest the replica should abort if it > > can't rollback? This may lead to an avalanche of rejoins on leader > > restart, bringing performance to a halt. > > No, I declare replica with biggest LSN as a new shining leader. More > than that, new leader can (so far it will be by default) finalize the > former leader life's work by replicating txns and appropriate confirms. Right, this also assumes the restart is noticed, so it follows the same logic. -- Konstantin Osipov, Moscow, Russia https://scylladb.com ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-05-06 18:44 ` Konstantin Osipov @ 2020-05-12 15:55 ` Sergey Ostanevich 2020-05-12 16:42 ` Konstantin Osipov 2020-05-13 21:39 ` Vladislav Shpilevoy 0 siblings, 2 replies; 53+ messages in thread From: Sergey Ostanevich @ 2020-05-12 15:55 UTC (permalink / raw) To: Konstantin Osipov, Vladislav Shpilevoy, tarantool-patches On 06 мая 21:44, Konstantin Osipov wrote: > * Sergey Ostanevich <sergos@tarantool.org> [20/05/06 19:41]: > > > > | | | | | > > > > | [Quorum | | | > > > > | achieved] | | | > > > > | | | | | > > > > | [TXN undo log | | | > > > > | destroyed] | | | > > > > | | | | | > > > > | |---Confirm--->| | | > > > > | | | | | > > > > > > What happens if writing Confirm to WAL fails? TXN und log record > > > is destroyed already. Will the server panic now on WAL failure, > > > even if it is intermittent? > > > > I would like to have an example of intermittent WAL failure. Can it be > > other than problem with disc - be it space/availability/malfunction? > > For SAN disks it can simply be a networking issue. The same is > true for any virtual filesystem in the cloud. For local disks it > is most often out of space, but this is not an impossible event. The SANDisk is an SSD vendor. I bet you mean NAS - network array storage, isn't it? Then I see no difference in WAL write into NAS in current schema - you will catch a timeout, WAL will report failure, replica stops. > > > For all of those it should be resolved outside the DBMS anyways. So, > > leader should stop and report its problems to orchestrator/admins. > > Sergey, I understand that RAFT spec is big and with this spec you > try to split it into manageable parts. The question is how useful > is this particular piece. I'm trying to point out that "the leader > should stop" is not a silver bullet - especially since each such > stop may mean a rejoin of some other node. The purpose of sync > replication is to provide consistency without reducing > availability (i.e. make progress as long as the quorum > of nodes make progress). I'm not sure if we're talking about the same RAFT - mine is "In Search of an Understandable Consensus Algorithm (Extended Version)" from Stanford as of May 2014. And it is 15 pages - including references, conclusions and intro. Seems not that big. Although, most of it is dedicated to the leader election itself, which we intentionally put aside from this RFC. It is written in the very beginning and I empasized this by explicit mentioning of it. > > The current spec, suggesting there should be a leader stop in case > of most errors, reduces availability significantly, and doesn't > make external coordinator job any easier - it still has to follow to > the letter the prescriptions of RAFT. So, the postponing of a commit until quorum collection is the most useful part of this RFC, also to some point I'm trying to address the WAL insconsistency. Although, it can be covered only partly: if a leader's log diverge in unconfirmed transactions only, then they can be rolled back easiy. Technically, it should be enough if leader changed for a replica from the cluster majority at the moment of failure. Otherwise it will require pre-parsing of the WAL and it can well happens that WAL is not long enough, hence ex-leader still need a complete bootstrap. > > > landed to WAL - same is for replica. > > > > > > > > > > | |----------Confirm---------->| | > > > > > > What happens if peers receive and maybe even write Confirm to their WALs > > > but local WAL write is lost after a restart? > > > > Did you mean WAL write on leader as a local? Then we have a replica with > > a bigger LSN for the leader ID. > > > > WAL is not synced, > > > so we can easily lose the tail of the WAL. Tarantool will sync up > > > with all replicas on restart, > > > > But at this point a new leader will be appointed - the old one is > > restarted. Then the Confirm message will arrive to the restarted leader > > through a regular replication. > > This assumes that restart is guaranteed to be noticed by the > external coordinator and there is an election on every restart. Sure yes, if it restarted - then connection lost can't be unnoticed by anyone, be it coordinator or cluster. > > > > but there will be no "Replication > > > OK" messages from them, so it wouldn't know that the transaction > > > is committed on them. How is this handled? We may end up with some > > > replicas confirming the transaction while the leader will roll it > > > back on restart. Do you suggest there is a human intervention on > > > restart as well? > > > > > > > > > > | | | | | > > > > |<---TXN Ok-----| | [TXN undo log | > > > > | | | destroyed] | > > > > | | | | | > > > > | | | |---Confirm--->| > > > > | | | | | > > > > ``` > > > > > > > > The quorum should be collected as a table for a list of transactions > > > > waiting for quorum. The latest transaction that collects the quorum is > > > > considered as complete, as well as all transactions prior to it, since > > > > all transactions should be applied in order. Leader writes a 'confirm' > > > > message to the WAL that refers to the transaction's [LEADER_ID, LSN] and > > > > the confirm has its own LSN. This confirm message is delivered to all > > > > replicas through the existing replication mechanism. > > > > > > > > Replica should report a TXN application success to the leader via the > > > > IPROTO explicitly to allow leader to collect the quorum for the TXN. > > > > In case of application failure the replica has to disconnect from the > > > > replication the same way as it is done now. The replica also has to > > > > report its disconnection to the orchestrator. Further actions require > > > > human intervention, since failure means either technical problem (such > > > > as not enough space for WAL) that has to be resovled or an inconsistent > > > > state that requires rejoin. > > > > > > > As soon as leader appears in a situation it has not enough replicas > > > > to achieve quorum, the cluster should stop accepting any requests - both > > > > write and read. > > > > > > How does *the cluster* know the state of the leader and if it > > > doesn't, how it can possibly implement this? Did you mean > > > the leader should stop accepting transactions here? But how can > > > the leader know if it has not enough replicas during a read > > > transaction, if it doesn't contact any replica to serve a read? > > > > I expect to have a disconnection trigger assigned to all relays so that > > disconnection will cause the number of replicas decrease. The quorum > > size is static, so we can stop at the very moment the number dives below. > > What happens between the event the leader is partitioned away and > a new leader is elected? > > The leader may be unaware of the events and serve a read just > fine. As it is stated 20 lines above: > > > > As soon as leader appears in a situation it has not enough > > > > replicas > > > > to achieve quorum, the cluster should stop accepting any > > > > requests - both > > > > write and read. So it will not serve. > > So at least you can't say the leader shouldn't be serving reads > without quorum - because the only way to achieve it is to collect > a quorum of responses to reads as well. The leader lost connection to the (N-Q)+1 repllicas out of the N in cluster with a quorum of Q == it stops serving anything. So the quorum criteria is there: no quorum - no reads. > > > > > The reason for this is that replication of transactions > > > > can achieve quorum on replicas not visible to the leader. On the other > > > > hand, leader can't achieve quorum with available minority. Leader has to > > > > report the state and wait for human intervention. There's an option to > > > > ask leader to rollback to the latest transaction that has quorum: leader > > > > issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN > > > > is of the first transaction in the leader's undo log. The rollback > > > > message replicated to the available cluster will put it in a consistent > > > > state. After that configuration of the cluster can be updated to > > > > available quorum and leader can be switched back to write mode. > > > > > > As you should be able to conclude from restart scenario, it is > > > possible a replica has the record in *confirmed* state but the > > > leader has it in pending state. The replica will not be able to > > > roll back then. Do you suggest the replica should abort if it > > > can't rollback? This may lead to an avalanche of rejoins on leader > > > restart, bringing performance to a halt. > > > > No, I declare replica with biggest LSN as a new shining leader. More > > than that, new leader can (so far it will be by default) finalize the > > former leader life's work by replicating txns and appropriate confirms. > > Right, this also assumes the restart is noticed, so it follows the > same logic. How a restart can be unnoticed, if it causes disconnection? > > -- > Konstantin Osipov, Moscow, Russia > https://scylladb.com ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-05-12 15:55 ` Sergey Ostanevich @ 2020-05-12 16:42 ` Konstantin Osipov 2020-05-13 21:39 ` Vladislav Shpilevoy 1 sibling, 0 replies; 53+ messages in thread From: Konstantin Osipov @ 2020-05-12 16:42 UTC (permalink / raw) To: Sergey Ostanevich; +Cc: tarantool-patches, Vladislav Shpilevoy * Sergey Ostanevich <sergos@tarantool.org> [20/05/12 18:56]: > On 06 мая 21:44, Konstantin Osipov wrote: > > * Sergey Ostanevich <sergos@tarantool.org> [20/05/06 19:41]: > > > > > | | | | | > > > > > | [Quorum | | | > > > > > | achieved] | | | > > > > > | | | | | > > > > > | [TXN undo log | | | > > > > > | destroyed] | | | > > > > > | | | | | > > > > > | |---Confirm--->| | | > > > > > | | | | | > > > > > > > > What happens if writing Confirm to WAL fails? TXN und log record > > > > is destroyed already. Will the server panic now on WAL failure, > > > > even if it is intermittent? > > > > > > I would like to have an example of intermittent WAL failure. Can it be > > > other than problem with disc - be it space/availability/malfunction? > > > > For SAN disks it can simply be a networking issue. The same is > > true for any virtual filesystem in the cloud. For local disks it > > is most often out of space, but this is not an impossible event. > > The SANDisk is an SSD vendor. I bet you mean NAS - network array > storage, isn't it? Then I see no difference in WAL write into NAS in > current schema - you will catch a timeout, WAL will report failure, > replica stops. SAN stands for storage area network. There is no timeout in wal tx bus and no timeout in WAL I/O. A replica doesn't stop on an intermittent failure. Stopping a replica on an intermittent failure reduces availability of non-sync writes. It seems you have some assumptions in mind which are not in the document - e.g. that some timeouts are added. They are not in the POC either. I suppose the document is expected to explain quite accurately what has to be done, e.g. how these new timeouts work? > > > For all of those it should be resolved outside the DBMS anyways. So, > > > leader should stop and report its problems to orchestrator/admins. > > > > Sergey, I understand that RAFT spec is big and with this spec you > > try to split it into manageable parts. The question is how useful > > is this particular piece. I'm trying to point out that "the leader > > should stop" is not a silver bullet - especially since each such > > stop may mean a rejoin of some other node. The purpose of sync > > replication is to provide consistency without reducing > > availability (i.e. make progress as long as the quorum > > of nodes make progress). > > I'm not sure if we're talking about the same RAFT - mine is "In Search > of an Understandable Consensus Algorithm (Extended Version)" from > Stanford as of May 2014. And it is 15 pages - including references, > conclusions and intro. Seems not that big. > > Although, most of it is dedicated to the leader election itself, which > we intentionally put aside from this RFC. It is written in the very > beginning and I empasized this by explicit mentioning of it. I conclude that it is big from the state of this document. It provides some coverage of the normal operation. Leader election, failure detection, recovery/restart, replication configuration changes are either barely mentioned or not covered at all. I find no other reason to not cover them except to be able to come up with a MVP quicker. Do you? > > The current spec, suggesting there should be a leader stop in case > > of most errors, reduces availability significantly, and doesn't > > make external coordinator job any easier - it still has to follow to > > the letter the prescriptions of RAFT. > > So, the postponing of a commit until quorum collection is the most > useful part of this RFC, also to some point I'm trying to address the > WAL insconsistency. > Although, it can be covered only partly: if a > leader's log diverge in unconfirmed transactions only, then they can be > rolled back easiy. Technically, it should be enough if leader changed > for a replica from the cluster majority at the moment of failure. > Otherwise it will require pre-parsing of the WAL and it can well happens > that WAL is not long enough, hence ex-leader still need a complete > bootstrap. I don't understand what's pre-parsing and how what you write is relevant to the fact that reduced availability of non-raft writes is bad. > > > But at this point a new leader will be appointed - the old one is > > > restarted. Then the Confirm message will arrive to the restarted leader > > > through a regular replication. > > > > This assumes that restart is guaranteed to be noticed by the > > external coordinator and there is an election on every restart. > > Sure yes, if it restarted - then connection lost can't be unnoticed by > anyone, be it coordinator or cluster. Well, the spec doesn't say anywhere that the external coordinator has to establish a TCP connection to every participant. Could you please add a chapter where this is clarified? It seems you have a specific coordinator in mind ? > > > > > The quorum should be collected as a table for a list of transactions > > > > > waiting for quorum. The latest transaction that collects the quorum is > > > > > considered as complete, as well as all transactions prior to it, since > > > > > all transactions should be applied in order. Leader writes a 'confirm' > > > > > message to the WAL that refers to the transaction's [LEADER_ID, LSN] and > > > > > the confirm has its own LSN. This confirm message is delivered to all > > > > > replicas through the existing replication mechanism. > > > > > > > > > > Replica should report a TXN application success to the leader via the > > > > > IPROTO explicitly to allow leader to collect the quorum for the TXN. > > > > > In case of application failure the replica has to disconnect from the > > > > > replication the same way as it is done now. The replica also has to > > > > > report its disconnection to the orchestrator. Further actions require > > > > > human intervention, since failure means either technical problem (such > > > > > as not enough space for WAL) that has to be resovled or an inconsistent > > > > > state that requires rejoin. > > > > > > > > > As soon as leader appears in a situation it has not enough replicas > > > > > to achieve quorum, the cluster should stop accepting any requests - both > > > > > write and read. > > > > > > > > How does *the cluster* know the state of the leader and if it > > > > doesn't, how it can possibly implement this? Did you mean > > > > the leader should stop accepting transactions here? But how can > > > > the leader know if it has not enough replicas during a read > > > > transaction, if it doesn't contact any replica to serve a read? > > > > > > I expect to have a disconnection trigger assigned to all relays so that > > > disconnection will cause the number of replicas decrease. The quorum > > > size is static, so we can stop at the very moment the number dives below. > > > > What happens between the event the leader is partitioned away and > > a new leader is elected? > > > > The leader may be unaware of the events and serve a read just > > fine. > > As it is stated 20 lines above: > > > > > As soon as leader appears in a situation it has not enough > > > > > replicas > > > > > to achieve quorum, the cluster should stop accepting any > > > > > requests - both > > > > > write and read. > > So it will not serve. Sergey, this is recursion. I'm asking you to clarify exactly this point. Do you assume that replicas perform some kind of failure detection? What kind? Is it *in addition* to the failure detection performed by the external coordinator? Any failure detector imaginable would be asynchronous. What happens between the failure and the time it's detected? > > So at least you can't say the leader shouldn't be serving reads > > without quorum - because the only way to achieve it is to collect > > a quorum of responses to reads as well. > > The leader lost connection to the (N-Q)+1 repllicas out of the N in > cluster with a quorum of Q == it stops serving anything. So the quorum > criteria is there: no quorum - no reads. OK, so you assume that TCP connection *is* the failure detector? Failure detection in TCP is optional, asynchronous, and worst of all, unreliable. Why do think it can be used? > > > > > The reason for this is that replication of transactions > > > > > can achieve quorum on replicas not visible to the leader. On the other > > > > > hand, leader can't achieve quorum with available minority. Leader has to > > > > > report the state and wait for human intervention. There's an option to > > > > > ask leader to rollback to the latest transaction that has quorum: leader > > > > > issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN > > > > > is of the first transaction in the leader's undo log. The rollback > > > > > message replicated to the available cluster will put it in a consistent > > > > > state. After that configuration of the cluster can be updated to > > > > > available quorum and leader can be switched back to write mode. > > > > > > > > As you should be able to conclude from restart scenario, it is > > > > possible a replica has the record in *confirmed* state but the > > > > leader has it in pending state. The replica will not be able to > > > > roll back then. Do you suggest the replica should abort if it > > > > can't rollback? This may lead to an avalanche of rejoins on leader > > > > restart, bringing performance to a halt. > > > > > > No, I declare replica with biggest LSN as a new shining leader. More > > > than that, new leader can (so far it will be by default) finalize the > > > former leader life's work by replicating txns and appropriate confirms. > > > > Right, this also assumes the restart is noticed, so it follows the > > same logic. > > How a restart can be unnoticed, if it causes disconnection? Honestly, I'm baffled. It's like we speak different languages. I can't imagine you are unaware of the fallacies of distributed computing, but I see no other explanation to you question. -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-05-12 15:55 ` Sergey Ostanevich 2020-05-12 16:42 ` Konstantin Osipov @ 2020-05-13 21:39 ` Vladislav Shpilevoy 2020-05-13 23:54 ` Konstantin Osipov 2020-05-14 20:38 ` Sergey Ostanevich 1 sibling, 2 replies; 53+ messages in thread From: Vladislav Shpilevoy @ 2020-05-13 21:39 UTC (permalink / raw) To: Sergey Ostanevich, Konstantin Osipov, tarantool-patches Hi! Thanks for the discussion! On 12/05/2020 17:55, Sergey Ostanevich wrote: > On 06 мая 21:44, Konstantin Osipov wrote: >> * Sergey Ostanevich <sergos@tarantool.org> [20/05/06 19:41]: >>>>> | | | | | >>>>> | [Quorum | | | >>>>> | achieved] | | | >>>>> | | | | | >>>>> | [TXN undo log | | | >>>>> | destroyed] | | | >>>>> | | | | | >>>>> | |---Confirm--->| | | >>>>> | | | | | >>>> >>>> What happens if writing Confirm to WAL fails? TXN und log record >>>> is destroyed already. Will the server panic now on WAL failure, >>>> even if it is intermittent? >>> >>> I would like to have an example of intermittent WAL failure. Can it be >>> other than problem with disc - be it space/availability/malfunction? >> >> For SAN disks it can simply be a networking issue. The same is >> true for any virtual filesystem in the cloud. For local disks it >> is most often out of space, but this is not an impossible event. > > The SANDisk is an SSD vendor. I bet you mean NAS - network array > storage, isn't it? Then I see no difference in WAL write into NAS in > current schema - you will catch a timeout, WAL will report failure, > replica stops. > >> >>> For all of those it should be resolved outside the DBMS anyways. So, >>> leader should stop and report its problems to orchestrator/admins. >> >> Sergey, I understand that RAFT spec is big and with this spec you >> try to split it into manageable parts. The question is how useful >> is this particular piece. I'm trying to point out that "the leader >> should stop" is not a silver bullet - especially since each such >> stop may mean a rejoin of some other node. The purpose of sync >> replication is to provide consistency without reducing >> availability (i.e. make progress as long as the quorum >> of nodes make progress). > > I'm not sure if we're talking about the same RAFT - mine is "In Search > of an Understandable Consensus Algorithm (Extended Version)" from > Stanford as of May 2014. And it is 15 pages - including references, > conclusions and intro. Seems not that big. 15 pages of tightly packed theory is a big piece of data. And especially big, when it comes to application to a real project, with existing infrastructure, and all. Just my IMHO. I remember implementing SWIM - it is smaller than RAFT. Much smaller and simpler, and yet it took year to implement it, and cover all things described in the paper. This is not as simple as it looks, when it comes to edge cases. This is why the whole sync replication frustrates me more than anything else before, and why I am so reluctant to doing anything with it. The RFC mostly covers the normal operation, here I agree with Kostja. But the normal operation is not that interesting. Failures are much more important. > Although, most of it is dedicated to the leader election itself, which > we intentionally put aside from this RFC. It is written in the very > beginning and I empasized this by explicit mentioning of it. And still there will be leader election. Even though not ours for now. And Tarantool should provide API and instructions so as external applications could follow them and do the election. Usually in RFCs we describe API. With arguments, behaviour, and all. >> The current spec, suggesting there should be a leader stop in case >> of most errors, reduces availability significantly, and doesn't >> make external coordinator job any easier - it still has to follow to >> the letter the prescriptions of RAFT. >> >>> >>>> >>>>> | |----------Confirm---------->| | >>>> >>>> What happens if peers receive and maybe even write Confirm to their WALs >>>> but local WAL write is lost after a restart? >>> >>> Did you mean WAL write on leader as a local? Then we have a replica with >>> a bigger LSN for the leader ID. >> >>>> WAL is not synced, >>>> so we can easily lose the tail of the WAL. Tarantool will sync up >>>> with all replicas on restart, >>> >>> But at this point a new leader will be appointed - the old one is >>> restarted. Then the Confirm message will arrive to the restarted leader >>> through a regular replication. >> >> This assumes that restart is guaranteed to be noticed by the >> external coordinator and there is an election on every restart. > > Sure yes, if it restarted - then connection lost can't be unnoticed by > anyone, be it coordinator or cluster. Here comes another problem. Disconnect and restart have nothing to do with each other. The coordinator can loose connection without the peer leader restart. Just because it is network. Anything can happen. Moreover, while the coordinator does not have a connection, the leader can restart multiple times. We can't tell the coordinator rely on connectivity as a restart signal. >>>> but there will be no "Replication >>>> OK" messages from them, so it wouldn't know that the transaction >>>> is committed on them. How is this handled? We may end up with some >>>> replicas confirming the transaction while the leader will roll it >>>> back on restart. Do you suggest there is a human intervention on >>>> restart as well? >>>> >>>> >>>>> | | | | | >>>>> |<---TXN Ok-----| | [TXN undo log | >>>>> | | | destroyed] | >>>>> | | | | | >>>>> | | | |---Confirm--->| >>>>> | | | | | >>>>> ``` >>>>> >>>>> The quorum should be collected as a table for a list of transactions >>>>> waiting for quorum. The latest transaction that collects the quorum is >>>>> considered as complete, as well as all transactions prior to it, since >>>>> all transactions should be applied in order. Leader writes a 'confirm' >>>>> message to the WAL that refers to the transaction's [LEADER_ID, LSN] and >>>>> the confirm has its own LSN. This confirm message is delivered to all >>>>> replicas through the existing replication mechanism. >>>>> >>>>> Replica should report a TXN application success to the leader via the >>>>> IPROTO explicitly to allow leader to collect the quorum for the TXN. >>>>> In case of application failure the replica has to disconnect from the >>>>> replication the same way as it is done now. The replica also has to >>>>> report its disconnection to the orchestrator. Further actions require >>>>> human intervention, since failure means either technical problem (such >>>>> as not enough space for WAL) that has to be resovled or an inconsistent >>>>> state that requires rejoin. >>>> >>>>> As soon as leader appears in a situation it has not enough replicas >>>>> to achieve quorum, the cluster should stop accepting any requests - both >>>>> write and read. >>>> >>>> How does *the cluster* know the state of the leader and if it >>>> doesn't, how it can possibly implement this? Did you mean >>>> the leader should stop accepting transactions here? But how can >>>> the leader know if it has not enough replicas during a read >>>> transaction, if it doesn't contact any replica to serve a read? >>> >>> I expect to have a disconnection trigger assigned to all relays so that >>> disconnection will cause the number of replicas decrease. The quorum >>> size is static, so we can stop at the very moment the number dives below. >> >> What happens between the event the leader is partitioned away and >> a new leader is elected? >> >> The leader may be unaware of the events and serve a read just >> fine. > > As it is stated 20 lines above: >>>>> As soon as leader appears in a situation it has not enough >>>>> replicas >>>>> to achieve quorum, the cluster should stop accepting any >>>>> requests - both >>>>> write and read. > > So it will not serve. This breaks compatibility, since now an orphan node is perfectly able to serve reads. The cluster can't just stop doing everything, if the quorum is lost. Stop writes - yes, since the quorum is lost anyway. But reads do not need a quorum. If you say reads need a quorum, then they would need to go through WAL, collect confirmations, and all. >> So at least you can't say the leader shouldn't be serving reads >> without quorum - because the only way to achieve it is to collect >> a quorum of responses to reads as well. > > The leader lost connection to the (N-Q)+1 repllicas out of the N in > cluster with a quorum of Q == it stops serving anything. So the quorum > criteria is there: no quorum - no reads. Connection count tells nothing. Network connectivity is not a reliable source of information. Only messages and persistent data are reliable (to certain extent). >>>>> The reason for this is that replication of transactions >>>>> can achieve quorum on replicas not visible to the leader. On the other >>>>> hand, leader can't achieve quorum with available minority. Leader has to >>>>> report the state and wait for human intervention. There's an option to >>>>> ask leader to rollback to the latest transaction that has quorum: leader >>>>> issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN >>>>> is of the first transaction in the leader's undo log. The rollback >>>>> message replicated to the available cluster will put it in a consistent >>>>> state. After that configuration of the cluster can be updated to >>>>> available quorum and leader can be switched back to write mode. >>>> >>>> As you should be able to conclude from restart scenario, it is >>>> possible a replica has the record in *confirmed* state but the >>>> leader has it in pending state. The replica will not be able to >>>> roll back then. Do you suggest the replica should abort if it >>>> can't rollback? This may lead to an avalanche of rejoins on leader >>>> restart, bringing performance to a halt. >>> >>> No, I declare replica with biggest LSN as a new shining leader. More >>> than that, new leader can (so far it will be by default) finalize the >>> former leader life's work by replicating txns and appropriate confirms. >> >> Right, this also assumes the restart is noticed, so it follows the >> same logic. > > How a restart can be unnoticed, if it causes disconnection? Disconnection has nothing to do with restart. The coordinator itself may restart. Or it may loose connection to the leader temporarily. Or the leader may loose it without any restarts. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-05-13 21:39 ` Vladislav Shpilevoy @ 2020-05-13 23:54 ` Konstantin Osipov 2020-05-14 20:38 ` Sergey Ostanevich 1 sibling, 0 replies; 53+ messages in thread From: Konstantin Osipov @ 2020-05-13 23:54 UTC (permalink / raw) To: Vladislav Shpilevoy; +Cc: tarantool-patches * Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/05/14 00:42]: > > Sure yes, if it restarted - then connection lost can't be unnoticed by > > anyone, be it coordinator or cluster. > > Here comes another problem. Disconnect and restart have nothing to do with > each other. The coordinator can loose connection without the peer leader > restart. Just because it is network. Anything can happen. Moreover, while > the coordinator does not have a connection, the leader can restart multiple > times. yes. > We can't tell the coordinator rely on connectivity as a restart signal. Well, we could demand that the leader always demotes itself after restart. But the spec should be explicit about it and explain how the election happens in this case, because it still may have the longest WAL (but with some junk in it, thanks to lost confirms), so after restart the leader may need to reconcile its wal with the majority, fetching missing records back. Once again, RAFT is very explicit about this. By default it requires that the leader commit log is durable, i.e. wal_mode=sync. This would kill performance. Implementations exist which run in wal_mode=write (cassandra is one of them), but they know how to repair the log at the leader before proceeding with the next transaction. The reason I brought this up is that it's extremely tricky, and confusing as hell if the election is external (agree there should be an API, or better yet, abandon the idea of external election, just have no election for now at all, assume the leader never changes, and we only provide durability in multi-master config), with no consistency guarantees (but eventual one). > > How a restart can be unnoticed, if it causes disconnection? > > Disconnection has nothing to do with restart. The coordinator itself may > restart. Or it may loose connection to the leader temporarily. Or the > leader may loose it without any restarts. and yes. -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-05-13 21:39 ` Vladislav Shpilevoy 2020-05-13 23:54 ` Konstantin Osipov @ 2020-05-14 20:38 ` Sergey Ostanevich 2020-05-20 20:59 ` Sergey Ostanevich 1 sibling, 1 reply; 53+ messages in thread From: Sergey Ostanevich @ 2020-05-14 20:38 UTC (permalink / raw) To: Vladislav Shpilevoy; +Cc: tarantool-patches Hi! > >> Sergey, I understand that RAFT spec is big and with this spec you > >> try to split it into manageable parts. The question is how useful > >> is this particular piece. I'm trying to point out that "the leader > >> should stop" is not a silver bullet - especially since each such > >> stop may mean a rejoin of some other node. The purpose of sync > >> replication is to provide consistency without reducing > >> availability (i.e. make progress as long as the quorum > >> of nodes make progress). > > > > I'm not sure if we're talking about the same RAFT - mine is "In Search > > of an Understandable Consensus Algorithm (Extended Version)" from > > Stanford as of May 2014. And it is 15 pages - including references, > > conclusions and intro. Seems not that big. > > 15 pages of tightly packed theory is a big piece of data. And especially > big, when it comes to application to a real project, with existing > infrastructure, and all. Just my IMHO. I remember implementing SWIM - it > is smaller than RAFT. Much smaller and simpler, and yet it took year to > implement it, and cover all things described in the paper. That I won't object and this was the reason not to take the RAFT as is and implement it in full for next 2-3 years. That's why we had the very first part of RFC describing what it tries to address and what's not. > > This is not as simple as it looks, when it comes to edge cases. This is > why the whole sync replication frustrates me more than anything else > before, and why I am so reluctant to doing anything with it. > > The RFC mostly covers the normal operation, here I agree with Kostja. But > the normal operation is not that interesting. Failures are much more > important. Definitely and I expect to follow with more functionality on top of it. I believe it will be easier to do if the start will be as small as possible change to the existent code base, which I also try to follow. > > > Although, most of it is dedicated to the leader election itself, which > > we intentionally put aside from this RFC. It is written in the very > > beginning and I empasized this by explicit mentioning of it. > > And still there will be leader election. Even though not ours for now. > And Tarantool should provide API and instructions so as external > applications could follow them and do the election. > > Usually in RFCs we describe API. With arguments, behaviour, and all. That is something I believe should be done after we agree on the whole idea, such as confirm entry in WAL for sync transactions that appeared there earlier. Otherwise we can get very deep into the details, spending time for API definition while the idea itself can appear wrong. I believe that was a common ground to start, but we immediately went to discussion of so many details I tried to keep away before we agree on the key parts, such as WAL consistency or quorum collection. > > >> The current spec, suggesting there should be a leader stop in case > >> of most errors, reduces availability significantly, and doesn't > >> make external coordinator job any easier - it still has to follow to > >> the letter the prescriptions of RAFT. > >> > >>> > >>>> > >>>>> | |----------Confirm---------->| | > >>>> > >>>> What happens if peers receive and maybe even write Confirm to their WALs > >>>> but local WAL write is lost after a restart? > >>> > >>> Did you mean WAL write on leader as a local? Then we have a replica with > >>> a bigger LSN for the leader ID. > >> > >>>> WAL is not synced, > >>>> so we can easily lose the tail of the WAL. Tarantool will sync up > >>>> with all replicas on restart, > >>> > >>> But at this point a new leader will be appointed - the old one is > >>> restarted. Then the Confirm message will arrive to the restarted leader > >>> through a regular replication. > >> > >> This assumes that restart is guaranteed to be noticed by the > >> external coordinator and there is an election on every restart. > > > > Sure yes, if it restarted - then connection lost can't be unnoticed by > > anyone, be it coordinator or cluster. > > Here comes another problem. Disconnect and restart have nothing to do with > each other. The coordinator can loose connection without the peer leader > restart. Just because it is network. Anything can happen. Moreover, while > the coordinator does not have a connection, the leader can restart multiple > times. Definitely there should be a higher level functionality to support some sort of membership protocol, such as SWIM or RAFT itself. But introduction of it should not affect the basic priciples we have to agree upon. > > We can't tell the coordinator rely on connectivity as a restart signal. > > >>>> but there will be no "Replication > >>>> OK" messages from them, so it wouldn't know that the transaction > >>>> is committed on them. How is this handled? We may end up with some > >>>> replicas confirming the transaction while the leader will roll it > >>>> back on restart. Do you suggest there is a human intervention on > >>>> restart as well? > >>>> > >>>> > >>>>> | | | | | > >>>>> |<---TXN Ok-----| | [TXN undo log | > >>>>> | | | destroyed] | > >>>>> | | | | | > >>>>> | | | |---Confirm--->| > >>>>> | | | | | > >>>>> ``` > >>>>> > >>>>> The quorum should be collected as a table for a list of transactions > >>>>> waiting for quorum. The latest transaction that collects the quorum is > >>>>> considered as complete, as well as all transactions prior to it, since > >>>>> all transactions should be applied in order. Leader writes a 'confirm' > >>>>> message to the WAL that refers to the transaction's [LEADER_ID, LSN] and > >>>>> the confirm has its own LSN. This confirm message is delivered to all > >>>>> replicas through the existing replication mechanism. > >>>>> > >>>>> Replica should report a TXN application success to the leader via the > >>>>> IPROTO explicitly to allow leader to collect the quorum for the TXN. > >>>>> In case of application failure the replica has to disconnect from the > >>>>> replication the same way as it is done now. The replica also has to > >>>>> report its disconnection to the orchestrator. Further actions require > >>>>> human intervention, since failure means either technical problem (such > >>>>> as not enough space for WAL) that has to be resovled or an inconsistent > >>>>> state that requires rejoin. > >>>> > >>>>> As soon as leader appears in a situation it has not enough replicas > >>>>> to achieve quorum, the cluster should stop accepting any requests - both > >>>>> write and read. > >>>> > >>>> How does *the cluster* know the state of the leader and if it > >>>> doesn't, how it can possibly implement this? Did you mean > >>>> the leader should stop accepting transactions here? But how can > >>>> the leader know if it has not enough replicas during a read > >>>> transaction, if it doesn't contact any replica to serve a read? > >>> > >>> I expect to have a disconnection trigger assigned to all relays so that > >>> disconnection will cause the number of replicas decrease. The quorum > >>> size is static, so we can stop at the very moment the number dives below. > >> > >> What happens between the event the leader is partitioned away and > >> a new leader is elected? > >> > >> The leader may be unaware of the events and serve a read just > >> fine. > > > > As it is stated 20 lines above: > >>>>> As soon as leader appears in a situation it has not enough > >>>>> replicas > >>>>> to achieve quorum, the cluster should stop accepting any > >>>>> requests - both > >>>>> write and read. > > > > So it will not serve. > > This breaks compatibility, since now an orphan node is perfectly able > to serve reads. The cluster can't just stop doing everything, if the > quorum is lost. Stop writes - yes, since the quorum is lost anyway. But > reads do not need a quorum. > > If you say reads need a quorum, then they would need to go through WAL, > collect confirmations, and all. The reads should not be inconsistent - so that cluster will keep answering A or B for the same request. And in case we lost quorum we can't say for sure that all instances will answer the same. As we discussed it before, if leader appears in minor part of the cluster it can't issue rollback for all unconfirmed txns, since the majority will re-elect leader who will collect quorum for them. Means, we will appear is a state that cluster split in two. So the minor part should stop. Am I wrong here? > > >> So at least you can't say the leader shouldn't be serving reads > >> without quorum - because the only way to achieve it is to collect > >> a quorum of responses to reads as well. > > > > The leader lost connection to the (N-Q)+1 repllicas out of the N in > > cluster with a quorum of Q == it stops serving anything. So the quorum > > criteria is there: no quorum - no reads. > > Connection count tells nothing. Network connectivity is not a reliable > source of information. Only messages and persistent data are reliable > (to certain extent). Well, persistent data can't help obtain quorum if there's no connection to the replicas who should contribute to quorum. Correct me, if I'm wrong: in case no quorum available we can't garantee that the data is stored on at least <quorum> number of servers. Means - cluster is not operable. > > >>>>> The reason for this is that replication of transactions > >>>>> can achieve quorum on replicas not visible to the leader. On the other > >>>>> hand, leader can't achieve quorum with available minority. Leader has to > >>>>> report the state and wait for human intervention. There's an option to > >>>>> ask leader to rollback to the latest transaction that has quorum: leader > >>>>> issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN > >>>>> is of the first transaction in the leader's undo log. The rollback > >>>>> message replicated to the available cluster will put it in a consistent > >>>>> state. After that configuration of the cluster can be updated to > >>>>> available quorum and leader can be switched back to write mode. > >>>> > >>>> As you should be able to conclude from restart scenario, it is > >>>> possible a replica has the record in *confirmed* state but the > >>>> leader has it in pending state. The replica will not be able to > >>>> roll back then. Do you suggest the replica should abort if it > >>>> can't rollback? This may lead to an avalanche of rejoins on leader > >>>> restart, bringing performance to a halt. > >>> > >>> No, I declare replica with biggest LSN as a new shining leader. More > >>> than that, new leader can (so far it will be by default) finalize the > >>> former leader life's work by replicating txns and appropriate confirms. > >> > >> Right, this also assumes the restart is noticed, so it follows the > >> same logic. > > > > How a restart can be unnoticed, if it causes disconnection? > > Disconnection has nothing to do with restart. The coordinator itself may > restart. Or it may loose connection to the leader temporarily. Or the > leader may loose it without any restarts. But how we detect it right now in Tarantool? Is there any machinery? I suppose we can simply rely on the same at least to test the minimal - and 'normally operating' - first approach to the problem. So, thank you for all comments and please, find my updated RFC below. Sergos. --- * **Status**: In progress * **Start date**: 31-03-2020 * **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\> * **Issues**: https://github.com/tarantool/tarantool/issues/4842 ## Summary The aim of this RFC is to address the following list of problems formulated at MRG planning meeting: - protocol backward compatibility to enable cluster upgrade w/o downtime - consistency of data on replica and leader - switch from leader to replica without data loss - up to date replicas to run read-only requests - ability to switch async replicas into sync ones and vice versa - guarantee of rollback on leader and sync replicas - simplicity of cluster orchestration What this RFC is not: - high availability (HA) solution with automated failover, roles assignments an so on - master-master configuration support ## Background and motivation There are number of known implementation of consistent data presence in a Tarantool cluster. They can be commonly named as "wait for LSN" technique. The biggest issue with this technique is the absence of rollback guarantees at replica in case of transaction failure on one master or some of the replicas in the cluster. To provide such capabilities a new functionality should be introduced in Tarantool core, with requirements mentioned before - backward compatibility and ease of cluster orchestration. ## Detailed design ### Quorum commit The main idea behind the proposal is to reuse existent machinery as much as possible. It will ensure the well-tested and proven functionality across many instances in MRG and beyond is used. The transaction rollback mechanism is in place and works for WAL write failure. If we substitute the WAL success with a new situation which is named 'quorum' later in this document then no changes to the machinery is needed. The same is true for snapshot machinery that allows to create a copy of the database in memory for the whole period of snapshot file write. Adding quorum here also minimizes changes. Currently replication represented by the following scheme: ``` Customer Leader WAL(L) Replica WAL(R) |------TXN----->| | | | | | | | | | [TXN undo log | | | | created] | | | | | | | | | |-----TXN----->| | | | | | | | | |<---WAL Ok----| | | | | | | | | [TXN undo log | | | | destroyed] | | | | | | | | |<----TXN Ok----| | | | | |-------Replicate TXN------->| | | | | | | | | | [TXN undo log | | | | created] | | | | | | | | | |-----TXN----->| | | | | | | | | |<---WAL Ok----| | | | | | | | | [TXN undo log | | | | destroyed] | | | | | | ``` To introduce the 'quorum' we have to receive confirmation from replicas to make a decision on whether the quorum is actually present. Leader collects necessary amount of replicas confirmation plus its own WAL success. This state is named 'quorum' and gives leader the right to complete the customers' request. So the picture will change to: ``` Customer Leader WAL(L) Replica WAL(R) |------TXN----->| | | | | | | | | | [TXN undo log | | | | created] | | | | | | | | | |-----TXN----->| | | | | | | | | |-------Replicate TXN------->| | | | | | | | | | [TXN undo log | | |<---WAL Ok----| created] | | | | | | | [Waiting | |-----TXN----->| | of a quorum] | | | | | | |<---WAL Ok----| | | | | | | |<------Replication Ok-------| | | | | | | | [Quorum | | | | achieved] | | | | | | | | | |---Confirm--->| | | | | | | | | |----------Confirm---------->| | | | | | | |<---TXN Ok-----| | |---Confirm--->| | | | | | | [TXN undo log | [TXN undo log | | destroyed] | destroyed] | | | | | | ``` The quorum should be collected as a table for a list of transactions waiting for quorum. The latest transaction that collects the quorum is considered as complete, as well as all transactions prior to it, since all transactions should be applied in order. Leader writes a 'confirm' message to the WAL that refers to the transaction's [LEADER_ID, LSN] and the confirm has its own LSN. This confirm message is delivered to all replicas through the existing replication mechanism. Replica should report a TXN application success to the leader via the IPROTO explicitly to allow leader to collect the quorum for the TXN. In case of application failure the replica has to disconnect from the replication the same way as it is done now. The replica also has to report its disconnection to the orchestrator. Further actions require human intervention, since failure means either technical problem (such as not enough space for WAL) that has to be resolved or an inconsistent state that requires rejoin. As soon as leader appears in a situation it has not enough replicas to achieve quorum, the cluster should stop accepting any requests - both write and read. The reason for this is that replication of transactions can achieve quorum on replicas not visible to the leader. On the other hand, leader can't achieve quorum with available minority. Leader has to report the state and wait for human intervention. There's an option to ask leader to rollback to the latest transaction that has quorum: leader issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN is of the first transaction in the leader's undo log. The rollback message replicated to the available cluster will put it in a consistent state. After that configuration of the cluster can be updated to available quorum and leader can be switched back to write mode. ### Leader role assignment. To assign a leader role to an instance the following should be performed: 1. among all available instances pick the one that has the biggest vclock element of the former leader ID; an arbitrary istance can be selected in case it is first time the leader is assigned 2. the leader should assure that number of available instances in the cluster is enough to achieve the quorum and proceed to step 3, otherwise the leader should report the situation of incomplete quorum, as in the last paragraph of previous section 3. the selected instance has to take the responsibility to replicate former leader entries from its WAL, obtainig quorum and commit confirm messages referring to [FORMER_LEADER_ID, LSN] in its WAL, replicating to the cluster, after that it can start adding its own entries into the WAL ### Recovery and failover. Tarantool instance during reading WAL should postpone the undo log deletion until the 'confirm' is read. In case the WAL eof is achieved, the instance should keep undo log for all transactions that are waiting for a confirm entry until the role of the instance is set. If this instance will be assigned a leader role then all transactions that have no corresponding confirm message should be confirmed (see the leader role assignment). In case there's not enough replicas to set up a quorum the cluster can be switched into a read-only mode. Note, this can't be done by default since some of transactions can have confirmed state. It is up to human intervention to force rollback of all transactions that have no confirm and to put the cluster into a consistent state. In case the instance will be assigned a replica role, it may appear in a state that it has conflicting WAL entries, in case it recovered from a leader role and some of transactions didn't replicated to the current leader. This situation should be resolved through rejoin of the instance. Consider an example below. Originally instance with ID1 was assigned a Leader role and the cluster had 2 replicas with quorum set to 2. ``` +---------------------+---------------------+---------------------+ | ID1 | ID2 | ID3 | | Leader | Replica 1 | Replica 2 | +---------------------+---------------------+---------------------+ | ID1 Tx1 | ID1 Tx1 | ID1 Tx1 | +---------------------+---------------------+---------------------+ | ID1 Tx2 | ID1 Tx2 | ID1 Tx2 | +---------------------+---------------------+---------------------+ | ID1 Tx3 | ID1 Tx3 | ID1 Tx3 | +---------------------+---------------------+---------------------+ | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | | +---------------------+---------------------+---------------------+ | ID1 Tx4 | ID1 Tx4 | | +---------------------+---------------------+---------------------+ | ID1 Tx5 | ID1 Tx5 | | +---------------------+---------------------+---------------------+ | ID1 Conf [ID1, Tx2] | | | +---------------------+---------------------+---------------------+ | Tx6 | | | +---------------------+---------------------+---------------------+ | Tx7 | | | +---------------------+---------------------+---------------------+ ``` Suppose at this moment the ID1 instance crashes. Then the ID2 instance should be assigned a leader role since its ID1 LSN is the biggest. Then this new leader will deliver its WAL to all replicas. As soon as quorum for Tx4 and Tx5 will be obtained, it should write the corresponding Confirms to its WAL. Note that Tx are still uses ID1. ``` +---------------------+---------------------+---------------------+ | ID1 | ID2 | ID3 | | (dead) | Leader | Replica 2 | +---------------------+---------------------+---------------------+ | ID1 Tx1 | ID1 Tx1 | ID1 Tx1 | +---------------------+---------------------+---------------------+ | ID1 Tx2 | ID1 Tx2 | ID1 Tx2 | +---------------------+---------------------+---------------------+ | ID1 Tx3 | ID1 Tx3 | ID1 Tx3 | +---------------------+---------------------+---------------------+ | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | +---------------------+---------------------+---------------------+ | ID1 Tx4 | ID1 Tx4 | ID1 Tx4 | +---------------------+---------------------+---------------------+ | ID1 Tx5 | ID1 Tx5 | ID1 Tx5 | +---------------------+---------------------+---------------------+ | ID1 Conf [ID1, Tx2] | ID2 Conf [Id1, Tx5] | ID2 Conf [Id1, Tx5] | +---------------------+---------------------+---------------------+ | ID1 Tx6 | | | +---------------------+---------------------+---------------------+ | ID1 Tx7 | | | +---------------------+---------------------+---------------------+ ``` After rejoining ID1 will figure out the inconsistency of its WAL: the last WAL entry it has is corresponding to Tx7, while in Leader's log the last entry with ID1 is Tx5. In case the ID1's WAL contains corresponding entry then Replica 1 can stop reading WAL as soon as it hits the vclock[ID1] obtained from the current Leader. It will put the ID1 into a consistent state and it can obtain latest data via replication. The WAL should be rotated after a snapshot creation. The old WAL should be renamed so it will not be reused in the future and kept for postmortem. ``` +---------------------+---------------------+---------------------+ | ID1 | ID2 | ID3 | | Replica 1 | Leader | Replica 2 | +---------------------+---------------------+---------------------+ | ID1 Tx1 | ID1 Tx1 | ID1 Tx1 | +---------------------+---------------------+---------------------+ | ID1 Tx2 | ID1 Tx2 | ID1 Tx2 | +---------------------+---------------------+---------------------+ | ID1 Tx3 | ID1 Tx3 | ID1 Tx3 | +---------------------+---------------------+---------------------+ | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | +---------------------+---------------------+---------------------+ | ID1 Tx4 | ID1 Tx4 | ID1 Tx4 | +---------------------+---------------------+---------------------+ | ID1 Tx5 | ID1 Tx5 | ID1 Tx5 | +---------------------+---------------------+---------------------+ | | ID2 Conf [Id1, Tx5] | ID2 Conf [Id1, Tx5] | +---------------------+---------------------+---------------------+ | | ID2 Tx1 | ID2 Tx1 | +---------------------+---------------------+---------------------+ | | ID2 Tx2 | ID2 Tx2 | +---------------------+---------------------+---------------------+ ``` Although, there could be a situation that ID1's WAL begins with an LSN after the biggest available in the Leader's WAL. Either, for vinyl part of WAL can be referenced in .run files, hence can't be evicted by a simple WAL ignore. In such a case the ID1 needs a complete rejoin. ### Snapshot generation. We also can reuse current machinery of snapshot generation. Upon receiving a request to create a snapshot an instance should request a readview for the current commit operation. Although start of the snapshot generation should be postponed until this commit operation receives its confirmation. In case operation is rolled back, the snapshot generation should be aborted and restarted using current transaction after rollback is complete. After snapshot is created the WAL should start from the first operation that follows the commit operation snapshot is generated for. That means WAL will contain 'confirm' messages that refer to transactions that are not present in the WAL. Apparently, we have to allow this for the case 'confirm' refers to a transaction with LSN less than the first entry in the WAL. In case master appears unavailable a replica still have to be able to create a snapshot. Replica can perform rollback for all transactions that are not confirmed and claim its LSN as the latest confirmed txn. Then it can create a snapshot in a regular way and start with blank xlog file. All rolled back transactions will appear through the regular replication in case master reappears later on. ### Asynchronous replication. Along with synchronous replicas the cluster can contain asynchronous replicas. That means async replica doesn't reply to the leader with errors since they're not contributing into quorum. Still, async replicas have to follow the new WAL operation, such as keep rollback info until 'quorum' message is received. This is essential for the case of 'rollback' message appearance in the WAL. This message assumes replica is able to perform all necessary rollback by itself. Cluster information should contain explicit notification of each replica operation mode. ### Synchronous replication enabling. Synchronous operation can be required for a set of spaces in the data scheme. That means only transactions that contain data modification for these spaces should require quorum. Such transactions named synchronous. As soon as last operation of synchronous transaction appeared in leader's WAL, it will cause all following transactions - no matter if they are synchronous or not - wait for the quorum. In case quorum is not achieved the 'rollback' operation will cause rollback of all transactions after the synchronous one. It will ensure the consistent state of the data both on leader and replicas. In case user doesn't require synchronous operation for any space then no changes to the WAL generation and replication will appear. Cluster description should contain explicit attribute for each replica to denote it participates in synchronous activities. Also the description should contain criterion on how many replicas responses are needed to achieve the quorum. ## Rationale and alternatives There is an implementation of synchronous replication as part of gh-980 activities, still it is not in a state to get into the product. More than that it intentionally breaks backward compatibility which is a prerequisite for this proposal. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-05-14 20:38 ` Sergey Ostanevich @ 2020-05-20 20:59 ` Sergey Ostanevich 2020-05-25 23:41 ` Vladislav Shpilevoy 0 siblings, 1 reply; 53+ messages in thread From: Sergey Ostanevich @ 2020-05-20 20:59 UTC (permalink / raw) To: Vladislav Shpilevoy; +Cc: tarantool-patches Hi! I've updated part of recovery and leader election. The latest version is at the bottom. Thanks, Sergos On 14 мая 23:38, Sergey Ostanevich wrote: > Hi! > > > >> Sergey, I understand that RAFT spec is big and with this spec you > > >> try to split it into manageable parts. The question is how useful > > >> is this particular piece. I'm trying to point out that "the leader > > >> should stop" is not a silver bullet - especially since each such > > >> stop may mean a rejoin of some other node. The purpose of sync > > >> replication is to provide consistency without reducing > > >> availability (i.e. make progress as long as the quorum > > >> of nodes make progress). > > > > > > I'm not sure if we're talking about the same RAFT - mine is "In Search > > > of an Understandable Consensus Algorithm (Extended Version)" from > > > Stanford as of May 2014. And it is 15 pages - including references, > > > conclusions and intro. Seems not that big. > > > > 15 pages of tightly packed theory is a big piece of data. And especially > > big, when it comes to application to a real project, with existing > > infrastructure, and all. Just my IMHO. I remember implementing SWIM - it > > is smaller than RAFT. Much smaller and simpler, and yet it took year to > > implement it, and cover all things described in the paper. > > That I won't object and this was the reason not to take the RAFT as is > and implement it in full for next 2-3 years. That's why we had the very > first part of RFC describing what it tries to address and what's not. > > > > > This is not as simple as it looks, when it comes to edge cases. This is > > why the whole sync replication frustrates me more than anything else > > before, and why I am so reluctant to doing anything with it. > > > > The RFC mostly covers the normal operation, here I agree with Kostja. But > > the normal operation is not that interesting. Failures are much more > > important. > > Definitely and I expect to follow with more functionality on top of it. > I believe it will be easier to do if the start will be as small as > possible change to the existent code base, which I also try to follow. > > > > > > Although, most of it is dedicated to the leader election itself, which > > > we intentionally put aside from this RFC. It is written in the very > > > beginning and I empasized this by explicit mentioning of it. > > > > And still there will be leader election. Even though not ours for now. > > And Tarantool should provide API and instructions so as external > > applications could follow them and do the election. > > > > Usually in RFCs we describe API. With arguments, behaviour, and all. > > That is something I believe should be done after we agree on the whole > idea, such as confirm entry in WAL for sync transactions that appeared > there earlier. Otherwise we can get very deep into the details, spending > time for API definition while the idea itself can appear wrong. > > I believe that was a common ground to start, but we immediately went to > discussion of so many details I tried to keep away before we agree on the > key parts, such as WAL consistency or quorum collection. > > > > > >> The current spec, suggesting there should be a leader stop in case > > >> of most errors, reduces availability significantly, and doesn't > > >> make external coordinator job any easier - it still has to follow to > > >> the letter the prescriptions of RAFT. > > >> > > >>> > > >>>> > > >>>>> | |----------Confirm---------->| | > > >>>> > > >>>> What happens if peers receive and maybe even write Confirm to their WALs > > >>>> but local WAL write is lost after a restart? > > >>> > > >>> Did you mean WAL write on leader as a local? Then we have a replica with > > >>> a bigger LSN for the leader ID. > > >> > > >>>> WAL is not synced, > > >>>> so we can easily lose the tail of the WAL. Tarantool will sync up > > >>>> with all replicas on restart, > > >>> > > >>> But at this point a new leader will be appointed - the old one is > > >>> restarted. Then the Confirm message will arrive to the restarted leader > > >>> through a regular replication. > > >> > > >> This assumes that restart is guaranteed to be noticed by the > > >> external coordinator and there is an election on every restart. > > > > > > Sure yes, if it restarted - then connection lost can't be unnoticed by > > > anyone, be it coordinator or cluster. > > > > Here comes another problem. Disconnect and restart have nothing to do with > > each other. The coordinator can loose connection without the peer leader > > restart. Just because it is network. Anything can happen. Moreover, while > > the coordinator does not have a connection, the leader can restart multiple > > times. > > Definitely there should be a higher level functionality to support some > sort of membership protocol, such as SWIM or RAFT itself. But > introduction of it should not affect the basic priciples we have to > agree upon. > > > > > We can't tell the coordinator rely on connectivity as a restart signal. > > > > >>>> but there will be no "Replication > > >>>> OK" messages from them, so it wouldn't know that the transaction > > >>>> is committed on them. How is this handled? We may end up with some > > >>>> replicas confirming the transaction while the leader will roll it > > >>>> back on restart. Do you suggest there is a human intervention on > > >>>> restart as well? > > >>>> > > >>>> > > >>>>> | | | | | > > >>>>> |<---TXN Ok-----| | [TXN undo log | > > >>>>> | | | destroyed] | > > >>>>> | | | | | > > >>>>> | | | |---Confirm--->| > > >>>>> | | | | | > > >>>>> ``` > > >>>>> > > >>>>> The quorum should be collected as a table for a list of transactions > > >>>>> waiting for quorum. The latest transaction that collects the quorum is > > >>>>> considered as complete, as well as all transactions prior to it, since > > >>>>> all transactions should be applied in order. Leader writes a 'confirm' > > >>>>> message to the WAL that refers to the transaction's [LEADER_ID, LSN] and > > >>>>> the confirm has its own LSN. This confirm message is delivered to all > > >>>>> replicas through the existing replication mechanism. > > >>>>> > > >>>>> Replica should report a TXN application success to the leader via the > > >>>>> IPROTO explicitly to allow leader to collect the quorum for the TXN. > > >>>>> In case of application failure the replica has to disconnect from the > > >>>>> replication the same way as it is done now. The replica also has to > > >>>>> report its disconnection to the orchestrator. Further actions require > > >>>>> human intervention, since failure means either technical problem (such > > >>>>> as not enough space for WAL) that has to be resovled or an inconsistent > > >>>>> state that requires rejoin. > > >>>> > > >>>>> As soon as leader appears in a situation it has not enough replicas > > >>>>> to achieve quorum, the cluster should stop accepting any requests - both > > >>>>> write and read. > > >>>> > > >>>> How does *the cluster* know the state of the leader and if it > > >>>> doesn't, how it can possibly implement this? Did you mean > > >>>> the leader should stop accepting transactions here? But how can > > >>>> the leader know if it has not enough replicas during a read > > >>>> transaction, if it doesn't contact any replica to serve a read? > > >>> > > >>> I expect to have a disconnection trigger assigned to all relays so that > > >>> disconnection will cause the number of replicas decrease. The quorum > > >>> size is static, so we can stop at the very moment the number dives below. > > >> > > >> What happens between the event the leader is partitioned away and > > >> a new leader is elected? > > >> > > >> The leader may be unaware of the events and serve a read just > > >> fine. > > > > > > As it is stated 20 lines above: > > >>>>> As soon as leader appears in a situation it has not enough > > >>>>> replicas > > >>>>> to achieve quorum, the cluster should stop accepting any > > >>>>> requests - both > > >>>>> write and read. > > > > > > So it will not serve. > > > > This breaks compatibility, since now an orphan node is perfectly able > > to serve reads. The cluster can't just stop doing everything, if the > > quorum is lost. Stop writes - yes, since the quorum is lost anyway. But > > reads do not need a quorum. > > > > If you say reads need a quorum, then they would need to go through WAL, > > collect confirmations, and all. > > The reads should not be inconsistent - so that cluster will keep > answering A or B for the same request. And in case we lost quorum we > can't say for sure that all instances will answer the same. > > As we discussed it before, if leader appears in minor part of the > cluster it can't issue rollback for all unconfirmed txns, since the > majority will re-elect leader who will collect quorum for them. Means, > we will appear is a state that cluster split in two. So the minor part > should stop. Am I wrong here? > > > > > >> So at least you can't say the leader shouldn't be serving reads > > >> without quorum - because the only way to achieve it is to collect > > >> a quorum of responses to reads as well. > > > > > > The leader lost connection to the (N-Q)+1 repllicas out of the N in > > > cluster with a quorum of Q == it stops serving anything. So the quorum > > > criteria is there: no quorum - no reads. > > > > Connection count tells nothing. Network connectivity is not a reliable > > source of information. Only messages and persistent data are reliable > > (to certain extent). > > Well, persistent data can't help obtain quorum if there's no connection > to the replicas who should contribute to quorum. > Correct me, if I'm wrong: in case no quorum available we can't garantee > that the data is stored on at least <quorum> number of servers. Means - > cluster is not operable. > > > > > >>>>> The reason for this is that replication of transactions > > >>>>> can achieve quorum on replicas not visible to the leader. On the other > > >>>>> hand, leader can't achieve quorum with available minority. Leader has to > > >>>>> report the state and wait for human intervention. There's an option to > > >>>>> ask leader to rollback to the latest transaction that has quorum: leader > > >>>>> issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN > > >>>>> is of the first transaction in the leader's undo log. The rollback > > >>>>> message replicated to the available cluster will put it in a consistent > > >>>>> state. After that configuration of the cluster can be updated to > > >>>>> available quorum and leader can be switched back to write mode. > > >>>> > > >>>> As you should be able to conclude from restart scenario, it is > > >>>> possible a replica has the record in *confirmed* state but the > > >>>> leader has it in pending state. The replica will not be able to > > >>>> roll back then. Do you suggest the replica should abort if it > > >>>> can't rollback? This may lead to an avalanche of rejoins on leader > > >>>> restart, bringing performance to a halt. > > >>> > > >>> No, I declare replica with biggest LSN as a new shining leader. More > > >>> than that, new leader can (so far it will be by default) finalize the > > >>> former leader life's work by replicating txns and appropriate confirms. > > >> > > >> Right, this also assumes the restart is noticed, so it follows the > > >> same logic. > > > > > > How a restart can be unnoticed, if it causes disconnection? > > > > Disconnection has nothing to do with restart. The coordinator itself may > > restart. Or it may loose connection to the leader temporarily. Or the > > leader may loose it without any restarts. > > But how we detect it right now in Tarantool? Is there any machinery? > I suppose we can simply rely on the same at least to test the minimal - > and 'normally operating' - first approach to the problem. > > > So, thank you for all comments and please, find my updated RFC below. > > Sergos. > > --- * **Status**: In progress * **Start date**: 31-03-2020 * **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\> * **Issues**: https://github.com/tarantool/tarantool/issues/4842 ## Summary The aim of this RFC is to address the following list of problems formulated at MRG planning meeting: - protocol backward compatibility to enable cluster upgrade w/o downtime - consistency of data on replica and leader - switch from leader to replica without data loss - up to date replicas to run read-only requests - ability to switch async replicas into sync ones and vice versa - guarantee of rollback on leader and sync replicas - simplicity of cluster orchestration What this RFC is not: - high availability (HA) solution with automated failover, roles assignments an so on - master-master configuration support ## Background and motivation There are number of known implementation of consistent data presence in a Tarantool cluster. They can be commonly named as "wait for LSN" technique. The biggest issue with this technique is the absence of rollback guarantees at replica in case of transaction failure on one master or some of the replicas in the cluster. To provide such capabilities a new functionality should be introduced in Tarantool core, with requirements mentioned before - backward compatibility and ease of cluster orchestration. The cluster operation is expected to be in a full-mesh topology, although the process of automated topology support is beyond this RFC. ## Detailed design ### Quorum commit The main idea behind the proposal is to reuse existent machinery as much as possible. It will ensure the well-tested and proven functionality across many instances in MRG and beyond is used. The transaction rollback mechanism is in place and works for WAL write failure. If we substitute the WAL success with a new situation which is named 'quorum' later in this document then no changes to the machinery is needed. The same is true for snapshot machinery that allows to create a copy of the database in memory for the whole period of snapshot file write. Adding quorum here also minimizes changes. Currently replication represented by the following scheme: ``` Customer Leader WAL(L) Replica WAL(R) |------TXN----->| | | | | | | | | | [TXN undo log | | | | created] | | | | | | | | | |-----TXN----->| | | | | | | | | |<---WAL Ok----| | | | | | | | | [TXN undo log | | | | destroyed] | | | | | | | | |<----TXN Ok----| | | | | |-------Replicate TXN------->| | | | | | | | | | [TXN undo log | | | | created] | | | | | | | | | |-----TXN----->| | | | | | | | | |<---WAL Ok----| | | | | | | | | [TXN undo log | | | | destroyed] | | | | | | ``` To introduce the 'quorum' we have to receive confirmation from replicas to make a decision on whether the quorum is actually present. Leader collects necessary amount of replicas confirmation plus its own WAL success. This state is named 'quorum' and gives leader the right to complete the customers' request. So the picture will change to: ``` Customer Leader WAL(L) Replica WAL(R) |------TXN----->| | | | | | | | | | [TXN undo log | | | | created] | | | | | | | | | |-----TXN----->| | | | | | | | | |-------Replicate TXN------->| | | | | | | | | | [TXN undo log | | |<---WAL Ok----| created] | | | | | | | [Waiting | |-----TXN----->| | of a quorum] | | | | | | |<---WAL Ok----| | | | | | | |<------Replication Ok-------| | | | | | | | [Quorum | | | | achieved] | | | | | | | | | |---Confirm--->| | | | | | | | | |----------Confirm---------->| | | | | | | |<---TXN Ok-----| | |---Confirm--->| | | | | | | [TXN undo log | [TXN undo log | | destroyed] | destroyed] | | | | | | ``` The quorum should be collected as a table for a list of transactions waiting for quorum. The latest transaction that collects the quorum is considered as complete, as well as all transactions prior to it, since all transactions should be applied in order. Leader writes a 'confirm' message to the WAL that refers to the transaction's [LEADER_ID, LSN] and the confirm has its own LSN. This confirm message is delivered to all replicas through the existing replication mechanism. Replica should report a TXN application success to the leader via the IPROTO explicitly to allow leader to collect the quorum for the TXN. In case of application failure the replica has to disconnect from the replication the same way as it is done now. The replica also has to report its disconnection to the orchestrator. Further actions require human intervention, since failure means either technical problem (such as not enough space for WAL) that has to be resolved or an inconsistent state that requires rejoin. As soon as leader appears in a situation it has not enough replicas to achieve quorum, the cluster should stop accepting any requests - both write and read. The reason for this is that replication of transactions can achieve quorum on replicas not visible to the leader. On the other hand, leader can't achieve quorum with available minority. Leader has to report the state and wait for human intervention. There's an option to ask leader to rollback to the latest transaction that has quorum: leader issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN is of the first transaction in the leader's undo log. The rollback message replicated to the available cluster will put it in a consistent state. After that configuration of the cluster can be updated to available quorum and leader can be switched back to write mode. ### Leader role assignment. Be it a user-initiated assignment or an algorithmic one, it should use a common interface to assign the leader role. By now we implement a simplified machinery, still it should be feasible in the future to fit the algorithms, such as RAFT or proposed before box.ctl.promote. A system space \_voting can be used to replicate the voting among the cluster, this space should be writable even for a read-only instance. This space should contain a CURRENT_LEADER_ID at any time - means the current leader, can be a zero value at the start. This is needed to compare the appropriate vclock component below. All replicas should be subscribed to changes in the space and react as described below. promote(ID) - should be called from a replica with it's own ID. Writes an entry in the voting space about this ID is waiting for votes from cluster. The entry should also contain the current vclock[CURRENT_LEADER_ID] of the nominee. Upon changes in the space each replica should compare its appropriate vclock component with submitted one and append its vote to the space: AYE in case nominee's vclock is bigger or equal to the replica's one, NAY otherwise. As soon as nominee collects the quorum for being elected, it claims himself a Leader by switching in rw mode, writes CURRENT_LEADER_ID as a FORMER_LEADER_ID in the \_voting space and put its ID as a CURRENT_LEADER_ID. In case a NAY is appeared in the \_voting or a timeout predefined in box.cfg is reached, the nominee should remove it's entry from the space. The leader should assure that number of available instances in the cluster is enough to achieve the quorum and proceed to step 3, otherwise the leader should report the situation of incomplete quorum, as described in the last paragraph of previous section. The new Leader has to take the responsibility to replicate former Leader's entries from its WAL, obtain quorum and commit confirm messages referring to [FORMER_LEADER_ID, LSN] in its WAL, replicating to the cluster, after that it can start adding its own entries into the WAL. demote(ID) - should be called from the Leader instance. The Leader has to switch in ro mode and wait for its' undo log is empty. This effectively means all transactions are committed in the cluster and it is safe pass the leadership. Then it should write CURRENT_LEADER_ID as a FORMER_LEADER_ID and put CURRENT_LEADER_ID into 0. ### Recovery and failover. Tarantool instance during reading WAL should postpone the undo log deletion until the 'confirm' is read. In case the WAL eof is achieved, the instance should keep undo log for all transactions that are waiting for a confirm entry until the role of the instance is set. If this instance will be assigned a leader role then all transactions that have no corresponding confirm message should be confirmed (see the leader role assignment). In case there's not enough replicas to set up a quorum the cluster can be switched into a read-only mode. Note, this can't be done by default since some of transactions can have confirmed state. It is up to human intervention to force rollback of all transactions that have no confirm and to put the cluster into a consistent state. In case the instance will be assigned a replica role, it may appear in a state that it has conflicting WAL entries, in case it recovered from a leader role and some of transactions didn't replicated to the current leader. This situation should be resolved through rejoin of the instance. Consider an example below. Originally instance with ID1 was assigned a Leader role and the cluster had 2 replicas with quorum set to 2. ``` +---------------------+---------------------+---------------------+ | ID1 | ID2 | ID3 | | Leader | Replica 1 | Replica 2 | +---------------------+---------------------+---------------------+ | ID1 Tx1 | ID1 Tx1 | ID1 Tx1 | +---------------------+---------------------+---------------------+ | ID1 Tx2 | ID1 Tx2 | ID1 Tx2 | +---------------------+---------------------+---------------------+ | ID1 Tx3 | ID1 Tx3 | ID1 Tx3 | +---------------------+---------------------+---------------------+ | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | | +---------------------+---------------------+---------------------+ | ID1 Tx4 | ID1 Tx4 | | +---------------------+---------------------+---------------------+ | ID1 Tx5 | ID1 Tx5 | | +---------------------+---------------------+---------------------+ | ID1 Conf [ID1, Tx2] | | | +---------------------+---------------------+---------------------+ | Tx6 | | | +---------------------+---------------------+---------------------+ | Tx7 | | | +---------------------+---------------------+---------------------+ ``` Suppose at this moment the ID1 instance crashes. Then the ID2 instance should be assigned a leader role since its ID1 LSN is the biggest. Then this new leader will deliver its WAL to all replicas. As soon as quorum for Tx4 and Tx5 will be obtained, it should write the corresponding Confirms to its WAL. Note that Tx are still uses ID1. ``` +---------------------+---------------------+---------------------+ | ID1 | ID2 | ID3 | | (dead) | Leader | Replica 2 | +---------------------+---------------------+---------------------+ | ID1 Tx1 | ID1 Tx1 | ID1 Tx1 | +---------------------+---------------------+---------------------+ | ID1 Tx2 | ID1 Tx2 | ID1 Tx2 | +---------------------+---------------------+---------------------+ | ID1 Tx3 | ID1 Tx3 | ID1 Tx3 | +---------------------+---------------------+---------------------+ | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | +---------------------+---------------------+---------------------+ | ID1 Tx4 | ID1 Tx4 | ID1 Tx4 | +---------------------+---------------------+---------------------+ | ID1 Tx5 | ID1 Tx5 | ID1 Tx5 | +---------------------+---------------------+---------------------+ | ID1 Conf [ID1, Tx2] | ID2 Conf [Id1, Tx5] | ID2 Conf [Id1, Tx5] | +---------------------+---------------------+---------------------+ | ID1 Tx6 | | | +---------------------+---------------------+---------------------+ | ID1 Tx7 | | | +---------------------+---------------------+---------------------+ ``` After rejoining ID1 will figure out the inconsistency of its WAL: the last WAL entry it has is corresponding to Tx7, while in Leader's log the last entry with ID1 is Tx5. Confirm for a Tx can only be issued after appearance of the Tx on the majoirty of replicas, hence there's a good chances that ID1 will have inconsistency in its WAL covered with undo log. So, by rolling back all excessive Txs (in the example they are Tx6 and Tx7) the ID1 can put its memtx and vynil in consistent state. At this point a snapshot can be created at ID1 with appropriate WAL rotation. The old WAL should be renamed so it will not be reused in the future and can be kept for postmortem. ``` +---------------------+---------------------+---------------------+ | ID1 | ID2 | ID3 | | Replica 1 | Leader | Replica 2 | +---------------------+---------------------+---------------------+ | ID1 Tx1 | ID1 Tx1 | ID1 Tx1 | +---------------------+---------------------+---------------------+ | ID1 Tx2 | ID1 Tx2 | ID1 Tx2 | +---------------------+---------------------+---------------------+ | ID1 Tx3 | ID1 Tx3 | ID1 Tx3 | +---------------------+---------------------+---------------------+ | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | +---------------------+---------------------+---------------------+ | ID1 Tx4 | ID1 Tx4 | ID1 Tx4 | +---------------------+---------------------+---------------------+ | ID1 Tx5 | ID1 Tx5 | ID1 Tx5 | +---------------------+---------------------+---------------------+ | | ID2 Conf [Id1, Tx5] | ID2 Conf [Id1, Tx5] | +---------------------+---------------------+---------------------+ | | ID2 Tx1 | ID2 Tx1 | +---------------------+---------------------+---------------------+ | | ID2 Tx2 | ID2 Tx2 | +---------------------+---------------------+---------------------+ ``` Although, in case undo log is not enough to cover the WAL inconsistence with the new leader, the ID1 needs a complete rejoin. ### Snapshot generation. We also can reuse current machinery of snapshot generation. Upon receiving a request to create a snapshot an instance should request a readview for the current commit operation. Although start of the snapshot generation should be postponed until this commit operation receives its confirmation. In case operation is rolled back, the snapshot generation should be aborted and restarted using current transaction after rollback is complete. After snapshot is created the WAL should start from the first operation that follows the commit operation snapshot is generated for. That means WAL will contain 'confirm' messages that refer to transactions that are not present in the WAL. Apparently, we have to allow this for the case 'confirm' refers to a transaction with LSN less than the first entry in the WAL. In case master appears unavailable a replica still have to be able to create a snapshot. Replica can perform rollback for all transactions that are not confirmed and claim its LSN as the latest confirmed txn. Then it can create a snapshot in a regular way and start with blank xlog file. All rolled back transactions will appear through the regular replication in case master reappears later on. ### Asynchronous replication. Along with synchronous replicas the cluster can contain asynchronous replicas. That means async replica doesn't reply to the leader with errors since they're not contributing into quorum. Still, async replicas have to follow the new WAL operation, such as keep rollback info until 'quorum' message is received. This is essential for the case of 'rollback' message appearance in the WAL. This message assumes replica is able to perform all necessary rollback by itself. Cluster information should contain explicit notification of each replica operation mode. ### Synchronous replication enabling. Synchronous operation can be required for a set of spaces in the data scheme. That means only transactions that contain data modification for these spaces should require quorum. Such transactions named synchronous. As soon as last operation of synchronous transaction appeared in leader's WAL, it will cause all following transactions - no matter if they are synchronous or not - wait for the quorum. In case quorum is not achieved the 'rollback' operation will cause rollback of all transactions after the synchronous one. It will ensure the consistent state of the data both on leader and replicas. In case user doesn't require synchronous operation for any space then no changes to the WAL generation and replication will appear. Cluster description should contain explicit attribute for each replica to denote it participates in synchronous activities. Also the description should contain criterion on how many replicas responses are needed to achieve the quorum. ## Rationale and alternatives There is an implementation of synchronous replication as part of gh-980 activities, still it is not in a state to get into the product. More than that it intentionally breaks backward compatibility which is a prerequisite for this proposal. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-05-20 20:59 ` Sergey Ostanevich @ 2020-05-25 23:41 ` Vladislav Shpilevoy 2020-05-27 21:17 ` Sergey Ostanevich 0 siblings, 1 reply; 53+ messages in thread From: Vladislav Shpilevoy @ 2020-05-25 23:41 UTC (permalink / raw) To: Sergey Ostanevich; +Cc: tarantool-patches Hi! Thanks for the changes! >>>>>>>> As soon as leader appears in a situation it has not enough >>>>>>>> replicas >>>>>>>> to achieve quorum, the cluster should stop accepting any >>>>>>>> requests - both >>>>>>>> write and read. >>>> >>>> So it will not serve. >>> >>> This breaks compatibility, since now an orphan node is perfectly able >>> to serve reads. The cluster can't just stop doing everything, if the >>> quorum is lost. Stop writes - yes, since the quorum is lost anyway. But >>> reads do not need a quorum. >>> >>> If you say reads need a quorum, then they would need to go through WAL, >>> collect confirmations, and all. >> >> The reads should not be inconsistent - so that cluster will keep >> answering A or B for the same request. And in case we lost quorum we >> can't say for sure that all instances will answer the same. >> >> As we discussed it before, if leader appears in minor part of the >> cluster it can't issue rollback for all unconfirmed txns, since the >> majority will re-elect leader who will collect quorum for them. Means, >> we will appear is a state that cluster split in two. So the minor part >> should stop. Am I wrong here? Yeah, kinda. As long as you allow reading from replicas, you *always* will have a time slot, when you will be able to read different data for the same key on different replicas. Even with reads going through quorum. Because it is physically impossible to make nodes A and B start answering the same data at the same time moment. To notify them about a confirm you will send network messages, they will have not the same delay, won't be processed in the same moment of time, and some of them probably won't be even delivered. The only correct way to read the same - read from one node only. From the leader. And since this is not our way, it means we can't beat the 'inconsistent' reads problems. And I don't think we should. Because if somebody needs to do 'consistent' reads, they should read from leader only. In other words, the concept of 'consistency' is highly application dependent here. If we provide a way to read from replicas, we give flexibility to choose: read from leader only and see always the same data, or read from all, and have a possibility, that requests may see different data on different replicas sometimes. > ## Detailed design > > ### Quorum commit > > The main idea behind the proposal is to reuse existent machinery as much > as possible. It will ensure the well-tested and proven functionality > across many instances in MRG and beyond is used. The transaction rollback > mechanism is in place and works for WAL write failure. If we substitute > the WAL success with a new situation which is named 'quorum' later in > this document then no changes to the machinery is needed. The same is > true for snapshot machinery that allows to create a copy of the database > in memory for the whole period of snapshot file write. Adding quorum here > also minimizes changes. > > Currently replication represented by the following scheme: > ``` > Customer Leader WAL(L) Replica WAL(R) > |------TXN----->| | | | > | | | | | > | [TXN undo log | | | > | created] | | | > | | | | | > | |-----TXN----->| | | > | | | | | > | |<---WAL Ok----| | | > | | | | | > | [TXN undo log | | | > | destroyed] | | | > | | | | | > |<----TXN Ok----| | | | > | |-------Replicate TXN------->| | > | | | | | > | | | [TXN undo log | > | | | created] | > | | | | | > | | | |-----TXN----->| > | | | | | > | | | |<---WAL Ok----| > | | | | | > | | | [TXN undo log | > | | | destroyed] | > | | | | | > ``` > > To introduce the 'quorum' we have to receive confirmation from replicas > to make a decision on whether the quorum is actually present. Leader > collects necessary amount of replicas confirmation plus its own WAL > success. This state is named 'quorum' and gives leader the right to > complete the customers' request. So the picture will change to: > ``` > Customer Leader WAL(L) Replica WAL(R) > |------TXN----->| | | | > | | | | | > | [TXN undo log | | | > | created] | | | > | | | | | > | |-----TXN----->| | | > | | | | | > | |-------Replicate TXN------->| | > | | | | | > | | | [TXN undo log | > | |<---WAL Ok----| created] | > | | | | | > | [Waiting | |-----TXN----->| > | of a quorum] | | | > | | | |<---WAL Ok----| > | | | | | > | |<------Replication Ok-------| | > | | | | | > | [Quorum | | | > | achieved] | | | > | | | | | > | |---Confirm--->| | | > | | | | | > | |----------Confirm---------->| | > | | | | | > |<---TXN Ok-----| | |---Confirm--->| > | | | | | > | [TXN undo log | [TXN undo log | > | destroyed] | destroyed] | > | | | | | > ``` > > The quorum should be collected as a table for a list of transactions > waiting for quorum. The latest transaction that collects the quorum is > considered as complete, as well as all transactions prior to it, since > all transactions should be applied in order. Leader writes a 'confirm' > message to the WAL that refers to the transaction's [LEADER_ID, LSN] and > the confirm has its own LSN. This confirm message is delivered to all > replicas through the existing replication mechanism. > > Replica should report a TXN application success to the leader via the > IPROTO explicitly to allow leader to collect the quorum for the TXN. > In case of application failure the replica has to disconnect from the > replication the same way as it is done now. The replica also has to > report its disconnection to the orchestrator. Further actions require > human intervention, since failure means either technical problem (such > as not enough space for WAL) that has to be resolved or an inconsistent > state that requires rejoin. I don't think a replica should report disconnection. Problem of disconnection is that it leads to loosing the connection. So it may be not able to connect to the orchestrator. Also it would be strange for tarantool to depend on some external service, to which it should report. This looks like the orchestrator's business how will it determine connectivity. Replica has nothing to do with it from its side. > As soon as leader appears in a situation it has not enough replicas > to achieve quorum, the cluster should stop accepting any requests - both > write and read. The moment of not having enough replicas can't be determined properly. You may loose connection to replicas (they could be powered off), but TCP won't see that, and the node will continue working. The failure will be discovered only when a 'write' request will try to collect a quorum, or after a timeout will pass on not delivering heartbeats. During this time reads will be served. And there is no way to prevent them except collecting a quorum on that. See my first comment in this email for more details. On the summary: we can't stop accepting read requests. Btw, what to do with reads, which were *in-progress*, when the quorum was lost? Such as long vinyl reads. > The reason for this is that replication of transactions > can achieve quorum on replicas not visible to the leader. On the other > hand, leader can't achieve quorum with available minority. Leader has to > report the state and wait for human intervention. Yeah, but if the leader couldn't achieve a quorum on some transactions, they are not visible (assuming MVCC will work properly). So they can't be read anyway. And if a leader answered an error, it does not mean that the transaction wasn't replicated on the majority, as we discussed at some meeting, I don't already remember when. So here read allowance also works fine - not having some data visible and getting error at a sync transaction does not mean it is not committed. A user should be aware of that. > There's an option to > ask leader to rollback to the latest transaction that has quorum: leader > issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN > is of the first transaction in the leader's undo log. The rollback > message replicated to the available cluster will put it in a consistent > state. After that configuration of the cluster can be updated to > available quorum and leader can be switched back to write mode. > > ### Leader role assignment. > > Be it a user-initiated assignment or an algorithmic one, it should use > a common interface to assign the leader role. By now we implement a > simplified machinery, still it should be feasible in the future to fit > the algorithms, such as RAFT or proposed before box.ctl.promote. > > A system space \_voting can be used to replicate the voting among the > cluster, this space should be writable even for a read-only instance. > This space should contain a CURRENT_LEADER_ID at any time - means the > current leader, can be a zero value at the start. This is needed to > compare the appropriate vclock component below. > > All replicas should be subscribed to changes in the space and react as > described below. > > promote(ID) - should be called from a replica with it's own ID. > Writes an entry in the voting space about this ID is waiting for > votes from cluster. The entry should also contain the current > vclock[CURRENT_LEADER_ID] of the nominee. > > Upon changes in the space each replica should compare its appropriate > vclock component with submitted one and append its vote to the space: > AYE in case nominee's vclock is bigger or equal to the replica's one, > NAY otherwise. > > As soon as nominee collects the quorum for being elected, it claims > himself a Leader by switching in rw mode, writes CURRENT_LEADER_ID as > a FORMER_LEADER_ID in the \_voting space and put its ID as a > CURRENT_LEADER_ID. In case a NAY is appeared in the \_voting or a > timeout predefined in box.cfg is reached, the nominee should remove > it's entry from the space. > > The leader should assure that number of available instances in the > cluster is enough to achieve the quorum and proceed to step 3, otherwise > the leader should report the situation of incomplete quorum, as > described in the last paragraph of previous section. > > The new Leader has to take the responsibility to replicate former Leader's > entries from its WAL, obtain quorum and commit confirm messages referring > to [FORMER_LEADER_ID, LSN] in its WAL, replicating to the cluster, after > that it can start adding its own entries into the WAL. > > demote(ID) - should be called from the Leader instance. > The Leader has to switch in ro mode and wait for its' undo log is > empty. This effectively means all transactions are committed in the > cluster and it is safe pass the leadership. Then it should write > CURRENT_LEADER_ID as a FORMER_LEADER_ID and put CURRENT_LEADER_ID > into 0. This looks like box.ctl.promote() algorithm. Although I thought we decided not to implement any kind of auto election here, no? Box.ctl.promote() assumed, that it does all the steps automatically, except choosing on which node to call this function. This is what it was so complicated. It was basically raft. But yeah, as discussed verbally, this is a subject for improvement. The way I see it is that we need to give vclock based algorithm of choosing a new leader; tell how to stop replication from the old leader; allow to read vclock from replicas (basically, let the external service read box.info). Since you said you think we should not provide an API for all sync transactions rollback, it looks like no need in a special new API. But if we still want to allow to rollback all pending transactions of the old leader on a new leader (like Mons wants) then yeah, seems like we would need a new function. For example, box.ctl.sync_rollback() to rollback all pending. And box.ctl.sync_confirm() to confirm all pending. Perhaps we could add more admin-line parameters such as replica_id with which to write 'confirm/rollback' message. > ### Recovery and failover. > > Tarantool instance during reading WAL should postpone the undo log > deletion until the 'confirm' is read. In case the WAL eof is achieved, > the instance should keep undo log for all transactions that are waiting > for a confirm entry until the role of the instance is set. > > If this instance will be assigned a leader role then all transactions > that have no corresponding confirm message should be confirmed (see the > leader role assignment). > > In case there's not enough replicas to set up a quorum the cluster can > be switched into a read-only mode. Note, this can't be done by default > since some of transactions can have confirmed state. It is up to human > intervention to force rollback of all transactions that have no confirm > and to put the cluster into a consistent state. Above you said: >> As soon as leader appears in a situation it has not enough replicas >> to achieve quorum, the cluster should stop accepting any requests - both >> write and read. But here I see, that the cluster "switched into a read-only mode". So there is a contradiction. And I think it should be resolved in favor of 'read-only mode'. I explained why in the previous comments. > In case the instance will be assigned a replica role, it may appear in > a state that it has conflicting WAL entries, in case it recovered from a > leader role and some of transactions didn't replicated to the current > leader. This situation should be resolved through rejoin of the instance. > > Consider an example below. Originally instance with ID1 was assigned a > Leader role and the cluster had 2 replicas with quorum set to 2. > > ``` > +---------------------+---------------------+---------------------+ > | ID1 | ID2 | ID3 | > | Leader | Replica 1 | Replica 2 | > +---------------------+---------------------+---------------------+ > | ID1 Tx1 | ID1 Tx1 | ID1 Tx1 | > +---------------------+---------------------+---------------------+ > | ID1 Tx2 | ID1 Tx2 | ID1 Tx2 | > +---------------------+---------------------+---------------------+ > | ID1 Tx3 | ID1 Tx3 | ID1 Tx3 | > +---------------------+---------------------+---------------------+ > | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | | > +---------------------+---------------------+---------------------+ > | ID1 Tx4 | ID1 Tx4 | | > +---------------------+---------------------+---------------------+ > | ID1 Tx5 | ID1 Tx5 | | > +---------------------+---------------------+---------------------+ > | ID1 Conf [ID1, Tx2] | | | > +---------------------+---------------------+---------------------+ > | Tx6 | | | > +---------------------+---------------------+---------------------+ > | Tx7 | | | > +---------------------+---------------------+---------------------+ > ``` > Suppose at this moment the ID1 instance crashes. Then the ID2 instance > should be assigned a leader role since its ID1 LSN is the biggest. > Then this new leader will deliver its WAL to all replicas. > > As soon as quorum for Tx4 and Tx5 will be obtained, it should write the > corresponding Confirms to its WAL. Note that Tx are still uses ID1. > ``` > +---------------------+---------------------+---------------------+ > | ID1 | ID2 | ID3 | > | (dead) | Leader | Replica 2 | > +---------------------+---------------------+---------------------+ > | ID1 Tx1 | ID1 Tx1 | ID1 Tx1 | > +---------------------+---------------------+---------------------+ > | ID1 Tx2 | ID1 Tx2 | ID1 Tx2 | > +---------------------+---------------------+---------------------+ > | ID1 Tx3 | ID1 Tx3 | ID1 Tx3 | > +---------------------+---------------------+---------------------+ > | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | > +---------------------+---------------------+---------------------+ > | ID1 Tx4 | ID1 Tx4 | ID1 Tx4 | > +---------------------+---------------------+---------------------+ > | ID1 Tx5 | ID1 Tx5 | ID1 Tx5 | > +---------------------+---------------------+---------------------+ > | ID1 Conf [ID1, Tx2] | ID2 Conf [Id1, Tx5] | ID2 Conf [Id1, Tx5] | Id1 -> ID1 (typo) > +---------------------+---------------------+---------------------+ > | ID1 Tx6 | | | > +---------------------+---------------------+---------------------+ > | ID1 Tx7 | | | > +---------------------+---------------------+---------------------+ > ``` > After rejoining ID1 will figure out the inconsistency of its WAL: the > last WAL entry it has is corresponding to Tx7, while in Leader's log the > last entry with ID1 is Tx5. Confirm for a Tx can only be issued after > appearance of the Tx on the majoirty of replicas, hence there's a good > chances that ID1 will have inconsistency in its WAL covered with undo > log. So, by rolling back all excessive Txs (in the example they are Tx6 > and Tx7) the ID1 can put its memtx and vynil in consistent state. Yeah, but the problem is that the node1 has vclock[ID1] == 'Conf [ID1, Tx2]'. This row can't be rolled back. So looks like node1 needs a rejoin. > At this point a snapshot can be created at ID1 with appropriate WAL > rotation. The old WAL should be renamed so it will not be reused in the > future and can be kept for postmortem. > ``` > +---------------------+---------------------+---------------------+ > | ID1 | ID2 | ID3 | > | Replica 1 | Leader | Replica 2 | > +---------------------+---------------------+---------------------+ > | ID1 Tx1 | ID1 Tx1 | ID1 Tx1 | > +---------------------+---------------------+---------------------+ > | ID1 Tx2 | ID1 Tx2 | ID1 Tx2 | > +---------------------+---------------------+---------------------+ > | ID1 Tx3 | ID1 Tx3 | ID1 Tx3 | > +---------------------+---------------------+---------------------+ > | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | > +---------------------+---------------------+---------------------+ > | ID1 Tx4 | ID1 Tx4 | ID1 Tx4 | > +---------------------+---------------------+---------------------+ > | ID1 Tx5 | ID1 Tx5 | ID1 Tx5 | > +---------------------+---------------------+---------------------+ > | | ID2 Conf [Id1, Tx5] | ID2 Conf [Id1, Tx5] | > +---------------------+---------------------+---------------------+ > | | ID2 Tx1 | ID2 Tx1 | > +---------------------+---------------------+---------------------+ > | | ID2 Tx2 | ID2 Tx2 | > +---------------------+---------------------+---------------------+ > ``` > Although, in case undo log is not enough to cover the WAL inconsistence > with the new leader, the ID1 needs a complete rejoin. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-05-25 23:41 ` Vladislav Shpilevoy @ 2020-05-27 21:17 ` Sergey Ostanevich 2020-06-09 16:19 ` Sergey Ostanevich 0 siblings, 1 reply; 53+ messages in thread From: Sergey Ostanevich @ 2020-05-27 21:17 UTC (permalink / raw) To: Vladislav Shpilevoy; +Cc: tarantool-patches Hi! Thanks for review! Some comments below. On 26 мая 01:41, Vladislav Shpilevoy wrote: > >> > >> The reads should not be inconsistent - so that cluster will keep > >> answering A or B for the same request. And in case we lost quorum we > >> can't say for sure that all instances will answer the same. > >> > >> As we discussed it before, if leader appears in minor part of the > >> cluster it can't issue rollback for all unconfirmed txns, since the > >> majority will re-elect leader who will collect quorum for them. Means, > >> we will appear is a state that cluster split in two. So the minor part > >> should stop. Am I wrong here? > > Yeah, kinda. As long as you allow reading from replicas, you *always* will > have a time slot, when you will be able to read different data for the > same key on different replicas. Even with reads going through quorum. > > Because it is physically impossible to make nodes A and B start answering > the same data at the same time moment. To notify them about a confirm you will > send network messages, they will have not the same delay, won't be processed > in the same moment of time, and some of them probably won't be even delivered. > > The only correct way to read the same - read from one node only. From the > leader. And since this is not our way, it means we can't beat the 'inconsistent' > reads problems. And I don't think we should. Because if somebody needs to do > 'consistent' reads, they should read from leader only. > > In other words, the concept of 'consistency' is highly application dependent > here. If we provide a way to read from replicas, we give flexibility to choose: > read from leader only and see always the same data, or read from all, and have > a possibility, that requests may see different data on different replicas > sometimes. So, it looks like we will follow the current approach: if quorum can't be achieved, cluster appears in r/o mode. Objections? > > > > Replica should report a TXN application success to the leader via the > > IPROTO explicitly to allow leader to collect the quorum for the TXN. > > In case of application failure the replica has to disconnect from the > > replication the same way as it is done now. The replica also has to > > report its disconnection to the orchestrator. Further actions require > > human intervention, since failure means either technical problem (such > > as not enough space for WAL) that has to be resolved or an inconsistent > > state that requires rejoin. > > I don't think a replica should report disconnection. Problem of > disconnection is that it leads to loosing the connection. So it may be > not able to connect to the orchestrator. Also it would be strange for > tarantool to depend on some external service, to which it should report. > This looks like the orchestrator's business how will it determine > connectivity. Replica has nothing to do with it from its side. External service is something I expect to be useful for the first part of implementation - the quorum part. Definitely, we will move onward to achieve some automation in leader election and failover. I just don't expect this to be part of this RFC. Anyways, orchestrator has to ask replica to figure out the connectivity between replica and leader. > > > As soon as leader appears in a situation it has not enough replicas > > to achieve quorum, the cluster should stop accepting any requests - both > > write and read. > > The moment of not having enough replicas can't be determined properly. > You may loose connection to replicas (they could be powered off), but > TCP won't see that, and the node will continue working. The failure will > be discovered only when a 'write' request will try to collect a quorum, > or after a timeout will pass on not delivering heartbeats. During this > time reads will be served. And there is no way to prevent them except > collecting a quorum on that. See my first comment in this email for more > details. > > On the summary: we can't stop accepting read requests. > > Btw, what to do with reads, which were *in-progress*, when the quorum > was lost? Such as long vinyl reads. But the quorum was in place at the start of it? Then according to transaction manager behavior only older version data will be available for read - means data that collected quorum. > > > The reason for this is that replication of transactions > > can achieve quorum on replicas not visible to the leader. On the other > > hand, leader can't achieve quorum with available minority. Leader has to > > report the state and wait for human intervention. > > Yeah, but if the leader couldn't achieve a quorum on some transactions, > they are not visible (assuming MVCC will work properly). So they can't > be read anyway. And if a leader answered an error, it does not mean that > the transaction wasn't replicated on the majority, as we discussed at some > meeting, I don't already remember when. So here read allowance also works > fine - not having some data visible and getting error at a sync transaction > does not mean it is not committed. A user should be aware of that. True, we discussed that we should guarantee only that if we answered 'Ok' then data is present in quorum number of instances. [...] > > demote(ID) - should be called from the Leader instance. > > The Leader has to switch in ro mode and wait for its' undo log is > > empty. This effectively means all transactions are committed in the > > cluster and it is safe pass the leadership. Then it should write > > CURRENT_LEADER_ID as a FORMER_LEADER_ID and put CURRENT_LEADER_ID > > into 0. > > This looks like box.ctl.promote() algorithm. Although I thought we decided > not to implement any kind of auto election here, no? Box.ctl.promote() > assumed, that it does all the steps automatically, except choosing on which > node to call this function. This is what it was so complicated. It was > basically raft. > > But yeah, as discussed verbally, this is a subject for improvement. I personally would like to postpone the algorithm should be postponed for the next stage (Q3-Q4) but now we should not mess up too much to revamp. Hence, we have to elaborate the internals - such as _voting table I mentioned. Even with introduction of terms for each leader - as in RAFT for example - we still can keep it in a replicated space, isn't it? > > The way I see it is that we need to give vclock based algorithm of choosing > a new leader; tell how to stop replication from the old leader; allow to > read vclock from replicas (basically, let the external service read box.info). That's the #1 for me by now: how a read-only replica can quit listening to a demoted leader, which can be not aware of its demotion? Still, for efficiency it should be done w/o disconnection. > > Since you said you think we should not provide an API for all sync transactions > rollback, it looks like no need in a special new API. But if we still want > to allow to rollback all pending transactions of the old leader on a new leader > (like Mons wants) then yeah, seems like we would need a new function. For example, > box.ctl.sync_rollback() to rollback all pending. And box.ctl.sync_confirm() to > confirm all pending. Perhaps we could add more admin-line parameters such as > replica_id with which to write 'confirm/rollback' message. I believe it's a good point to keep two approaches and perhaps set one of the two in configuration. This should resolve the issue with 'the rest of the cluster confirms old leader's transactions and because of it leader can't rollback'. > > > ### Recovery and failover. > > > > Tarantool instance during reading WAL should postpone the undo log > > deletion until the 'confirm' is read. In case the WAL eof is achieved, > > the instance should keep undo log for all transactions that are waiting > > for a confirm entry until the role of the instance is set. > > > > If this instance will be assigned a leader role then all transactions > > that have no corresponding confirm message should be confirmed (see the > > leader role assignment). > > > > In case there's not enough replicas to set up a quorum the cluster can > > be switched into a read-only mode. Note, this can't be done by default > > since some of transactions can have confirmed state. It is up to human > > intervention to force rollback of all transactions that have no confirm > > and to put the cluster into a consistent state. > > Above you said: > > >> As soon as leader appears in a situation it has not enough replicas > >> to achieve quorum, the cluster should stop accepting any requests - both > >> write and read. > > But here I see, that the cluster "switched into a read-only mode". So there > is a contradiction. And I think it should be resolved in favor of > 'read-only mode'. I explained why in the previous comments. My bad, I was moving around this problem already and tend to allow r/o. Will update. > > > In case the instance will be assigned a replica role, it may appear in > > a state that it has conflicting WAL entries, in case it recovered from a > > leader role and some of transactions didn't replicated to the current > > leader. This situation should be resolved through rejoin of the instance. > > > > Consider an example below. Originally instance with ID1 was assigned a > > Leader role and the cluster had 2 replicas with quorum set to 2. > > > > ``` > > +---------------------+---------------------+---------------------+ > > | ID1 | ID2 | ID3 | > > | Leader | Replica 1 | Replica 2 | > > +---------------------+---------------------+---------------------+ > > | ID1 Tx1 | ID1 Tx1 | ID1 Tx1 | > > +---------------------+---------------------+---------------------+ > > | ID1 Tx2 | ID1 Tx2 | ID1 Tx2 | > > +---------------------+---------------------+---------------------+ > > | ID1 Tx3 | ID1 Tx3 | ID1 Tx3 | > > +---------------------+---------------------+---------------------+ > > | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | | > > +---------------------+---------------------+---------------------+ > > | ID1 Tx4 | ID1 Tx4 | | > > +---------------------+---------------------+---------------------+ > > | ID1 Tx5 | ID1 Tx5 | | > > +---------------------+---------------------+---------------------+ > > | ID1 Conf [ID1, Tx2] | | | > > +---------------------+---------------------+---------------------+ > > | Tx6 | | | > > +---------------------+---------------------+---------------------+ > > | Tx7 | | | > > +---------------------+---------------------+---------------------+ > > ``` > > Suppose at this moment the ID1 instance crashes. Then the ID2 instance > > should be assigned a leader role since its ID1 LSN is the biggest. > > Then this new leader will deliver its WAL to all replicas. > > > > As soon as quorum for Tx4 and Tx5 will be obtained, it should write the > > corresponding Confirms to its WAL. Note that Tx are still uses ID1. > > ``` > > +---------------------+---------------------+---------------------+ > > | ID1 | ID2 | ID3 | > > | (dead) | Leader | Replica 2 | > > +---------------------+---------------------+---------------------+ > > | ID1 Tx1 | ID1 Tx1 | ID1 Tx1 | > > +---------------------+---------------------+---------------------+ > > | ID1 Tx2 | ID1 Tx2 | ID1 Tx2 | > > +---------------------+---------------------+---------------------+ > > | ID1 Tx3 | ID1 Tx3 | ID1 Tx3 | > > +---------------------+---------------------+---------------------+ > > | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | > > +---------------------+---------------------+---------------------+ > > | ID1 Tx4 | ID1 Tx4 | ID1 Tx4 | > > +---------------------+---------------------+---------------------+ > > | ID1 Tx5 | ID1 Tx5 | ID1 Tx5 | > > +---------------------+---------------------+---------------------+ > > | ID1 Conf [ID1, Tx2] | ID2 Conf [Id1, Tx5] | ID2 Conf [Id1, Tx5] | > > Id1 -> ID1 (typo) Thanks! > > > +---------------------+---------------------+---------------------+ > > | ID1 Tx6 | | | > > +---------------------+---------------------+---------------------+ > > | ID1 Tx7 | | | > > +---------------------+---------------------+---------------------+ > > ``` > > After rejoining ID1 will figure out the inconsistency of its WAL: the > > last WAL entry it has is corresponding to Tx7, while in Leader's log the > > last entry with ID1 is Tx5. Confirm for a Tx can only be issued after > > appearance of the Tx on the majoirty of replicas, hence there's a good > > chances that ID1 will have inconsistency in its WAL covered with undo > > log. So, by rolling back all excessive Txs (in the example they are Tx6 > > and Tx7) the ID1 can put its memtx and vynil in consistent state. > > Yeah, but the problem is that the node1 has vclock[ID1] == 'Conf [ID1, Tx2]'. > This row can't be rolled back. So looks like node1 needs a rejoin. Confirm message is equivalent to a NOP - @sergepetrenko apparently does implementation exactly this way. So there's no need to roll it back in an engine, rather perform the xlog rotation before it. > > > At this point a snapshot can be created at ID1 with appropriate WAL > > rotation. The old WAL should be renamed so it will not be reused in the > > future and can be kept for postmortem. > > ``` > > +---------------------+---------------------+---------------------+ > > | ID1 | ID2 | ID3 | > > | Replica 1 | Leader | Replica 2 | > > +---------------------+---------------------+---------------------+ > > | ID1 Tx1 | ID1 Tx1 | ID1 Tx1 | > > +---------------------+---------------------+---------------------+ > > | ID1 Tx2 | ID1 Tx2 | ID1 Tx2 | > > +---------------------+---------------------+---------------------+ > > | ID1 Tx3 | ID1 Tx3 | ID1 Tx3 | > > +---------------------+---------------------+---------------------+ > > | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | > > +---------------------+---------------------+---------------------+ > > | ID1 Tx4 | ID1 Tx4 | ID1 Tx4 | > > +---------------------+---------------------+---------------------+ > > | ID1 Tx5 | ID1 Tx5 | ID1 Tx5 | > > +---------------------+---------------------+---------------------+ > > | | ID2 Conf [Id1, Tx5] | ID2 Conf [Id1, Tx5] | > > +---------------------+---------------------+---------------------+ > > | | ID2 Tx1 | ID2 Tx1 | > > +---------------------+---------------------+---------------------+ > > | | ID2 Tx2 | ID2 Tx2 | > > +---------------------+---------------------+---------------------+ > > ``` > > Although, in case undo log is not enough to cover the WAL inconsistence > > with the new leader, the ID1 needs a complete rejoin. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-05-27 21:17 ` Sergey Ostanevich @ 2020-06-09 16:19 ` Sergey Ostanevich 2020-06-11 15:17 ` Vladislav Shpilevoy 0 siblings, 1 reply; 53+ messages in thread From: Sergey Ostanevich @ 2020-06-09 16:19 UTC (permalink / raw) To: Vladislav Shpilevoy; +Cc: tarantool-patches Hi! Please, take a look at the latest changes, which include timeouts for quorum collection and the heartbeat for ensure the leader is alive. regards, Sergos * **Status**: In progress * **Start date**: 31-03-2020 * **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\> * **Issues**: https://github.com/tarantool/tarantool/issues/4842 ## Summary The aim of this RFC is to address the following list of problems formulated at MRG planning meeting: - protocol backward compatibility to enable cluster upgrade w/o downtime - consistency of data on replica and leader - switch from leader to replica without data loss - up to date replicas to run read-only requests - ability to switch async replicas into sync ones and vice versa - guarantee of rollback on leader and sync replicas - simplicity of cluster orchestration What this RFC is not: - high availability (HA) solution with automated failover, roles assignments an so on - master-master configuration support ## Background and motivation There are number of known implementation of consistent data presence in a Tarantool cluster. They can be commonly named as "wait for LSN" technique. The biggest issue with this technique is the absence of rollback guarantees at replica in case of transaction failure on one master or some of the replicas in the cluster. To provide such capabilities a new functionality should be introduced in Tarantool core, with requirements mentioned before - backward compatibility and ease of cluster orchestration. The cluster operation is expected to be in a full-mesh topology, although the process of automated topology support is beyond this RFC. ## Detailed design ### Quorum commit The main idea behind the proposal is to reuse existent machinery as much as possible. It will ensure the well-tested and proven functionality across many instances in MRG and beyond is used. The transaction rollback mechanism is in place and works for WAL write failure. If we substitute the WAL success with a new situation which is named 'quorum' later in this document then no changes to the machinery is needed. The same is true for snapshot machinery that allows to create a copy of the database in memory for the whole period of snapshot file write. Adding quorum here also minimizes changes. Currently replication represented by the following scheme: ``` Customer Leader WAL(L) Replica WAL(R) |------TXN----->| | | | | | | | | | [TXN undo log | | | | created] | | | | | | | | | |-----TXN----->| | | | | | | | | |<---WAL Ok----| | | | | | | | | [TXN undo log | | | | destroyed] | | | | | | | | |<----TXN Ok----| | | | | |-------Replicate TXN------->| | | | | | | | | | [TXN undo log | | | | created] | | | | | | | | | |-----TXN----->| | | | | | | | | |<---WAL Ok----| | | | | | | | | [TXN undo log | | | | destroyed] | | | | | | ``` To introduce the 'quorum' we have to receive confirmation from replicas to make a decision on whether the quorum is actually present. Leader collects necessary amount of replicas confirmation plus its own WAL success. This state is named 'quorum' and gives leader the right to complete the customers' request. So the picture will change to: ``` Customer Leader WAL(L) Replica WAL(R) |------TXN----->| | | | | | | | | | [TXN undo log | | | | created] | | | | | | | | | |-----TXN----->| | | | | | | | | |-------Replicate TXN------->| | | | | | | | | | [TXN undo log | | |<---WAL Ok----| created] | | | | | | | [Waiting | |-----TXN----->| | of a quorum] | | | | | | |<---WAL Ok----| | | | | | | |<------Replication Ok-------| | | | | | | | [Quorum | | | | achieved] | | | | | | | | | |---Confirm--->| | | | | | | | | |----------Confirm---------->| | | | | | | |<---TXN Ok-----| | |---Confirm--->| | | | | | | [TXN undo log | [TXN undo log | | destroyed] | destroyed] | | | | | | ``` The quorum should be collected as a table for a list of transactions waiting for quorum. The latest transaction that collects the quorum is considered as complete, as well as all transactions prior to it, since all transactions should be applied in order. Leader writes a 'confirm' message to the WAL that refers to the transaction's [LEADER_ID, LSN] and the confirm has its own LSN. This confirm message is delivered to all replicas through the existing replication mechanism. Replica should report a TXN application success to the leader via the IPROTO explicitly to allow leader to collect the quorum for the TXN. In case of application failure the replica has to disconnect from the replication the same way as it is done now. The replica also has to report its disconnection to the orchestrator. Further actions require human intervention, since failure means either technical problem (such as not enough space for WAL) that has to be resolved or an inconsistent state that requires rejoin. Currently Tarantool provides no protection from dirty read from the memtx during the TXN write into the WAL. So, there is a chance of TXN can fail to be written to the WAL, while some read requests can report success of TXN. In this RFC we make no attempt to resolve the dirty read, so it should be addressed by user code. Although we plan to introduce an MVCC machinery similar to available in vinyl engnie which will resolve the dirty read problem. ### Connection liveness There is a timeout-based mechanism in Tarantool that controls the asynchronous replication, which uses the following config: ``` * replication_connect_timeout = 4 * replication_sync_lag = 10 * replication_sync_timeout = 300 * replication_timeout = 1 ``` For backward compatibility and to differentiate the async replication we should augment the configuration with the following: ``` * synchro_replication_heartbeat = 4 * synchro_replication_quorum_timeout = 4 ``` Leader should send a heartbeat every synchro_replication_heartbeat if there were no messages sent. Replicas should respond to the heartbeat just the same way as they do it now. As soon as Leader has no response for another heartbeat interval, it should consider the replica is lost. As soon as leader appears in a situation it has not enough replicas to achieve quorum, it should stop accepting write requests. There's an option for leader to rollback to the latest transaction that has quorum: leader issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN is of the first transaction in the leader's undo log. The rollback message replicated to the available cluster will put it in a consistent state. After that configuration of the cluster can be updated to a new available quorum and leader can be switched back to write mode. During the quorum collection it can happen that some of replicas become unavailable due to some reason, so leader should wait at most for synchro_replication_quorum_timeout after which it issues a Rollback pointing to the oldest TXN in the waiting list. ### Leader role assignment. Be it a user-initiated assignment or an algorithmic one, it should use a common interface to assign the leader role. By now we implement a simplified machinery, still it should be feasible in the future to fit the algorithms, such as RAFT or proposed before box.ctl.promote. A system space \_voting can be used to replicate the voting among the cluster, this space should be writable even for a read-only instance. This space should contain a CURRENT_LEADER_ID at any time - means the current leader, can be a zero value at the start. This is needed to compare the appropriate vclock component below. All replicas should be subscribed to changes in the space and react as described below. promote(ID) - should be called from a replica with it's own ID. Writes an entry in the voting space about this ID is waiting for votes from cluster. The entry should also contain the current vclock[CURRENT_LEADER_ID] of the nominee. Upon changes in the space each replica should compare its appropriate vclock component with submitted one and append its vote to the space: AYE in case nominee's vclock is bigger or equal to the replica's one, NAY otherwise. As soon as nominee collects the quorum for being elected, it claims himself a Leader by switching in rw mode, writes CURRENT_LEADER_ID as a FORMER_LEADER_ID in the \_voting space and put its ID as a CURRENT_LEADER_ID. In case a NAY is appeared in the \_voting or a timeout predefined in box.cfg is reached, the nominee should remove it's entry from the space. The leader should assure that number of available instances in the cluster is enough to achieve the quorum and proceed to step 3, otherwise the leader should report the situation of incomplete quorum, as described in the last paragraph of previous section. The new Leader has to take the responsibility to replicate former Leader's entries from its WAL, obtain quorum and commit confirm messages referring to [FORMER_LEADER_ID, LSN] in its WAL, replicating to the cluster, after that it can start adding its own entries into the WAL. demote(ID) - should be called from the Leader instance. The Leader has to switch in ro mode and wait for its' undo log is empty. This effectively means all transactions are committed in the cluster and it is safe pass the leadership. Then it should write CURRENT_LEADER_ID as a FORMER_LEADER_ID and put CURRENT_LEADER_ID into 0. ### Recovery and failover. Tarantool instance during reading WAL should postpone the undo log deletion until the 'confirm' is read. In case the WAL eof is achieved, the instance should keep undo log for all transactions that are waiting for a confirm entry until the role of the instance is set. If this instance will be assigned a leader role then all transactions that have no corresponding confirm message should be confirmed (see the leader role assignment). In case there's not enough replicas to set up a quorum the cluster can be switched into a read-only mode. Note, this can't be done by default since some of transactions can have confirmed state. It is up to human intervention to force rollback of all transactions that have no confirm and to put the cluster into a consistent state. In case the instance will be assigned a replica role, it may appear in a state that it has conflicting WAL entries, in case it recovered from a leader role and some of transactions didn't replicated to the current leader. This situation should be resolved through rejoin of the instance. Consider an example below. Originally instance with ID1 was assigned a Leader role and the cluster had 2 replicas with quorum set to 2. ``` +---------------------+---------------------+---------------------+ | ID1 | ID2 | ID3 | | Leader | Replica 1 | Replica 2 | +---------------------+---------------------+---------------------+ | ID1 Tx1 | ID1 Tx1 | ID1 Tx1 | +---------------------+---------------------+---------------------+ | ID1 Tx2 | ID1 Tx2 | ID1 Tx2 | +---------------------+---------------------+---------------------+ | ID1 Tx3 | ID1 Tx3 | ID1 Tx3 | +---------------------+---------------------+---------------------+ | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | | +---------------------+---------------------+---------------------+ | ID1 Tx4 | ID1 Tx4 | | +---------------------+---------------------+---------------------+ | ID1 Tx5 | ID1 Tx5 | | +---------------------+---------------------+---------------------+ | ID1 Conf [ID1, Tx2] | | | +---------------------+---------------------+---------------------+ | Tx6 | | | +---------------------+---------------------+---------------------+ | Tx7 | | | +---------------------+---------------------+---------------------+ ``` Suppose at this moment the ID1 instance crashes. Then the ID2 instance should be assigned a leader role since its ID1 LSN is the biggest. Then this new leader will deliver its WAL to all replicas. As soon as quorum for Tx4 and Tx5 will be obtained, it should write the corresponding Confirms to its WAL. Note that Tx are still uses ID1. ``` +---------------------+---------------------+---------------------+ | ID1 | ID2 | ID3 | | (dead) | Leader | Replica 2 | +---------------------+---------------------+---------------------+ | ID1 Tx1 | ID1 Tx1 | ID1 Tx1 | +---------------------+---------------------+---------------------+ | ID1 Tx2 | ID1 Tx2 | ID1 Tx2 | +---------------------+---------------------+---------------------+ | ID1 Tx3 | ID1 Tx3 | ID1 Tx3 | +---------------------+---------------------+---------------------+ | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | +---------------------+---------------------+---------------------+ | ID1 Tx4 | ID1 Tx4 | ID1 Tx4 | +---------------------+---------------------+---------------------+ | ID1 Tx5 | ID1 Tx5 | ID1 Tx5 | +---------------------+---------------------+---------------------+ | ID1 Conf [ID1, Tx2] | ID2 Conf [ID1, Tx5] | ID2 Conf [ID1, Tx5] | +---------------------+---------------------+---------------------+ | ID1 Tx6 | | | +---------------------+---------------------+---------------------+ | ID1 Tx7 | | | +---------------------+---------------------+---------------------+ ``` After rejoining ID1 will figure out the inconsistency of its WAL: the last WAL entry it has is corresponding to Tx7, while in Leader's log the last entry with ID1 is Tx5. Confirm for a Tx can only be issued after appearance of the Tx on the majoirty of replicas, hence there's a good chances that ID1 will have inconsistency in its WAL covered with undo log. So, by rolling back all excessive Txs (in the example they are Tx6 and Tx7) the ID1 can put its memtx and vynil in consistent state. At this point a snapshot can be created at ID1 with appropriate WAL rotation. The old WAL should be renamed so it will not be reused in the future and can be kept for postmortem. ``` +---------------------+---------------------+---------------------+ | ID1 | ID2 | ID3 | | Replica 1 | Leader | Replica 2 | +---------------------+---------------------+---------------------+ | ID1 Tx1 | ID1 Tx1 | ID1 Tx1 | +---------------------+---------------------+---------------------+ | ID1 Tx2 | ID1 Tx2 | ID1 Tx2 | +---------------------+---------------------+---------------------+ | ID1 Tx3 | ID1 Tx3 | ID1 Tx3 | +---------------------+---------------------+---------------------+ | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | +---------------------+---------------------+---------------------+ | ID1 Tx4 | ID1 Tx4 | ID1 Tx4 | +---------------------+---------------------+---------------------+ | ID1 Tx5 | ID1 Tx5 | ID1 Tx5 | +---------------------+---------------------+---------------------+ | | ID2 Conf [ID1, Tx5] | ID2 Conf [ID1, Tx5] | +---------------------+---------------------+---------------------+ | | ID2 Tx1 | ID2 Tx1 | +---------------------+---------------------+---------------------+ | | ID2 Tx2 | ID2 Tx2 | +---------------------+---------------------+---------------------+ ``` Although, in case undo log is not enough to cover the WAL inconsistence with the new leader, the ID1 needs a complete rejoin. ### Snapshot generation. We also can reuse current machinery of snapshot generation. Upon receiving a request to create a snapshot an instance should request a readview for the current commit operation. Although start of the snapshot generation should be postponed until this commit operation receives its confirmation. In case operation is rolled back, the snapshot generation should be aborted and restarted using current transaction after rollback is complete. After snapshot is created the WAL should start from the first operation that follows the commit operation snapshot is generated for. That means WAL will contain 'confirm' messages that refer to transactions that are not present in the WAL. Apparently, we have to allow this for the case 'confirm' refers to a transaction with LSN less than the first entry in the WAL. In case master appears unavailable a replica still have to be able to create a snapshot. Replica can perform rollback for all transactions that are not confirmed and claim its LSN as the latest confirmed txn. Then it can create a snapshot in a regular way and start with blank xlog file. All rolled back transactions will appear through the regular replication in case master reappears later on. ### Asynchronous replication. Along with synchronous replicas the cluster can contain asynchronous replicas. That means async replica doesn't reply to the leader with errors since they're not contributing into quorum. Still, async replicas have to follow the new WAL operation, such as keep rollback info until 'quorum' message is received. This is essential for the case of 'rollback' message appearance in the WAL. This message assumes replica is able to perform all necessary rollback by itself. Cluster information should contain explicit notification of each replica operation mode. ### Synchronous replication enabling. Synchronous operation can be required for a set of spaces in the data scheme. That means only transactions that contain data modification for these spaces should require quorum. Such transactions named synchronous. As soon as last operation of synchronous transaction appeared in leader's WAL, it will cause all following transactions - no matter if they are synchronous or not - wait for the quorum. In case quorum is not achieved the 'rollback' operation will cause rollback of all transactions after the synchronous one. It will ensure the consistent state of the data both on leader and replicas. In case user doesn't require synchronous operation for any space then no changes to the WAL generation and replication will appear. Cluster description should contain explicit attribute for each replica to denote it participates in synchronous activities. Also the description should contain criterion on how many replicas responses are needed to achieve the quorum. ## Rationale and alternatives There is an implementation of synchronous replication as part of gh-980 activities, still it is not in a state to get into the product. More than that it intentionally breaks backward compatibility which is a prerequisite for this proposal. On 28 мая 00:17, Sergey Ostanevich wrote: > Hi! > > Thanks for review! > > Some comments below. > On 26 мая 01:41, Vladislav Shpilevoy wrote: > > >> > > >> The reads should not be inconsistent - so that cluster will keep > > >> answering A or B for the same request. And in case we lost quorum we > > >> can't say for sure that all instances will answer the same. > > >> > > >> As we discussed it before, if leader appears in minor part of the > > >> cluster it can't issue rollback for all unconfirmed txns, since the > > >> majority will re-elect leader who will collect quorum for them. Means, > > >> we will appear is a state that cluster split in two. So the minor part > > >> should stop. Am I wrong here? > > > > Yeah, kinda. As long as you allow reading from replicas, you *always* will > > have a time slot, when you will be able to read different data for the > > same key on different replicas. Even with reads going through quorum. > > > > Because it is physically impossible to make nodes A and B start answering > > the same data at the same time moment. To notify them about a confirm you will > > send network messages, they will have not the same delay, won't be processed > > in the same moment of time, and some of them probably won't be even delivered. > > > > The only correct way to read the same - read from one node only. From the > > leader. And since this is not our way, it means we can't beat the 'inconsistent' > > reads problems. And I don't think we should. Because if somebody needs to do > > 'consistent' reads, they should read from leader only. > > > > In other words, the concept of 'consistency' is highly application dependent > > here. If we provide a way to read from replicas, we give flexibility to choose: > > read from leader only and see always the same data, or read from all, and have > > a possibility, that requests may see different data on different replicas > > sometimes. > > So, it looks like we will follow the current approach: if quorum can't > be achieved, cluster appears in r/o mode. Objections? > > > > > > > Replica should report a TXN application success to the leader via the > > > IPROTO explicitly to allow leader to collect the quorum for the TXN. > > > In case of application failure the replica has to disconnect from the > > > replication the same way as it is done now. The replica also has to > > > report its disconnection to the orchestrator. Further actions require > > > human intervention, since failure means either technical problem (such > > > as not enough space for WAL) that has to be resolved or an inconsistent > > > state that requires rejoin. > > > > I don't think a replica should report disconnection. Problem of > > disconnection is that it leads to loosing the connection. So it may be > > not able to connect to the orchestrator. Also it would be strange for > > tarantool to depend on some external service, to which it should report. > > This looks like the orchestrator's business how will it determine > > connectivity. Replica has nothing to do with it from its side. > > External service is something I expect to be useful for the first part > of implementation - the quorum part. Definitely, we will move onward to > achieve some automation in leader election and failover. I just don't > expect this to be part of this RFC. > > Anyways, orchestrator has to ask replica to figure out the connectivity > between replica and leader. > > > > > > As soon as leader appears in a situation it has not enough replicas > > > to achieve quorum, the cluster should stop accepting any requests - both > > > write and read. > > > > The moment of not having enough replicas can't be determined properly. > > You may loose connection to replicas (they could be powered off), but > > TCP won't see that, and the node will continue working. The failure will > > be discovered only when a 'write' request will try to collect a quorum, > > or after a timeout will pass on not delivering heartbeats. During this > > time reads will be served. And there is no way to prevent them except > > collecting a quorum on that. See my first comment in this email for more > > details. > > > > On the summary: we can't stop accepting read requests. > > > > Btw, what to do with reads, which were *in-progress*, when the quorum > > was lost? Such as long vinyl reads. > > But the quorum was in place at the start of it? Then according to > transaction manager behavior only older version data will be available > for read - means data that collected quorum. > > > > > > The reason for this is that replication of transactions > > > can achieve quorum on replicas not visible to the leader. On the other > > > hand, leader can't achieve quorum with available minority. Leader has to > > > report the state and wait for human intervention. > > > > Yeah, but if the leader couldn't achieve a quorum on some transactions, > > they are not visible (assuming MVCC will work properly). So they can't > > be read anyway. And if a leader answered an error, it does not mean that > > the transaction wasn't replicated on the majority, as we discussed at some > > meeting, I don't already remember when. So here read allowance also works > > fine - not having some data visible and getting error at a sync transaction > > does not mean it is not committed. A user should be aware of that. > > True, we discussed that we should guarantee only that if we answered > 'Ok' then data is present in quorum number of instances. > > [...] > > > > demote(ID) - should be called from the Leader instance. > > > The Leader has to switch in ro mode and wait for its' undo log is > > > empty. This effectively means all transactions are committed in the > > > cluster and it is safe pass the leadership. Then it should write > > > CURRENT_LEADER_ID as a FORMER_LEADER_ID and put CURRENT_LEADER_ID > > > into 0. > > > > This looks like box.ctl.promote() algorithm. Although I thought we decided > > not to implement any kind of auto election here, no? Box.ctl.promote() > > assumed, that it does all the steps automatically, except choosing on which > > node to call this function. This is what it was so complicated. It was > > basically raft. > > > > But yeah, as discussed verbally, this is a subject for improvement. > > I personally would like to postpone the algorithm should be postponed > for the next stage (Q3-Q4) but now we should not mess up too much to > revamp. Hence, we have to elaborate the internals - such as _voting > table I mentioned. > > Even with introduction of terms for each leader - as in RAFT for example > - we still can keep it in a replicated space, isn't it? > > > > > The way I see it is that we need to give vclock based algorithm of choosing > > a new leader; tell how to stop replication from the old leader; allow to > > read vclock from replicas (basically, let the external service read box.info). > > That's the #1 for me by now: how a read-only replica can quit listening > to a demoted leader, which can be not aware of its demotion? Still, for > efficiency it should be done w/o disconnection. > > > > > Since you said you think we should not provide an API for all sync transactions > > rollback, it looks like no need in a special new API. But if we still want > > to allow to rollback all pending transactions of the old leader on a new leader > > (like Mons wants) then yeah, seems like we would need a new function. For example, > > box.ctl.sync_rollback() to rollback all pending. And box.ctl.sync_confirm() to > > confirm all pending. Perhaps we could add more admin-line parameters such as > > replica_id with which to write 'confirm/rollback' message. > > I believe it's a good point to keep two approaches and perhaps set one > of the two in configuration. This should resolve the issue with 'the > rest of the cluster confirms old leader's transactions and because of it > leader can't rollback'. > > > > > > ### Recovery and failover. > > > > > > Tarantool instance during reading WAL should postpone the undo log > > > deletion until the 'confirm' is read. In case the WAL eof is achieved, > > > the instance should keep undo log for all transactions that are waiting > > > for a confirm entry until the role of the instance is set. > > > > > > If this instance will be assigned a leader role then all transactions > > > that have no corresponding confirm message should be confirmed (see the > > > leader role assignment). > > > > > > In case there's not enough replicas to set up a quorum the cluster can > > > be switched into a read-only mode. Note, this can't be done by default > > > since some of transactions can have confirmed state. It is up to human > > > intervention to force rollback of all transactions that have no confirm > > > and to put the cluster into a consistent state. > > > > Above you said: > > > > >> As soon as leader appears in a situation it has not enough replicas > > >> to achieve quorum, the cluster should stop accepting any requests - both > > >> write and read. > > > > But here I see, that the cluster "switched into a read-only mode". So there > > is a contradiction. And I think it should be resolved in favor of > > 'read-only mode'. I explained why in the previous comments. > > My bad, I was moving around this problem already and tend to allow r/o. > Will update. > > > > > > In case the instance will be assigned a replica role, it may appear in > > > a state that it has conflicting WAL entries, in case it recovered from a > > > leader role and some of transactions didn't replicated to the current > > > leader. This situation should be resolved through rejoin of the instance. > > > > > > Consider an example below. Originally instance with ID1 was assigned a > > > Leader role and the cluster had 2 replicas with quorum set to 2. > > > > > > ``` > > > +---------------------+---------------------+---------------------+ > > > | ID1 | ID2 | ID3 | > > > | Leader | Replica 1 | Replica 2 | > > > +---------------------+---------------------+---------------------+ > > > | ID1 Tx1 | ID1 Tx1 | ID1 Tx1 | > > > +---------------------+---------------------+---------------------+ > > > | ID1 Tx2 | ID1 Tx2 | ID1 Tx2 | > > > +---------------------+---------------------+---------------------+ > > > | ID1 Tx3 | ID1 Tx3 | ID1 Tx3 | > > > +---------------------+---------------------+---------------------+ > > > | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | | > > > +---------------------+---------------------+---------------------+ > > > | ID1 Tx4 | ID1 Tx4 | | > > > +---------------------+---------------------+---------------------+ > > > | ID1 Tx5 | ID1 Tx5 | | > > > +---------------------+---------------------+---------------------+ > > > | ID1 Conf [ID1, Tx2] | | | > > > +---------------------+---------------------+---------------------+ > > > | Tx6 | | | > > > +---------------------+---------------------+---------------------+ > > > | Tx7 | | | > > > +---------------------+---------------------+---------------------+ > > > ``` > > > Suppose at this moment the ID1 instance crashes. Then the ID2 instance > > > should be assigned a leader role since its ID1 LSN is the biggest. > > > Then this new leader will deliver its WAL to all replicas. > > > > > > As soon as quorum for Tx4 and Tx5 will be obtained, it should write the > > > corresponding Confirms to its WAL. Note that Tx are still uses ID1. > > > ``` > > > +---------------------+---------------------+---------------------+ > > > | ID1 | ID2 | ID3 | > > > | (dead) | Leader | Replica 2 | > > > +---------------------+---------------------+---------------------+ > > > | ID1 Tx1 | ID1 Tx1 | ID1 Tx1 | > > > +---------------------+---------------------+---------------------+ > > > | ID1 Tx2 | ID1 Tx2 | ID1 Tx2 | > > > +---------------------+---------------------+---------------------+ > > > | ID1 Tx3 | ID1 Tx3 | ID1 Tx3 | > > > +---------------------+---------------------+---------------------+ > > > | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | > > > +---------------------+---------------------+---------------------+ > > > | ID1 Tx4 | ID1 Tx4 | ID1 Tx4 | > > > +---------------------+---------------------+---------------------+ > > > | ID1 Tx5 | ID1 Tx5 | ID1 Tx5 | > > > +---------------------+---------------------+---------------------+ > > > | ID1 Conf [ID1, Tx2] | ID2 Conf [Id1, Tx5] | ID2 Conf [Id1, Tx5] | > > > > Id1 -> ID1 (typo) > > Thanks! > > > > > > +---------------------+---------------------+---------------------+ > > > | ID1 Tx6 | | | > > > +---------------------+---------------------+---------------------+ > > > | ID1 Tx7 | | | > > > +---------------------+---------------------+---------------------+ > > > ``` > > > After rejoining ID1 will figure out the inconsistency of its WAL: the > > > last WAL entry it has is corresponding to Tx7, while in Leader's log the > > > last entry with ID1 is Tx5. Confirm for a Tx can only be issued after > > > appearance of the Tx on the majoirty of replicas, hence there's a good > > > chances that ID1 will have inconsistency in its WAL covered with undo > > > log. So, by rolling back all excessive Txs (in the example they are Tx6 > > > and Tx7) the ID1 can put its memtx and vynil in consistent state. > > > > Yeah, but the problem is that the node1 has vclock[ID1] == 'Conf [ID1, Tx2]'. > > This row can't be rolled back. So looks like node1 needs a rejoin. > > Confirm message is equivalent to a NOP - @sergepetrenko apparently does > implementation exactly this way. So there's no need to roll it back in > an engine, rather perform the xlog rotation before it. > > > > > > At this point a snapshot can be created at ID1 with appropriate WAL > > > rotation. The old WAL should be renamed so it will not be reused in the > > > future and can be kept for postmortem. > > > ``` > > > +---------------------+---------------------+---------------------+ > > > | ID1 | ID2 | ID3 | > > > | Replica 1 | Leader | Replica 2 | > > > +---------------------+---------------------+---------------------+ > > > | ID1 Tx1 | ID1 Tx1 | ID1 Tx1 | > > > +---------------------+---------------------+---------------------+ > > > | ID1 Tx2 | ID1 Tx2 | ID1 Tx2 | > > > +---------------------+---------------------+---------------------+ > > > | ID1 Tx3 | ID1 Tx3 | ID1 Tx3 | > > > +---------------------+---------------------+---------------------+ > > > | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | ID1 Conf [ID1, Tx1] | > > > +---------------------+---------------------+---------------------+ > > > | ID1 Tx4 | ID1 Tx4 | ID1 Tx4 | > > > +---------------------+---------------------+---------------------+ > > > | ID1 Tx5 | ID1 Tx5 | ID1 Tx5 | > > > +---------------------+---------------------+---------------------+ > > > | | ID2 Conf [Id1, Tx5] | ID2 Conf [Id1, Tx5] | > > > +---------------------+---------------------+---------------------+ > > > | | ID2 Tx1 | ID2 Tx1 | > > > +---------------------+---------------------+---------------------+ > > > | | ID2 Tx2 | ID2 Tx2 | > > > +---------------------+---------------------+---------------------+ > > > ``` > > > Although, in case undo log is not enough to cover the WAL inconsistence > > > with the new leader, the ID1 needs a complete rejoin. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-06-09 16:19 ` Sergey Ostanevich @ 2020-06-11 15:17 ` Vladislav Shpilevoy 2020-06-12 20:31 ` Sergey Ostanevich 0 siblings, 1 reply; 53+ messages in thread From: Vladislav Shpilevoy @ 2020-06-11 15:17 UTC (permalink / raw) To: Sergey Ostanevich; +Cc: tarantool-patches Hi! Thanks for the updates! > ### Connection liveness > > There is a timeout-based mechanism in Tarantool that controls the > asynchronous replication, which uses the following config: > ``` > * replication_connect_timeout = 4 > * replication_sync_lag = 10 > * replication_sync_timeout = 300 > * replication_timeout = 1 > ``` > For backward compatibility and to differentiate the async replication > we should augment the configuration with the following: > ``` > * synchro_replication_heartbeat = 4 Heartbeats are already being sent. I don't see any sense in adding a second heartbeat option. > * synchro_replication_quorum_timeout = 4 Since this is a replication option, it should start from replication_ prefix. > ``` > Leader should send a heartbeat every synchro_replication_heartbeat if > there were no messages sent. Replicas should respond to the heartbeat > just the same way as they do it now. As soon as Leader has no response > for another heartbeat interval, it should consider the replica is lost. All of that is already done in the regular heartbeats, not related nor bound to any synchronous activities. Just like failure detection should be. > As soon as leader appears in a situation it has not enough replicas > to achieve quorum, it should stop accepting write requests. There's an > option for leader to rollback to the latest transaction that has quorum: > leader issues a 'rollback' message referring to the [LEADER_ID, LSN] > where LSN is of the first transaction in the leader's undo log. What is that option? > The rollback message replicated to the available cluster will put it in a > consistent state. After that configuration of the cluster can be > updated to a new available quorum and leader can be switched back to > write mode. > > During the quorum collection it can happen that some of replicas become > unavailable due to some reason, so leader should wait at most for > synchro_replication_quorum_timeout after which it issues a Rollback > pointing to the oldest TXN in the waiting list. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-06-11 15:17 ` Vladislav Shpilevoy @ 2020-06-12 20:31 ` Sergey Ostanevich 0 siblings, 0 replies; 53+ messages in thread From: Sergey Ostanevich @ 2020-06-12 20:31 UTC (permalink / raw) To: Vladislav Shpilevoy; +Cc: tarantool-patches Hi! Thanks for review, attaching a diff. Full version is available at the branch https://github.com/tarantool/tarantool/blob/sergos/quorum-based-synchro/ On 11 июн 17:17, Vladislav Shpilevoy wrote: > Hi! Thanks for the updates! > > > ### Connection liveness > > > > There is a timeout-based mechanism in Tarantool that controls the > > asynchronous replication, which uses the following config: > > ``` > > * replication_connect_timeout = 4 > > * replication_sync_lag = 10 > > * replication_sync_timeout = 300 > > * replication_timeout = 1 > > ``` > > For backward compatibility and to differentiate the async replication > > we should augment the configuration with the following: > > ``` > > * synchro_replication_heartbeat = 4 > > Heartbeats are already being sent. I don't see any sense in adding a > second heartbeat option. I had an idea that synchronous replication can co-exist with async one, so they have to have independent tuning. Now I realize that sending two types of heartbeats is too much, so I'll drop this one. > > > * synchro_replication_quorum_timeout = 4 > > Since this is a replication option, it should start from replication_ > prefix. There are number of options already exist that are very similar in naming, such as replication_sync_timeout, replication_sync_lag and even replication_connect_quorum. I expect to resolve the ambiguity with putting in a new prefix, synchro_replication. The drawback is those options reused from async mode would be not-so-clearly linked to the synch one. > > > ``` > > Leader should send a heartbeat every synchro_replication_heartbeat if > > there were no messages sent. Replicas should respond to the heartbeat > > just the same way as they do it now. As soon as Leader has no response > > for another heartbeat interval, it should consider the replica is lost. > > All of that is already done in the regular heartbeats, not related nor > bound to any synchronous activities. Just like failure detection should be. > > > As soon as leader appears in a situation it has not enough replicas > > to achieve quorum, it should stop accepting write requests. There's an > > option for leader to rollback to the latest transaction that has quorum: > > leader issues a 'rollback' message referring to the [LEADER_ID, LSN] > > where LSN is of the first transaction in the leader's undo log. > > What is that option? Good catch, thanks! This option was introduced to get to a consistent state with replicas. Although, if Leader will wait longer than timeout for quorum it will rollback anyways, so I will remove mention of this. > > > The rollback message replicated to the available cluster will put it in a > > consistent state. After that configuration of the cluster can be > > updated to a new available quorum and leader can be switched back to > > write mode. > > > > During the quorum collection it can happen that some of replicas become > > unavailable due to some reason, so leader should wait at most for > > synchro_replication_quorum_timeout after which it issues a Rollback > > pointing to the oldest TXN in the waiting list. diff --git a/doc/rfc/quorum-based-synchro.md b/doc/rfc/quorum-based-synchro.md index c7dcf56b5..0a92642fd 100644 --- a/doc/rfc/quorum-based-synchro.md +++ b/doc/rfc/quorum-based-synchro.md @@ -83,9 +83,10 @@ Customer Leader WAL(L) Replica WAL(R) To introduce the 'quorum' we have to receive confirmation from replicas to make a decision on whether the quorum is actually present. Leader -collects necessary amount of replicas confirmation plus its own WAL -success. This state is named 'quorum' and gives leader the right to -complete the customers' request. So the picture will change to: +collects replication_synchro_quorum-1 of replicas confirmation and its +own WAL success. This state is named 'quorum' and gives leader the +right to complete the customers' request. So the picture will change +to: ``` Customer Leader WAL(L) Replica WAL(R) |------TXN----->| | | | @@ -158,26 +159,21 @@ asynchronous replication, which uses the following config: For backward compatibility and to differentiate the async replication we should augment the configuration with the following: ``` -* synchro_replication_heartbeat = 4 -* synchro_replication_quorum_timeout = 4 +* replication_synchro_quorum_timeout = 4 +* replication_synchro_quorum = 4 ``` -Leader should send a heartbeat every synchro_replication_heartbeat if -there were no messages sent. Replicas should respond to the heartbeat -just the same way as they do it now. As soon as Leader has no response -for another heartbeat interval, it should consider the replica is lost. -As soon as leader appears in a situation it has not enough replicas -to achieve quorum, it should stop accepting write requests. There's an -option for leader to rollback to the latest transaction that has quorum: -leader issues a 'rollback' message referring to the [LEADER_ID, LSN] -where LSN is of the first transaction in the leader's undo log. The -rollback message replicated to the available cluster will put it in a -consistent state. After that configuration of the cluster can be -updated to a new available quorum and leader can be switched back to -write mode. +Leader should send a heartbeat every replication_timeout if there were +no messages sent. Replicas should respond to the heartbeat just the +same way as they do it now. As soon as Leader has no response for +another heartbeat interval, it should consider the replica is lost. As +soon as leader appears in a situation it has not enough replicas to +achieve quorum, it should stop accepting write requests. After that +configuration of the cluster can be updated to a new available quorum +and leader can be switched back to write mode. During the quorum collection it can happen that some of replicas become unavailable due to some reason, so leader should wait at most for -synchro_replication_quorum_timeout after which it issues a Rollback +replication_synchro_quorum_timeout after which it issues a Rollback pointing to the oldest TXN in the waiting list. ### Leader role assignment. @@ -274,9 +270,9 @@ Leader role and the cluster had 2 replicas with quorum set to 2. +---------------------+---------------------+---------------------+ | ID1 Conf [ID1, Tx2] | | | +---------------------+---------------------+---------------------+ -| Tx6 | | | +| ID1 Tx | | | +---------------------+---------------------+---------------------+ -| Tx7 | | | +| ID1 Tx | | | +---------------------+---------------------+---------------------+ ``` Suppose at this moment the ID1 instance crashes. Then the ID2 instance ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-05-06 16:39 ` Sergey Ostanevich 2020-05-06 18:44 ` Konstantin Osipov @ 2020-05-13 21:36 ` Vladislav Shpilevoy 2020-05-13 23:45 ` Konstantin Osipov 1 sibling, 1 reply; 53+ messages in thread From: Vladislav Shpilevoy @ 2020-05-13 21:36 UTC (permalink / raw) To: Sergey Ostanevich, Konstantin Osipov, tarantool-patches Thanks for the discussion! On 06/05/2020 18:39, Sergey Ostanevich wrote: > Hi! > > Thanks for review! > >>> | | | | | >>> | [Quorum | | | >>> | achieved] | | | >>> | | | | | >>> | [TXN undo log | | | >>> | destroyed] | | | >>> | | | | | >>> | |---Confirm--->| | | >>> | | | | | >> >> What happens if writing Confirm to WAL fails? TXN und log record >> is destroyed already. Will the server panic now on WAL failure, >> even if it is intermittent? > > I would like to have an example of intermittent WAL failure. Can it be > other than problem with disc - be it space/availability/malfunction? > > For all of those it should be resolved outside the DBMS anyways. So, > leader should stop and report its problems to orchestrator/admins. > > I would agree that undo log can be destroyed *after* the Confirm is > landed to WAL - same is for replica. Well, in fact you can't (or can you?). Because it won't help. Once you tried to write 'Confirm', it means you got the quorum. So now in case you will fail, a new leader will write 'Confirm' for you, when will see a quorum too. So the current leader has no right to write 'Rollback' from this moment, from what I understand. Because it still can be confirmed by a new leader later, if you fail before 'Rollback' is replicated to all. However the same problem appears, if you write 'Confirm' *successfully*. Still the leader can fail, and a newer leader will write 'Rollback' if won't collect the quorum again. Don't know what to do with that really. Probably nothing. >> >>> | |----------Confirm---------->| | >> >> What happens if peers receive and maybe even write Confirm to their WALs >> but local WAL write is lost after a restart? > > Did you mean WAL write on leader as a local? Then we have a replica with > a bigger LSN for the leader ID. > >> WAL is not synced, >> so we can easily lose the tail of the WAL. Tarantool will sync up >> with all replicas on restart, > > But at this point a new leader will be appointed - the old one is > restarted. Then the Confirm message will arrive to the restarted leader > through a regular replication. > >> but there will be no "Replication >> OK" messages from them, so it wouldn't know that the transaction >> is committed on them. How is this handled? We may end up with some >> replicas confirming the transaction while the leader will roll it >> back on restart. Do you suggest there is a human intervention on >> restart as well? >> >> >>> | | | | | >>> |<---TXN Ok-----| | [TXN undo log | >>> | | | destroyed] | >>> | | | | | >>> | | | |---Confirm--->| >>> | | | | | >>> ``` >>> >>> The quorum should be collected as a table for a list of transactions >>> waiting for quorum. The latest transaction that collects the quorum is >>> considered as complete, as well as all transactions prior to it, since >>> all transactions should be applied in order. Leader writes a 'confirm' >>> message to the WAL that refers to the transaction's [LEADER_ID, LSN] and >>> the confirm has its own LSN. This confirm message is delivered to all >>> replicas through the existing replication mechanism. >>> >>> Replica should report a TXN application success to the leader via the >>> IPROTO explicitly to allow leader to collect the quorum for the TXN. >>> In case of application failure the replica has to disconnect from the >>> replication the same way as it is done now. The replica also has to >>> report its disconnection to the orchestrator. Further actions require >>> human intervention, since failure means either technical problem (such >>> as not enough space for WAL) that has to be resovled or an inconsistent >>> state that requires rejoin. >> >>> As soon as leader appears in a situation it has not enough replicas >>> to achieve quorum, the cluster should stop accepting any requests - both >>> write and read. >> >> How does *the cluster* know the state of the leader and if it >> doesn't, how it can possibly implement this? Did you mean >> the leader should stop accepting transactions here? But how can >> the leader know if it has not enough replicas during a read >> transaction, if it doesn't contact any replica to serve a read? > > I expect to have a disconnection trigger assigned to all relays so that > disconnection will cause the number of replicas decrease. The quorum > size is static, so we can stop at the very moment the number dives below. This is a very dubious statement. In TCP disconnect may be detected much later, than it happened. So to collect a quorum on something you need to literally collect this quorum, with special WAL records, via network, and all. A disconnect trigger does not help at all here. Talking of the whole 'read-quorum' idea, I don't like it. Because this really makes things unbearably harder to implement, the nodes become much slower and less available in terms of any problems. I think reads should be allowed always, and from any node (except during bootstrap, of course). After all, you have transactions for consistency. So as far as replication respects transaction boundaries, every node is in a consistent state. Maybe not all of them are in the same state, but every one is consistent. Honestly, I can't even imagine, how is it possible to implement a completely synchronous simultaneous cluster progression. It is impossible even in theory. There always will be a time period, when some nodes are further than the others. At least because of network delays. So either we allows reads from master only, or we allow reads from everywhere, and in that case nothing will save from a possibility of seeing different data on different nodes. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-05-13 21:36 ` Vladislav Shpilevoy @ 2020-05-13 23:45 ` Konstantin Osipov 0 siblings, 0 replies; 53+ messages in thread From: Konstantin Osipov @ 2020-05-13 23:45 UTC (permalink / raw) To: Vladislav Shpilevoy; +Cc: tarantool-patches * Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/05/14 00:37]: > > Thanks for review! > > > >>> | | | | | > >>> | [Quorum | | | > >>> | achieved] | | | > >>> | | | | | > >>> | [TXN undo log | | | > >>> | destroyed] | | | > >>> | | | | | > >>> | |---Confirm--->| | | > >>> | | | | | > >> > >> What happens if writing Confirm to WAL fails? TXN und log record > >> is destroyed already. Will the server panic now on WAL failure, > >> even if it is intermittent? > > > > I would like to have an example of intermittent WAL failure. Can it be > > other than problem with disc - be it space/availability/malfunction? > > > > For all of those it should be resolved outside the DBMS anyways. So, > > leader should stop and report its problems to orchestrator/admins. > > > > I would agree that undo log can be destroyed *after* the Confirm is > > landed to WAL - same is for replica. > > Well, in fact you can't (or can you?). Because it won't help. Once you > tried to write 'Confirm', it means you got the quorum. So now in case > you will fail, a new leader will write 'Confirm' for you, when will see > a quorum too. So the current leader has no right to write 'Rollback' > from this moment, from what I understand. Because it still can be > confirmed by a new leader later, if you fail before 'Rollback' is > replicated to all. > > However the same problem appears, if you write 'Confirm' *successfully*. > Still the leader can fail, and a newer leader will write 'Rollback' if > won't collect the quorum again. Don't know what to do with that really. > Probably nothing. Maybe consult with the raft spec? The new leader is guaranteed to see the transaction since it has reached the majority of replicas. So it will definitely write "confirm" for it. The reason I asked the question is I want the case of intermittent failures be described in the spec. For example, is "confirm" a cbus message, then if there is a cascading rollback of the batch it is part of, it can be rolled back. I would like to see all these scenarios covered in the spec. If one of them ends with panic, I would like to understand how the external coordinator is going to resolve the new election. Raft has answers for all of it. > >>> As soon as leader appears in a situation it has not enough replicas > >>> to achieve quorum, the cluster should stop accepting any requests - both > >>> write and read. > >> > >> How does *the cluster* know the state of the leader and if it > >> doesn't, how it can possibly implement this? Did you mean > >> the leader should stop accepting transactions here? But how can > >> the leader know if it has not enough replicas during a read > >> transaction, if it doesn't contact any replica to serve a read? > > > > I expect to have a disconnection trigger assigned to all relays so that > > disconnection will cause the number of replicas decrease. The quorum > > size is static, so we can stop at the very moment the number dives below. > > This is a very dubious statement. In TCP disconnect may be detected much > later, than it happened. So to collect a quorum on something you need to > literally collect this quorum, with special WAL records, via network, and > all. A disconnect trigger does not help at all here. Erhm, thanks. > Talking of the whole 'read-quorum' idea, I don't like it. Because this > really makes things unbearably harder to implement, the nodes become > much slower and less available in terms of any problems. > > I think reads should be allowed always, and from any node (except during > bootstrap, of course). After all, you have transactions for consistency. So > as far as replication respects transaction boundaries, every node is in a > consistent state. Maybe not all of them are in the same state, but every one > is consistent. In memtx, you read by default dirty, uncommitted data. It was OK for single-node transactions, since the only chance for it to be rolled back were out of space/disk failure, which were extremely rare, now you really read dirty stuff, because you can easily have it rolled back because of lack of quorum or re-election. So it's a much bigger deal. > Honestly, I can't even imagine, how is it possible to implement a completely > synchronous simultaneous cluster progression. It is impossible even in theory. > There always will be a time period, when some nodes are further than the > others. At least because of network delays. > > So either we allows reads from master only, or we allow reads from everywhere, > and in that case nothing will save from a possibility of seeing different > data on different nodes. This is why there are many consistency models out there (just google consistency models in distributed systems), and the minor details are important. It's indeed hard to implement the strictest model (serial), but it is also often unnecessary, and there is consensus in the relational databases what issues are acceptable and what are not. More specifically, I think for tarantool sync replication we should aim at read committed. The spec should say it in no uncertain terms and explain how it is achieved. -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-04-30 14:50 ` Sergey Ostanevich 2020-05-06 8:52 ` Konstantin Osipov @ 2020-05-06 18:55 ` Konstantin Osipov 2020-05-06 19:10 ` Konstantin Osipov 2020-05-13 21:42 ` Vladislav Shpilevoy 2020-05-07 23:01 ` Konstantin Osipov 2 siblings, 2 replies; 53+ messages in thread From: Konstantin Osipov @ 2020-05-06 18:55 UTC (permalink / raw) To: Sergey Ostanevich; +Cc: tarantool-patches, Vladislav Shpilevoy * Sergey Ostanevich <sergos@tarantool.org> [20/04/30 17:51]: A few more issues: - the spec assumes there is a full mesh. In any other topology electing a leader based on the longest wal can easily deadlock. Yet it provides no protection against non-full-mesh setups. Currently the server can't even detect that this is not a full-mesh setup, so can't check if the precondition for this to work correctly is met. - the spec assumes that quorum is identical to the number of replicas, and the number of replicas is stable across cluster life time. Can I have quorum=2 while the number of replicas is 4? Am I allowed to increase the number of replicas online? What happens when a replica is added, how exactly and starting from which transaction is the leader required to collect a bigger quorum? - the same goes for removing a replica. How is the quorum reduced? -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-05-06 18:55 ` Konstantin Osipov @ 2020-05-06 19:10 ` Konstantin Osipov 2020-05-12 16:03 ` Sergey Ostanevich 2020-05-13 21:42 ` Vladislav Shpilevoy 1 sibling, 1 reply; 53+ messages in thread From: Konstantin Osipov @ 2020-05-06 19:10 UTC (permalink / raw) To: Sergey Ostanevich, Vladislav Shpilevoy, tarantool-patches * Konstantin Osipov <kostja.osipov@gmail.com> [20/05/06 21:55]: > A few more issues: > > - the spec assumes there is a full mesh. In any other > topology electing a leader based on the longest wal can easily > deadlock. Yet it provides no protection against non-full-mesh > setups. Currently the server can't even detect that this is not > a full-mesh setup, so can't check if the precondition for this > to work correctly is met. Come to think of it, it's a special case of network partitioning. A replica with the longest WAL can be reachable by the external coordinator but partitioned away from the majority, so never able to make progress. -- Konstantin Osipov, Moscow, Russia https://scylladb.com ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-05-06 19:10 ` Konstantin Osipov @ 2020-05-12 16:03 ` Sergey Ostanevich 0 siblings, 0 replies; 53+ messages in thread From: Sergey Ostanevich @ 2020-05-12 16:03 UTC (permalink / raw) To: Konstantin Osipov, Vladislav Shpilevoy, tarantool-patches On 06 мая 22:10, Konstantin Osipov wrote: > * Konstantin Osipov <kostja.osipov@gmail.com> [20/05/06 21:55]: > > A few more issues: > > > > - the spec assumes there is a full mesh. In any other > > topology electing a leader based on the longest wal can easily > > deadlock. Yet it provides no protection against non-full-mesh > > setups. Currently the server can't even detect that this is not > > a full-mesh setup, so can't check if the precondition for this > > to work correctly is met. > > Come to think of it, it's a special case of network partitioning. > A replica with the longest WAL can be reachable by the external > coordinator but partitioned away from the majority, so never able to > make progress. So the answer from this replica on it's appointment will be 'I have no quorum'. Hence, the orchentration should pick the next-length WAL. What's the problem? > > > -- > Konstantin Osipov, Moscow, Russia > https://scylladb.com ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-05-06 18:55 ` Konstantin Osipov 2020-05-06 19:10 ` Konstantin Osipov @ 2020-05-13 21:42 ` Vladislav Shpilevoy 2020-05-14 0:05 ` Konstantin Osipov 1 sibling, 1 reply; 53+ messages in thread From: Vladislav Shpilevoy @ 2020-05-13 21:42 UTC (permalink / raw) To: Konstantin Osipov, Sergey Ostanevich, tarantool-patches Thanks for the discussion! On 06/05/2020 20:55, Konstantin Osipov wrote: > * Sergey Ostanevich <sergos@tarantool.org> [20/04/30 17:51]: > > A few more issues: > > - the spec assumes there is a full mesh. In any other > topology electing a leader based on the longest wal can easily > deadlock. Yet it provides no protection against non-full-mesh > setups. Currently the server can't even detect that this is not > a full-mesh setup, so can't check if the precondition for this > to work correctly is met. Yes, this is a very unstable construction. But we failed to come up with a solution right now, which would protect against accidental non-fullmesh. For example, how will it work, when I add a new node? If non-fullmesh is forbidden, the new node just can't be added ever, because this can't be done on all nodes simultaneously. > - the spec assumes that quorum is identical to the > number of replicas, and the number of replicas is stable across > cluster life time. Can I have quorum=2 while the number of > replicas is 4? Am I allowed to increase the number of replicas > online? What happens when a replica is added, > how exactly and starting from which transaction is the leader > required to collect a bigger quorum? Quorum <= number of replicas. It is a parameter, just like replication_connect_quorum. I think you are allowed to add new replicas. When a replica is added, it goes through the normal join process. > - the same goes for removing a replica. How is the quorum reduced? Node is just removed, I guess. If total number of nodes becomes less than quorum, obviously no transactions will be served. However what to do with the existing pending transactions, which already accounted the removed replica in their quorums? Should they be decremented? All what I am talking here are guesses. Which should be clarified in the RFC in the ideal world, of course. Tbh, we discussed the sync replication for may hours in voice, and this is a surprise, that all of them fit into such a small update of the RFC. Even though it didn't fit. Since we obviously still didn't clarify many things. Especially exact API look. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-05-13 21:42 ` Vladislav Shpilevoy @ 2020-05-14 0:05 ` Konstantin Osipov 0 siblings, 0 replies; 53+ messages in thread From: Konstantin Osipov @ 2020-05-14 0:05 UTC (permalink / raw) To: Vladislav Shpilevoy; +Cc: tarantool-patches * Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/05/14 00:47]: > > A few more issues: > > > > - the spec assumes there is a full mesh. In any other > > topology electing a leader based on the longest wal can easily > > deadlock. Yet it provides no protection against non-full-mesh > > setups. Currently the server can't even detect that this is not > > a full-mesh setup, so can't check if the precondition for this > > to work correctly is met. > > Yes, this is a very unstable construction. But we failed to come up > with a solution right now, which would protect against accidental > non-fullmesh. For example, how will it work, when I add a new node? > If non-fullmesh is forbidden, the new node just can't be added ever, > because this can't be done on all nodes simultaneously. Again the answer is present in the raft spec. The node is added in two steps, first steps commits the "add node" event to the durable state of the entire group, the second step (which is also a raft transaction) enacts the new node. This could be achieved in more or less straightforward manner if _cluster is a sync table with replication group = all members of the cluster. But as I said, I can't imagine this is possible with an external coordinator, since it may not be available during boot. Regarding detecting the full mesh, remember the task I created for using swim to discover members and bring non-full-mesh setups to full-mesh automatically? Is the reason for this task to exist clear now? Is it clear now why I asked you (multiple times) to begin working on sync replication by adding built-in swim instances on every replica and using them, instead of the current replication heartbeats, for failure detection? I believe there was a task somewhere for it, too. > > - the spec assumes that quorum is identical to the > > number of replicas, and the number of replicas is stable across > > cluster life time. Can I have quorum=2 while the number of > > replicas is 4? Am I allowed to increase the number of replicas > > online? What happens when a replica is added, > > how exactly and starting from which transaction is the leader > > required to collect a bigger quorum? > > Quorum <= number of replicas. It is a parameter, just like > replication_connect_quorum. I wrote in a comment to the task that it'd be even better if we list node uuids as group members, and assign group to space explicitly, so that it's not just ## of replicas, but specific replicas identified by their uuids. The thing is, it's vague in the spec. The spec has to be explicit about all box.schema API changes, because they will define legacy that will be hard to deal with later. > I think you are allowed to add new replicas. When a replica is added, > it goes through the normal join process. At what point is joins the group and can ACK, i.e. become part of a quorum? That's the question I wanted to be written down explicitly in this document. RAFT has an answer for it. > > - the same goes for removing a replica. How is the quorum reduced? > > Node is just removed, I guess. If total number of nodes becomes less > than quorum, obviously no transactions will be served. Other vendors support 3 different scenarios here: - it can be down for maintenance. In our turns, it means it is simply shut down, without changes to _cluster or space settings - it can be removed forever, in that case an admin may want to reduce the quorum size. - it can be replaced. with box.schema.group API all 3 cases can be translated to API calls on the group itself. e.g. it would be possible to say box.schema.group.groupname.remove(uuid) box.schema.group.groupname.replace(old_uuid, new_uuid). We don't need to implement it right away, but we must provision for these operations in the spec, and at least have a clue how they will be handled in the future. > However what to do with the existing pending transactions, which > already accounted the removed replica in their quorums? Should they be > decremented? > > All what I am talking here are guesses. Which should be clarified in the > RFC in the ideal world, of course. > > Tbh, we discussed the sync replication for may hours in voice, and this > is a surprise, that all of them fit into such a small update of the RFC. > Even though it didn't fit. Since we obviously still didn't clarify many > things. Especially exact API look. -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-04-30 14:50 ` Sergey Ostanevich 2020-05-06 8:52 ` Konstantin Osipov 2020-05-06 18:55 ` Konstantin Osipov @ 2020-05-07 23:01 ` Konstantin Osipov 2020-05-12 16:40 ` Sergey Ostanevich 2 siblings, 1 reply; 53+ messages in thread From: Konstantin Osipov @ 2020-05-07 23:01 UTC (permalink / raw) To: Sergey Ostanevich; +Cc: tarantool-patches, Vladislav Shpilevoy > ### Synchronous replication enabling. > > Synchronous operation can be required for a set of spaces in the data > scheme. That means only transactions that contain data modification for > these spaces should require quorum. Such transactions named synchronous. > As soon as last operation of synchronous transaction appeared in leader's > WAL, it will cause all following transactions - no matter if they are > synchronous or not - wait for the quorum. In case quorum is not achieved > the 'rollback' operation will cause rollback of all transactions after > the synchronous one. It will ensure the consistent state of the data both > on leader and replicas. In case user doesn't require synchronous operation > for any space then no changes to the WAL generation and replication will > appear. 1) It's unclear what happens here if async tx follows a sync tx. Does it wait for the sync tx? This reduces availability for async txs - so it's hardly acceptable. Besides, with group=local spaces, one can quickly run out of memory for undo. Then it should be allowed to proceed and commit. Then mixing sync and async tables in a single transaction shouldn't be allowed. Imagine t1 is sync and t2 is async. tx1 changes t1 and t2, tx2 changes t2. tx1 is not confirmed and must be rolled back. But it can not revert changes of tx2. The spec should clarify that. 2) First candidates to "sync" spaces are system spaces, especially _schema (to fix box.once()) and _cluster (to fix parallel join of multiple replicas). I can't imagine it's possible to make system spaces synchronous with an external coordinator - the coordinator may not be available during box.cfg{}. 3) One can quickly run out of memory for undo. Any sync transaction should be capped with a timeout to avoid OOMs. I don't know how many times I should repeat it. The only good solution for load control is in-memory WAL, which will allow to rollback all transactions as soon as network partitioning is detected. -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-05-07 23:01 ` Konstantin Osipov @ 2020-05-12 16:40 ` Sergey Ostanevich 2020-05-12 17:47 ` Konstantin Osipov 0 siblings, 1 reply; 53+ messages in thread From: Sergey Ostanevich @ 2020-05-12 16:40 UTC (permalink / raw) To: Konstantin Osipov, Vladislav Shpilevoy, tarantool-patches On 08 мая 02:01, Konstantin Osipov wrote: > > > > ### Synchronous replication enabling. > > > > Synchronous operation can be required for a set of spaces in the data > > scheme. That means only transactions that contain data modification for > > these spaces should require quorum. Such transactions named synchronous. > > As soon as last operation of synchronous transaction appeared in leader's > > WAL, it will cause all following transactions - no matter if they are > > synchronous or not - wait for the quorum. In case quorum is not achieved > > the 'rollback' operation will cause rollback of all transactions after > > the synchronous one. It will ensure the consistent state of the data both > > on leader and replicas. In case user doesn't require synchronous operation > > for any space then no changes to the WAL generation and replication will > > appear. > > 1) It's unclear what happens here if async tx follows a sync tx. > Does it wait for the sync tx? This reduces availability for Definitely yes, unless we keep the 'dirty read' as it is at the moment in memtx. This is the essence of the design, and it is temporary until the MVCC similar to the vinyl machinery appears. I intentionally didn't include this big task into this RFC. It will provide similar capabilities, although it will keep only dependent transactions in the undo log. Also, it looks like it will fit well into the machinery of this RFC. > async txs - so it's hardly acceptable. Besides, with > group=local spaces, one can quickly run out of memory for undo. > > > Then it should be allowed to proceed and commit. > > Then mixing sync and async tables in a single transaction > shouldn't be allowed. > > Imagine t1 is sync and t2 is async. tx1 changes t1 and t2, tx2 > changes t2. tx1 is not confirmed and must be rolled back. But it can > not revert changes of tx2. > > The spec should clarify that. > > 2) First candidates to "sync" spaces are system spaces, especially > _schema (to fix box.once()) and _cluster (to fix parallel join > of multiple replicas). > > I can't imagine it's possible to make system spaces synchronous > with an external coordinator - the coordinator may not be > available during box.cfg{}. May not be - means no coordination, means the server can't start. Again, we're not trying to elaborate the self-driven cluster at this moment, we rely on external coonrdination. > > 3) One can quickly run out of memory for undo. Any sync > transaction should be capped with a timeout to avoid OOMs. I > don't know how many times I should repeat it. The only good > solution for load control is in-memory WAL, which will allow to > rollback all transactions as soon as network partitioning is > detected. How in-memry WAL can help save on _undo_ memory? To rollback whatever amount of transactions one need to store the undo. > > -- > Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-05-12 16:40 ` Sergey Ostanevich @ 2020-05-12 17:47 ` Konstantin Osipov 2020-05-13 21:34 ` Vladislav Shpilevoy 0 siblings, 1 reply; 53+ messages in thread From: Konstantin Osipov @ 2020-05-12 17:47 UTC (permalink / raw) To: Sergey Ostanevich; +Cc: tarantool-patches, Vladislav Shpilevoy * Sergey Ostanevich <sergos@tarantool.org> [20/05/12 19:43]: > > 1) It's unclear what happens here if async tx follows a sync tx. > > Does it wait for the sync tx? This reduces availability for > > Definitely yes, unless we keep the 'dirty read' as it is at the moment > in memtx. This is the essence of the design, and it is temporary until > the MVCC similar to the vinyl machinery appears. I intentionally didn't > include this big task into this RFC. > > It will provide similar capabilities, although it will keep only > dependent transactions in the undo log. Also, it looks like it will fit > well into the machinery of this RFC. = reduced availability for all who have at least one sync space. If different spaces have different quorum size = quorum size of the biggest group is effectively used for all spaces. Replica-local transactions, e.g. those used by vinyl compaction, are rolled back if there is no quorum. What's the value of this? > > > async txs - so it's hardly acceptable. Besides, with > > group=local spaces, one can quickly run out of memory for undo. > > > > > > Then it should be allowed to proceed and commit. > > > > Then mixing sync and async tables in a single transaction > > shouldn't be allowed. > > > > Imagine t1 is sync and t2 is async. tx1 changes t1 and t2, tx2 > > changes t2. tx1 is not confirmed and must be rolled back. But it can > > not revert changes of tx2. > > > > The spec should clarify that. You conveniently skip this explanation of the problem - meaning you don't intend to address it? > > > > 3) One can quickly run out of memory for undo. Any sync > > transaction should be capped with a timeout to avoid OOMs. I > > don't know how many times I should repeat it. The only good > > solution for load control is in-memory WAL, which will allow to > > rollback all transactions as soon as network partitioning is > > detected. > > How in-memry WAL can help save on _undo_ memory? > To rollback whatever amount of transactions one need to store the undo. I wrote earlier that it works as a natural failure detector and throttling mechanism. If there is no quorum, we can see it immediately by looking at the number of active subscribers of the in-memory WAL, so do not accumulate undo. -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-05-12 17:47 ` Konstantin Osipov @ 2020-05-13 21:34 ` Vladislav Shpilevoy 2020-05-13 23:31 ` Konstantin Osipov 0 siblings, 1 reply; 53+ messages in thread From: Vladislav Shpilevoy @ 2020-05-13 21:34 UTC (permalink / raw) To: Konstantin Osipov, Sergey Ostanevich, tarantool-patches Thanks for the discussion! On 12/05/2020 19:47, Konstantin Osipov wrote: > * Sergey Ostanevich <sergos@tarantool.org> [20/05/12 19:43]: > >>> 1) It's unclear what happens here if async tx follows a sync tx. >>> Does it wait for the sync tx? This reduces availability for >> >> Definitely yes, unless we keep the 'dirty read' as it is at the moment >> in memtx. This is the essence of the design, and it is temporary until >> the MVCC similar to the vinyl machinery appears. I intentionally didn't >> include this big task into this RFC. >> >> It will provide similar capabilities, although it will keep only >> dependent transactions in the undo log. Also, it looks like it will fit >> well into the machinery of this RFC. > > = reduced availability for all who have at least one sync space. > > If different spaces have different quorum size = quorum size of > the biggest group is effectively used for all spaces. > > Replica-local transactions, e.g. those used by vinyl compaction, > are rolled back if there is no quorum. > > What's the value of this? There is an example when it leaves the database in an inconsistent state, when half of a transaction is applied. I don't know why Sergey didn't add it. I propose to him to extend the RFC with these examples. Since you are not the first person, who finds this strange and wrong. So clearly the RFC still does not explain this moment diligently enough. >>> async txs - so it's hardly acceptable. Besides, with >>> group=local spaces, one can quickly run out of memory for undo. >>> >>> >>> 3) One can quickly run out of memory for undo. Any sync >>> transaction should be capped with a timeout to avoid OOMs. I >>> don't know how many times I should repeat it. The only good >>> solution for load control is in-memory WAL, which will allow to >>> rollback all transactions as soon as network partitioning is >>> detected. >> >> How in-memry WAL can help save on _undo_ memory? >> To rollback whatever amount of transactions one need to store the undo. > > I wrote earlier that it works as a natural failure detector and > throttling mechanism. If > there is no quorum, we can see it immediately by looking at the > number of active subscribers of the in-memory WAL, so do not > accumulate undo. Here we go again ... Talking of throttling. Without in-memory WAL no need for throttling. All is 'slow' by design already, as you think. Talking of failure detection - what??? I don't get it. This is something new. With in-memory relay or without you anyway can see if there is a quorum. This is a matter of API of replication and transaction modules, and their interaction with each other, solved by txn_limbo in my branch. But still, I don't see how knowing number of subscribers helps with the quorum. Subscriber presence does not add to quorums by itself. Anyway every transaction needs to be replicated before you can say that its quorum got +1 replica ack. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication 2020-05-13 21:34 ` Vladislav Shpilevoy @ 2020-05-13 23:31 ` Konstantin Osipov 0 siblings, 0 replies; 53+ messages in thread From: Konstantin Osipov @ 2020-05-13 23:31 UTC (permalink / raw) To: Vladislav Shpilevoy; +Cc: tarantool-patches * Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/05/14 00:37]: > >>> 3) One can quickly run out of memory for undo. Any sync > >>> transaction should be capped with a timeout to avoid OOMs. I > >>> don't know how many times I should repeat it. The only good > >>> solution for load control is in-memory WAL, which will allow to > >>> rollback all transactions as soon as network partitioning is > >>> detected. > >> > >> How in-memry WAL can help save on _undo_ memory? > >> To rollback whatever amount of transactions one need to store the undo. > > > > I wrote earlier that it works as a natural failure detector and > > throttling mechanism. If > > there is no quorum, we can see it immediately by looking at the > > number of active subscribers of the in-memory WAL, so do not > > accumulate undo. > > Here we go again ... > > Talking of throttling. Without in-memory WAL no need for throttling. All is > 'slow' by design already, as you think. What is the limit for transactions in txn_limbo list? How does this limit work? What about the fibers, which are pinned as long as the transaction is not committed? > > Talking of failure detection - what??? I don't get it. This is something new. > With in-memory relay or without you anyway can see if there is a quorum. How do you "see" it? You write to the WAL and wait for acks. You could add a wait timeout, and assume there is no quorum if there are no acks within the timeout. This is not the best strategy, but there is no other. The spec doesn't say even that, it simply says that somehow lack of quorum is detected, but how it is detected is not clear. With in-memory WAL you can afford to wait longer if you have space in the ring buffer, and you know immediately if you shouldn't wait because you see that the ring buffer is full and the majority of subscribers are behind the start of the buffer. > This is a matter of API of replication and transaction modules, and their > interaction with each other, solved by txn_limbo in my branch. How is it "solved"? > But still, I don't see how knowing number of subscribers helps with the > quorum. Subscriber presence does not add to quorums by itself. Anyway every > transaction needs to be replicated before you can say that its quorum got > +1 replica ack. It helps to see quickly absence of the quorum, not presence of it. -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 53+ messages in thread
end of thread, other threads:[~2020-06-12 20:31 UTC | newest] Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-04-03 21:08 [Tarantool-patches] [RFC] Quorum-based synchronous replication Sergey Ostanevich 2020-04-07 13:02 ` Aleksandr Lyapunov 2020-04-08 9:18 ` Sergey Ostanevich 2020-04-08 14:05 ` Konstantin Osipov 2020-04-08 15:06 ` Sergey Ostanevich 2020-04-14 12:58 ` Sergey Bronnikov 2020-04-14 14:43 ` Sergey Ostanevich 2020-04-15 11:09 ` sergos 2020-04-15 14:50 ` sergos 2020-04-16 7:13 ` Aleksandr Lyapunov 2020-04-17 10:10 ` Konstantin Osipov 2020-04-17 13:45 ` Sergey Ostanevich 2020-04-20 11:20 ` Serge Petrenko 2020-04-20 23:32 ` Vladislav Shpilevoy 2020-04-21 10:49 ` Sergey Ostanevich 2020-04-21 22:17 ` Vladislav Shpilevoy 2020-04-22 16:50 ` Sergey Ostanevich 2020-04-22 20:28 ` Vladislav Shpilevoy 2020-04-23 6:58 ` Konstantin Osipov 2020-04-23 9:14 ` Konstantin Osipov 2020-04-23 11:27 ` Sergey Ostanevich 2020-04-23 11:43 ` Konstantin Osipov 2020-04-23 15:11 ` Sergey Ostanevich 2020-04-23 20:39 ` Konstantin Osipov 2020-04-23 21:38 ` Vladislav Shpilevoy 2020-04-23 22:28 ` Konstantin Osipov 2020-04-30 14:50 ` Sergey Ostanevich 2020-05-06 8:52 ` Konstantin Osipov 2020-05-06 16:39 ` Sergey Ostanevich 2020-05-06 18:44 ` Konstantin Osipov 2020-05-12 15:55 ` Sergey Ostanevich 2020-05-12 16:42 ` Konstantin Osipov 2020-05-13 21:39 ` Vladislav Shpilevoy 2020-05-13 23:54 ` Konstantin Osipov 2020-05-14 20:38 ` Sergey Ostanevich 2020-05-20 20:59 ` Sergey Ostanevich 2020-05-25 23:41 ` Vladislav Shpilevoy 2020-05-27 21:17 ` Sergey Ostanevich 2020-06-09 16:19 ` Sergey Ostanevich 2020-06-11 15:17 ` Vladislav Shpilevoy 2020-06-12 20:31 ` Sergey Ostanevich 2020-05-13 21:36 ` Vladislav Shpilevoy 2020-05-13 23:45 ` Konstantin Osipov 2020-05-06 18:55 ` Konstantin Osipov 2020-05-06 19:10 ` Konstantin Osipov 2020-05-12 16:03 ` Sergey Ostanevich 2020-05-13 21:42 ` Vladislav Shpilevoy 2020-05-14 0:05 ` Konstantin Osipov 2020-05-07 23:01 ` Konstantin Osipov 2020-05-12 16:40 ` Sergey Ostanevich 2020-05-12 17:47 ` Konstantin Osipov 2020-05-13 21:34 ` Vladislav Shpilevoy 2020-05-13 23:31 ` Konstantin Osipov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox