From: Sergey Bronnikov <sergeyb@tarantool.org> To: Sergey Ostanevich <sergos@tarantool.org> Cc: tarantool-patches@dev.tarantool.org Subject: Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication Date: Tue, 14 Apr 2020 15:58:48 +0300 [thread overview] Message-ID: <20200414125848.GA1249@pony.bronevichok.ru> (raw) In-Reply-To: <20200403210836.GB18283@tarantool.org> Hi, see 5 comments inline On 00:08 Sat 04 Apr , Sergey Ostanevich wrote: > > * **Status**: In progress > * **Start date**: 31-03-2020 > * **Authors**: Sergey Ostanevich @sergos \<sergos@tarantool.org\> > * **Issues**: 1. Just for convenience, please add https://github.com/tarantool/tarantool/issues/4842 > ## Summary > > The aim of this RFC is to address the following list of problems > formulated at MRG planning meeting: > - protocol backward compatibility to enable cluster upgrade w/o > downtime > - consistency of data on replica and leader > - switch from leader to replica without data loss > - up to date replicas to run read-only requests > - ability to switch async replicas into sync ones 2. Ability to switch async replicas into sync ones and vice-versa? Or not? > - guarantee of rollback on leader and sync replicas > - simplicity of cluster orchestration > > What this RFC is not: > > - high availability (HA) solution with automated failover, roles > assignments an so on > - master-master configuration support > > > ## Background and motivation > > There are number of known implemenatation of consistent data presence in > a cluster. They can be commonly named as "wait for LSN" technique. The > biggest issue with this technique is the abscence of rollback gauarantees 3. typo: gauarantees -> guarantees > at replica in case of transaction failure on one master or some of the > replics in the cluster. 4. typo: replics -> replicas > > To provide such capabilities a new functionality should be introduced in > Tarantool core, with limitation mentioned before - backward compatilibity > and ease of cluster orchestration. 5. but there is nothing mentioned before about these limitations. > ## Detailed design > > ### Quorum commit > The main idea behind the proposal is to reuse existent machinery as much > as possible. It will ensure the well-tested and proven functionality > across many instances in MRG and beyond is used. The transaction rollback > mechanism is in place and works for WAL write failure. If we substitute > the WAL success with a new situation which is named 'quorum' later in > this document then no changes to the machinery is needed. The same is > true for snapshot machinery that allows to create a copy of the database > in memory for the whole period of snapshot file write. Adding quorum here > also minimizes changes. > > Currently replication represented by the following scheme: > ``` > Customer Leader WAL(L) Replica WAL(R) > |------TXN----->| | | | > | | | | | > | [TXN Rollback | | | > | created] | | | > | | | | | > | |-----TXN----->| | | > | | | | | > | |<---WAL Ok----| | | > | | | | | > | [TXN Rollback | | | > | destroyed] | | | > | | | | | > |<----TXN Ok----| | | | > | |-------Replicate TXN------->| | > | | | | | > | | | [TXN Rollback | > | | | created] | > | | | | | > | | | |-----TXN----->| > | | | | | > | | | |<---WAL Ok----| > | | | | | > | | | [TXN Rollback | > | | | destroyed] | > | | | | | > ``` > > > To introduce the 'quorum' we have to receive confirmation from replicas > to make a decision on whether the quorum is actually present. Leader > collects necessary amount of replicas confirmation plus its own WAL > success. This state is named 'quorum' and gives leader the right to > complete the customers' request. So the picture will change to: > ``` > Customer Leader WAL(L) Replica WAL(R) > |------TXN----->| | | | > | | | | | > | [TXN Rollback | | | > | created] | | | > | | | | | > | |-----TXN----->| | | > | | | | | > | |-------Replicate TXN------->| | > | | | | | > | | | [TXN Rollback | > | |<---WAL Ok----| created] | > | | | | | > | [Waiting | |-----TXN----->| > | of a quorum] | | | > | | | |<---WAL Ok----| > | | | | | > | |<------Replication Ok-------| | > | | | | | > | [Quorum | | | > | achieved] | | | > | | | | | > | [TXN Rollback | | | > | destroyed] | | | > | | | | | > | |----Quorum--->| | | > | | | | | > | |-----------Quorum---------->| | > | | | | | > |<---TXN Ok-----| | [TXN Rollback | > | | | destroyed] | > | | | | | > | | | |----Quorum--->| > | | | | | > ``` > > The quorum should be collected as a table for a list of transactions > waiting for quorum. The latest transaction that collects the quorum is > considered as complete, as well as all transactions prior to it, since > all transactions should be applied in order. Leader writes a 'quorum' > message to the WAL and it is delivered to Replicas. > > Replica should report a positive or a negative result of the TXN to the > Leader via the IPROTO explicitly to allow Leader to collect the quorum > or anti-quorum for the TXN. In case negative result for the TXN received > from minor number of Replicas, then Leader has to send an error message > to each Replica, which in turn has to disconnect from the replication > the same way as it is done now in case of conflict. > > In case Leader receives enough error messages to do not achieve the > quorum it should write the 'rollback' message in the WAL. After that > Leader and Replicas will perform the rollback for all TXN that didn't > receive quorum. > > ### Recovery and failover. > > Tarantool instance during reading WAL should postpone the commit until > the quorum is read. In case the WAL eof is achieved, the instance should > keep rollback for all transactions that are waiting for a quorum entry > until the role of the instance is set. In case this instance become a > Replica there are no additional actions needed, sine all info about > quorum/rollback will arrive via replication. In case this instance is > assigned a Leader role, it should write 'rollback' in its WAL and > perform rollback for all transactions waiting for a quorum. > > In case of a Leader failure a Replica with the biggest LSN with former > leader's ID is elected as a new leader. The replica should record > 'rollback' in its WAL which effectively means that all transactions > without quorum should be rolled back. This rollback will be delivered to > all replicas and they will perform rollbacks of all transactions waiting > for quorum. > > ### Snapshot generation. > > We also can reuse current machinery of snapshot generation. Upon > receiving a request to create a snapshot an instance should request a > readview for the current commit operation. Although start of the > snapshot generation should be postponed until this commit operation > receives its quorum. In case operation is rolled back, the snapshot > generation should be aborted and restarted using current transaction > after rollback is complete. > > After snapshot is created the WAL should start from the first operation > that follows the commit operation snapshot is generated for. That means > WAL will contain a quorum message that refers to a transaction that is > not present in the WAL. Apparently, we have to allow this for the case > quorum refers to a transaction with LSN less than the first entry in the > WAL and only once. > > ### Asynchronous replication. > > Along with synchronous Replicas the cluster can contain asynchronous > Replicas. That means async Replica doesn't reply to the Leader with > errors since they're not contributing into quorum. Still, async > Replicas have to follow the new WAL operation, such as keep rollback > info until 'quorum' message is received. This is essential for the case > of 'rollback' message appearance in the WAL. This message assumes > Replica is able to perform all necessary rollback by itself. Cluster > information should contain explicit notification of each Replica > operation mode. > > ### Synchronous replication enabling. > > Synchronous operation can be required for a set of spaces in the data > scheme. That means only transactions that contain data modification for > these spaces should require quorum. Such transactions named synchronous. > As soon as last operation of synchronous transaction appeared in Leader's > WAL, it will cause all following transactions - matter if they are > synchronous or not - wait for the quorum. In case quorum is not achieved > the 'rollback' operation will cause rollback of all transactions after > the synchronous one. It will ensure the consistent state of the data both > on Leader and Replicas. In case user doesn't require synchronous operation > for any space then no changes to the WAL generation and replication will > appear. > > Cluster description should contain explicit attribute for each Replica > to denote it participates in synchronous activities. Also the description > should contain criterion on how many Replicas responses are needed to > achieve the quorum. > > > ## Rationale and alternatives > > There is an implementation of synchronous replication as part of gh-980 > activities, still it is not in a state to get into the product. More > than that it intentionally breaks backward compatibility which is a > prerequisite for this proposal. -- sergeyb@
next prev parent reply other threads:[~2020-04-14 12:58 UTC|newest] Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-04-03 21:08 Sergey Ostanevich 2020-04-07 13:02 ` Aleksandr Lyapunov 2020-04-08 9:18 ` Sergey Ostanevich 2020-04-08 14:05 ` Konstantin Osipov 2020-04-08 15:06 ` Sergey Ostanevich 2020-04-14 12:58 ` Sergey Bronnikov [this message] 2020-04-14 14:43 ` Sergey Ostanevich 2020-04-15 11:09 ` sergos 2020-04-15 14:50 ` sergos 2020-04-16 7:13 ` Aleksandr Lyapunov 2020-04-17 10:10 ` Konstantin Osipov 2020-04-17 13:45 ` Sergey Ostanevich 2020-04-20 11:20 ` Serge Petrenko 2020-04-20 23:32 ` Vladislav Shpilevoy 2020-04-21 10:49 ` Sergey Ostanevich 2020-04-21 22:17 ` Vladislav Shpilevoy 2020-04-22 16:50 ` Sergey Ostanevich 2020-04-22 20:28 ` Vladislav Shpilevoy 2020-04-23 6:58 ` Konstantin Osipov 2020-04-23 9:14 ` Konstantin Osipov 2020-04-23 11:27 ` Sergey Ostanevich 2020-04-23 11:43 ` Konstantin Osipov 2020-04-23 15:11 ` Sergey Ostanevich 2020-04-23 20:39 ` Konstantin Osipov 2020-04-23 21:38 ` Vladislav Shpilevoy 2020-04-23 22:28 ` Konstantin Osipov 2020-04-30 14:50 ` Sergey Ostanevich 2020-05-06 8:52 ` Konstantin Osipov 2020-05-06 16:39 ` Sergey Ostanevich 2020-05-06 18:44 ` Konstantin Osipov 2020-05-12 15:55 ` Sergey Ostanevich 2020-05-12 16:42 ` Konstantin Osipov 2020-05-13 21:39 ` Vladislav Shpilevoy 2020-05-13 23:54 ` Konstantin Osipov 2020-05-14 20:38 ` Sergey Ostanevich 2020-05-20 20:59 ` Sergey Ostanevich 2020-05-25 23:41 ` Vladislav Shpilevoy 2020-05-27 21:17 ` Sergey Ostanevich 2020-06-09 16:19 ` Sergey Ostanevich 2020-06-11 15:17 ` Vladislav Shpilevoy 2020-06-12 20:31 ` Sergey Ostanevich 2020-05-13 21:36 ` Vladislav Shpilevoy 2020-05-13 23:45 ` Konstantin Osipov 2020-05-06 18:55 ` Konstantin Osipov 2020-05-06 19:10 ` Konstantin Osipov 2020-05-12 16:03 ` Sergey Ostanevich 2020-05-13 21:42 ` Vladislav Shpilevoy 2020-05-14 0:05 ` Konstantin Osipov 2020-05-07 23:01 ` Konstantin Osipov 2020-05-12 16:40 ` Sergey Ostanevich 2020-05-12 17:47 ` Konstantin Osipov 2020-05-13 21:34 ` Vladislav Shpilevoy 2020-05-13 23:31 ` Konstantin Osipov
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20200414125848.GA1249@pony.bronevichok.ru \ --to=sergeyb@tarantool.org \ --cc=sergos@tarantool.org \ --cc=tarantool-patches@dev.tarantool.org \ --subject='Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox