From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf1-f65.google.com (mail-lf1-f65.google.com [209.85.167.65]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id F1091445320 for ; Mon, 3 Aug 2020 23:56:30 +0300 (MSK) Received: by mail-lf1-f65.google.com with SMTP id x24so3870531lfe.11 for ; Mon, 03 Aug 2020 13:56:30 -0700 (PDT) Date: Mon, 3 Aug 2020 23:56:28 +0300 From: Konstantin Osipov Message-ID: <20200803205628.GA387134@atlas> References: <4E8CD3F1-803C-494A-AC81-B26968280E04@tarantool.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4E8CD3F1-803C-494A-AC81-B26968280E04@tarantool.org> Subject: Re: [Tarantool-patches] [RFC] Raft over qsync List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "sergos@tarantool.org" Cc: tarantool-patches@dev.tarantool.org * sergos@tarantool.org [20/08/03 23:54]: willem is a very naive implementation, I'd recommend looking at etcd/raft instead. > * **Status**: In progress > * **Start date**: 27-07-2020 > * **Authors**: Sergey Ostanevich @sergos \, Vladislav Shpilevoy @Gerold103 \, Cyrill Gorcunov @cyrillos \ > * **Issues**: https://github.com/tarantool/tarantool/issues/5202 > > ## Summary > > The #4842 is introduced a quorum based replication (qsync) in Tarantool > environment. To augment the synchronous replication with automated leader > election and failover, we need to make a choice on how to implement one of > algorithms available. Our preference is Raft since it is has good > comprehensibility. I would refer to the https://raft.github.io/raft.pdf further > in the document. > > The biggest problem is to link together the master-master nature of log > replication in Tarantool with the strong leader concept in Raft. > > ## Background and motivation > > Customers requested synchronous replication with automated failover as part of > Tarantool development. These features also fits well with Tarantool future we > envision in MRG. > > ## Detailed design > > Qsync implementation is performed in a way that allows users to run > Tarantool-based solution without any changes to their code base. Literally if > you have a database set up, it will continue running after upgrade the same way > as it was prior to 2.5 version. You can start employing the synchronous > replication by introduction of specific spaces in your schema and after that > qsync with come to play. There were no changes to the protocol, so you can mix > 2.5 instances with previous in both ways - replicate from new to old either > vice-versa - until you introduce the first synchronous space. > > The machinery under the hood oblige all instances to follow a new process of > transaction: if transaction touches a synchronous space then it will require a > special command from the WAL - confirm. Since the obligation is an incremental > requirement we can keep the backward compatibility and in both ways. > > We expect to elaborate similar approach to the Raft-based failover machinery. > Which means one can use the qsync replication without the Raft enabled, being > able to elaborate its own failover mechanism. Although, if Raft is enabled then > all instances in the cluster are obliged to follow the rules implied by the > Raft, such as ignore log entries from a leader with stale term number. > > ### Leader Election > > We expect the leader election algorithm can be reused from the Raft > implementation [2] with additional mapping to the Tarantool replication > mechanism. > > The replication mechanism of Tarantool with qsync provides the following > features, that required by Raft algorithm: > * The new entry is delivered to the majority of the replicas and only after this > it is committed, the fact of commit is delivered to client > * The log reflects the fact of an entry is committed by a special entry 'Commit' > as part of qsync machinery > * Log consistency comes from the append-only nature of the log and the vclock > check-up during the log append > * Log inconsistencies are handled during replication (see log replication below) > > Raft implementation will hold leader's term and it's vote for the current term > in its local structures, while issuing WAL entries to reflect changes in the > status will ensure the persistence of the Raft state. After a fail node will > read the WAL entries and apply them to its runtime state. These entries should > go into WAL under the ID number 0, so that they will not be propagated. > > To propagate the results of election there should be an entry in WAL with > dedicated info: raft_term and leader_id. An elected leader should issue such an > entry upon its election as a first entry. > > > ### Log Replication > > The qsync RFC explains how we enforce the log replication in a way it is > described in clause 5.3 of the [1]: committed entry always has a commit message > in the xlog. Key difference here is that log entry index comprises of two parts: > the LSN and the served ID. The follower's log consistency will be achieved > during a) leader election, when follower will only vote for a candidate who has > VCLOCK components greater or equal to follower's and b) during the join to a new > leader, when follower will have an option to drop it's waiting queue (named > limbo in qsync implementation), either perform a full rejoin. > The latter is painful, still is the only way to follow the current > representation of xlog that contains no replay info. > > There is a requirement in 5.1 of [1]: > > > If a server receives a request with a stale term number, it rejects the > > request. > > that requires to introduce a machinery that will inspect every DML message from > IPROTO to check if it satisfies the requirement. To introduce this machinery > there should be an additional info put into the replicated DML map under > IPROTO_RAFT_TERM: a term of the leader taken from the Raft runtime > implementation. This info should be added to the DML - and checked - only in > case the cluster is configured to use the Raft machinery. > > As a result an instance that is joining a cluster with the Raft protocol enabled > has to enable the protocol either, otherwise it should be disconnected. > > ## Rationale and alternatives > > C implementation of Raft using binary protocol has an alternative of > implementation using Lua, for example [3]. Although, the performance of the > latter can have a significant impact in part of log replication enforcement. > > ## References > > * [1] https://raft.github.io/raft.pdf > * [2] https://github.com/willemt/raft > * [3] https://github.com/igorcoding/tarantool-raft -- Konstantin Osipov, Moscow, Russia