From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp39.i.mail.ru (smtp39.i.mail.ru [94.100.177.99]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 60FCF41C5DA for ; Fri, 12 Jun 2020 23:31:04 +0300 (MSK) Date: Fri, 12 Jun 2020 23:31:02 +0300 From: Sergey Ostanevich Message-ID: <20200612203102.GI50@tarantool.org> References: <20200506163901.GH112@tarantool.org> <20200506184445.GB24913@atlas> <20200512155508.GJ112@tarantool.org> <78713377-806f-8cf6-efe0-5019f3d3e428@tarantool.org> <20200514203811.GN112@tarantool.org> <20200520205925.GA58@tarantool.org> <887a0ec5-3a01-565a-0c31-f7fab619af8f@tarantool.org> <20200527211735.GA50@tarantool.org> <20200609161929.GF50@tarantool.org> <6678e2a7-9f95-ce3d-1c2f-1eaf20589005@tarantool.org> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <6678e2a7-9f95-ce3d-1c2f-1eaf20589005@tarantool.org> Subject: Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Vladislav Shpilevoy Cc: tarantool-patches@dev.tarantool.org Hi! Thanks for review, attaching a diff. Full version is available at the branch https://github.com/tarantool/tarantool/blob/sergos/quorum-based-synchro/ On 11 июн 17:17, Vladislav Shpilevoy wrote: > Hi! Thanks for the updates! > > > ### Connection liveness > > > > There is a timeout-based mechanism in Tarantool that controls the > > asynchronous replication, which uses the following config: > > ``` > > * replication_connect_timeout = 4 > > * replication_sync_lag = 10 > > * replication_sync_timeout = 300 > > * replication_timeout = 1 > > ``` > > For backward compatibility and to differentiate the async replication > > we should augment the configuration with the following: > > ``` > > * synchro_replication_heartbeat = 4 > > Heartbeats are already being sent. I don't see any sense in adding a > second heartbeat option. I had an idea that synchronous replication can co-exist with async one, so they have to have independent tuning. Now I realize that sending two types of heartbeats is too much, so I'll drop this one. > > > * synchro_replication_quorum_timeout = 4 > > Since this is a replication option, it should start from replication_ > prefix. There are number of options already exist that are very similar in naming, such as replication_sync_timeout, replication_sync_lag and even replication_connect_quorum. I expect to resolve the ambiguity with putting in a new prefix, synchro_replication. The drawback is those options reused from async mode would be not-so-clearly linked to the synch one. > > > ``` > > Leader should send a heartbeat every synchro_replication_heartbeat if > > there were no messages sent. Replicas should respond to the heartbeat > > just the same way as they do it now. As soon as Leader has no response > > for another heartbeat interval, it should consider the replica is lost. > > All of that is already done in the regular heartbeats, not related nor > bound to any synchronous activities. Just like failure detection should be. > > > As soon as leader appears in a situation it has not enough replicas > > to achieve quorum, it should stop accepting write requests. There's an > > option for leader to rollback to the latest transaction that has quorum: > > leader issues a 'rollback' message referring to the [LEADER_ID, LSN] > > where LSN is of the first transaction in the leader's undo log. > > What is that option? Good catch, thanks! This option was introduced to get to a consistent state with replicas. Although, if Leader will wait longer than timeout for quorum it will rollback anyways, so I will remove mention of this. > > > The rollback message replicated to the available cluster will put it in a > > consistent state. After that configuration of the cluster can be > > updated to a new available quorum and leader can be switched back to > > write mode. > > > > During the quorum collection it can happen that some of replicas become > > unavailable due to some reason, so leader should wait at most for > > synchro_replication_quorum_timeout after which it issues a Rollback > > pointing to the oldest TXN in the waiting list. diff --git a/doc/rfc/quorum-based-synchro.md b/doc/rfc/quorum-based-synchro.md index c7dcf56b5..0a92642fd 100644 --- a/doc/rfc/quorum-based-synchro.md +++ b/doc/rfc/quorum-based-synchro.md @@ -83,9 +83,10 @@ Customer Leader WAL(L) Replica WAL(R) To introduce the 'quorum' we have to receive confirmation from replicas to make a decision on whether the quorum is actually present. Leader -collects necessary amount of replicas confirmation plus its own WAL -success. This state is named 'quorum' and gives leader the right to -complete the customers' request. So the picture will change to: +collects replication_synchro_quorum-1 of replicas confirmation and its +own WAL success. This state is named 'quorum' and gives leader the +right to complete the customers' request. So the picture will change +to: ``` Customer Leader WAL(L) Replica WAL(R) |------TXN----->| | | | @@ -158,26 +159,21 @@ asynchronous replication, which uses the following config: For backward compatibility and to differentiate the async replication we should augment the configuration with the following: ``` -* synchro_replication_heartbeat = 4 -* synchro_replication_quorum_timeout = 4 +* replication_synchro_quorum_timeout = 4 +* replication_synchro_quorum = 4 ``` -Leader should send a heartbeat every synchro_replication_heartbeat if -there were no messages sent. Replicas should respond to the heartbeat -just the same way as they do it now. As soon as Leader has no response -for another heartbeat interval, it should consider the replica is lost. -As soon as leader appears in a situation it has not enough replicas -to achieve quorum, it should stop accepting write requests. There's an -option for leader to rollback to the latest transaction that has quorum: -leader issues a 'rollback' message referring to the [LEADER_ID, LSN] -where LSN is of the first transaction in the leader's undo log. The -rollback message replicated to the available cluster will put it in a -consistent state. After that configuration of the cluster can be -updated to a new available quorum and leader can be switched back to -write mode. +Leader should send a heartbeat every replication_timeout if there were +no messages sent. Replicas should respond to the heartbeat just the +same way as they do it now. As soon as Leader has no response for +another heartbeat interval, it should consider the replica is lost. As +soon as leader appears in a situation it has not enough replicas to +achieve quorum, it should stop accepting write requests. After that +configuration of the cluster can be updated to a new available quorum +and leader can be switched back to write mode. During the quorum collection it can happen that some of replicas become unavailable due to some reason, so leader should wait at most for -synchro_replication_quorum_timeout after which it issues a Rollback +replication_synchro_quorum_timeout after which it issues a Rollback pointing to the oldest TXN in the waiting list. ### Leader role assignment. @@ -274,9 +270,9 @@ Leader role and the cluster had 2 replicas with quorum set to 2. +---------------------+---------------------+---------------------+ | ID1 Conf [ID1, Tx2] | | | +---------------------+---------------------+---------------------+ -| Tx6 | | | +| ID1 Tx | | | +---------------------+---------------------+---------------------+ -| Tx7 | | | +| ID1 Tx | | | +---------------------+---------------------+---------------------+ ``` Suppose at this moment the ID1 instance crashes. Then the ID2 instance