From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtpng1.m.smailru.net (smtpng1.m.smailru.net [94.100.181.251]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 0D90D4696C3 for ; Fri, 24 Apr 2020 00:38:36 +0300 (MSK) References: <20200403210836.GB18283@tarantool.org> From: Vladislav Shpilevoy Message-ID: Date: Thu, 23 Apr 2020 23:38:34 +0200 MIME-Version: 1.0 In-Reply-To: <20200403210836.GB18283@tarantool.org> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Subject: Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Sergey Ostanevich , tarantool-patches@dev.tarantool.org, Timur Safin , Mons Anderson Hi! Here is a short summary of our late night discussion and the questions it brought up, while I was trying to design a draft plan of an implementation. Since the RFC is too far from the code, and I needed a more 'pedestrian' and detailed plan. The question is about 'confirm' message and quorum collection. Here is the schema presented in the RFC: > Customer Leader WAL(L) Replica WAL(R) > |------TXN----->| | | | > | | | | | > | [TXN undo log | | | > | created] | | | > | | | | | > | |-----TXN----->| | | > | | | | | > | |-------Replicate TXN------->| | > | | | | | > | | | [TXN undo log | > | |<---WAL Ok----| created] | > | | | | | > | [Waiting | |-----TXN----->| > | of a quorum] | | | > | | | |<---WAL Ok----| > | | | | | > | |<------Replication Ok-------| | > | | | | | > | [Quorum | | | > | achieved] | | | > | | | | | > | [TXN undo log | | | > | destroyed] | | | > | | | | | > | |---Confirm--->| | | > | | | | | > | |----------Confirm---------->| | > | | | | | > |<---TXN Ok-----| | [TXN undo log | > | | | destroyed] | > | | | | | > | | | |---Confirm--->| > | | | | | It says, that once the quorum is collected, and 'confirm' is written to local leader's WAL, it is considered committed and is reported to the client as successful. On the other hand it is said, that in case of leader change the new leader will rollback all not confirmed transactions. That leads to the following bug: Assume we have 4 instances: i1, i2, i3, i4. Leader is i1. It writes a transaction with LSN1. The LSN1 is sent to other nodes, they apply it ok, and send acks to the leader. The leader sees i2-i4 all applied the transaction (propagated their LSNs to LSN1). It writes 'confirm' to its local WAL, reports it to the client as success, the client's request is over, it is returned back to some remote node, etc. The transaction is officially synchronously committed. Then the leader's machine dies - disk is dead. The confirm was not sent to any of the other nodes. For example, it started having problems with network connection to the replicas recently before the death. Or it just didn't manage to hand the confirm out. >From now on if any of the other nodes i2-i4 becomes a leader, it will rollback the officially confirmed transaction, even if it has it, and all the other nodes too. That basically means, this sync replication gives exactly the same guarantees as the async replication - 'confirm' on the leader tells nothing about replicas except that they *are able to apply the transaction*, but still may not apply it. Am I missing something? Another issue is with failure detection. Lets assume, that we wait for 'confirm' to be propagated on quorum of replicas too. Assume some replicas responded with an error. So they first said they can apply the transaction, and saved it into their WALs, and then they couldn't apply confirm. That could happen because of 2 reasons: replica has problems with WAL, or the replica becomes unreachable from the master. WAL-problematic replicas can be disconnected forcefully, since they are clearly not able to work properly anymore. But what to do with disconnected replicas? 'Confirm' can't wait for them forever - we will run out of fibers, if we have even just hundreds of RPS of sync transactions, and wait for, lets say, a few minutes. On the other hand we can't roll them back, because 'confirm' has been written to the local WAL already. Note for those who is concerned: this has nothing to do with in-memory relay. It has the same problems, which are in the protocol, not in the implementation.