From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lj1-f193.google.com (mail-lj1-f193.google.com [209.85.208.193]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 3247C469710 for ; Thu, 14 May 2020 02:45:56 +0300 (MSK) Received: by mail-lj1-f193.google.com with SMTP id l19so1468912lje.10 for ; Wed, 13 May 2020 16:45:56 -0700 (PDT) Date: Thu, 14 May 2020 02:45:53 +0300 From: Konstantin Osipov Message-ID: <20200513234553.GB5698@atlas> References: <20200403210836.GB18283@tarantool.org> <20200430145033.GF112@tarantool.org> <20200506085249.GA2842@atlas> <20200506163901.GH112@tarantool.org> <10920620-b911-360f-f075-7718a40bf4af@tarantool.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <10920620-b911-360f-f075-7718a40bf4af@tarantool.org> Subject: Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Vladislav Shpilevoy Cc: tarantool-patches@dev.tarantool.org * Vladislav Shpilevoy [20/05/14 00:37]: > > Thanks for review! > > > >>> | | | | | > >>> | [Quorum | | | > >>> | achieved] | | | > >>> | | | | | > >>> | [TXN undo log | | | > >>> | destroyed] | | | > >>> | | | | | > >>> | |---Confirm--->| | | > >>> | | | | | > >> > >> What happens if writing Confirm to WAL fails? TXN und log record > >> is destroyed already. Will the server panic now on WAL failure, > >> even if it is intermittent? > > > > I would like to have an example of intermittent WAL failure. Can it be > > other than problem with disc - be it space/availability/malfunction? > > > > For all of those it should be resolved outside the DBMS anyways. So, > > leader should stop and report its problems to orchestrator/admins. > > > > I would agree that undo log can be destroyed *after* the Confirm is > > landed to WAL - same is for replica. > > Well, in fact you can't (or can you?). Because it won't help. Once you > tried to write 'Confirm', it means you got the quorum. So now in case > you will fail, a new leader will write 'Confirm' for you, when will see > a quorum too. So the current leader has no right to write 'Rollback' > from this moment, from what I understand. Because it still can be > confirmed by a new leader later, if you fail before 'Rollback' is > replicated to all. > > However the same problem appears, if you write 'Confirm' *successfully*. > Still the leader can fail, and a newer leader will write 'Rollback' if > won't collect the quorum again. Don't know what to do with that really. > Probably nothing. Maybe consult with the raft spec? The new leader is guaranteed to see the transaction since it has reached the majority of replicas. So it will definitely write "confirm" for it. The reason I asked the question is I want the case of intermittent failures be described in the spec. For example, is "confirm" a cbus message, then if there is a cascading rollback of the batch it is part of, it can be rolled back. I would like to see all these scenarios covered in the spec. If one of them ends with panic, I would like to understand how the external coordinator is going to resolve the new election. Raft has answers for all of it. > >>> As soon as leader appears in a situation it has not enough replicas > >>> to achieve quorum, the cluster should stop accepting any requests - both > >>> write and read. > >> > >> How does *the cluster* know the state of the leader and if it > >> doesn't, how it can possibly implement this? Did you mean > >> the leader should stop accepting transactions here? But how can > >> the leader know if it has not enough replicas during a read > >> transaction, if it doesn't contact any replica to serve a read? > > > > I expect to have a disconnection trigger assigned to all relays so that > > disconnection will cause the number of replicas decrease. The quorum > > size is static, so we can stop at the very moment the number dives below. > > This is a very dubious statement. In TCP disconnect may be detected much > later, than it happened. So to collect a quorum on something you need to > literally collect this quorum, with special WAL records, via network, and > all. A disconnect trigger does not help at all here. Erhm, thanks. > Talking of the whole 'read-quorum' idea, I don't like it. Because this > really makes things unbearably harder to implement, the nodes become > much slower and less available in terms of any problems. > > I think reads should be allowed always, and from any node (except during > bootstrap, of course). After all, you have transactions for consistency. So > as far as replication respects transaction boundaries, every node is in a > consistent state. Maybe not all of them are in the same state, but every one > is consistent. In memtx, you read by default dirty, uncommitted data. It was OK for single-node transactions, since the only chance for it to be rolled back were out of space/disk failure, which were extremely rare, now you really read dirty stuff, because you can easily have it rolled back because of lack of quorum or re-election. So it's a much bigger deal. > Honestly, I can't even imagine, how is it possible to implement a completely > synchronous simultaneous cluster progression. It is impossible even in theory. > There always will be a time period, when some nodes are further than the > others. At least because of network delays. > > So either we allows reads from master only, or we allow reads from everywhere, > and in that case nothing will save from a possibility of seeing different > data on different nodes. This is why there are many consistency models out there (just google consistency models in distributed systems), and the minor details are important. It's indeed hard to implement the strictest model (serial), but it is also often unnecessary, and there is consensus in the relational databases what issues are acceptable and what are not. More specifically, I think for tarantool sync replication we should aim at read committed. The spec should say it in no uncertain terms and explain how it is achieved. -- Konstantin Osipov, Moscow, Russia