From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lj1-f180.google.com (mail-lj1-f180.google.com [209.85.208.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 0DE5C469710 for ; Tue, 12 May 2020 19:42:05 +0300 (MSK) Received: by mail-lj1-f180.google.com with SMTP id f18so14417760lja.13 for ; Tue, 12 May 2020 09:42:05 -0700 (PDT) Date: Tue, 12 May 2020 19:42:02 +0300 From: Konstantin Osipov Message-ID: <20200512164202.GA2049@atlas> References: <20200403210836.GB18283@tarantool.org> <20200430145033.GF112@tarantool.org> <20200506085249.GA2842@atlas> <20200506163901.GH112@tarantool.org> <20200506184445.GB24913@atlas> <20200512155508.GJ112@tarantool.org> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20200512155508.GJ112@tarantool.org> Subject: Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Sergey Ostanevich Cc: tarantool-patches@dev.tarantool.org, Vladislav Shpilevoy * Sergey Ostanevich [20/05/12 18:56]: > On 06 мая 21:44, Konstantin Osipov wrote: > > * Sergey Ostanevich [20/05/06 19:41]: > > > > > | | | | | > > > > > | [Quorum | | | > > > > > | achieved] | | | > > > > > | | | | | > > > > > | [TXN undo log | | | > > > > > | destroyed] | | | > > > > > | | | | | > > > > > | |---Confirm--->| | | > > > > > | | | | | > > > > > > > > What happens if writing Confirm to WAL fails? TXN und log record > > > > is destroyed already. Will the server panic now on WAL failure, > > > > even if it is intermittent? > > > > > > I would like to have an example of intermittent WAL failure. Can it be > > > other than problem with disc - be it space/availability/malfunction? > > > > For SAN disks it can simply be a networking issue. The same is > > true for any virtual filesystem in the cloud. For local disks it > > is most often out of space, but this is not an impossible event. > > The SANDisk is an SSD vendor. I bet you mean NAS - network array > storage, isn't it? Then I see no difference in WAL write into NAS in > current schema - you will catch a timeout, WAL will report failure, > replica stops. SAN stands for storage area network. There is no timeout in wal tx bus and no timeout in WAL I/O. A replica doesn't stop on an intermittent failure. Stopping a replica on an intermittent failure reduces availability of non-sync writes. It seems you have some assumptions in mind which are not in the document - e.g. that some timeouts are added. They are not in the POC either. I suppose the document is expected to explain quite accurately what has to be done, e.g. how these new timeouts work? > > > For all of those it should be resolved outside the DBMS anyways. So, > > > leader should stop and report its problems to orchestrator/admins. > > > > Sergey, I understand that RAFT spec is big and with this spec you > > try to split it into manageable parts. The question is how useful > > is this particular piece. I'm trying to point out that "the leader > > should stop" is not a silver bullet - especially since each such > > stop may mean a rejoin of some other node. The purpose of sync > > replication is to provide consistency without reducing > > availability (i.e. make progress as long as the quorum > > of nodes make progress). > > I'm not sure if we're talking about the same RAFT - mine is "In Search > of an Understandable Consensus Algorithm (Extended Version)" from > Stanford as of May 2014. And it is 15 pages - including references, > conclusions and intro. Seems not that big. > > Although, most of it is dedicated to the leader election itself, which > we intentionally put aside from this RFC. It is written in the very > beginning and I empasized this by explicit mentioning of it. I conclude that it is big from the state of this document. It provides some coverage of the normal operation. Leader election, failure detection, recovery/restart, replication configuration changes are either barely mentioned or not covered at all. I find no other reason to not cover them except to be able to come up with a MVP quicker. Do you? > > The current spec, suggesting there should be a leader stop in case > > of most errors, reduces availability significantly, and doesn't > > make external coordinator job any easier - it still has to follow to > > the letter the prescriptions of RAFT. > > So, the postponing of a commit until quorum collection is the most > useful part of this RFC, also to some point I'm trying to address the > WAL insconsistency. > Although, it can be covered only partly: if a > leader's log diverge in unconfirmed transactions only, then they can be > rolled back easiy. Technically, it should be enough if leader changed > for a replica from the cluster majority at the moment of failure. > Otherwise it will require pre-parsing of the WAL and it can well happens > that WAL is not long enough, hence ex-leader still need a complete > bootstrap. I don't understand what's pre-parsing and how what you write is relevant to the fact that reduced availability of non-raft writes is bad. > > > But at this point a new leader will be appointed - the old one is > > > restarted. Then the Confirm message will arrive to the restarted leader > > > through a regular replication. > > > > This assumes that restart is guaranteed to be noticed by the > > external coordinator and there is an election on every restart. > > Sure yes, if it restarted - then connection lost can't be unnoticed by > anyone, be it coordinator or cluster. Well, the spec doesn't say anywhere that the external coordinator has to establish a TCP connection to every participant. Could you please add a chapter where this is clarified? It seems you have a specific coordinator in mind ? > > > > > The quorum should be collected as a table for a list of transactions > > > > > waiting for quorum. The latest transaction that collects the quorum is > > > > > considered as complete, as well as all transactions prior to it, since > > > > > all transactions should be applied in order. Leader writes a 'confirm' > > > > > message to the WAL that refers to the transaction's [LEADER_ID, LSN] and > > > > > the confirm has its own LSN. This confirm message is delivered to all > > > > > replicas through the existing replication mechanism. > > > > > > > > > > Replica should report a TXN application success to the leader via the > > > > > IPROTO explicitly to allow leader to collect the quorum for the TXN. > > > > > In case of application failure the replica has to disconnect from the > > > > > replication the same way as it is done now. The replica also has to > > > > > report its disconnection to the orchestrator. Further actions require > > > > > human intervention, since failure means either technical problem (such > > > > > as not enough space for WAL) that has to be resovled or an inconsistent > > > > > state that requires rejoin. > > > > > > > > > As soon as leader appears in a situation it has not enough replicas > > > > > to achieve quorum, the cluster should stop accepting any requests - both > > > > > write and read. > > > > > > > > How does *the cluster* know the state of the leader and if it > > > > doesn't, how it can possibly implement this? Did you mean > > > > the leader should stop accepting transactions here? But how can > > > > the leader know if it has not enough replicas during a read > > > > transaction, if it doesn't contact any replica to serve a read? > > > > > > I expect to have a disconnection trigger assigned to all relays so that > > > disconnection will cause the number of replicas decrease. The quorum > > > size is static, so we can stop at the very moment the number dives below. > > > > What happens between the event the leader is partitioned away and > > a new leader is elected? > > > > The leader may be unaware of the events and serve a read just > > fine. > > As it is stated 20 lines above: > > > > > As soon as leader appears in a situation it has not enough > > > > > replicas > > > > > to achieve quorum, the cluster should stop accepting any > > > > > requests - both > > > > > write and read. > > So it will not serve. Sergey, this is recursion. I'm asking you to clarify exactly this point. Do you assume that replicas perform some kind of failure detection? What kind? Is it *in addition* to the failure detection performed by the external coordinator? Any failure detector imaginable would be asynchronous. What happens between the failure and the time it's detected? > > So at least you can't say the leader shouldn't be serving reads > > without quorum - because the only way to achieve it is to collect > > a quorum of responses to reads as well. > > The leader lost connection to the (N-Q)+1 repllicas out of the N in > cluster with a quorum of Q == it stops serving anything. So the quorum > criteria is there: no quorum - no reads. OK, so you assume that TCP connection *is* the failure detector? Failure detection in TCP is optional, asynchronous, and worst of all, unreliable. Why do think it can be used? > > > > > The reason for this is that replication of transactions > > > > > can achieve quorum on replicas not visible to the leader. On the other > > > > > hand, leader can't achieve quorum with available minority. Leader has to > > > > > report the state and wait for human intervention. There's an option to > > > > > ask leader to rollback to the latest transaction that has quorum: leader > > > > > issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN > > > > > is of the first transaction in the leader's undo log. The rollback > > > > > message replicated to the available cluster will put it in a consistent > > > > > state. After that configuration of the cluster can be updated to > > > > > available quorum and leader can be switched back to write mode. > > > > > > > > As you should be able to conclude from restart scenario, it is > > > > possible a replica has the record in *confirmed* state but the > > > > leader has it in pending state. The replica will not be able to > > > > roll back then. Do you suggest the replica should abort if it > > > > can't rollback? This may lead to an avalanche of rejoins on leader > > > > restart, bringing performance to a halt. > > > > > > No, I declare replica with biggest LSN as a new shining leader. More > > > than that, new leader can (so far it will be by default) finalize the > > > former leader life's work by replicating txns and appropriate confirms. > > > > Right, this also assumes the restart is noticed, so it follows the > > same logic. > > How a restart can be unnoticed, if it causes disconnection? Honestly, I'm baffled. It's like we speak different languages. I can't imagine you are unaware of the fallacies of distributed computing, but I see no other explanation to you question. -- Konstantin Osipov, Moscow, Russia