From: Konstantin Osipov <kostja.osipov@gmail.com> To: Sergey Ostanevich <sergos@tarantool.org> Cc: tarantool-patches@dev.tarantool.org, Vladislav Shpilevoy <v.shpilevoy@tarantool.org> Subject: Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication Date: Tue, 12 May 2020 19:42:02 +0300 [thread overview] Message-ID: <20200512164202.GA2049@atlas> (raw) In-Reply-To: <20200512155508.GJ112@tarantool.org> * Sergey Ostanevich <sergos@tarantool.org> [20/05/12 18:56]: > On 06 мая 21:44, Konstantin Osipov wrote: > > * Sergey Ostanevich <sergos@tarantool.org> [20/05/06 19:41]: > > > > > | | | | | > > > > > | [Quorum | | | > > > > > | achieved] | | | > > > > > | | | | | > > > > > | [TXN undo log | | | > > > > > | destroyed] | | | > > > > > | | | | | > > > > > | |---Confirm--->| | | > > > > > | | | | | > > > > > > > > What happens if writing Confirm to WAL fails? TXN und log record > > > > is destroyed already. Will the server panic now on WAL failure, > > > > even if it is intermittent? > > > > > > I would like to have an example of intermittent WAL failure. Can it be > > > other than problem with disc - be it space/availability/malfunction? > > > > For SAN disks it can simply be a networking issue. The same is > > true for any virtual filesystem in the cloud. For local disks it > > is most often out of space, but this is not an impossible event. > > The SANDisk is an SSD vendor. I bet you mean NAS - network array > storage, isn't it? Then I see no difference in WAL write into NAS in > current schema - you will catch a timeout, WAL will report failure, > replica stops. SAN stands for storage area network. There is no timeout in wal tx bus and no timeout in WAL I/O. A replica doesn't stop on an intermittent failure. Stopping a replica on an intermittent failure reduces availability of non-sync writes. It seems you have some assumptions in mind which are not in the document - e.g. that some timeouts are added. They are not in the POC either. I suppose the document is expected to explain quite accurately what has to be done, e.g. how these new timeouts work? > > > For all of those it should be resolved outside the DBMS anyways. So, > > > leader should stop and report its problems to orchestrator/admins. > > > > Sergey, I understand that RAFT spec is big and with this spec you > > try to split it into manageable parts. The question is how useful > > is this particular piece. I'm trying to point out that "the leader > > should stop" is not a silver bullet - especially since each such > > stop may mean a rejoin of some other node. The purpose of sync > > replication is to provide consistency without reducing > > availability (i.e. make progress as long as the quorum > > of nodes make progress). > > I'm not sure if we're talking about the same RAFT - mine is "In Search > of an Understandable Consensus Algorithm (Extended Version)" from > Stanford as of May 2014. And it is 15 pages - including references, > conclusions and intro. Seems not that big. > > Although, most of it is dedicated to the leader election itself, which > we intentionally put aside from this RFC. It is written in the very > beginning and I empasized this by explicit mentioning of it. I conclude that it is big from the state of this document. It provides some coverage of the normal operation. Leader election, failure detection, recovery/restart, replication configuration changes are either barely mentioned or not covered at all. I find no other reason to not cover them except to be able to come up with a MVP quicker. Do you? > > The current spec, suggesting there should be a leader stop in case > > of most errors, reduces availability significantly, and doesn't > > make external coordinator job any easier - it still has to follow to > > the letter the prescriptions of RAFT. > > So, the postponing of a commit until quorum collection is the most > useful part of this RFC, also to some point I'm trying to address the > WAL insconsistency. > Although, it can be covered only partly: if a > leader's log diverge in unconfirmed transactions only, then they can be > rolled back easiy. Technically, it should be enough if leader changed > for a replica from the cluster majority at the moment of failure. > Otherwise it will require pre-parsing of the WAL and it can well happens > that WAL is not long enough, hence ex-leader still need a complete > bootstrap. I don't understand what's pre-parsing and how what you write is relevant to the fact that reduced availability of non-raft writes is bad. > > > But at this point a new leader will be appointed - the old one is > > > restarted. Then the Confirm message will arrive to the restarted leader > > > through a regular replication. > > > > This assumes that restart is guaranteed to be noticed by the > > external coordinator and there is an election on every restart. > > Sure yes, if it restarted - then connection lost can't be unnoticed by > anyone, be it coordinator or cluster. Well, the spec doesn't say anywhere that the external coordinator has to establish a TCP connection to every participant. Could you please add a chapter where this is clarified? It seems you have a specific coordinator in mind ? > > > > > The quorum should be collected as a table for a list of transactions > > > > > waiting for quorum. The latest transaction that collects the quorum is > > > > > considered as complete, as well as all transactions prior to it, since > > > > > all transactions should be applied in order. Leader writes a 'confirm' > > > > > message to the WAL that refers to the transaction's [LEADER_ID, LSN] and > > > > > the confirm has its own LSN. This confirm message is delivered to all > > > > > replicas through the existing replication mechanism. > > > > > > > > > > Replica should report a TXN application success to the leader via the > > > > > IPROTO explicitly to allow leader to collect the quorum for the TXN. > > > > > In case of application failure the replica has to disconnect from the > > > > > replication the same way as it is done now. The replica also has to > > > > > report its disconnection to the orchestrator. Further actions require > > > > > human intervention, since failure means either technical problem (such > > > > > as not enough space for WAL) that has to be resovled or an inconsistent > > > > > state that requires rejoin. > > > > > > > > > As soon as leader appears in a situation it has not enough replicas > > > > > to achieve quorum, the cluster should stop accepting any requests - both > > > > > write and read. > > > > > > > > How does *the cluster* know the state of the leader and if it > > > > doesn't, how it can possibly implement this? Did you mean > > > > the leader should stop accepting transactions here? But how can > > > > the leader know if it has not enough replicas during a read > > > > transaction, if it doesn't contact any replica to serve a read? > > > > > > I expect to have a disconnection trigger assigned to all relays so that > > > disconnection will cause the number of replicas decrease. The quorum > > > size is static, so we can stop at the very moment the number dives below. > > > > What happens between the event the leader is partitioned away and > > a new leader is elected? > > > > The leader may be unaware of the events and serve a read just > > fine. > > As it is stated 20 lines above: > > > > > As soon as leader appears in a situation it has not enough > > > > > replicas > > > > > to achieve quorum, the cluster should stop accepting any > > > > > requests - both > > > > > write and read. > > So it will not serve. Sergey, this is recursion. I'm asking you to clarify exactly this point. Do you assume that replicas perform some kind of failure detection? What kind? Is it *in addition* to the failure detection performed by the external coordinator? Any failure detector imaginable would be asynchronous. What happens between the failure and the time it's detected? > > So at least you can't say the leader shouldn't be serving reads > > without quorum - because the only way to achieve it is to collect > > a quorum of responses to reads as well. > > The leader lost connection to the (N-Q)+1 repllicas out of the N in > cluster with a quorum of Q == it stops serving anything. So the quorum > criteria is there: no quorum - no reads. OK, so you assume that TCP connection *is* the failure detector? Failure detection in TCP is optional, asynchronous, and worst of all, unreliable. Why do think it can be used? > > > > > The reason for this is that replication of transactions > > > > > can achieve quorum on replicas not visible to the leader. On the other > > > > > hand, leader can't achieve quorum with available minority. Leader has to > > > > > report the state and wait for human intervention. There's an option to > > > > > ask leader to rollback to the latest transaction that has quorum: leader > > > > > issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN > > > > > is of the first transaction in the leader's undo log. The rollback > > > > > message replicated to the available cluster will put it in a consistent > > > > > state. After that configuration of the cluster can be updated to > > > > > available quorum and leader can be switched back to write mode. > > > > > > > > As you should be able to conclude from restart scenario, it is > > > > possible a replica has the record in *confirmed* state but the > > > > leader has it in pending state. The replica will not be able to > > > > roll back then. Do you suggest the replica should abort if it > > > > can't rollback? This may lead to an avalanche of rejoins on leader > > > > restart, bringing performance to a halt. > > > > > > No, I declare replica with biggest LSN as a new shining leader. More > > > than that, new leader can (so far it will be by default) finalize the > > > former leader life's work by replicating txns and appropriate confirms. > > > > Right, this also assumes the restart is noticed, so it follows the > > same logic. > > How a restart can be unnoticed, if it causes disconnection? Honestly, I'm baffled. It's like we speak different languages. I can't imagine you are unaware of the fallacies of distributed computing, but I see no other explanation to you question. -- Konstantin Osipov, Moscow, Russia
next prev parent reply other threads:[~2020-05-12 16:42 UTC|newest] Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-04-03 21:08 Sergey Ostanevich 2020-04-07 13:02 ` Aleksandr Lyapunov 2020-04-08 9:18 ` Sergey Ostanevich 2020-04-08 14:05 ` Konstantin Osipov 2020-04-08 15:06 ` Sergey Ostanevich 2020-04-14 12:58 ` Sergey Bronnikov 2020-04-14 14:43 ` Sergey Ostanevich 2020-04-15 11:09 ` sergos 2020-04-15 14:50 ` sergos 2020-04-16 7:13 ` Aleksandr Lyapunov 2020-04-17 10:10 ` Konstantin Osipov 2020-04-17 13:45 ` Sergey Ostanevich 2020-04-20 11:20 ` Serge Petrenko 2020-04-20 23:32 ` Vladislav Shpilevoy 2020-04-21 10:49 ` Sergey Ostanevich 2020-04-21 22:17 ` Vladislav Shpilevoy 2020-04-22 16:50 ` Sergey Ostanevich 2020-04-22 20:28 ` Vladislav Shpilevoy 2020-04-23 6:58 ` Konstantin Osipov 2020-04-23 9:14 ` Konstantin Osipov 2020-04-23 11:27 ` Sergey Ostanevich 2020-04-23 11:43 ` Konstantin Osipov 2020-04-23 15:11 ` Sergey Ostanevich 2020-04-23 20:39 ` Konstantin Osipov 2020-04-23 21:38 ` Vladislav Shpilevoy 2020-04-23 22:28 ` Konstantin Osipov 2020-04-30 14:50 ` Sergey Ostanevich 2020-05-06 8:52 ` Konstantin Osipov 2020-05-06 16:39 ` Sergey Ostanevich 2020-05-06 18:44 ` Konstantin Osipov 2020-05-12 15:55 ` Sergey Ostanevich 2020-05-12 16:42 ` Konstantin Osipov [this message] 2020-05-13 21:39 ` Vladislav Shpilevoy 2020-05-13 23:54 ` Konstantin Osipov 2020-05-14 20:38 ` Sergey Ostanevich 2020-05-20 20:59 ` Sergey Ostanevich 2020-05-25 23:41 ` Vladislav Shpilevoy 2020-05-27 21:17 ` Sergey Ostanevich 2020-06-09 16:19 ` Sergey Ostanevich 2020-06-11 15:17 ` Vladislav Shpilevoy 2020-06-12 20:31 ` Sergey Ostanevich 2020-05-13 21:36 ` Vladislav Shpilevoy 2020-05-13 23:45 ` Konstantin Osipov 2020-05-06 18:55 ` Konstantin Osipov 2020-05-06 19:10 ` Konstantin Osipov 2020-05-12 16:03 ` Sergey Ostanevich 2020-05-13 21:42 ` Vladislav Shpilevoy 2020-05-14 0:05 ` Konstantin Osipov 2020-05-07 23:01 ` Konstantin Osipov 2020-05-12 16:40 ` Sergey Ostanevich 2020-05-12 17:47 ` Konstantin Osipov 2020-05-13 21:34 ` Vladislav Shpilevoy 2020-05-13 23:31 ` Konstantin Osipov
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20200512164202.GA2049@atlas \ --to=kostja.osipov@gmail.com \ --cc=sergos@tarantool.org \ --cc=tarantool-patches@dev.tarantool.org \ --cc=v.shpilevoy@tarantool.org \ --subject='Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox