Tarantool development patches archive
 help / color / mirror / Atom feed
From: Konstantin Osipov <kostja.osipov@gmail.com>
To: Sergey Ostanevich <sergos@tarantool.org>
Cc: tarantool-patches@dev.tarantool.org,
	Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Subject: Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
Date: Tue, 12 May 2020 19:42:02 +0300	[thread overview]
Message-ID: <20200512164202.GA2049@atlas> (raw)
In-Reply-To: <20200512155508.GJ112@tarantool.org>

* Sergey Ostanevich <sergos@tarantool.org> [20/05/12 18:56]:
> On 06 мая 21:44, Konstantin Osipov wrote:
> > * Sergey Ostanevich <sergos@tarantool.org> [20/05/06 19:41]:
> > > > >    |               |              |             |              |
> > > > >    |            [Quorum           |             |              |
> > > > >    |           achieved]          |             |              |
> > > > >    |               |              |             |              |
> > > > >    |         [TXN undo log        |             |              |
> > > > >    |           destroyed]         |             |              |
> > > > >    |               |              |             |              |
> > > > >    |               |---Confirm--->|             |              |
> > > > >    |               |              |             |              |
> > > > 
> > > > What happens if writing Confirm to WAL fails? TXN und log record
> > > > is destroyed already. Will the server panic now on WAL failure,
> > > > even if it is intermittent?
> > > 
> > > I would like to have an example of intermittent WAL failure. Can it be
> > > other than problem with disc - be it space/availability/malfunction?
> > 
> > For SAN disks it can simply be a networking issue. The same is
> > true for any virtual filesystem in the cloud. For local disks it
> > is most often out of space, but this is not an impossible event.
> 
> The SANDisk is an SSD vendor. I bet you mean NAS - network array
> storage, isn't it? Then I see no difference in WAL write into NAS in
> current schema - you will catch a timeout, WAL will report failure,
> replica stops.

SAN stands for storage area network.

There is no timeout in wal tx bus and no timeout in WAL I/O. 

A replica doesn't stop on an intermittent failure. Stopping a
replica on an intermittent failure reduces availability of
non-sync writes.

It seems you have some assumptions in mind which are not in the
document - e.g. that some timeouts are added. They are not in the
POC either.

I suppose the document is expected to explain quite accurately
what has to be done, e.g. how these new timeouts work?

> > > For all of those it should be resolved outside the DBMS anyways. So,
> > > leader should stop and report its problems to orchestrator/admins.
> > 
> > Sergey, I understand that RAFT spec is big and with this spec you
> > try to split it into manageable parts. The question is how useful
> > is this particular piece. I'm trying to point out that "the leader
> > should stop" is not a silver bullet - especially since each such
> > stop may mean a rejoin of some other node. The purpose of sync
> > replication is to provide consistency without reducing
> > availability (i.e. make progress as long as the quorum
> > of nodes make progress). 
> 
> I'm not sure if we're talking about the same RAFT - mine is "In Search
> of an Understandable Consensus Algorithm (Extended Version)" from
> Stanford as of May 2014. And it is 15 pages - including references,
> conclusions and intro. Seems not that big.
> 
> Although, most of it is dedicated to the leader election itself, which
> we intentionally put aside from this RFC. It is written in the very
> beginning and I empasized this by explicit mentioning of it.

I conclude that it is big from the state of this document. It
provides some coverage of the normal operation.
Leader election, failure detection, recovery/restart,
replication configuration changes are either barely
mentioned or not covered at all.
I find no other reason to not cover them except to be able to come
up with a MVP quicker. Do you?


> > The current spec, suggesting there should be a leader stop in case
> > of most errors, reduces availability significantly, and doesn't
> > make external coordinator job any easier - it still has to follow to
> > the letter the prescriptions of RAFT. 
> 
> So, the postponing of a commit until quorum collection is the most
> useful part of this RFC, also to some point I'm trying to address the
> WAL insconsistency.

> Although, it can be covered only partly: if a
> leader's log diverge in unconfirmed transactions only, then they can be
> rolled back easiy. Technically, it should be enough if leader changed
> for a replica from the cluster majority at the moment of failure.
> Otherwise it will require pre-parsing of the WAL and it can well happens
> that WAL is not long enough, hence ex-leader still need a complete
> bootstrap. 

I don't understand what's pre-parsing and how what you write is
relevant to the fact that reduced availability of non-raft writes
is bad.

> > > But at this point a new leader will be appointed - the old one is
> > > restarted. Then the Confirm message will arrive to the restarted leader 
> > > through a regular replication.
> > 
> > This assumes that restart is guaranteed to be noticed by the
> > external coordinator and there is an election on every restart.
> 
> Sure yes, if it restarted - then connection lost can't be unnoticed by
> anyone, be it coordinator or cluster.

Well, the spec doesn't say anywhere that the external coordinator
has to establish a TCP connection to every participant. Could you
please add a chapter where this is clarified? It seems you have a
specific coordinator in mind ?

> > > > > The quorum should be collected as a table for a list of transactions
> > > > > waiting for quorum. The latest transaction that collects the quorum is
> > > > > considered as complete, as well as all transactions prior to it, since
> > > > > all transactions should be applied in order. Leader writes a 'confirm'
> > > > > message to the WAL that refers to the transaction's [LEADER_ID, LSN] and
> > > > > the confirm has its own LSN. This confirm message is delivered to all
> > > > > replicas through the existing replication mechanism.
> > > > > 
> > > > > Replica should report a TXN application success to the leader via the
> > > > > IPROTO explicitly to allow leader to collect the quorum for the TXN.
> > > > > In case of application failure the replica has to disconnect from the
> > > > > replication the same way as it is done now. The replica also has to
> > > > > report its disconnection to the orchestrator. Further actions require
> > > > > human intervention, since failure means either technical problem (such
> > > > > as not enough space for WAL) that has to be resovled or an inconsistent
> > > > > state that requires rejoin.
> > > > 
> > > > > As soon as leader appears in a situation it has not enough replicas
> > > > > to achieve quorum, the cluster should stop accepting any requests - both
> > > > > write and read.
> > > > 
> > > > How does *the cluster* know the state of the leader and if it
> > > > doesn't, how it can possibly implement this? Did you mean
> > > > the leader should stop accepting transactions here? But how can
> > > > the leader know if it has not enough replicas during a read
> > > > transaction, if it doesn't contact any replica to serve a read?
> > > 
> > > I expect to have a disconnection trigger assigned to all relays so that
> > > disconnection will cause the number of replicas decrease. The quorum
> > > size is static, so we can stop at the very moment the number dives below.
> > 
> > What happens between the event the leader is partitioned away and
> > a new leader is elected?
> > 
> > The leader may be unaware of the events and serve a read just
> > fine.
> 
> As it is stated 20 lines above: 
> > > > > As soon as leader appears in a situation it has not enough
> > > > > replicas
> > > > > to achieve quorum, the cluster should stop accepting any
> > > > > requests - both
> > > > > write and read.
> 
> So it will not serve. 

Sergey, this is recursion. I'm asking you to clarify exactly this
point.
Do you assume that replicas perform some kind of
failure detection? What kind? Is it *in addition* to the failure
detection performed by the external coordinator? 
Any failure detector imaginable would be asynchronous. 
What happens between the failure and the time it's detected?

> > So at least you can't say the leader shouldn't be serving reads
> > without quorum - because the only way to achieve it is to collect
> > a quorum of responses to reads as well.
> 
> The leader lost connection to the (N-Q)+1 repllicas out of the N in
> cluster with a quorum of Q == it stops serving anything. So the quorum
> criteria is there: no quorum - no reads. 

OK, so you assume that TCP connection *is* the failure detector? 

Failure detection in TCP is optional, asynchronous, and worst of
all, unreliable. Why do think it can be used? 

> > > > > The reason for this is that replication of transactions
> > > > > can achieve quorum on replicas not visible to the leader. On the other
> > > > > hand, leader can't achieve quorum with available minority. Leader has to
> > > > > report the state and wait for human intervention. There's an option to
> > > > > ask leader to rollback to the latest transaction that has quorum: leader
> > > > > issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN
> > > > > is of the first transaction in the leader's undo log. The rollback
> > > > > message replicated to the available cluster will put it in a consistent
> > > > > state. After that configuration of the cluster can be updated to
> > > > > available quorum and leader can be switched back to write mode.
> > > > 
> > > > As you should be able to conclude from restart scenario, it is
> > > > possible a replica has the record in *confirmed* state but the
> > > > leader has it in pending state. The replica will not be able to
> > > > roll back then. Do you suggest the replica should abort if it
> > > > can't rollback? This may lead to an avalanche of rejoins on leader
> > > > restart, bringing performance to a halt.
> > > 
> > > No, I declare replica with biggest LSN as a new shining leader. More
> > > than that, new leader can (so far it will be by default) finalize the
> > > former leader life's work by replicating txns and appropriate confirms.
> > 
> > Right, this also assumes the restart is noticed, so it follows the
> > same logic.
> 
> How a restart can be unnoticed, if it causes disconnection?

Honestly, I'm baffled. It's like we speak different languages. 
I can't imagine you are unaware of the fallacies of distributed
computing, but I see no other explanation to you question.

-- 
Konstantin Osipov, Moscow, Russia

  reply	other threads:[~2020-05-12 16:42 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-03 21:08 Sergey Ostanevich
2020-04-07 13:02 ` Aleksandr Lyapunov
2020-04-08  9:18   ` Sergey Ostanevich
2020-04-08 14:05     ` Konstantin Osipov
2020-04-08 15:06       ` Sergey Ostanevich
2020-04-14 12:58 ` Sergey Bronnikov
2020-04-14 14:43   ` Sergey Ostanevich
2020-04-15 11:09     ` sergos
2020-04-15 14:50       ` sergos
2020-04-16  7:13         ` Aleksandr Lyapunov
2020-04-17 10:10         ` Konstantin Osipov
2020-04-17 13:45           ` Sergey Ostanevich
2020-04-20 11:20         ` Serge Petrenko
2020-04-20 23:32 ` Vladislav Shpilevoy
2020-04-21 10:49   ` Sergey Ostanevich
2020-04-21 22:17     ` Vladislav Shpilevoy
2020-04-22 16:50       ` Sergey Ostanevich
2020-04-22 20:28         ` Vladislav Shpilevoy
2020-04-23  6:58       ` Konstantin Osipov
2020-04-23  9:14         ` Konstantin Osipov
2020-04-23 11:27           ` Sergey Ostanevich
2020-04-23 11:43             ` Konstantin Osipov
2020-04-23 15:11               ` Sergey Ostanevich
2020-04-23 20:39                 ` Konstantin Osipov
2020-04-23 21:38 ` Vladislav Shpilevoy
2020-04-23 22:28   ` Konstantin Osipov
2020-04-30 14:50   ` Sergey Ostanevich
2020-05-06  8:52     ` Konstantin Osipov
2020-05-06 16:39       ` Sergey Ostanevich
2020-05-06 18:44         ` Konstantin Osipov
2020-05-12 15:55           ` Sergey Ostanevich
2020-05-12 16:42             ` Konstantin Osipov [this message]
2020-05-13 21:39             ` Vladislav Shpilevoy
2020-05-13 23:54               ` Konstantin Osipov
2020-05-14 20:38               ` Sergey Ostanevich
2020-05-20 20:59                 ` Sergey Ostanevich
2020-05-25 23:41                   ` Vladislav Shpilevoy
2020-05-27 21:17                     ` Sergey Ostanevich
2020-06-09 16:19                       ` Sergey Ostanevich
2020-06-11 15:17                         ` Vladislav Shpilevoy
2020-06-12 20:31                           ` Sergey Ostanevich
2020-05-13 21:36         ` Vladislav Shpilevoy
2020-05-13 23:45           ` Konstantin Osipov
2020-05-06 18:55     ` Konstantin Osipov
2020-05-06 19:10       ` Konstantin Osipov
2020-05-12 16:03         ` Sergey Ostanevich
2020-05-13 21:42       ` Vladislav Shpilevoy
2020-05-14  0:05         ` Konstantin Osipov
2020-05-07 23:01     ` Konstantin Osipov
2020-05-12 16:40       ` Sergey Ostanevich
2020-05-12 17:47         ` Konstantin Osipov
2020-05-13 21:34           ` Vladislav Shpilevoy
2020-05-13 23:31             ` Konstantin Osipov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200512164202.GA2049@atlas \
    --to=kostja.osipov@gmail.com \
    --cc=sergos@tarantool.org \
    --cc=tarantool-patches@dev.tarantool.org \
    --cc=v.shpilevoy@tarantool.org \
    --subject='Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox