Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication

Tarantool development patches archive
 help / color / mirror / Atom feed

From: Sergey Ostanevich <sergos@tarantool.org>
To: Konstantin Osipov <kostja.osipov@gmail.com>,
	Vladislav Shpilevoy <v.shpilevoy@tarantool.org>,
	tarantool-patches@dev.tarantool.org
Subject: Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
Date: Wed, 6 May 2020 19:39:01 +0300	[thread overview]
Message-ID: <20200506163901.GH112@tarantool.org> (raw)
In-Reply-To: <20200506085249.GA2842@atlas>

Hi!

Thanks for review!

> >    |               |              |             |              |
> >    |            [Quorum           |             |              |
> >    |           achieved]          |             |              |
> >    |               |              |             |              |
> >    |         [TXN undo log        |             |              |
> >    |           destroyed]         |             |              |
> >    |               |              |             |              |
> >    |               |---Confirm--->|             |              |
> >    |               |              |             |              |
> 
> What happens if writing Confirm to WAL fails? TXN und log record
> is destroyed already. Will the server panic now on WAL failure,
> even if it is intermittent?

I would like to have an example of intermittent WAL failure. Can it be
other than problem with disc - be it space/availability/malfunction?

For all of those it should be resolved outside the DBMS anyways. So,
leader should stop and report its problems to orchestrator/admins.

I would agree that undo log can be destroyed *after* the Confirm is
landed to WAL - same is for replica.

> 
> >    |               |----------Confirm---------->|              |
> 
> What happens if peers receive and maybe even write Confirm to their WALs
> but local WAL write is lost after a restart?

Did you mean WAL write on leader as a local? Then we have a replica with
a bigger LSN for the leader ID. 

> WAL is not synced, 
> so we can easily lose the tail of the WAL. Tarantool will sync up
> with all replicas on restart,

But at this point a new leader will be appointed - the old one is
restarted. Then the Confirm message will arrive to the restarted leader 
through a regular replication.

> but there will be no "Replication
> OK" messages from them, so it wouldn't know that the transaction
> is committed on them. How is this handled? We may end up with some
> replicas confirming the transaction while the leader will roll it
> back on restart. Do you suggest there is a human intervention on
> restart as well?
> 
> 
> >    |               |              |             |              |
> >    |<---TXN Ok-----|              |       [TXN undo log        |
> >    |               |              |         destroyed]         |
> >    |               |              |             |              |
> >    |               |              |             |---Confirm--->|
> >    |               |              |             |              |
> > ```
> > 
> > The quorum should be collected as a table for a list of transactions
> > waiting for quorum. The latest transaction that collects the quorum is
> > considered as complete, as well as all transactions prior to it, since
> > all transactions should be applied in order. Leader writes a 'confirm'
> > message to the WAL that refers to the transaction's [LEADER_ID, LSN] and
> > the confirm has its own LSN. This confirm message is delivered to all
> > replicas through the existing replication mechanism.
> > 
> > Replica should report a TXN application success to the leader via the
> > IPROTO explicitly to allow leader to collect the quorum for the TXN.
> > In case of application failure the replica has to disconnect from the
> > replication the same way as it is done now. The replica also has to
> > report its disconnection to the orchestrator. Further actions require
> > human intervention, since failure means either technical problem (such
> > as not enough space for WAL) that has to be resovled or an inconsistent
> > state that requires rejoin.
> 
> > As soon as leader appears in a situation it has not enough replicas
> > to achieve quorum, the cluster should stop accepting any requests - both
> > write and read.
> 
> How does *the cluster* know the state of the leader and if it
> doesn't, how it can possibly implement this? Did you mean
> the leader should stop accepting transactions here? But how can
> the leader know if it has not enough replicas during a read
> transaction, if it doesn't contact any replica to serve a read?

I expect to have a disconnection trigger assigned to all relays so that
disconnection will cause the number of replicas decrease. The quorum
size is static, so we can stop at the very moment the number dives below.

> 
> > The reason for this is that replication of transactions
> > can achieve quorum on replicas not visible to the leader. On the other
> > hand, leader can't achieve quorum with available minority. Leader has to
> > report the state and wait for human intervention. There's an option to
> > ask leader to rollback to the latest transaction that has quorum: leader
> > issues a 'rollback' message referring to the [LEADER_ID, LSN] where LSN
> > is of the first transaction in the leader's undo log. The rollback
> > message replicated to the available cluster will put it in a consistent
> > state. After that configuration of the cluster can be updated to
> > available quorum and leader can be switched back to write mode.
> 
> As you should be able to conclude from restart scenario, it is
> possible a replica has the record in *confirmed* state but the
> leader has it in pending state. The replica will not be able to
> roll back then. Do you suggest the replica should abort if it
> can't rollback? This may lead to an avalanche of rejoins on leader
> restart, bringing performance to a halt.

No, I declare replica with biggest LSN as a new shining leader. More
than that, new leader can (so far it will be by default) finalize the
former leader life's work by replicating txns and appropriate confirms.

Sergos.

next prev parent reply	other threads:[~2020-05-06 16:39 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-03 21:08 Sergey Ostanevich
2020-04-07 13:02 ` Aleksandr Lyapunov
2020-04-08  9:18   ` Sergey Ostanevich
2020-04-08 14:05     ` Konstantin Osipov
2020-04-08 15:06       ` Sergey Ostanevich
2020-04-14 12:58 ` Sergey Bronnikov
2020-04-14 14:43   ` Sergey Ostanevich
2020-04-15 11:09     ` sergos
2020-04-15 14:50       ` sergos
2020-04-16  7:13         ` Aleksandr Lyapunov
2020-04-17 10:10         ` Konstantin Osipov
2020-04-17 13:45           ` Sergey Ostanevich
2020-04-20 11:20         ` Serge Petrenko
2020-04-20 23:32 ` Vladislav Shpilevoy
2020-04-21 10:49   ` Sergey Ostanevich
2020-04-21 22:17     ` Vladislav Shpilevoy
2020-04-22 16:50       ` Sergey Ostanevich
2020-04-22 20:28         ` Vladislav Shpilevoy
2020-04-23  6:58       ` Konstantin Osipov
2020-04-23  9:14         ` Konstantin Osipov
2020-04-23 11:27           ` Sergey Ostanevich
2020-04-23 11:43             ` Konstantin Osipov
2020-04-23 15:11               ` Sergey Ostanevich
2020-04-23 20:39                 ` Konstantin Osipov
2020-04-23 21:38 ` Vladislav Shpilevoy
2020-04-23 22:28   ` Konstantin Osipov
2020-04-30 14:50   ` Sergey Ostanevich
2020-05-06  8:52     ` Konstantin Osipov
2020-05-06 16:39       ` Sergey Ostanevich [this message]
2020-05-06 18:44         ` Konstantin Osipov
2020-05-12 15:55           ` Sergey Ostanevich
2020-05-12 16:42             ` Konstantin Osipov
2020-05-13 21:39             ` Vladislav Shpilevoy
2020-05-13 23:54               ` Konstantin Osipov
2020-05-14 20:38               ` Sergey Ostanevich
2020-05-20 20:59                 ` Sergey Ostanevich
2020-05-25 23:41                   ` Vladislav Shpilevoy
2020-05-27 21:17                     ` Sergey Ostanevich
2020-06-09 16:19                       ` Sergey Ostanevich
2020-06-11 15:17                         ` Vladislav Shpilevoy
2020-06-12 20:31                           ` Sergey Ostanevich
2020-05-13 21:36         ` Vladislav Shpilevoy
2020-05-13 23:45           ` Konstantin Osipov
2020-05-06 18:55     ` Konstantin Osipov
2020-05-06 19:10       ` Konstantin Osipov
2020-05-12 16:03         ` Sergey Ostanevich
2020-05-13 21:42       ` Vladislav Shpilevoy
2020-05-14  0:05         ` Konstantin Osipov
2020-05-07 23:01     ` Konstantin Osipov
2020-05-12 16:40       ` Sergey Ostanevich
2020-05-12 17:47         ` Konstantin Osipov
2020-05-13 21:34           ` Vladislav Shpilevoy
2020-05-13 23:31             ` Konstantin Osipov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200506163901.GH112@tarantool.org \
    --to=sergos@tarantool.org \
    --cc=kostja.osipov@gmail.com \
    --cc=tarantool-patches@dev.tarantool.org \
    --cc=v.shpilevoy@tarantool.org \
    --subject='Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox