Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication

Tarantool development patches archive
 help / color / mirror / Atom feed

From: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
To: Sergey Ostanevich <sergos@tarantool.org>,
	tarantool-patches@dev.tarantool.org,
	Timur Safin <tsafin@tarantool.org>, Mons Anderson <mons@cpan.org>
Subject: Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
Date: Thu, 23 Apr 2020 23:38:34 +0200	[thread overview]
Message-ID: <c86ef610-f54e-524e-103a-324e7e572d2d@tarantool.org> (raw)
In-Reply-To: <20200403210836.GB18283@tarantool.org>

Hi!

Here is a short summary of our late night discussion and the
questions it brought up, while I was trying to design a draft
plan of an implementation. Since the RFC is too far from the
code, and I needed a more 'pedestrian' and detailed plan.

The question is about 'confirm' message and quorum collection.
Here is the schema presented in the RFC:

> Customer        Leader          WAL(L)        Replica        WAL(R)
>    |------TXN----->|              |             |              |
>    |               |              |             |              |
>    |         [TXN undo log        |             |              |
>    |            created]          |             |              |
>    |               |              |             |              |
>    |               |-----TXN----->|             |              |
>    |               |              |             |              |
>    |               |-------Replicate TXN------->|              |
>    |               |              |             |              |
>    |               |              |       [TXN undo log        |
>    |               |<---WAL Ok----|          created]          |
>    |               |              |             |              |
>    |           [Waiting           |             |-----TXN----->|
>    |         of a quorum]         |             |              |
>    |               |              |             |<---WAL Ok----|
>    |               |              |             |              |
>    |               |<------Replication Ok-------|              |
>    |               |              |             |              |
>    |            [Quorum           |             |              |
>    |           achieved]          |             |              |
>    |               |              |             |              |
>    |         [TXN undo log        |             |              |
>    |           destroyed]         |             |              |
>    |               |              |             |              |
>    |               |---Confirm--->|             |              |
>    |               |              |             |              |
>    |               |----------Confirm---------->|              |
>    |               |              |             |              |
>    |<---TXN Ok-----|              |       [TXN undo log        |
>    |               |              |         destroyed]         |
>    |               |              |             |              |
>    |               |              |             |---Confirm--->|
>    |               |              |             |              |

It says, that once the quorum is collected, and 'confirm' is written
to local leader's WAL, it is considered committed and is reported
to the client as successful.

On the other hand it is said, that in case of leader change the
new leader will rollback all not confirmed transactions. That leads
to the following bug:

Assume we have 4 instances: i1, i2, i3, i4. Leader is i1. It
writes a transaction with LSN1. The LSN1 is sent to other nodes,
they apply it ok, and send acks to the leader. The leader sees
i2-i4 all applied the transaction (propagated their LSNs to LSN1).
It writes 'confirm' to its local WAL, reports it to the client as
success, the client's request is over, it is returned back to
some remote node, etc. The transaction is officially synchronously
committed.

Then the leader's machine dies - disk is dead. The confirm was
not sent to any of the other nodes. For example, it started having
problems with network connection to the replicas recently before
the death. Or it just didn't manage to hand the confirm out.

From now on if any of the other nodes i2-i4 becomes a leader, it
will rollback the officially confirmed transaction, even if it
has it, and all the other nodes too.

That basically means, this sync replication gives exactly the same
guarantees as the async replication - 'confirm' on the leader tells
nothing about replicas except that they *are able to apply the
transaction*, but still may not apply it.

Am I missing something?

Another issue is with failure detection. Lets assume, that we wait
for 'confirm' to be propagated on quorum of replicas too. Assume
some replicas responded with an error. So they first said they can
apply the transaction, and saved it into their WALs, and then they
couldn't apply confirm. That could happen because of 2 reasons:
replica has problems with WAL, or the replica becomes unreachable
from the master.

WAL-problematic replicas can be disconnected forcefully, since they
are clearly not able to work properly anymore. But what to do with
disconnected replicas? 'Confirm' can't wait for them forever - we
will run out of fibers, if we have even just hundreds of RPS of
sync transactions, and wait for, lets say, a few minutes. On the
other hand we can't roll them back, because 'confirm' has been
written to the local WAL already.

Note for those who is concerned: this has nothing to do with
in-memory relay. It has the same problems, which are in the protocol,
not in the implementation.

next prev parent reply	other threads:[~2020-04-23 21:38 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-03 21:08 Sergey Ostanevich
2020-04-07 13:02 ` Aleksandr Lyapunov
2020-04-08  9:18   ` Sergey Ostanevich
2020-04-08 14:05     ` Konstantin Osipov
2020-04-08 15:06       ` Sergey Ostanevich
2020-04-14 12:58 ` Sergey Bronnikov
2020-04-14 14:43   ` Sergey Ostanevich
2020-04-15 11:09     ` sergos
2020-04-15 14:50       ` sergos
2020-04-16  7:13         ` Aleksandr Lyapunov
2020-04-17 10:10         ` Konstantin Osipov
2020-04-17 13:45           ` Sergey Ostanevich
2020-04-20 11:20         ` Serge Petrenko
2020-04-20 23:32 ` Vladislav Shpilevoy
2020-04-21 10:49   ` Sergey Ostanevich
2020-04-21 22:17     ` Vladislav Shpilevoy
2020-04-22 16:50       ` Sergey Ostanevich
2020-04-22 20:28         ` Vladislav Shpilevoy
2020-04-23  6:58       ` Konstantin Osipov
2020-04-23  9:14         ` Konstantin Osipov
2020-04-23 11:27           ` Sergey Ostanevich
2020-04-23 11:43             ` Konstantin Osipov
2020-04-23 15:11               ` Sergey Ostanevich
2020-04-23 20:39                 ` Konstantin Osipov
2020-04-23 21:38 ` Vladislav Shpilevoy [this message]
2020-04-23 22:28   ` Konstantin Osipov
2020-04-30 14:50   ` Sergey Ostanevich
2020-05-06  8:52     ` Konstantin Osipov
2020-05-06 16:39       ` Sergey Ostanevich
2020-05-06 18:44         ` Konstantin Osipov
2020-05-12 15:55           ` Sergey Ostanevich
2020-05-12 16:42             ` Konstantin Osipov
2020-05-13 21:39             ` Vladislav Shpilevoy
2020-05-13 23:54               ` Konstantin Osipov
2020-05-14 20:38               ` Sergey Ostanevich
2020-05-20 20:59                 ` Sergey Ostanevich
2020-05-25 23:41                   ` Vladislav Shpilevoy
2020-05-27 21:17                     ` Sergey Ostanevich
2020-06-09 16:19                       ` Sergey Ostanevich
2020-06-11 15:17                         ` Vladislav Shpilevoy
2020-06-12 20:31                           ` Sergey Ostanevich
2020-05-13 21:36         ` Vladislav Shpilevoy
2020-05-13 23:45           ` Konstantin Osipov
2020-05-06 18:55     ` Konstantin Osipov
2020-05-06 19:10       ` Konstantin Osipov
2020-05-12 16:03         ` Sergey Ostanevich
2020-05-13 21:42       ` Vladislav Shpilevoy
2020-05-14  0:05         ` Konstantin Osipov
2020-05-07 23:01     ` Konstantin Osipov
2020-05-12 16:40       ` Sergey Ostanevich
2020-05-12 17:47         ` Konstantin Osipov
2020-05-13 21:34           ` Vladislav Shpilevoy
2020-05-13 23:31             ` Konstantin Osipov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c86ef610-f54e-524e-103a-324e7e572d2d@tarantool.org \
    --to=v.shpilevoy@tarantool.org \
    --cc=mons@cpan.org \
    --cc=sergos@tarantool.org \
    --cc=tarantool-patches@dev.tarantool.org \
    --cc=tsafin@tarantool.org \
    --subject='Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox