Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication

Tarantool development patches archive
 help / color / mirror / Atom feed

From: Konstantin Osipov <kostja.osipov@gmail.com>
To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Cc: tarantool-patches@dev.tarantool.org
Subject: Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
Date: Thu, 14 May 2020 02:45:53 +0300	[thread overview]
Message-ID: <20200513234553.GB5698@atlas> (raw)
In-Reply-To: <10920620-b911-360f-f075-7718a40bf4af@tarantool.org>

* Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/05/14 00:37]:
> > Thanks for review!
> > 
> >>>    |               |              |             |              |
> >>>    |            [Quorum           |             |              |
> >>>    |           achieved]          |             |              |
> >>>    |               |              |             |              |
> >>>    |         [TXN undo log        |             |              |
> >>>    |           destroyed]         |             |              |
> >>>    |               |              |             |              |
> >>>    |               |---Confirm--->|             |              |
> >>>    |               |              |             |              |
> >>
> >> What happens if writing Confirm to WAL fails? TXN und log record
> >> is destroyed already. Will the server panic now on WAL failure,
> >> even if it is intermittent?
> > 
> > I would like to have an example of intermittent WAL failure. Can it be
> > other than problem with disc - be it space/availability/malfunction?
> > 
> > For all of those it should be resolved outside the DBMS anyways. So,
> > leader should stop and report its problems to orchestrator/admins.
> > 
> > I would agree that undo log can be destroyed *after* the Confirm is
> > landed to WAL - same is for replica.
> 
> Well, in fact you can't (or can you?). Because it won't help. Once you
> tried to write 'Confirm', it means you got the quorum. So now in case
> you will fail, a new leader will write 'Confirm' for you, when will see
> a quorum too. So the current leader has no right to write 'Rollback'
> from this moment, from what I understand. Because it still can be
> confirmed by a new leader later, if you fail before 'Rollback' is
> replicated to all.
> 
> However the same problem appears, if you write 'Confirm' *successfully*.
> Still the leader can fail, and a newer leader will write 'Rollback' if
> won't collect the quorum again. Don't know what to do with that really.
> Probably nothing.

Maybe consult with the raft spec? 

The new leader is guaranteed to see the transaction since it has
reached the majority of replicas. So it will definitely write
"confirm" for it. The reason I asked the question is I want the
case of intermittent failures be described in the spec.
For example, is "confirm" a cbus message, then if there is a
cascading rollback of the batch it is part of, it can be rolled
back. I would like to see all these scenarios covered in the spec.
If one of them ends with panic, I would like to understand how the
external coordinator is going to resolve the new election. 
Raft has answers for all of it.



> >>> As soon as leader appears in a situation it has not enough replicas
> >>> to achieve quorum, the cluster should stop accepting any requests - both
> >>> write and read.
> >>
> >> How does *the cluster* know the state of the leader and if it
> >> doesn't, how it can possibly implement this? Did you mean
> >> the leader should stop accepting transactions here? But how can
> >> the leader know if it has not enough replicas during a read
> >> transaction, if it doesn't contact any replica to serve a read?
> > 
> > I expect to have a disconnection trigger assigned to all relays so that
> > disconnection will cause the number of replicas decrease. The quorum
> > size is static, so we can stop at the very moment the number dives below.
> 
> This is a very dubious statement. In TCP disconnect may be detected much
> later, than it happened. So to collect a quorum on something you need to
> literally collect this quorum, with special WAL records, via network, and
> all. A disconnect trigger does not help at all here.

Erhm, thanks. 

> Talking of the whole 'read-quorum' idea, I don't like it. Because this
> really makes things unbearably harder to implement, the nodes become
> much slower and less available in terms of any problems.
> 
> I think reads should be allowed always, and from any node (except during
> bootstrap, of course). After all, you have transactions for consistency. So
> as far as replication respects transaction boundaries, every node is in a
> consistent state. Maybe not all of them are in the same state, but every one
> is consistent.

In memtx, you read by default dirty, uncommitted data. It was OK for single-node
transactions, since the only chance for it to be rolled back were
out of space/disk failure, which were extremely rare, now you
really read dirty stuff, because you can easily have it rolled
back because of lack of quorum or re-election.
So it's a much bigger deal.


> Honestly, I can't even imagine, how is it possible to implement a completely
> synchronous simultaneous cluster progression. It is impossible even in theory.
> There always will be a time period, when some nodes are further than the
> others. At least because of network delays.
> 
> So either we allows reads from master only, or we allow reads from everywhere,
> and in that case nothing will save from a possibility of seeing different
> data on different nodes.

This is why there are many consistency models out there (just
google consistency models in distributed systems), and the minor
details are important. It's indeed hard to implement the strictest model
(serial), but it is also often unnecessary, and there is consensus
in the relational databases what issues are acceptable and what are not. 

More specifically, I think for tarantool sync replication we
should aim at read committed. The spec should say it in no
uncertain terms and explain how it is achieved.

-- 
Konstantin Osipov, Moscow, Russia

next prev parent reply	other threads:[~2020-05-13 23:45 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-03 21:08 Sergey Ostanevich
2020-04-07 13:02 ` Aleksandr Lyapunov
2020-04-08  9:18   ` Sergey Ostanevich
2020-04-08 14:05     ` Konstantin Osipov
2020-04-08 15:06       ` Sergey Ostanevich
2020-04-14 12:58 ` Sergey Bronnikov
2020-04-14 14:43   ` Sergey Ostanevich
2020-04-15 11:09     ` sergos
2020-04-15 14:50       ` sergos
2020-04-16  7:13         ` Aleksandr Lyapunov
2020-04-17 10:10         ` Konstantin Osipov
2020-04-17 13:45           ` Sergey Ostanevich
2020-04-20 11:20         ` Serge Petrenko
2020-04-20 23:32 ` Vladislav Shpilevoy
2020-04-21 10:49   ` Sergey Ostanevich
2020-04-21 22:17     ` Vladislav Shpilevoy
2020-04-22 16:50       ` Sergey Ostanevich
2020-04-22 20:28         ` Vladislav Shpilevoy
2020-04-23  6:58       ` Konstantin Osipov
2020-04-23  9:14         ` Konstantin Osipov
2020-04-23 11:27           ` Sergey Ostanevich
2020-04-23 11:43             ` Konstantin Osipov
2020-04-23 15:11               ` Sergey Ostanevich
2020-04-23 20:39                 ` Konstantin Osipov
2020-04-23 21:38 ` Vladislav Shpilevoy
2020-04-23 22:28   ` Konstantin Osipov
2020-04-30 14:50   ` Sergey Ostanevich
2020-05-06  8:52     ` Konstantin Osipov
2020-05-06 16:39       ` Sergey Ostanevich
2020-05-06 18:44         ` Konstantin Osipov
2020-05-12 15:55           ` Sergey Ostanevich
2020-05-12 16:42             ` Konstantin Osipov
2020-05-13 21:39             ` Vladislav Shpilevoy
2020-05-13 23:54               ` Konstantin Osipov
2020-05-14 20:38               ` Sergey Ostanevich
2020-05-20 20:59                 ` Sergey Ostanevich
2020-05-25 23:41                   ` Vladislav Shpilevoy
2020-05-27 21:17                     ` Sergey Ostanevich
2020-06-09 16:19                       ` Sergey Ostanevich
2020-06-11 15:17                         ` Vladislav Shpilevoy
2020-06-12 20:31                           ` Sergey Ostanevich
2020-05-13 21:36         ` Vladislav Shpilevoy
2020-05-13 23:45           ` Konstantin Osipov [this message]
2020-05-06 18:55     ` Konstantin Osipov
2020-05-06 19:10       ` Konstantin Osipov
2020-05-12 16:03         ` Sergey Ostanevich
2020-05-13 21:42       ` Vladislav Shpilevoy
2020-05-14  0:05         ` Konstantin Osipov
2020-05-07 23:01     ` Konstantin Osipov
2020-05-12 16:40       ` Sergey Ostanevich
2020-05-12 17:47         ` Konstantin Osipov
2020-05-13 21:34           ` Vladislav Shpilevoy
2020-05-13 23:31             ` Konstantin Osipov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200513234553.GB5698@atlas \
    --to=kostja.osipov@gmail.com \
    --cc=tarantool-patches@dev.tarantool.org \
    --cc=v.shpilevoy@tarantool.org \
    --subject='Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox