Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication

Tarantool development patches archive
 help / color / mirror / Atom feed

From: Konstantin Osipov <kostja.osipov@gmail.com>
To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Cc: tarantool-patches@dev.tarantool.org
Subject: Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication
Date: Thu, 14 May 2020 03:05:32 +0300	[thread overview]
Message-ID: <20200514000532.GD5698@atlas> (raw)
In-Reply-To: <f256f58d-7280-9cdb-cef0-88a279b56260@tarantool.org>

* Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/05/14 00:47]:
> > A few more issues:
> > 
> > - the spec assumes there is a full mesh. In any other
> >   topology electing a leader based on the longest wal can easily
> >   deadlock. Yet it provides no protection against non-full-mesh
> >   setups. Currently the server can't even detect that this is not
> >   a full-mesh setup, so can't check if the precondition for this
> >   to work correctly is met.
> 
> Yes, this is a very unstable construction. But we failed to come up
> with a solution right now, which would protect against accidental
> non-fullmesh. For example, how will it work, when I add a new node?
> If non-fullmesh is forbidden, the new node just can't be added ever,
> because this can't be done on all nodes simultaneously.

Again the answer is present in the raft spec. The node is added in two
steps, first steps commits the "add node" event to the durable
state of the entire group, the second step (which is also a raft
transaction) enacts the new node. This could be achieved in more
or less straightforward manner if _cluster is a sync table with
replication group = all members of the cluster. But as I said, I
can't imagine this is possible with an external coordinator, since
it may not be available during boot.

Regarding detecting the full mesh, remember the task I created for
using swim to discover members and bring non-full-mesh setups to
full-mesh automatically? Is the reason for this task to exist
clear now? Is it clear now why I asked you (multiple times) to
begin working on sync replication by adding built-in swim
instances on every replica and using them, instead of the current
replication heartbeats, for failure detection? I believe there was
a task somewhere for it, too. 

> > - the spec assumes that quorum is identical to the
> >   number of replicas, and the number of replicas is stable across
> >   cluster life time. Can I have quorum=2 while the number of
> >   replicas is 4? Am I allowed to increase the number of replicas
> >   online? What happens when a replica is added,
> >   how exactly and starting from which transaction is the leader
> >   required to collect a bigger quorum?
> 
> Quorum <= number of replicas. It is a parameter, just like
> replication_connect_quorum.

I wrote in a comment to the task that it'd be even better if we
list node uuids as group members, and assign group to space
explicitly, so that it's not just ## of replicas, but specific
replicas identified by their uuids.

The thing is, it's vague in the spec. The spec has to be explicit
about all box.schema API changes, because they will define legacy
that will be hard to deal with later.

> I think you are allowed to add new replicas. When a replica is added,
> it goes through the normal join process.

At what point is joins the group and can ACK, i.e. become part of
a quorum? That's the question I wanted to be written down
explicitly in this document. RAFT has an answer for it.

> > - the same goes for removing a replica. How is the quorum reduced?
> 
> Node is just removed, I guess. If total number of nodes becomes less
> than quorum, obviously no transactions will be served.

Other vendors support 3 different scenarios here:
- it can be down for maintenance. In our turns, it means it is
  simply shut down, without changes to _cluster or space settings
- it can be removed forever, in that case an admin may want to
  reduce the quorum size. 
- it can be replaced.

with box.schema.group API all 3 cases can be translated to API
calls on the group itself. 

e.g. it would be possible to say
box.schema.group.groupname.remove(uuid)
box.schema.group.groupname.replace(old_uuid, new_uuid).

We don't need to implement it right away, but we must provision
for these operations in the spec, and at least  have a clue how
they will be handled in the future.

> However what to do with the existing pending transactions, which
> already accounted the removed replica in their quorums? Should they be
> decremented?
> 
> All what I am talking here are guesses. Which should be clarified in the
> RFC in the ideal world, of course.
> 
> Tbh, we discussed the sync replication for may hours in voice, and this
> is a surprise, that all of them fit into such a small update of the RFC.
> Even though it didn't fit. Since we obviously still didn't clarify many
> things. Especially exact API look.

-- 
Konstantin Osipov, Moscow, Russia

next prev parent reply	other threads:[~2020-05-14  0:05 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-03 21:08 Sergey Ostanevich
2020-04-07 13:02 ` Aleksandr Lyapunov
2020-04-08  9:18   ` Sergey Ostanevich
2020-04-08 14:05     ` Konstantin Osipov
2020-04-08 15:06       ` Sergey Ostanevich
2020-04-14 12:58 ` Sergey Bronnikov
2020-04-14 14:43   ` Sergey Ostanevich
2020-04-15 11:09     ` sergos
2020-04-15 14:50       ` sergos
2020-04-16  7:13         ` Aleksandr Lyapunov
2020-04-17 10:10         ` Konstantin Osipov
2020-04-17 13:45           ` Sergey Ostanevich
2020-04-20 11:20         ` Serge Petrenko
2020-04-20 23:32 ` Vladislav Shpilevoy
2020-04-21 10:49   ` Sergey Ostanevich
2020-04-21 22:17     ` Vladislav Shpilevoy
2020-04-22 16:50       ` Sergey Ostanevich
2020-04-22 20:28         ` Vladislav Shpilevoy
2020-04-23  6:58       ` Konstantin Osipov
2020-04-23  9:14         ` Konstantin Osipov
2020-04-23 11:27           ` Sergey Ostanevich
2020-04-23 11:43             ` Konstantin Osipov
2020-04-23 15:11               ` Sergey Ostanevich
2020-04-23 20:39                 ` Konstantin Osipov
2020-04-23 21:38 ` Vladislav Shpilevoy
2020-04-23 22:28   ` Konstantin Osipov
2020-04-30 14:50   ` Sergey Ostanevich
2020-05-06  8:52     ` Konstantin Osipov
2020-05-06 16:39       ` Sergey Ostanevich
2020-05-06 18:44         ` Konstantin Osipov
2020-05-12 15:55           ` Sergey Ostanevich
2020-05-12 16:42             ` Konstantin Osipov
2020-05-13 21:39             ` Vladislav Shpilevoy
2020-05-13 23:54               ` Konstantin Osipov
2020-05-14 20:38               ` Sergey Ostanevich
2020-05-20 20:59                 ` Sergey Ostanevich
2020-05-25 23:41                   ` Vladislav Shpilevoy
2020-05-27 21:17                     ` Sergey Ostanevich
2020-06-09 16:19                       ` Sergey Ostanevich
2020-06-11 15:17                         ` Vladislav Shpilevoy
2020-06-12 20:31                           ` Sergey Ostanevich
2020-05-13 21:36         ` Vladislav Shpilevoy
2020-05-13 23:45           ` Konstantin Osipov
2020-05-06 18:55     ` Konstantin Osipov
2020-05-06 19:10       ` Konstantin Osipov
2020-05-12 16:03         ` Sergey Ostanevich
2020-05-13 21:42       ` Vladislav Shpilevoy
2020-05-14  0:05         ` Konstantin Osipov [this message]
2020-05-07 23:01     ` Konstantin Osipov
2020-05-12 16:40       ` Sergey Ostanevich
2020-05-12 17:47         ` Konstantin Osipov
2020-05-13 21:34           ` Vladislav Shpilevoy
2020-05-13 23:31             ` Konstantin Osipov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200514000532.GD5698@atlas \
    --to=kostja.osipov@gmail.com \
    --cc=tarantool-patches@dev.tarantool.org \
    --cc=v.shpilevoy@tarantool.org \
    --subject='Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox