[Tarantool-patches] [RFC] Quorum-based synchronous replication
Konstantin Osipov
kostja.osipov at gmail.com
Thu May 14 03:05:32 MSK 2020
* Vladislav Shpilevoy <v.shpilevoy at tarantool.org> [20/05/14 00:47]:
> > A few more issues:
> >
> > - the spec assumes there is a full mesh. In any other
> > topology electing a leader based on the longest wal can easily
> > deadlock. Yet it provides no protection against non-full-mesh
> > setups. Currently the server can't even detect that this is not
> > a full-mesh setup, so can't check if the precondition for this
> > to work correctly is met.
>
> Yes, this is a very unstable construction. But we failed to come up
> with a solution right now, which would protect against accidental
> non-fullmesh. For example, how will it work, when I add a new node?
> If non-fullmesh is forbidden, the new node just can't be added ever,
> because this can't be done on all nodes simultaneously.
Again the answer is present in the raft spec. The node is added in two
steps, first steps commits the "add node" event to the durable
state of the entire group, the second step (which is also a raft
transaction) enacts the new node. This could be achieved in more
or less straightforward manner if _cluster is a sync table with
replication group = all members of the cluster. But as I said, I
can't imagine this is possible with an external coordinator, since
it may not be available during boot.
Regarding detecting the full mesh, remember the task I created for
using swim to discover members and bring non-full-mesh setups to
full-mesh automatically? Is the reason for this task to exist
clear now? Is it clear now why I asked you (multiple times) to
begin working on sync replication by adding built-in swim
instances on every replica and using them, instead of the current
replication heartbeats, for failure detection? I believe there was
a task somewhere for it, too.
> > - the spec assumes that quorum is identical to the
> > number of replicas, and the number of replicas is stable across
> > cluster life time. Can I have quorum=2 while the number of
> > replicas is 4? Am I allowed to increase the number of replicas
> > online? What happens when a replica is added,
> > how exactly and starting from which transaction is the leader
> > required to collect a bigger quorum?
>
> Quorum <= number of replicas. It is a parameter, just like
> replication_connect_quorum.
I wrote in a comment to the task that it'd be even better if we
list node uuids as group members, and assign group to space
explicitly, so that it's not just ## of replicas, but specific
replicas identified by their uuids.
The thing is, it's vague in the spec. The spec has to be explicit
about all box.schema API changes, because they will define legacy
that will be hard to deal with later.
> I think you are allowed to add new replicas. When a replica is added,
> it goes through the normal join process.
At what point is joins the group and can ACK, i.e. become part of
a quorum? That's the question I wanted to be written down
explicitly in this document. RAFT has an answer for it.
> > - the same goes for removing a replica. How is the quorum reduced?
>
> Node is just removed, I guess. If total number of nodes becomes less
> than quorum, obviously no transactions will be served.
Other vendors support 3 different scenarios here:
- it can be down for maintenance. In our turns, it means it is
simply shut down, without changes to _cluster or space settings
- it can be removed forever, in that case an admin may want to
reduce the quorum size.
- it can be replaced.
with box.schema.group API all 3 cases can be translated to API
calls on the group itself.
e.g. it would be possible to say
box.schema.group.groupname.remove(uuid)
box.schema.group.groupname.replace(old_uuid, new_uuid).
We don't need to implement it right away, but we must provision
for these operations in the spec, and at least have a clue how
they will be handled in the future.
> However what to do with the existing pending transactions, which
> already accounted the removed replica in their quorums? Should they be
> decremented?
>
> All what I am talking here are guesses. Which should be clarified in the
> RFC in the ideal world, of course.
>
> Tbh, we discussed the sync replication for may hours in voice, and this
> is a surprise, that all of them fit into such a small update of the RFC.
> Even though it didn't fit. Since we obviously still didn't clarify many
> things. Especially exact API look.
--
Konstantin Osipov, Moscow, Russia
More information about the Tarantool-patches
mailing list