From: Konstantin Osipov <kostja.osipov@gmail.com> To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org> Cc: tarantool-patches@dev.tarantool.org Subject: Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication Date: Thu, 14 May 2020 03:05:32 +0300 [thread overview] Message-ID: <20200514000532.GD5698@atlas> (raw) In-Reply-To: <f256f58d-7280-9cdb-cef0-88a279b56260@tarantool.org> * Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/05/14 00:47]: > > A few more issues: > > > > - the spec assumes there is a full mesh. In any other > > topology electing a leader based on the longest wal can easily > > deadlock. Yet it provides no protection against non-full-mesh > > setups. Currently the server can't even detect that this is not > > a full-mesh setup, so can't check if the precondition for this > > to work correctly is met. > > Yes, this is a very unstable construction. But we failed to come up > with a solution right now, which would protect against accidental > non-fullmesh. For example, how will it work, when I add a new node? > If non-fullmesh is forbidden, the new node just can't be added ever, > because this can't be done on all nodes simultaneously. Again the answer is present in the raft spec. The node is added in two steps, first steps commits the "add node" event to the durable state of the entire group, the second step (which is also a raft transaction) enacts the new node. This could be achieved in more or less straightforward manner if _cluster is a sync table with replication group = all members of the cluster. But as I said, I can't imagine this is possible with an external coordinator, since it may not be available during boot. Regarding detecting the full mesh, remember the task I created for using swim to discover members and bring non-full-mesh setups to full-mesh automatically? Is the reason for this task to exist clear now? Is it clear now why I asked you (multiple times) to begin working on sync replication by adding built-in swim instances on every replica and using them, instead of the current replication heartbeats, for failure detection? I believe there was a task somewhere for it, too. > > - the spec assumes that quorum is identical to the > > number of replicas, and the number of replicas is stable across > > cluster life time. Can I have quorum=2 while the number of > > replicas is 4? Am I allowed to increase the number of replicas > > online? What happens when a replica is added, > > how exactly and starting from which transaction is the leader > > required to collect a bigger quorum? > > Quorum <= number of replicas. It is a parameter, just like > replication_connect_quorum. I wrote in a comment to the task that it'd be even better if we list node uuids as group members, and assign group to space explicitly, so that it's not just ## of replicas, but specific replicas identified by their uuids. The thing is, it's vague in the spec. The spec has to be explicit about all box.schema API changes, because they will define legacy that will be hard to deal with later. > I think you are allowed to add new replicas. When a replica is added, > it goes through the normal join process. At what point is joins the group and can ACK, i.e. become part of a quorum? That's the question I wanted to be written down explicitly in this document. RAFT has an answer for it. > > - the same goes for removing a replica. How is the quorum reduced? > > Node is just removed, I guess. If total number of nodes becomes less > than quorum, obviously no transactions will be served. Other vendors support 3 different scenarios here: - it can be down for maintenance. In our turns, it means it is simply shut down, without changes to _cluster or space settings - it can be removed forever, in that case an admin may want to reduce the quorum size. - it can be replaced. with box.schema.group API all 3 cases can be translated to API calls on the group itself. e.g. it would be possible to say box.schema.group.groupname.remove(uuid) box.schema.group.groupname.replace(old_uuid, new_uuid). We don't need to implement it right away, but we must provision for these operations in the spec, and at least have a clue how they will be handled in the future. > However what to do with the existing pending transactions, which > already accounted the removed replica in their quorums? Should they be > decremented? > > All what I am talking here are guesses. Which should be clarified in the > RFC in the ideal world, of course. > > Tbh, we discussed the sync replication for may hours in voice, and this > is a surprise, that all of them fit into such a small update of the RFC. > Even though it didn't fit. Since we obviously still didn't clarify many > things. Especially exact API look. -- Konstantin Osipov, Moscow, Russia
next prev parent reply other threads:[~2020-05-14 0:05 UTC|newest] Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-04-03 21:08 Sergey Ostanevich 2020-04-07 13:02 ` Aleksandr Lyapunov 2020-04-08 9:18 ` Sergey Ostanevich 2020-04-08 14:05 ` Konstantin Osipov 2020-04-08 15:06 ` Sergey Ostanevich 2020-04-14 12:58 ` Sergey Bronnikov 2020-04-14 14:43 ` Sergey Ostanevich 2020-04-15 11:09 ` sergos 2020-04-15 14:50 ` sergos 2020-04-16 7:13 ` Aleksandr Lyapunov 2020-04-17 10:10 ` Konstantin Osipov 2020-04-17 13:45 ` Sergey Ostanevich 2020-04-20 11:20 ` Serge Petrenko 2020-04-20 23:32 ` Vladislav Shpilevoy 2020-04-21 10:49 ` Sergey Ostanevich 2020-04-21 22:17 ` Vladislav Shpilevoy 2020-04-22 16:50 ` Sergey Ostanevich 2020-04-22 20:28 ` Vladislav Shpilevoy 2020-04-23 6:58 ` Konstantin Osipov 2020-04-23 9:14 ` Konstantin Osipov 2020-04-23 11:27 ` Sergey Ostanevich 2020-04-23 11:43 ` Konstantin Osipov 2020-04-23 15:11 ` Sergey Ostanevich 2020-04-23 20:39 ` Konstantin Osipov 2020-04-23 21:38 ` Vladislav Shpilevoy 2020-04-23 22:28 ` Konstantin Osipov 2020-04-30 14:50 ` Sergey Ostanevich 2020-05-06 8:52 ` Konstantin Osipov 2020-05-06 16:39 ` Sergey Ostanevich 2020-05-06 18:44 ` Konstantin Osipov 2020-05-12 15:55 ` Sergey Ostanevich 2020-05-12 16:42 ` Konstantin Osipov 2020-05-13 21:39 ` Vladislav Shpilevoy 2020-05-13 23:54 ` Konstantin Osipov 2020-05-14 20:38 ` Sergey Ostanevich 2020-05-20 20:59 ` Sergey Ostanevich 2020-05-25 23:41 ` Vladislav Shpilevoy 2020-05-27 21:17 ` Sergey Ostanevich 2020-06-09 16:19 ` Sergey Ostanevich 2020-06-11 15:17 ` Vladislav Shpilevoy 2020-06-12 20:31 ` Sergey Ostanevich 2020-05-13 21:36 ` Vladislav Shpilevoy 2020-05-13 23:45 ` Konstantin Osipov 2020-05-06 18:55 ` Konstantin Osipov 2020-05-06 19:10 ` Konstantin Osipov 2020-05-12 16:03 ` Sergey Ostanevich 2020-05-13 21:42 ` Vladislav Shpilevoy 2020-05-14 0:05 ` Konstantin Osipov [this message] 2020-05-07 23:01 ` Konstantin Osipov 2020-05-12 16:40 ` Sergey Ostanevich 2020-05-12 17:47 ` Konstantin Osipov 2020-05-13 21:34 ` Vladislav Shpilevoy 2020-05-13 23:31 ` Konstantin Osipov
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20200514000532.GD5698@atlas \ --to=kostja.osipov@gmail.com \ --cc=tarantool-patches@dev.tarantool.org \ --cc=v.shpilevoy@tarantool.org \ --subject='Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox