From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Vladislav Shpilevoy Subject: [PATCH 1/8] rfc: describe box.ctl.promote protocol Date: Wed, 8 Aug 2018 01:03:44 +0300 Message-Id: <19cfba156e919307ad24bb0604c94f3fad272bce.1533679264.git.v.shpilevoy@tarantool.org> In-Reply-To: References: In-Reply-To: References: To: tarantool-patches@freelists.org Cc: vdavydov.dev@gmail.com List-ID: Part of #3055 --- doc/rfc/3055-box_ctl_promote.md | 237 ++++++++++++++++++++++++++++++++++ doc/rfc/3055-box_ctl_promote_img1.svg | 2 + 2 files changed, 239 insertions(+) create mode 100644 doc/rfc/3055-box_ctl_promote.md create mode 100644 doc/rfc/3055-box_ctl_promote_img1.svg diff --git a/doc/rfc/3055-box_ctl_promote.md b/doc/rfc/3055-box_ctl_promote.md new file mode 100644 index 000000000..3f8e854e6 --- /dev/null +++ b/doc/rfc/3055-box_ctl_promote.md @@ -0,0 +1,237 @@ +# Replicaset master promotion + +* **Status**: In progress +* **Start date**: 02-03-2018 +* **Authors**: Vladislav Shpilevoy @Gerold103 \, +Konstantin Osipov @kostja \ +* **Issues**: [#3055](https://github.com/tarantool/tarantool/issues/3055), +[#2625](https://github.com/tarantool/tarantool/issues/2625) + +## Summary + +Replicaset master promotion is a procedure of atomic making one slave be a new +master, and an old master be a slave in a full-mesh master-slave replicaset. +Master is a replica in read-write mode. Slave is a replica in read-only mode. + +Master promotion has API: +```Lua +-- +-- Called on a slave promotes its role to master, demoting an old +-- one to slave. Called on a master returns an error. +-- @param opts Optional settings: +-- * timeout - the time in which a promotion must be +-- finished; +-- * quorum - before an old master demotion its data must +-- be synced with no less than quorum slave count, +-- including the being promoted one. +-- +-- @retval true Promotion is started. +-- @retval nil, error Can not start promotion. +-- +box.ctl.promote(opts) + +-- +-- Status of the latest finished or the currently working +-- promotion round. +-- @retval Empty table. Promote() was not called since the +-- instance had started, or it had started on another +-- instance, that did not sent a promotion info to the +-- current instance yet. +-- @retval status A table with the format: +-- { +-- round_id = , +-- round_uuid = , +-- initiator_uuid = , +-- timeout = , +-- quorum = , +-- role = , +-- phase = , +-- comment = , +-- old_master_uuid = , +-- } +-- +box.ctl.promote_info() + +-- +-- Remove info about all promotions from the entire cluster. +-- +box.ctl.promote_reset() +``` + +## Background and motivation + +The promote procedure strongly simplifies life of developers since they must not +do all of the promotion steps manually, that in a common case is not a trivial +task, as you will see in the algorithm description in the next section. + +The common algorithm, disregarding failures and their processing, consists of +the following steps: +1. On an old master stop accepting DDL/DML - only DQL; +2. Wait until all master data is received by needed slave count, including the +new master candidate; +3. Make the old master be a slave; +4. Make the slave be a new master; +5. Notify all other slaves, that master is changed. + +All of the steps are persisted in WAL, that guarantees, that even after a +promotion participant is restarted, after waking up it will not forgot about +promotion. Persistency together with the mandatory quorum 50% + 1 instances +eliminates any possibility of making the cluster has two masters after a +promotion. + +## Detailed design + +Each cluster member has a special system space to distribute promotion steps +over the cluster via replication channels - `_promotion`. Each record in the +space is a promotion message sent by one of instances. +```Lua +format = {} +-- ID of the promotion round. Each round has an unique identifier +-- of two parts: ID and UUID. ID is used to order rounds by the +-- time of their start. Each new round has an ID > than all the +-- known previous ones. Timestamps can not be used since clocks +-- are not perfectly sinced over network. +format[1] = {'id', 'unsigned'} + +-- UUID of the promotion round. UUID is generated by a promotion +-- initiator and allows to protect from an error when promotions +-- are started on different nodes at the same time with the same +-- round IDs. UUIDs are different in them because of different +-- initiators. +format[2] = {'round_uuid', 'string'} + +-- The promotion round step. It is constantly growing number for +-- each promotion participant and is used to persist order of sent +-- messages. Each instance arranges its messages with step +-- numbers. Also steps are used to persist relative order of +-- messages from different sources. +format[3] = {'step', 'unsigned'} + +-- UUID of the sender instance. +format[4] = {'source_uuid', 'string'} + +-- Timestamp of the message dispatch time by the sender clock. +-- Just debug attribute, that is persisted. +format[5] = {'ts', 'unsigned'} + +-- Type is what the sender wants to get or send. Value depends on +-- type. +format[6] = {'type', 'string'} + +-- Depending on the message type, different values are stored. +format[7] = {'value', 'any', is_nullable = true} +-- +-- Here the type-value pairs are described. +-- +-- 'begin' - the message sent by a promotion initiator to start +-- a round. Value contains promotion metadata: quorum +-- and timeout. +-- +-- 'status' - the message sent by all promotion participants. It +-- has several goals: cope with a case when the +-- cluster has no masters; when has multiple; to +-- persist read-only cfg flag for recovery. +-- +-- 'sync' - the message sent by an old master to sync with the +-- slaves. Value is nil. +-- +-- 'success' - the message sent by a slave on 'sync' and by an old +-- master when all syncs are collected. +-- +-- 'error' - an error, that can be sent by any cluster member. +-- For example, it can be failed sync, or an existing +-- promotion is found, or timeout. Value is the error +-- description. +-- +s = box.schema.create_space('_promotion', {format = format}) +``` +To participate in a promotion a cluster member just writes into `_promotion` +space and waits until the record is replicated. This space is cleared by a +garbage collector from finished promotions - with error or success status. Only +latest promotion is not deleted to be able to restore a role after recovery. + +Below the protocol is described. On the image the state machine is showed: +![alt text](https://raw.githubusercontent.com/tarantool/tarantool/2e591965dfb4603ac1b197621c9c8eb5e8eb8d9f/doc/rfc/3328-wire_protocol_img1.svg?sanitize=true) + +In the simplest case the being promoted instance is a master already - +immediately finish the promotion with the error and with no persisting that. Now +assume promote() is called on a slave. At first, the initiator broadcasts +`begin` request with the promotion status: quorum and timeout. + +Each cluster member, received the `begin`, checks if it already knows about +another active promotions. If does, then responds `error` to the newer promotion +request. Else broadcasts `status` message. + +If the cluster has no a master, the promotion initiator detects it collecting +statuses from all of the cluster members. In such a case the initiator on behalf +of an old master syncs with the slaves and becomes a master. + +If the cluster has a master, but it is not available, then the initiator +terminates the round after timeout. Consider the case when a master exists and +is active. + +An old master got `begin` request enters read-only mode and broadcasts `sync` +request. A slave got `sync` finishes its participation in the round responding +`success`. The old master collects quorum `success`es including the promotion +initiator's. Once the old master has collected responses it writes its own +`success`. The initiator, got `success` from the master, enters read-write mode +and becomes a new master. + +### Recovery + +Recovery is quite simple and merely replays all promotion messages from the +`_promotion` space. But it also does a tricky thing under the hood - it recovers +`box.cfg.read_only` flag. Consider how is it possible. + +During a promotion round three cases exist which persist `read_only` one way or +another. +1. When an instance sends `status` message, its `not read_only` is stored in the +message's value as `is_master`. +2. When an instance sends `begin` message, it means, that it is an initiator, +but only non-master can start a promotion. So if an instance has sent `begin`, +it has `read_only = tue`. +3. Due to messages reordering from different replication sources it is possible, +that a non-initiator instance has received `sync` message before it succeeded to +send `status`. Then the instance is a watcher and has `read_only = true`. + +On recovery the `read_only` value has to be recovered to exactly the same value, +that was persisted, because other instances already aware of this value. And it +can not be changed manually until the promotion history is cleaned up. + +## Rationale and alternatives + +The protocol has several disputable details. + +A one could notice that an old master on `begin` sends two messages: `status` +and `sync` with no an intermediate message. And that `begin` plays both to +notify about the new promotion round and to trigger an old master sync. This +slightly complicates step numbers calculation and a round result figuring out +from the history records. But it reduces number of messages and message types +and optimizes the most common case - regular master promotion in a master-slave +full-mesh cluster. + +An alternative - add a new phase between `status` collecting and `sync` of the +old master. It could make the promotion protocol a bit more simple in an +implementation and understanding but on the contrary it is obviously longer. + + +Another option is to add a new message `commit`. This could be a message written +by an initiator when it receives `success` from the old master. The `commit` +message makes it clear when a round is successfully finished - when an initiator +has `commit` record. Now it is possible, that an old master sent `success`, but +during the message sending the initiator terminated the round with timeout. In +such a case the cluster becomes read-only until next promotion, and it is hard +to understand on recovery that the round is failed though the old master had +sent `success`. A drawback of this proposal is +1 message type and +1 message. + + +Apart from the branches above, there is a major available improvement of the +whole protocol: allow promotion when an old master is down, but exists. The main +challenge of such indulgence is how to learn is an old master really down or the +promotion initiator just has no network link with it? To detect the old master +failure a quorum of another instances is necessary. And here another problem +arises, when a replicaset consists of two instances only - what should a +promotion initiator do, when its neighbour is not available? - is it down or +is it network problems?