From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <v.shpilevoy@tarantool.org>
From: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Subject: [PATCH 1/8] rfc: describe box.ctl.promote protocol
Date: Wed,  8 Aug 2018 01:03:44 +0300
Message-Id: <19cfba156e919307ad24bb0604c94f3fad272bce.1533679264.git.v.shpilevoy@tarantool.org>
In-Reply-To: <cover.1533679264.git.v.shpilevoy@tarantool.org>
References: <cover.1533679264.git.v.shpilevoy@tarantool.org>
In-Reply-To: <cover.1533679264.git.v.shpilevoy@tarantool.org>
References: <cover.1533679264.git.v.shpilevoy@tarantool.org>
To: tarantool-patches@freelists.org
Cc: vdavydov.dev@gmail.com
List-ID: <tarantool-patches.dev.tarantool.org>

Part of #3055
---
 doc/rfc/3055-box_ctl_promote.md       | 237 ++++++++++++++++++++++++++++++++++
 doc/rfc/3055-box_ctl_promote_img1.svg |   2 +
 2 files changed, 239 insertions(+)
 create mode 100644 doc/rfc/3055-box_ctl_promote.md
 create mode 100644 doc/rfc/3055-box_ctl_promote_img1.svg
diff --git a/doc/rfc/3055-box_ctl_promote.md b/doc/rfc/3055-box_ctl_promote.md
new file mode 100644
index 000000000..3f8e854e6
--- /dev/null
+++ b/doc/rfc/3055-box_ctl_promote.md
@@ -0,0 +1,237 @@
+# Replicaset master promotion
+
+* **Status**: In progress
+* **Start date**: 02-03-2018
+* **Authors**: Vladislav Shpilevoy @Gerold103 \<v.shpilevoy@tarantool.org\>,
+Konstantin Osipov @kostja \<kostja@tarantool.org\>
+* **Issues**: [#3055](https://github.com/tarantool/tarantool/issues/3055),
+[#2625](https://github.com/tarantool/tarantool/issues/2625)
+
+## Summary
+
+Replicaset master promotion is a procedure of atomic making one slave be a new
+master, and an old master be a slave in a full-mesh master-slave replicaset.
+Master is a replica in read-write mode. Slave is a replica in read-only mode.
+
+Master promotion has API:
+```Lua
+--
+-- Called on a slave promotes its role to master, demoting an old
+-- one to slave. Called on a master returns an error.
+-- @param opts Optional settings:
+--        * timeout - the time in which a promotion must be
+--          finished;
+--        * quorum - before an old master demotion its data must
+--          be synced with no less than quorum slave count,
+--          including the being promoted one.
+--
+-- @retval true Promotion is started.
+-- @retval nil, error Can not start promotion.
+--
+box.ctl.promote(opts)
+
+--
+-- Status of the latest finished or the currently working
+-- promotion round.
+-- @retval Empty table. Promote() was not called since the
+--         instance had started, or it had started on another
+--         instance, that did not sent a promotion info to the
+--         current instance yet.
+-- @retval status A table with the format:
+--    {
+--         round_id = <Promotion ID>,
+--         round_uuid = <Promotion round UUID>,
+--         initiator_uuid = <UUID of the promotion initiator>,
+--         timeout = <Timeout of the promotion round>,
+--         quorum = <Requested quorum>,
+--         role = <The instance role in the round: old master,
+--                 watcher, initiator, undefined>,
+--         phase = <The round phase: success, error, in progress>,
+--         comment = <A human readable comment about the current
+--                    promotion status>,
+--         old_master_uuid = <UUID of the old master>,
+--    }
+--
+box.ctl.promote_info()
+
+--
+-- Remove info about all promotions from the entire cluster.
+--
+box.ctl.promote_reset()
+```
+
+## Background and motivation
+
+The promote procedure strongly simplifies life of developers since they must not
+do all of the promotion steps manually, that in a common case is not a trivial
+task, as you will see in the algorithm description in the next section.
+
+The common algorithm, disregarding failures and their processing, consists of
+the following steps: 
+1. On an old master stop accepting DDL/DML - only DQL;
+2. Wait until all master data is received by needed slave count, including the
+new master candidate;
+3. Make the old master be a slave;
+4. Make the slave be a new master;
+5. Notify all other slaves, that master is changed.
+
+All of the steps are persisted in WAL, that guarantees, that even after a
+promotion participant is restarted, after waking up it will not forgot about
+promotion. Persistency together with the mandatory quorum 50% + 1 instances
+eliminates any possibility of making the cluster has two masters after a
+promotion.
+
+## Detailed design
+
+Each cluster member has a special system space to distribute promotion steps
+over the cluster via replication channels - `_promotion`. Each record in the
+space is a promotion message sent by one of instances.
+```Lua
+format = {}
+-- ID of the promotion round. Each round has an unique identifier
+-- of two parts: ID and UUID. ID is used to order rounds by the
+-- time of their start. Each new round has an ID > than all the
+-- known previous ones. Timestamps can not be used since clocks
+-- are not perfectly sinced over network.
+format[1] = {'id', 'unsigned'}
+
+-- UUID of the promotion round. UUID is generated by a promotion
+-- initiator and allows to protect from an error when promotions
+-- are started on different nodes at the same time with the same
+-- round IDs. UUIDs are different in them because of different
+-- initiators.
+format[2] = {'round_uuid', 'string'}
+
+-- The promotion round step. It is constantly growing number for
+-- each promotion participant and is used to persist order of sent
+-- messages. Each instance arranges its messages with step
+-- numbers. Also steps are used to persist relative order of
+-- messages from different sources.
+format[3] = {'step', 'unsigned'}
+
+-- UUID of the sender instance.
+format[4] = {'source_uuid', 'string'}
+
+-- Timestamp of the message dispatch time by the sender clock.
+-- Just debug attribute, that is persisted.
+format[5] = {'ts', 'unsigned'}
+
+-- Type is what the sender wants to get or send. Value depends on
+-- type.
+format[6] = {'type', 'string'}
+
+-- Depending on the message type, different values are stored.
+format[7] = {'value', 'any', is_nullable = true}
+--
+--            Here the type-value pairs are described.
+--
+-- 'begin'   - the message sent by a promotion initiator to start
+--             a round. Value contains promotion metadata: quorum
+--             and timeout.
+--
+-- 'status'  - the message sent by all promotion participants. It
+--             has several goals: cope with a case when the
+--             cluster has no masters; when has multiple; to
+--             persist read-only cfg flag for recovery.
+--
+-- 'sync'    - the message sent by an old master to sync with the
+--             slaves. Value is nil.
+--
+-- 'success' - the message sent by a slave on 'sync' and by an old
+--             master when all syncs are collected.
+--
+-- 'error'   - an error, that can be sent by any cluster member.
+--             For example, it can be failed sync, or an existing
+--             promotion is found, or timeout. Value is the error
+--             description.
+--
+s = box.schema.create_space('_promotion', {format = format})
+```
+To participate in a promotion a cluster member just writes into `_promotion`
+space and waits until the record is replicated. This space is cleared by a
+garbage collector from finished promotions - with error or success status. Only
+latest promotion is not deleted to be able to restore a role after recovery.
+
+Below the protocol is described. On the image the state machine is showed:
+![alt text](https://raw.githubusercontent.com/tarantool/tarantool/2e591965dfb4603ac1b197621c9c8eb5e8eb8d9f/doc/rfc/3328-wire_protocol_img1.svg?sanitize=true)
+
+In the simplest case the being promoted instance is a master already -
+immediately finish the promotion with the error and with no persisting that. Now
+assume promote() is called on a slave. At first, the initiator broadcasts
+`begin` request with the promotion status: quorum and timeout.
+
+Each cluster member, received the `begin`, checks if it already knows about
+another active promotions. If does, then responds `error` to the newer promotion
+request. Else broadcasts `status` message.
+
+If the cluster has no a master, the promotion initiator detects it collecting
+statuses from all of the cluster members. In such a case the initiator on behalf
+of an old master syncs with the slaves and becomes a master.
+
+If the cluster has a master, but it is not available, then the initiator
+terminates the round after timeout. Consider the case when a master exists and
+is active.
+
+An old master got `begin` request enters read-only mode and broadcasts `sync`
+request. A slave got `sync` finishes its participation in the round responding
+`success`. The old master collects quorum `success`es including the promotion
+initiator's. Once the old master has collected responses it writes its own
+`success`. The initiator, got `success` from the master, enters read-write mode
+and becomes a new master.
+
+### Recovery
+
+Recovery is quite simple and merely replays all promotion messages from the
+`_promotion` space. But it also does a tricky thing under the hood - it recovers
+`box.cfg.read_only` flag. Consider how is it possible.
+
+During a promotion round three cases exist which persist `read_only` one way or
+another.
+1. When an instance sends `status` message, its `not read_only` is stored in the
+message's value as `is_master`.
+2. When an instance sends `begin` message, it means, that it is an initiator,
+but only non-master can start a promotion. So if an instance has sent `begin`,
+it has `read_only = tue`.
+3. Due to messages reordering from different replication sources it is possible,
+that a non-initiator instance has received `sync` message before it succeeded to
+send `status`. Then the instance is a watcher and has `read_only = true`.
+
+On recovery the `read_only` value has to be recovered to exactly the same value,
+that was persisted, because other instances already aware of this value. And it
+can not be changed manually until the promotion history is cleaned up.
+
+## Rationale and alternatives
+
+The protocol has several disputable details.
+
+A one could notice that an old master on `begin` sends two messages: `status`
+and `sync` with no an intermediate message. And that `begin` plays both to
+notify about the new promotion round and to trigger an old master sync. This
+slightly complicates step numbers calculation and a round result figuring out
+from the history records. But it reduces number of messages and message types
+and optimizes the most common case - regular master promotion in a master-slave
+full-mesh cluster.
+
+An alternative - add a new phase between `status` collecting and `sync` of the
+old master. It could make the promotion protocol a bit more simple in an
+implementation and understanding but on the contrary it is obviously longer.
+
+
+Another option is to add a new message `commit`. This could be a message written
+by an initiator when it receives `success` from the old master. The `commit`
+message makes it clear when a round is successfully finished - when an initiator
+has `commit` record. Now it is possible, that an old master sent `success`, but
+during the message sending the initiator terminated the round with timeout. In
+such a case the cluster becomes read-only until next promotion, and it is hard
+to understand on recovery that the round is failed though the old master had
+sent `success`. A drawback of this proposal is +1 message type and +1 message.
+
+
+Apart from the branches above, there is a major available improvement of the
+whole protocol: allow promotion when an old master is down, but exists. The main
+challenge of such indulgence is how to learn is an old master really down or the
+promotion initiator just has no network link with it? To detect the old master
+failure a quorum of another instances is necessary. And here another problem
+arises, when a replicaset consists of two instances only - what should a
+promotion initiator do, when its neighbour is not available? - is it down or
+is it network problems?