[PATCH 7/8] box: introduce box.ctl.promote

Vladislav Shpilevoy v.shpilevoy at tarantool.org
Wed Aug 8 01:03:50 MSK 2018


Replicaset master promotion is a procedure of atomic making one
slave be a new master, and an old master be a slave in a fullmesh
master-slave replicaset.

The promotion follows the protocol described in details in the
corresponding RFC. Shortly, the protocol collects a quorum of
instances who approves the promotion, syncs data with the old
master and demotes it.

The protocol is intended to work with a single master cluster and
with at least 50% + 1 quorum mandatory including an old master.
It is tolerant to messages reordering from different sources, to
errors like multiple masters, timeouts, restarts of any promotion
participant. Also the promote protocol supports promotion in a
completely read-only cluster. It is useful, for example, when
after one of rare cases of a promotion fail the cluster is left
in a read-only state with no masters. Then the promotion can just
be called again to fix it. Such read-only promotion has only one
restriction - all of the instances have to be safe and sound.

Once a promotion is executed, it makes box.cfg.read_only
attribute be immutable. It is because actually the promotion
protocol persists this attribute as a part of one of messages and
sends it to other instances. So a user can not both use the
promotion and manually change box.cfg.read_only.

The promotion has several API methods:

* box.ctl.promote({timeout = ..., quorum = ...}).
  This function is meant to be called on a slave to demote the
  old master if exists and promote the current instance.

* box.ctl.promote_info().
  This function shows info about the latest promotion (finished
  or running now - does not matter, just the latest).

* box.ctl.promote_reset().
  This function clears the promotion history so a user would be
  able to re-assign master/slave roles in a cluster manually.

Closes #3055

@TarantoolBot document
Title: Document box.ctl.promote()
Subj. For details of the patch see the commit message. For
details of the protocol and API see the RFC:
doc/rfc/3055-box_ctl_promote.md
---
 src/box/CMakeLists.txt       |    1 +
 src/box/alter.cc             |   61 ++-
 src/box/box.cc               |   17 +
 src/box/box.h                |   27 ++
 src/box/errcode.h            |    3 +
 src/box/lua/cfg.cc           |    9 +-
 src/box/lua/ctl.c            |   82 ++++
 src/box/promote.c            | 1075 ++++++++++++++++++++++++++++++++++++++++++
 src/box/promote.h            |  170 +++++++
 src/cfg.c                    |   11 +
 src/cfg.h                    |    3 +
 src/main.cc                  |    1 +
 test/box/misc.result         |    9 +-
 test/promote/basic.result    |  472 +++++++++++++++++++
 test/promote/basic.test.lua  |  160 +++++++
 test/promote/box.lua         |    8 +
 test/promote/box1.lua        |  112 +++++
 test/promote/box2.lua        |    1 +
 test/promote/box3.lua        |    1 +
 test/promote/box4.lua        |    1 +
 test/promote/errinj.result   |  222 +++++++++
 test/promote/errinj.test.lua |   87 ++++
 test/promote/suite.ini       |    6 +
 23 files changed, 2530 insertions(+), 9 deletions(-)
 create mode 100644 src/box/promote.c
 create mode 100644 src/box/promote.h
 create mode 100644 test/promote/basic.result
 create mode 100644 test/promote/basic.test.lua
 create mode 100644 test/promote/box.lua
 create mode 100644 test/promote/box1.lua
 create mode 120000 test/promote/box2.lua
 create mode 120000 test/promote/box3.lua
 create mode 120000 test/promote/box4.lua
 create mode 100644 test/promote/errinj.result
 create mode 100644 test/promote/errinj.test.lua
 create mode 100644 test/promote/suite.ini

diff --git a/src/box/CMakeLists.txt b/src/box/CMakeLists.txt
index ad544270b..1a1e7025c 100644
--- a/src/box/CMakeLists.txt
+++ b/src/box/CMakeLists.txt
@@ -112,6 +112,7 @@ add_library(box STATIC
     journal.c
     wal.c
     call.c
+    promote.c
     ${lua_sources}
     lua/init.c
     lua/call.c
diff --git a/src/box/alter.cc b/src/box/alter.cc
index 7a7325038..6df31e75a 100644
--- a/src/box/alter.cc
+++ b/src/box/alter.cc
@@ -52,6 +52,7 @@
 #include "identifier.h"
 #include "version.h"
 #include "sequence.h"
+#include "promote.h"
 
 /**
  * chap-sha1 of empty string, i.e.
@@ -2924,11 +2925,69 @@ on_replace_dd_cluster(struct trigger *trigger, void *event)
 	txn_on_commit(txn, on_commit);
 }
 
+/**
+ * Process promotion messages on commit only. Prepared but not
+ * committed messages can not be processed since they could
+ * rollback, but promotion requires each processed message is
+ * persisted and is able to recovery on restart.
+ */
 static void
-on_replace_dd_promotion(struct trigger *trigger, void *event)
+on_commit_process_promote_msg(struct trigger *trigger, void *event)
+{
+	(void) event;
+	promote_process((struct promote_msg *) trigger->data);
+}
+
+/**
+ * Check that the promotion space is empty and reset for this
+ * case the state. Manual reset here is used by replicas when on
+ * one of them box.ctl.promote_reset() is called. Then on the
+ * source replica the promotion state is dropped but on other
+ * replicas this action should be done under the hood. This is the
+ * only possible place to do it.
+ */
+static void
+on_commit_check_promotion_reset(struct trigger *trigger, void *event)
 {
 	(void) trigger;
 	(void) event;
+	if (index_count(space_index(space_by_id(BOX_PROMOTION_ID), 0), ITER_ALL,
+			NULL, 0) == 0)
+		box_ctl_promote_reset();
+}
+
+static void
+on_replace_dd_promotion(struct trigger *trigger, void *event)
+{
+	struct txn *txn = (struct txn *) event;
+	struct txn_stmt *stmt = txn_current_stmt(txn);
+	if (stmt->new_tuple == NULL && stmt->old_tuple != NULL) {
+		trigger = txn_alter_trigger_new(on_commit_check_promotion_reset,
+						NULL);
+		txn_on_commit(txn, trigger);
+		return;
+	}
+	assert(stmt->new_tuple != NULL);
+	if (stmt->old_tuple != NULL) {
+		tnt_raise(ClientError, ER_UNSUPPORTED, "Promotion",
+			  "history edit");
+	}
+	/*
+	 * Forbid multistatement only for non-DELETE since the
+	 * later is used for promotion reset in batches - the
+	 * whole round per one transaction is dropped.
+	 */
+	txn_check_singlestatement_xc(txn, "Space _promotion");
+	struct promote_msg *msg =
+		region_alloc_object_xc(&fiber()->gc, struct promote_msg);
+	/*
+	 * Decode the message before the commit to do message's
+	 * sanity check.
+	 */
+	if (promote_msg_decode(tuple_data(stmt->new_tuple), msg) != 0)
+		diag_raise();
+	trigger = txn_alter_trigger_new(on_commit_process_promote_msg, msg);
+	txn_on_commit(txn, trigger);
 }
 
 /* }}} cluster configuration */
diff --git a/src/box/box.cc b/src/box/box.cc
index d8fbc6252..8bbd0d424 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -73,6 +73,7 @@
 #include "call.h"
 #include "func.h"
 #include "sequence.h"
+#include "promote.h"
 
 static char status[64] = "unknown";
 
@@ -216,6 +217,12 @@ box_set_ro(bool ro)
 	fiber_cond_broadcast(&ro_cond);
 }
 
+void
+box_expose_ro()
+{
+	cfg_rawsetb("read_only", is_ro);
+}
+
 bool
 box_is_writable(void)
 {
@@ -970,6 +977,15 @@ box_index_id_by_name(uint32_t space_id, const char *name, uint32_t len)
 }
 /** \endcond public */
 
+int
+box_process_sys_dml(struct request *request)
+{
+	struct space *space = space_cache_find(request->space_id);
+	assert(space != NULL);
+	assert(space_is_system(space));
+	return process_dml(request, space, NULL);
+}
+
 int
 box_process_dml(struct request *request, box_tuple_t **result)
 {
@@ -1981,6 +1997,7 @@ box_cfg_xc(void)
 	port_init();
 	iproto_init();
 	wal_thread_start();
+	box_ctl_promote_init();
 
 	title("loading");
 
diff --git a/src/box/box.h b/src/box/box.h
index 29618c9f8..526e73608 100644
--- a/src/box/box.h
+++ b/src/box/box.h
@@ -86,6 +86,25 @@ box_atfork(void);
 void
 box_set_ro(bool ro);
 
+/**
+ * Expose current read-only flag into Lua config as
+ * box.cfg.read_only. Used when the value is changed internally,
+ * for example, by box.ctl.promote.
+ */
+void
+box_expose_ro();
+
+/**
+ * Check that read-only value can be changed via box.cfg. It can
+ * be immutable when a promotion is used, so that a user should
+ * either manipulate the flag manually or trust to
+ * box.ctl.promote.
+ * @retval 0 Success.
+ * @retval -1 Error. Diag is set.
+ */
+int
+box_check_ro_is_mutable();
+
 bool
 box_is_writable(void);
 
@@ -405,6 +424,14 @@ box_sequence_reset(uint32_t seq_id);
 int
 box_process_dml(struct request *request, box_tuple_t **result);
 
+/**
+ * Process DML operation on a system space without any RO checks.
+ * Can be used internally only. @Sa box_process_dml for the
+ * parameter and the returned value.
+ */
+int
+box_process_sys_dml(struct request *request);
+
 int
 boxk(int type, uint32_t space_id, const char *format, ...);
 
diff --git a/src/box/errcode.h b/src/box/errcode.h
index 3d5f66af8..4c56ad645 100644
--- a/src/box/errcode.h
+++ b/src/box/errcode.h
@@ -208,6 +208,9 @@ struct errcode_record {
 	/*153 */_(ER_NULLABLE_MISMATCH,		"Field %d is %s in space format, but %s in index parts") \
 	/*154 */_(ER_TRANSACTION_YIELD,		"Transaction has been aborted by a fiber yield") \
 	/*155 */_(ER_NO_SUCH_GROUP,		"Replication group '%s' does not exist") \
+	/*156 */_(ER_PROMOTE,			"Error during promotion with round UUID '%s': %s") \
+	/*157 */_(ER_WRONG_PROMOTION_RECORD,	"Wrong record in _promotion (field %u): %s") \
+	/*158 */_(ER_PROMOTE_EXISTS,		"Promotion is in progress") \
 
 /*
  * !IMPORTANT! Please follow instructions at start of the file
diff --git a/src/box/lua/cfg.cc b/src/box/lua/cfg.cc
index 0f6b8a5a3..877db0254 100644
--- a/src/box/lua/cfg.cc
+++ b/src/box/lua/cfg.cc
@@ -167,11 +167,10 @@ lbox_cfg_set_checkpoint_count(struct lua_State *L)
 static int
 lbox_cfg_set_read_only(struct lua_State *L)
 {
-	try {
-		box_set_ro(cfg_geti("read_only") != 0);
-	} catch (Exception *) {
-		luaT_error(L);
-	}
+	bool new_value = cfg_geti("read_only") != 0;
+	if (box_check_ro_is_mutable() != 0 && new_value != box_is_ro())
+		return luaT_error(L);
+	box_set_ro(new_value);
 	return 0;
 }
 
diff --git a/src/box/lua/ctl.c b/src/box/lua/ctl.c
index 9a105ed5c..08c2354bb 100644
--- a/src/box/lua/ctl.c
+++ b/src/box/lua/ctl.c
@@ -29,6 +29,7 @@
  * SUCH DAMAGE.
  */
 #include "box/lua/ctl.h"
+#include "box/lua/info.h"
 
 #include <tarantool_ev.h>
 
@@ -38,7 +39,10 @@
 
 #include "lua/utils.h"
 
+#include "box/info.h"
 #include "box/box.h"
+#include "box/promote.h"
+#include "box/error.h"
 
 static int
 lbox_ctl_wait_ro(struct lua_State *L)
@@ -64,9 +68,87 @@ lbox_ctl_wait_rw(struct lua_State *L)
 	return 0;
 }
 
+/**
+ * Lua binding for box_ctl_promote. Takes non-mandatory options:
+ * timeout and quorum.
+ * @param L Lua stack.
+ * @retval Number of values pushed onto the stack. 2 means nil and
+ *         error object. 1 means ok and the value is true.
+ */
+static int
+lbox_ctl_promote(struct lua_State *L)
+{
+	int quorum = -1;
+	double timeout = TIMEOUT_INFINITY;
+	int top = lua_gettop(L);
+	if (top > 1) {
+usage_error:
+		return luaL_error(L, "Usage: box.ctl.promote([{timeout = "\
+				  "<double>, quorum = <unsigned>}])");
+	} else if (top == 1) {
+		lua_getfield(L, 1, "quorum");
+		int ok;
+		if (! lua_isnil(L, -1)) {
+			quorum = lua_tointegerx(L, -1, &ok);
+			if (ok == 0)
+				goto usage_error;
+		}
+		lua_getfield(L, 1, "timeout");
+		if (! lua_isnil(L, -1)) {
+			timeout = lua_tonumberx(L, -1, &ok);
+			if (ok == 0)
+				goto usage_error;
+		}
+	}
+	if (box_ctl_promote(timeout, quorum) != 0) {
+		lua_pushnil(L);
+		luaT_pusherror(L, box_error_last());
+		return 2;
+	} else {
+		lua_pushboolean(L, true);
+		return 1;
+	}
+}
+
+/**
+ * Lua binding for box_ctl_promote_reset. Has no arguments.
+ * @param L Lua stack.
+ * @retval Number of values pushed onto the stack. 2 means nil and
+ *         error object. 1 means ok and the value is true.
+ */
+static int
+lbox_ctl_promote_reset(struct lua_State *L)
+{
+	if (box_ctl_promote_reset() != 0) {
+		lua_pushnil(L);
+		luaT_pusherror(L, box_error_last());
+		return 2;
+	}
+	lua_pushboolean(L, true);
+	return 1;
+}
+
+/**
+ * Lua binding for box_ctl_promote_info. Has no arguments.
+ * @param L Lua stack.
+ * @retval Number of values pushed onto the stack. Always is 1 -
+ *         a Lua table with info parameters.
+ */
+static int
+lbox_ctl_promote_info(struct lua_State *L)
+{
+	struct info_handler info;
+	luaT_info_handler_create(&info, L);
+	box_ctl_promote_info(&info);
+	return 1;
+}
+
 static const struct luaL_Reg lbox_ctl_lib[] = {
 	{"wait_ro", lbox_ctl_wait_ro},
 	{"wait_rw", lbox_ctl_wait_rw},
+	{"promote", lbox_ctl_promote},
+	{"promote_reset", lbox_ctl_promote_reset},
+	{"promote_info", lbox_ctl_promote_info},
 	{NULL, NULL}
 };
 
diff --git a/src/box/promote.c b/src/box/promote.c
new file mode 100644
index 000000000..dcc39b5bd
--- /dev/null
+++ b/src/box/promote.c
@@ -0,0 +1,1075 @@
+/*
+ * Copyright 2010-2018, Tarantool AUTHORS, please see AUTHORS file.
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above
+ *    copyright notice, this list of conditions and the
+ *    following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above
+ *    copyright notice, this list of conditions and the following
+ *    disclaimer in the documentation and/or other materials
+ *    provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY <COPYRIGHT HOLDER> ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
+ * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
+ * <COPYRIGHT HOLDER> OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
+ * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+ * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
+ * THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+#include "box.h"
+#include "replication.h"
+#include "promote.h"
+#include "error.h"
+#include "msgpuck.h"
+#include "xrow.h"
+#include "space.h"
+#include "schema.h"
+#include "schema_def.h"
+#include "txn.h"
+#include "tuple.h"
+#include "iproto_constants.h"
+#include "opt_def.h"
+#include "info.h"
+
+static const char *promote_msg_type_strs[] = {
+	"begin",
+	"status",
+	"sync",
+	"success",
+	"error",
+};
+
+/** True, if @a msg is created by the current instance. */
+static inline bool
+promote_msg_is_mine(const struct promote_msg *msg)
+{
+	return tt_uuid_is_equal(&msg->source_uuid, &INSTANCE_UUID);
+}
+
+enum promote_role {
+	PROMOTE_ROLE_UNDEFINED = 0,
+	PROMOTE_ROLE_INITIATOR,
+	PROMOTE_ROLE_OLD_MASTER,
+	PROMOTE_ROLE_WATCHER
+};
+
+static const char *promote_role_strs[] = {
+	"undefined",
+	"initiator",
+	"old master",
+	"watcher",
+};
+
+enum promote_phase {
+	PROMOTE_PHASE_NON_ACTIVE = 0,
+	PROMOTE_PHASE_ERROR,
+	PROMOTE_PHASE_SUCCESS,
+	PROMOTE_PHASE_IN_PROGRESS,
+};
+
+static const char *promote_phase_strs[] = {
+	"non-active",
+	"error",
+	"success",
+	"in progress",
+};
+
+/**
+ * The current promotion state. If the promotion is finished, then
+ * the latest one is stored here as a cache for
+ * box.ctl.promote_info().
+ */
+static struct promote_state {
+	/**
+	 * Each round has an unique identifier of two parts: ID
+	 * and UUID. ID is used to order rounds by the time of
+	 * their start. Each new round has an ID > than all the
+	 * known previous ones. Timestamps can not be used since
+	 * clocks are not perfectly sinced over network.
+	 */
+	int round_id;
+	/**
+	 * UUID is generated by a promotion initiator and allows
+	 * to protect from an error when promotions are started on
+	 * different nodes at the same time with the same round
+	 * IDs. UUIDs are different in them because of different
+	 * initiators.
+	 */
+	struct tt_uuid round_uuid;
+	/** UUID of an old master if known, nil UUID otherwise. */
+	struct tt_uuid old_master_uuid;
+	/** UUID of an initiator if known, nil UUID otherwise. */
+	struct tt_uuid initiator_uuid;
+	/** Diagnostics storing the current round error. */
+	struct diag diag;
+	/**
+	 * Condition emited each time the promotion state is
+	 * changed.
+	 */
+	struct fiber_cond on_change;
+	/**
+	 * Role of the current instance in the current promotion
+	 * round.
+	 */
+	enum promote_role role;
+	/**
+	 * Current round promotion phase. If the round is
+	 * finsihed, the result (error/success) is stored here as
+	 * well.
+	 */
+	enum promote_phase phase;
+	/**
+	 * Description of the latest thing done during the current
+	 * promotion round. It is not persisted anywhere and
+	 * exists merely to improve user experience. It is shown
+	 * in box.ctl.promote_info().
+	 */
+	char comment[DIAG_ERRMSG_MAX + 1];
+	/**
+	 * The current promotion round quorum. Becomes valid only
+	 * when an initiator becomes known. Quorum is number of
+	 * replicas that should approve the promotion and sync
+	 * with the old master before its demotion. The quorum
+	 * includes the old master and the initiator.
+	 */
+	int quorum;
+	/**
+	 * Number of already collected syncs with the old master.
+	 * Valid on the old master and on the initiator if it acts
+	 * on behalf of the later.
+	 */
+	int sync_count;
+	/**
+	 * The current promotion round timeout. Once it is
+	 * exceeded, the round is terminated with persisting that
+	 * fact. Becomes valid only when an initiator becomes
+	 * known.
+	 */
+	double timeout;
+	/** The promotion timer fiber. */
+	struct fiber *timer;
+	/**
+	 * Number of watchers participating in the current
+	 * promotion round. If this value + the initiator equals
+	 * the cluster size, then the cluster is read-only. In
+	 * such a case the promotion is allowed even though an old
+	 * master does not exist. The initiator acts on behalf of
+	 * the later then.
+	 */
+	int watcher_count;
+	/**
+	 * The current promotion step. It is constantly growing
+	 * number for each promotion participant and is used to
+	 * persist order of sent messages. Each instance arranges
+	 * its messages with step numbers. Also steps are used to
+	 * persist relative order of messages from different
+	 * sources.
+	 */
+	int step;
+	/**
+	 * True if this instance at least once succeeded to commit
+	 * its status and set role. Status message contains
+	 * is_master flag that actually is persisted read_only cfg
+	 * option. So other instances now are aware of this status
+	 * and it can not be changed by a user via box.cfg.
+	 */
+	bool is_role_committed;
+} promote_state;
+
+/**
+ * Getters for different attributes and properties of the
+ * promotion state.
+ */
+
+static inline bool
+promote_is_active(void)
+{
+	return promote_state.phase == PROMOTE_PHASE_IN_PROGRESS;
+}
+
+static inline bool
+promote_is_master_known(void)
+{
+	return !tt_uuid_is_equal(&promote_state.old_master_uuid, &uuid_nil);
+}
+
+static inline bool
+promote_is_initiator_known(void)
+{
+	return !tt_uuid_is_equal(&promote_state.initiator_uuid, &uuid_nil);
+}
+
+static inline bool
+promote_is_finished(void)
+{
+	return !promote_is_active() && promote_state.timer == NULL;
+}
+
+static inline bool
+promote_is_cluster_readonly(void)
+{
+	return promote_state.watcher_count + 1 == replicaset.applier.total;
+}
+
+static inline bool
+promote_is_this_round_msg(const struct promote_msg *msg)
+{
+	return promote_is_active() &&
+	       tt_uuid_is_equal(&msg->round_uuid, &promote_state.round_uuid);
+}
+
+/**
+ * Comment a promotion event. The comment text is available to be
+ * seen from box.ctl.promote_info(), and is logged.
+ */
+#define promote_comment(...) do { \
+	snprintf(promote_state.comment, sizeof(promote_state.comment), \
+		 __VA_ARGS__); \
+	say_info(promote_state.comment); \
+} while(0)
+
+/**
+ * Serialize the promotion message into a string.
+ * @param msg Message to serialize.
+ * @retval String with the serialized message.
+ */
+static inline const char *
+promote_msg_str(const struct promote_msg *msg)
+{
+	int offset = 0;
+	char *buf = tt_static_buf();
+	int len = TT_STATIC_BUF_LEN;
+
+	offset += snprintf(buf, len, "{id: %d, round: '", msg->round_id);
+	tt_uuid_to_string(&msg->round_uuid, buf + offset);
+	offset += UUID_STR_LEN;
+	offset += snprintf(buf + offset, len - offset, "', step: %d, source: '",
+			   msg->step);
+	tt_uuid_to_string(&msg->source_uuid, buf + offset);
+	offset += UUID_STR_LEN;
+	offset += snprintf(buf + offset, len - offset, "', ts: %f, type: '%s'",
+			   msg->ts, promote_msg_type_strs[msg->type]);
+	switch (msg->type) {
+	case PROMOTE_MSG_BEGIN:
+		offset += snprintf(buf + offset, len - offset, ", quorum: %d, "\
+				   "timeout: %f}", msg->begin.quorum,
+				   msg->begin.timeout);
+		break;
+	case PROMOTE_MSG_STATUS:
+		offset += snprintf(buf + offset, len - offset, ", is_master: "\
+				   "%d}", (int) msg->status.is_master);
+		break;
+	case PROMOTE_MSG_ERROR:
+		offset += snprintf(buf + offset, len - offset, ", code: %d, "\
+				   "message: '%s'}", msg->error.code,
+				   msg->error.message);
+		break;
+	default:
+		offset += snprintf(buf + offset, len - offset, "}");
+		break;
+	}
+	return buf;
+}
+
+/**
+ * Encode the promotion message into MessagePack tuple ready to
+ * be inserted into _promotion space.
+ * @param msg Promotion message to encode.
+ * @param[out] size_out Size of the result.
+ *
+ * @retval NULL Error.
+ * @retval not NULL MessagePack encoded message.
+ */
+static const char *
+promote_msg_encode(const struct promote_msg *msg, uint32_t *size_out)
+{
+	size_t size = 1024;
+	char *data = region_alloc(&fiber()->gc, size);
+	if (data == NULL) {
+		diag_set(OutOfMemory, size, "region_alloc", "data");
+		return NULL;
+	}
+	char *begin = data;
+	data = mp_encode_array(data, 7);
+	data = mp_encode_uint(data, msg->round_id);
+	data = mp_encode_str(data, tt_uuid_str(&msg->round_uuid),
+			     UUID_STR_LEN);
+	data = mp_encode_uint(data, msg->step);
+	data = mp_encode_str(data, tt_uuid_str(&msg->source_uuid),
+			     UUID_STR_LEN);
+	data = mp_encode_double(data, msg->ts);
+	const char *type_str = promote_msg_type_strs[msg->type];
+	data = mp_encode_str(data, type_str, strlen(type_str));
+	switch(msg->type) {
+	case PROMOTE_MSG_BEGIN:
+		data = mp_encode_map(data, 2);
+		data = mp_encode_str(data, "quorum", strlen("quorum"));
+		data = mp_encode_uint(data, msg->begin.quorum);
+		data = mp_encode_str(data, "timeout", strlen("timeout"));
+		data = mp_encode_double(data, msg->begin.timeout);
+		break;
+	case PROMOTE_MSG_STATUS:
+		data = mp_encode_map(data, 1);
+		data = mp_encode_str(data, "is_master", strlen("is_master"));
+		data = mp_encode_bool(data, msg->status.is_master);
+		break;
+	case PROMOTE_MSG_ERROR:
+		data = mp_encode_map(data, 2);
+		data = mp_encode_str(data, "code", strlen("code"));
+		data = mp_encode_uint(data, msg->error.code);
+		data = mp_encode_str(data, "message", strlen("message"));
+		data = mp_encode_str(data, msg->error.message,
+				     strlen(msg->error.message));
+		break;
+	default:
+		data = mp_encode_nil(data);
+		break;
+	};
+	*size_out = data - begin;
+	assert(*size_out <= size);
+	return begin;
+}
+
+const struct opt_def promote_msg_begin_format[] = {
+	OPT_DEF("quorum", OPT_UINT32, struct promote_msg, begin.quorum),
+	OPT_DEF("timeout", OPT_FLOAT, struct promote_msg, begin.timeout),
+	OPT_END,
+};
+
+const struct opt_def promote_msg_status_format[] = {
+	OPT_DEF("is_master", OPT_BOOL, struct promote_msg, status.is_master),
+	OPT_END,
+};
+
+const struct opt_def promote_msg_error_format[] = {
+	OPT_DEF("code", OPT_UINT32, struct promote_msg, error.code),
+	OPT_DEF("message", OPT_STRPTR, struct promote_msg, error.message),
+	OPT_END,
+};
+
+int
+promote_msg_decode(const char *data, struct promote_msg *msg)
+{
+	uint32_t size = mp_decode_array(&data);
+	assert(size == 7 || size == 6);
+	uint32_t len;
+	struct region *region = &fiber()->gc;
+	msg->round_id = (int) mp_decode_uint(&data);
+	const char *str = mp_decode_str(&data, &len);
+	if (tt_uuid_from_strl(str, len, &msg->round_uuid) != 0) {
+		diag_set(ClientError, ER_WRONG_PROMOTION_RECORD,
+			 BOX_PROMOTION_FIELD_ROUND_UUID, "invalid UUID");
+		return -1;
+	}
+	msg->step = (int) mp_decode_uint(&data);
+	str = mp_decode_str(&data, &len);
+	if (tt_uuid_from_strl(str, len, &msg->source_uuid) != 0) {
+		diag_set(ClientError, ER_WRONG_PROMOTION_RECORD,
+			 BOX_PROMOTION_FIELD_SOURCE_UUID, "invalid UUID");
+		return -1;
+	}
+	if (mp_read_double(&data, &msg->ts) != 0 || msg->ts < 0) {
+		diag_set(ClientError, ER_WRONG_PROMOTION_RECORD,
+			 BOX_PROMOTION_FIELD_TS, "wrong ts");
+		return -1;
+	}
+	str = mp_decode_str(&data, &len);
+	msg->type = STRN2ENUM(promote_msg_type, str, len);
+	if (msg->type == promote_msg_type_MAX) {
+		diag_set(ClientError, ER_WRONG_PROMOTION_RECORD,
+			 BOX_PROMOTION_FIELD_TYPE, "wrong type");
+		return -1;
+	}
+
+	switch(msg->type) {
+	case PROMOTE_MSG_BEGIN:
+		if (opts_decode(msg, promote_msg_begin_format, &data,
+				ER_WRONG_PROMOTION_RECORD,
+				BOX_PROMOTION_FIELD_VALUE, region, 2) != 0)
+			return -1;
+		break;
+	case PROMOTE_MSG_STATUS:
+		if (opts_decode(msg, promote_msg_status_format, &data,
+				ER_WRONG_PROMOTION_RECORD,
+				BOX_PROMOTION_FIELD_VALUE, region, 1) != 0)
+			return -1;
+		break;
+	case PROMOTE_MSG_ERROR:
+		if (opts_decode(msg, promote_msg_error_format, &data,
+				ER_WRONG_PROMOTION_RECORD,
+				BOX_PROMOTION_FIELD_VALUE, region, 2) != 0)
+			return -1;
+		break;
+	default:
+		if (mp_typeof(*data) != MP_NIL) {
+			diag_set(ClientError, ER_WRONG_PROMOTION_RECORD,
+				 BOX_PROMOTION_FIELD_VALUE,
+				 tt_sprintf("'%s' has to have value nil",
+					    promote_msg_type_strs[msg->type]));
+			return -1;
+		}
+		mp_decode_nil(&data);
+		break;
+	};
+	return 0;
+}
+
+/**
+ * Send the promotion message via its writing into _promotion
+ * space.
+ * @param ap Variable length argument list. Contains a single
+ *        element - pointer to a promotion message to send.
+ *
+ * @retval -1 Error.
+ * @retval 0 Success.
+ */
+static int
+promote_send_f(va_list ap)
+{
+	const struct promote_msg *msg = va_arg(ap, const struct promote_msg *);
+	struct request request;
+	memset(&request, 0, sizeof(request));
+	request.type = IPROTO_INSERT;
+	request.space_id = BOX_PROMOTION_ID;
+	uint32_t size;
+	request.tuple = promote_msg_encode(msg, &size);
+	if (request.tuple == NULL)
+		return -1;
+	request.tuple_end = request.tuple + size;
+	return box_process_sys_dml(&request);
+}
+
+/**
+ * Wrapper for promote_send_f to send the message in a separate
+ * fiber. It is needed to be able to write records into _promotion
+ * space from on_commit trigger where core promotion logic is
+ * concentrated and a transaction exists already (though it is
+ * committed).
+ */
+static inline int
+promote_send(const struct promote_msg *msg)
+{
+	/*
+	 * Do nothing on recovery. If a message was sent on the
+	 * previous work session, it would be recovered among next
+	 * rows.
+	 */
+	if (! box_is_configured())
+		return 0;
+	struct fiber *sender = fiber_new("promote sender", promote_send_f);
+	if (sender == NULL)
+		return -1;
+	say_info("send promotion message: %s", promote_msg_str(msg));
+	fiber_set_joinable(sender, true);
+	fiber_start(sender, msg);
+	int rc = fiber_join(sender);
+	if (rc != 0) {
+		say_info("promotion message has not sent: %s",
+			 box_error_message(box_error_last()));
+	}
+	return rc;
+}
+
+/**
+ * Create the promotion message.
+ * @param[out] msg Message to create.
+ * @param type Type to set to @a msg.
+ */
+static inline void
+promote_msg_create(struct promote_msg *msg, enum promote_msg_type type)
+{
+	msg->round_id = promote_state.round_id;
+	msg->round_uuid = promote_state.round_uuid;
+	msg->source_uuid = INSTANCE_UUID;
+	msg->ts = fiber_time();
+	msg->type = type;
+	msg->step = ++promote_state.step;
+}
+
+/**
+ * Send a 'begin' promotion message. For this a new round is
+ * initialized and round_id is incremented.
+ */
+static inline int
+promote_send_begin(int quorum, double timeout)
+{
+	struct promote_msg msg;
+	promote_msg_create(&msg, PROMOTE_MSG_BEGIN);
+	tt_uuid_create(&msg.round_uuid);
+	msg.begin.quorum = quorum;
+	msg.begin.timeout = timeout;
+	msg.round_id++;
+	msg.step = 1;
+	return promote_send(&msg);
+}
+
+/**
+ * Send a 'status' promotion message. It contains a role of this
+ * instance. The message is sent as a response to 'begin' message.
+ */
+static inline int
+promote_send_status(void)
+{
+	struct promote_msg msg;
+	promote_msg_create(&msg, PROMOTE_MSG_STATUS);
+	msg.status.is_master = ! box_is_ro();
+	return promote_send(&msg);
+}
+
+/**
+ * Send a 'sync' promotion message. It is sent by this instance if
+ * it is an old master to be demoted. Sync brings this instance
+ * into read-only mode, while watchers and the initiator responds
+ * to this message with 'success'.
+ */
+static inline int
+promote_send_sync(void)
+{
+	struct promote_msg msg;
+	promote_msg_create(&msg, PROMOTE_MSG_SYNC);
+	return promote_send(&msg);
+}
+
+/**
+ * Send a 'success' promotion message. It is sent by a promotion
+ * watcher and an initiator as a response to 'sync' and by an old
+ * master when the sync is successfull. The later means the whole
+ * promotion round success.
+ */
+static inline int
+promote_send_success(void)
+{
+	struct promote_msg msg;
+	promote_msg_create(&msg, PROMOTE_MSG_SUCCESS);
+	return promote_send(&msg);
+}
+
+/**
+ * Send an 'error' promotion message. It is sent by any instace
+ * on different errors like timeout, multiple masters discovery,
+ * local errors (OOM, WAL error etc). This message is sent in
+ * scope of the current round and on commit terminates the local
+ * promotion state.
+ */
+static inline int
+promote_send_error(void)
+{
+	struct promote_msg msg;
+	promote_msg_create(&msg, PROMOTE_MSG_ERROR);
+	struct error *e = box_error_last();
+	msg.error.code = box_error_code(e);
+	msg.error.message = box_error_message(e);
+	return promote_send(&msg);
+}
+
+/**
+ * Send an 'error' promotion message out of scope of the current
+ * round. For example, as a response to unexpected message from
+ * another round while there are the current round active.
+ */
+static inline int
+promote_send_out_of_bound_error(int round_id, const struct tt_uuid *round_uuid,
+				int step)
+{
+	struct promote_msg msg;
+	promote_msg_create(&msg, PROMOTE_MSG_ERROR);
+	msg.round_id = round_id;
+	msg.round_uuid = *round_uuid;
+	struct error *e = box_error_last();
+	msg.error.code = box_error_code(e);
+	msg.error.message = box_error_message(e);
+	msg.step = step;
+	return promote_send(&msg);
+}
+
+int
+box_check_ro_is_mutable()
+{
+	if (! promote_state.is_role_committed)
+		return 0;
+	diag_set(ClientError, ER_CFG, "read_only", "can not change the option "\
+		 "when box.ctl.promote() was used");
+	return -1;
+}
+
+int
+box_ctl_promote(double timeout, int quorum)
+{
+	if (quorum < 0)
+		quorum = replicaset.applier.total;
+	if (! box_is_ro()) {
+		diag_set(ClientError, ER_PROMOTE, "non-initialized",
+			 "the initiator is already master");
+		return -1;
+	}
+	if (! promote_is_finished()) {
+		diag_set(ClientError, ER_PROMOTE_EXISTS);
+		return -1;
+	}
+	if (quorum <= replicaset.applier.total / 2) {
+		diag_set(ClientError, ER_PROMOTE, "non-initialized",
+			 tt_sprintf("too small quorum, expected > %d, "\
+				    "but got %d", replicaset.applier.total / 2,
+				    quorum));
+		return -1;
+	}
+	if (promote_send_begin(quorum, timeout) != 0)
+		return -1;
+
+	while (promote_state.phase != PROMOTE_PHASE_SUCCESS) {
+		fiber_cond_wait(&promote_state.on_change);
+		if (promote_state.phase == PROMOTE_PHASE_ERROR) {
+			assert(! diag_is_empty(&promote_state.diag));
+			diag_move(&promote_state.diag, diag_get());
+			return -1;
+		}
+	}
+	return 0;
+}
+
+/**
+ * Delete the promotion round with the specified id.
+ * @param id Round ID to delete by.
+ * @param[out] next_id ID of a next round.
+ * @param pk Primary index of the _promotion space.
+ *
+ * @retval 0 Success.
+ * @retval -1 Error.
+ */
+static inline int
+promote_clean_round(uint32_t id, uint32_t *next_id, struct index *pk)
+{
+	if (! promote_is_finished()) {
+		diag_set(ClientError, ER_PROMOTE_EXISTS);
+		return -1;
+	}
+	char key[16];
+	mp_encode_uint(key, id);
+	if (index_count(pk, ITER_ALL, NULL, 0) == 0)
+		return 0;
+	struct request request;
+	memset(&request, 0, sizeof(request));
+	request.type = IPROTO_DELETE;
+	request.space_id = BOX_PROMOTION_ID;
+	struct iterator *it = index_create_iterator(pk, ITER_GE, key, 1);
+	if (it == NULL)
+		return -1;
+	if (box_txn_begin() != 0) {
+		iterator_delete(it);
+		return -1;
+	}
+	struct tuple *t;
+	int rc;
+	while ((rc = iterator_next(it, &t)) == 0 && t != NULL) {
+		uint32_t key_size;
+		tuple_field_u32(t, BOX_PROMOTION_FIELD_ID, next_id);
+		if (*next_id != id)
+			break;
+		request.key = tuple_extract_key(t, pk->def->key_def, &key_size);
+		if (request.key == NULL)
+			goto rollback;
+		request.key_end = request.key + key_size;
+		if (box_process_sys_dml(&request) != 0)
+			goto rollback;
+	}
+	if (rc != 0 || box_txn_commit() != 0)
+		goto rollback;
+	iterator_delete(it);
+	return 0;
+rollback:
+	box_txn_rollback();
+	iterator_delete(it);
+	return -1;
+}
+
+int
+box_ctl_promote_reset(void)
+{
+	uint32_t id, next_id = 0;
+	struct index *pk = space_index(space_by_id(BOX_PROMOTION_ID), 0);
+	do {
+		id = next_id;
+		if (promote_clean_round(id, &next_id, pk) != 0)
+			return -1;
+	} while (id != next_id);
+	promote_state.phase = PROMOTE_PHASE_NON_ACTIVE;
+	promote_state.is_role_committed = false;
+	return 0;
+}
+
+/**
+ * Promotion timer worker function. It waits on the promotion
+ * state change condition variable at most timeout seconds and
+ * if the current round is not finished in time, the timeout error
+ * is committed.
+ */
+static int
+promote_timer_f(va_list ap)
+{
+	(void) ap;
+	assert(promote_state.timeout >= 0);
+	fiber_set_cancellable(true);
+	double timeout = promote_state.timeout;
+	double start = fiber_clock();
+	while (fiber_cond_wait_timeout(&promote_state.on_change,
+				       timeout) == 0) {
+		if (!promote_is_active() || fiber_is_cancelled())
+			goto stop;
+		timeout -= fiber_clock() - start;
+		start = fiber_clock();
+	}
+	if (!promote_is_active() || fiber_is_cancelled())
+		goto stop;
+	diag_set(ClientError, ER_TIMEOUT);
+	promote_state.step++;
+	promote_send_error();
+stop:
+	say_info("promotion timer is stopped");
+	assert(fiber() == promote_state.timer);
+	promote_state.timer = NULL;
+	return 0;
+}
+
+/**
+ * Start a promotion timer to terminate the current round on
+ * timeout.
+ */
+static int
+promote_start_timer(void)
+{
+	assert(promote_state.timer == NULL);
+	promote_state.timer = fiber_new("promote timer", promote_timer_f);
+	if (promote_state.timer == NULL)
+		return -1;
+	say_info("start promotion timer for %f seconds", promote_state.timeout);
+	fiber_start(promote_state.timer);
+	return 0;
+}
+
+void
+box_ctl_promote_info(struct info_handler *info)
+{
+	struct promote_state *s = &promote_state;
+	info_begin(info);
+	if (s->phase == PROMOTE_PHASE_NON_ACTIVE) {
+		info_end(info);
+		return;
+	}
+	info_append_int(info, "round_id", s->round_id);
+	info_append_str(info, "round_uuid", tt_uuid_str(&s->round_uuid));
+	if (promote_is_initiator_known()) {
+		info_append_str(info, "initiator_uuid",
+				tt_uuid_str(&s->initiator_uuid));
+		info_append_int(info, "quorum", s->quorum);
+		info_append_double(info, "timeout", s->timeout);
+	}
+	info_append_str(info, "role", promote_role_strs[s->role]);
+	info_append_str(info, "phase", promote_phase_strs[s->phase]);
+	info_append_str(info, "comment", s->comment);
+	if (promote_is_master_known()) {
+		info_append_str(info, "old_master_uuid",
+				tt_uuid_str(&s->old_master_uuid));
+	}
+	info_end(info);
+}
+
+void
+promote_process(const struct promote_msg *msg)
+{
+	if (box_is_configured()) {
+		say_info("promotion message has %s: %s",
+			 promote_msg_is_mine(msg) ? "commited" : "received",
+			 promote_msg_str(msg));
+	} else {
+		say_info("promotion message has recovered: %s",
+			 promote_msg_str(msg));
+	}
+	if (! promote_is_active()) {
+		if (msg->round_id <= promote_state.round_id) {
+			say_info("Ignored outdated round id %u, expected > %u",
+				 msg->round_id, promote_state.round_id);
+			return;
+		}
+		/*
+		 * During recovery there are no yields so do them
+		 * manually when needed to stop the timer. Avoid
+		 * starting a timer is not possible since only a
+		 * part of the round could be persisted, so after
+		 * the recovery is finished it is necessary to
+		 * commit an error on timeout, or finish the round
+		 * with success.
+		 */
+		if (promote_state.timer != NULL) {
+			assert(! box_is_configured());
+			fiber_cancel(promote_state.timer);
+			while (promote_state.timer != NULL)
+				fiber_sleep(0);
+		}
+		promote_state.step = 1;
+		promote_state.round_id = msg->round_id;
+		promote_state.round_uuid = msg->round_uuid;
+		promote_state.old_master_uuid = uuid_nil;
+		promote_state.initiator_uuid = uuid_nil;
+		diag_clear(&promote_state.diag);
+		promote_state.phase = PROMOTE_PHASE_IN_PROGRESS;
+		/*
+		 * Until 'status' message is commited, the role is
+		 * undefined. It is not possible to use
+		 * box_is_ro() right here since it can be
+		 * recovery. And by recovery of its own 'status'
+		 * messages the instance restores its read_only
+		 * flag and the role.
+		 */
+		promote_state.role = PROMOTE_ROLE_UNDEFINED;
+		promote_state.sync_count = 0;
+		promote_state.watcher_count = 0;
+		/*
+		 * Begin and quorum can not be set right now,
+		 * because the first message may be non-begin and
+		 * thus does not contain any round initial info.
+		 * It is called messages reordeing and it possible
+		 * when, for example, one instance downloads the
+		 * same round messages from two different
+		 * instances. Some of them can be received
+		 * earlier, but commited later breaking the order.
+		 * So it is not allowed to trust the order.
+		 */
+	} else if (!promote_is_this_round_msg(msg)) {
+		/*
+		 * Do not respond error on error, or else an
+		 * infinite error messages exchange will be
+		 * started.
+		 */
+		if (msg->type == PROMOTE_MSG_ERROR)
+			return;
+		diag_set(ClientError, ER_PROMOTE, tt_uuid_str(&msg->round_uuid),
+			 "unexpected message");
+		promote_send_out_of_bound_error(msg->round_id, &msg->round_uuid,
+						msg->step + 1);
+		return;
+	} else {
+		promote_state.step = MAX(msg->step, promote_state.step);
+	}
+	/*
+	 * The main processing switch. Here each instance of each
+	 * type responds to each type of message.
+	 */
+	switch (msg->type) {
+	case PROMOTE_MSG_BEGIN:
+		promote_state.initiator_uuid = msg->source_uuid;
+		promote_state.quorum = msg->begin.quorum;
+		promote_state.timeout = msg->begin.timeout;
+		if (promote_start_timer() != 0) {
+			promote_send_error();
+			break;
+		}
+		if (! promote_msg_is_mine(msg)) {
+			promote_send_status();
+		} else {
+			promote_state.role = PROMOTE_ROLE_INITIATOR;
+			promote_state.is_role_committed = true;
+			/*
+			 * If an instance sent 'begin' then it was
+			 * not a master at the moment of sending.
+			 * Recovery this status.
+			 */
+			box_set_ro(true);
+			box_expose_ro();
+			promote_comment("promotion is started, my promotion "\
+					"role is %s",
+					promote_role_strs[promote_state.role]);
+		}
+		break;
+
+	case PROMOTE_MSG_STATUS:
+		if (promote_state.role == PROMOTE_ROLE_UNDEFINED &&
+		    promote_msg_is_mine(msg)) {
+			/*
+			 * An instance can restore its role ONLY
+			 * by its own status messages and only on
+			 * commit. Even it've just sent the status
+			 * one moment earlier. Also the 'status'
+			 * message is used to recovery read_only.
+			 */
+			if (! box_is_ro())
+				promote_state.role = PROMOTE_ROLE_OLD_MASTER;
+			else
+				promote_state.role = PROMOTE_ROLE_WATCHER;
+			promote_state.is_role_committed = true;
+			box_set_ro(! msg->status.is_master);
+			box_expose_ro();
+			promote_comment("promotion is started, my promotion "\
+					"role is %s",
+					promote_role_strs[promote_state.role]);
+		}
+		if (msg->status.is_master) {
+			if (! promote_is_master_known()) {
+				promote_state.old_master_uuid =
+					msg->source_uuid;
+				if (promote_state.role !=
+				    PROMOTE_ROLE_OLD_MASTER)
+					break;
+				if (promote_msg_is_mine(msg)) {
+					/* Synced with self. */
+					promote_state.sync_count++;
+					promote_send_sync();
+					break;
+				}
+			}
+			const char *r, *m1, *m2;
+			r = tt_uuid_str(&msg->round_uuid);
+			m1 = tt_uuid_str(&msg->source_uuid);
+			m2 = tt_uuid_str(&promote_state.old_master_uuid);
+			/*
+			 * Sort master UUIDs to stabilize the
+			 * error message. Mostly for tests.
+			 */
+			if (strcmp(m1, m2) > 0)
+				SWAP(m1, m2);
+			diag_set(ClientError, ER_PROMOTE, r,
+				 tt_sprintf("two masters exist: '%s' and '%s'",
+					    m1, m2));
+			promote_send_error();
+			break;
+		}
+		++promote_state.watcher_count;
+		if (promote_state.role != PROMOTE_ROLE_INITIATOR ||
+		    !promote_is_cluster_readonly())
+			break;
+		/*
+		 * The cluster is readonly and 100% available.
+		 * Then the promotion is safe allowed. But the
+		 * initiator plays for an old master.
+		 */
+		promote_comment("the cluster is completely readonly, the "\
+				"initiator acts on behalf of an old master ");
+		/* Synced with self. */
+		promote_state.sync_count++;
+		promote_send_sync();
+		break;
+
+	case PROMOTE_MSG_SYNC:
+		if (promote_msg_is_mine(msg)) {
+			if (promote_state.role == PROMOTE_ROLE_OLD_MASTER) {
+				promote_comment("old master entered readonly "\
+						"mode to sync with slaves");
+				box_set_ro(true);
+				box_expose_ro();
+			} else {
+				assert(promote_state.role ==
+				       PROMOTE_ROLE_INITIATOR);
+				assert(promote_is_cluster_readonly());
+			}
+		} else {
+			if (promote_state.role == PROMOTE_ROLE_UNDEFINED) {
+				promote_state.is_role_committed = true;
+				promote_state.role = PROMOTE_ROLE_WATCHER;
+				promote_comment("promotion is started, 'sync' "\
+						"is received before my status "\
+						"was committed so I am not a "\
+						"master and not an initiator, "\
+						"but watcher");
+				box_set_ro(true);
+				box_expose_ro();
+			}
+			promote_send_success();
+		}
+		break;
+
+	case PROMOTE_MSG_SUCCESS:
+		switch (promote_state.role) {
+		case PROMOTE_ROLE_OLD_MASTER:
+			/*
+			 * The old master sends 'success' to
+			 * notify the initiator about the round
+			 * successfull finish.
+			 */
+			if (promote_msg_is_mine(msg)) {
+				promote_state.phase = PROMOTE_PHASE_SUCCESS;
+				promote_comment("the old master is demoted "\
+						"completely");
+			} else if (++promote_state.sync_count ==
+				   promote_state.quorum) {
+				/*
+				 * On commit the code above is
+				 * called. But do nothing until
+				 * the commit.
+				 */
+				promote_send_success();
+			}
+			break;
+		case PROMOTE_ROLE_INITIATOR:
+			/*
+			 * The round is finished successfully in
+			 * two cases: the old master've sent
+			 * 'success' or the cluster is read-only
+			 * and each replica've sent 'success'.
+			 */
+			if (tt_uuid_is_equal(&msg->source_uuid,
+					     &promote_state.old_master_uuid) ||
+			    (promote_is_cluster_readonly() &&
+			     ++promote_state.sync_count ==
+			     promote_state.quorum)) {
+				promote_comment("the new master is promoted");
+				promote_state.phase = PROMOTE_PHASE_SUCCESS;
+				box_set_ro(false);
+				box_expose_ro();
+			}
+			break;
+		case PROMOTE_ROLE_WATCHER:
+			if (promote_msg_is_mine(msg)) {
+				promote_state.phase = PROMOTE_PHASE_SUCCESS;
+				promote_comment("the watcher has voted and "\
+						"left the round");
+			}
+			break;
+		default:
+			assert(promote_state.role == PROMOTE_ROLE_UNDEFINED);
+			assert(! promote_msg_is_mine(msg));
+			break;
+		}
+		break;
+
+	case PROMOTE_MSG_ERROR:
+		if (promote_state.role == PROMOTE_ROLE_OLD_MASTER &&
+		    promote_state.phase == PROMOTE_PHASE_IN_PROGRESS &&
+		    box_is_ro()) {
+			promote_comment("the old master is back in read-write "\
+					"mode due to the error: %s",
+					 msg->error.message);
+			box_set_ro(false);
+			box_expose_ro();
+		} else {
+			promote_comment("the round failed due to the error: %s",
+					msg->error.message);
+		}
+		promote_state.phase = PROMOTE_PHASE_ERROR;
+		box_error_raise(msg->error.code, "%s", msg->error.message);
+		diag_move(diag_get(), &promote_state.diag);
+		break;
+	default:
+		break;
+	}
+	promote_state.round_id = MAX(promote_state.round_id, msg->round_id);
+	fiber_cond_broadcast(&promote_state.on_change);
+}
+
+int
+box_ctl_promote_init(void)
+{
+	memset(&promote_state, 0, sizeof(promote_state));
+	fiber_cond_create(&promote_state.on_change);
+	return 0;
+}
diff --git a/src/box/promote.h b/src/box/promote.h
new file mode 100644
index 000000000..e66140ade
--- /dev/null
+++ b/src/box/promote.h
@@ -0,0 +1,170 @@
+#ifndef INCLUDES_TARANTOOL_BOX_PROMOTE_H
+#define INCLUDES_TARANTOOL_BOX_PROMOTE_H
+/*
+ * Copyright 2010-2018, Tarantool AUTHORS, please see AUTHORS file.
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above
+ *    copyright notice, this list of conditions and the
+ *    following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above
+ *    copyright notice, this list of conditions and the following
+ *    disclaimer in the documentation and/or other materials
+ *    provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY <COPYRIGHT HOLDER> ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
+ * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
+ * <COPYRIGHT HOLDER> OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
+ * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+ * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
+ * THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+#include "tt_uuid.h"
+#include "diag.h"
+#include "fiber_cond.h"
+
+#ifdef __cplusplus
+extern "C" {
+#endif /* __cplusplus */
+
+struct info_handler;
+
+enum promote_msg_type {
+	PROMOTE_MSG_BEGIN = 0,
+	PROMOTE_MSG_STATUS,
+	PROMOTE_MSG_SYNC,
+	PROMOTE_MSG_SUCCESS,
+	PROMOTE_MSG_ERROR,
+	promote_msg_type_MAX,
+};
+
+/**
+ * Promotion message. The unit of communication between an
+ * initiator, an old master and watchers.
+ */
+struct promote_msg {
+	/**
+	 * Round ID. Together with round UUID composes an unique
+	 * round identifier. For details see promotion_state.
+	 */
+	int round_id;
+	/** Promotion round UUID, generated by the initiator. */
+	struct tt_uuid round_uuid;
+	/** UUID of the message sender. */
+	struct tt_uuid source_uuid;
+	/**
+	 * Timestamp of the message send time by the sender clock.
+	 * Just debug attribute, that is persisted.
+	 */
+	double ts;
+	/** Promotion message type. */
+	enum promote_msg_type type;
+	/** Step of the round on which the message was sent. */
+	int step;
+	/**
+	 * Depending on the message type, different attributes
+	 * are available in the message.
+	 */
+	union {
+		struct {
+			/**
+			 * 'Begin' promotion message carries
+			 * quorum and timeout of the new round
+			 * among other common things above.
+			 */
+			int quorum;
+			double timeout;
+		} begin;
+		struct {
+			/**
+			 * 'Status' message carries the sender
+			 * role.
+			 */
+			bool is_master;
+		} status;
+		struct {
+			/**
+			 * 'Error' message carries the error code
+			 * and message to be set in diag.
+			 */
+			int code;
+			const char *message;
+		} error;
+	};
+};
+
+/**
+ * Decode the MessagePack encoded promotion message into @a msg.
+ * @param data MessagePack data to decode. Tuple from _promotion.
+ * @param[out] msg Object to fill up.
+ *
+ * @retval -1 Error during decoding.
+ * @retval 0 Success.
+ */
+int
+promote_msg_decode(const char *data, struct promote_msg *msg);
+
+/**
+ * Process the promotion message, update the promotion state. The
+ * processing is executed on commit of @a msg.
+ * @param msg Message to process.
+ */
+void
+promote_process(const struct promote_msg *msg);
+
+/**
+ * Promote the current instance to be a master in the fullmesh
+ * master-master cluster. The old master, if exists, is demoted.
+ * Once a promotion attempt is done anywhere, manual change of
+ * read_only flag is disabled.
+ * @param timeout Timeout during which the promotion should be
+ *        finished.
+ * @param quorum The promotion quorum of instances who should
+ *        approve the promotion and sync with the old master
+ *        before demotion. The quorum should be at least half of
+ *        the cluster size + 1 and include the old master. If an
+ *        old master does not exist, then the quorum is ignored
+ *        and the promotion waits for 100% of the cluster
+ *        members.
+ *
+ * @retval -1 Error.
+ * @retval 0 Success.
+ */
+int
+box_ctl_promote(double timeout, int quorum);
+
+/**
+ * Show status of the current active promotion round or the last
+ * finished one.
+ * @param info Info handler to collect the info into.
+ */
+void
+box_ctl_promote_info(struct info_handler *info);
+
+/**
+ * Remove all the promotion rounds from the history. That allows
+ * to change read_only manually again.
+ */
+int
+box_ctl_promote_reset(void);
+
+/** Initialize the promotion subsystem. */
+int
+box_ctl_promote_init(void);
+
+#ifdef __cplusplus
+}
+#endif /* __cplusplus */
+
+#endif /* INCLUDES_TARANTOOL_BOX_PROMOTE_H */
diff --git a/src/cfg.c b/src/cfg.c
index 7c7d6e793..4d02a315e 100644
--- a/src/cfg.c
+++ b/src/cfg.c
@@ -153,3 +153,14 @@ cfg_getarr_elem(const char *name, int i)
 	lua_pop(tarantool_L, 2);
 	return val;
 }
+
+void
+cfg_rawsetb(const char *name, bool b)
+{
+	lua_getfield(tarantool_L, LUA_GLOBALSINDEX, "box");
+	lua_getfield(tarantool_L, -1, "cfg");
+	lua_pushstring(tarantool_L, name);
+	lua_pushboolean(tarantool_L, b);
+	lua_rawset(tarantool_L, -3);
+	lua_pop(tarantool_L, 2);
+}
diff --git a/src/cfg.h b/src/cfg.h
index 8499388b8..a7e400fe5 100644
--- a/src/cfg.h
+++ b/src/cfg.h
@@ -61,6 +61,9 @@ cfg_getarr_size(const char *name);
 const char *
 cfg_getarr_elem(const char *name, int i);
 
+void
+cfg_rawsetb(const char *name, bool b);
+
 #if defined(__cplusplus)
 } /* extern "C" */
 #endif /* defined(__cplusplus) */
diff --git a/src/main.cc b/src/main.cc
index a36a2b0d0..e6ab28b92 100644
--- a/src/main.cc
+++ b/src/main.cc
@@ -514,6 +514,7 @@ load_cfg()
 	 */
 	say_crit("%s %s", tarantool_package(), tarantool_version());
 	say_crit("log level %i", cfg_geti("log_level"));
+	box_set_ro(cfg_geti("read_only") != 0);
 
 	if (pid_file_handle != NULL) {
 		if (pidfile_write(pid_file_handle) == -1)
diff --git a/test/box/misc.result b/test/box/misc.result
index 4895a78a2..14183829c 100644
--- a/test/box/misc.result
+++ b/test/box/misc.result
@@ -397,8 +397,11 @@ t;
   - 'box.error.FUNCTION_EXISTS : 52'
   - 'box.error.UPDATE_ARG_TYPE : 26'
   - 'box.error.CROSS_ENGINE_TRANSACTION : 81'
-  - 'box.error.FORMAT_MISMATCH_INDEX_PART : 27'
   - 'box.error.injection : table: <address>
+  - 'box.error.INVALID_XLOG_TYPE : 125'
+  - 'box.error.PROTOCOL : 104'
+  - 'box.error.FORMAT_MISMATCH_INDEX_PART : 27'
+  - 'box.error.PROMOTE : 156'
   - 'box.error.FUNCTION_TX_ACTIVE : 30'
   - 'box.error.ITERATOR_TYPE : 72'
   - 'box.error.TRANSACTION_YIELD : 154'
@@ -472,8 +475,8 @@ t;
   - 'box.error.UNSUPPORTED_PRIV : 98'
   - 'box.error.WRONG_SCHEMA_VERSION : 109'
   - 'box.error.ROLLBACK_IN_SUB_STMT : 123'
-  - 'box.error.PROTOCOL : 104'
-  - 'box.error.INVALID_XLOG_TYPE : 125'
+  - 'box.error.WRONG_PROMOTION_RECORD : 157'
+  - 'box.error.PROMOTE_EXISTS : 158'
   - 'box.error.INDEX_PART_TYPE_MISMATCH : 24'
   - 'box.error.UNSUPPORTED_INDEX_FEATURE : 112'
 ...
diff --git a/test/promote/basic.result b/test/promote/basic.result
new file mode 100644
index 000000000..f70659963
--- /dev/null
+++ b/test/promote/basic.result
@@ -0,0 +1,472 @@
+test_run = require('test_run').new()
+---
+...
+test_run:create_cluster(CLUSTER, 'promote')
+---
+...
+test_run:wait_fullmesh(CLUSTER)
+---
+...
+--
+-- Check the promote actually allows to switch the master.
+--
+_ = test_run:switch('box1')
+---
+...
+-- Box1 is a master.
+box.cfg.read_only
+---
+- false
+...
+_ = test_run:switch('box2')
+---
+...
+-- Box2 is a slave.
+box.cfg.read_only
+---
+- true
+...
+-- And can not do DDL/DML.
+box.schema.create_space('test') -- Fail.
+---
+- error: Can't modify data because this instance is in read-only mode.
+...
+box.ctl.promote()
+---
+- true
+...
+promote_info()
+---
+- quorum: 4
+  timeout: 3153600000
+  initiator_uuid: box2
+  old_master_uuid: box1
+  role: initiator
+  round_id: 1
+  comment: the new master is promoted
+  phase: success
+  round_uuid: round_1
+...
+-- Now the slave has become a master.
+box.cfg.read_only
+---
+- false
+...
+-- And can do DDL/DML.
+s = box.schema.create_space('test')
+---
+...
+s:drop()
+---
+...
+_ = test_run:switch('box1')
+---
+...
+-- In turn, the old master is a slave now.
+box.cfg.read_only
+---
+- true
+...
+promote_info()
+---
+- quorum: 4
+  timeout: 3153600000
+  initiator_uuid: box2
+  old_master_uuid: box1
+  role: old master
+  round_id: 1
+  comment: the old master is demoted completely
+  phase: success
+  round_uuid: round_1
+...
+-- For him any DDL/DML is forbidden.
+box.schema.create_space('test2')
+---
+- error: Can't modify data because this instance is in read-only mode.
+...
+-- Check a watcher state.
+_ = test_run:switch('box3')
+---
+...
+promote_info()
+---
+- quorum: 4
+  timeout: 3153600000
+  initiator_uuid: box2
+  old_master_uuid: box1
+  role: watcher
+  round_id: 1
+  comment: the watcher has voted and left the round
+  phase: success
+  round_uuid: round_1
+...
+--
+-- Clear the basic successfull test and try different errors.
+--
+_ = test_run:switch('box2')
+---
+...
+box.ctl.promote_reset()
+---
+- true
+...
+promotion_history()
+---
+- []
+...
+prom = box.space._promotion
+---
+...
+-- Invalid UUIDs.
+prom:insert{1, 'invalid', 1, box.info.uuid, 1, 't'}
+---
+- error: 'Wrong record in _promotion (field 1): invalid UUID'
+...
+prom:insert{1, box.info.uuid, 1, 'invalid', 1, 't'}
+---
+- error: 'Wrong record in _promotion (field 3): invalid UUID'
+...
+-- Invalid ts.
+prom:insert{1, box.info.uuid, 1, box.info.uuid, -1, 't'}
+---
+- error: 'Wrong record in _promotion (field 5): wrong ts'
+...
+-- Invalid type.
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'invalid'}
+---
+- error: 'Wrong record in _promotion (field 6): wrong type'
+...
+-- Invalid type-specific options.
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'begin', {quorum = 1}}
+---
+- error: 'Wrong record in _promotion (field 7): expected 2 keys but got 1'
+...
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'begin', {quorum = 'invalid', timeout = 1}}
+---
+- error: 'Wrong record in _promotion (field 7): ''quorum'' must be unsigned'
+...
+map = setmetatable({}, {__serialize = 'map'})
+---
+...
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'status', map}
+---
+- error: 'Wrong record in _promotion (field 7): expected 1 keys but got 0'
+...
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'status', {is_master = 'invalid'}}
+---
+- error: 'Wrong record in _promotion (field 7): ''is_master'' must be boolean'
+...
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'error', map}
+---
+- error: 'Wrong record in _promotion (field 7): expected 2 keys but got 0'
+...
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'error', {code = 'code', message = 'msg'}}
+---
+- error: 'Wrong record in _promotion (field 7): ''code'' must be unsigned'
+...
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'sync', map}
+---
+- error: 'Wrong record in _promotion (field 7): ''sync'' has to have value nil'
+...
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'success', map}
+---
+- error: 'Wrong record in _promotion (field 7): ''success'' has to have value nil'
+...
+--
+-- Test simple invalid scenarios.
+--
+-- Already master.
+box.ctl.promote()
+---
+- null
+- 'Error during promotion with round UUID ''non-initialized'': the initiator is already
+  master'
+...
+_ = test_run:switch('box1')
+---
+...
+-- Small quorum.
+box.ctl.promote({quorum = 2})
+---
+- null
+- 'Error during promotion with round UUID ''non-initialized'': too small quorum, expected
+  > 2, but got 2'
+...
+-- Two masters.
+box.cfg{read_only = false}
+---
+...
+_ = test_run:switch('box3')
+---
+...
+promote_check_error()
+---
+- null
+- 'Error during promotion with round UUID ''round_2'': two masters exist: ''box1''
+  and ''box2'''
+...
+promotion_history_find_masters()
+---
+- - {'step': 2, 'value': {'is_master': true}, 'id': 2, 'type': 'status', 'source_uuid': 'box1',
+    'round_uuid': 'round_2'}
+  - {'step': 2, 'value': {'is_master': true}, 'id': 2, 'type': 'status', 'source_uuid': 'box2',
+    'round_uuid': 'round_2'}
+...
+box.cfg.read_only
+---
+- true
+...
+_ = test_run:switch('box1')
+---
+...
+box.cfg.read_only
+---
+- false
+...
+_ = test_run:switch('box2')
+---
+...
+box.cfg.read_only
+---
+- false
+...
+_ = test_run:switch('box4')
+---
+...
+box.cfg.read_only
+---
+- true
+...
+-- Box.cfg.read_only became immutable when promote had been
+-- called.
+box.cfg{read_only = false}
+---
+- error: 'Incorrect value for option ''read_only'': can not change the option when
+    box.ctl.promote() was used'
+...
+--
+-- Test recovery after failed promotion.
+--
+_ = test_run:cmd('restart server box2')
+---
+...
+_ = test_run:cmd('restart server box3')
+---
+...
+_ = test_run:switch('box2')
+---
+...
+info = promote_info()
+---
+...
+info.old_master_uuid == 'box1' or info.old_master_uuid == 'box2'
+---
+- true
+...
+info.old_master_uuid = nil
+---
+...
+info.comment = info.comment:match('two masters exist')
+---
+...
+info
+---
+- quorum: 4
+  timeout: 3153600000
+  initiator_uuid: box3
+  role: old master
+  round_id: 2
+  comment: two masters exist
+  phase: error
+  round_uuid: round_2
+...
+_ = test_run:switch('box3')
+---
+...
+info = promote_info()
+---
+...
+info.old_master_uuid == 'box1' or info.old_master_uuid == 'box2'
+---
+- true
+...
+info.old_master_uuid = nil
+---
+...
+info
+---
+- quorum: 4
+  timeout: 3153600000
+  initiator_uuid: box3
+  role: initiator
+  round_id: 2
+  comment: 'the round failed due to the error: Error during promotion with round UUID
+    ''round_2'': two masters exist: ''box1'' and ''box2'''
+  phase: error
+  round_uuid: round_2
+...
+--
+-- Test timeout.
+--
+_ = test_run:switch('box1')
+---
+...
+box.ctl.promote_reset()
+---
+- true
+...
+box.cfg{read_only = true}
+---
+...
+-- Now box2 is a single master.
+_ = test_run:switch('box3')
+---
+...
+promote_check_error({timeout = 0.00001})
+---
+- null
+- Timeout exceeded
+...
+promote_info()
+---
+- quorum: 4
+  initiator_uuid: box3
+  phase: error
+  role: initiator
+  round_id: 3
+  comment: 'the round failed due to the error: Timeout exceeded'
+  timeout: 1e-05
+  round_uuid: round_3
+...
+--
+-- Test the case when the cluster is not read-only, but a single
+-- master is not available now. In such a case the promote()
+-- should fail regardless of quorum.
+--
+_ = test_run:cmd('stop server box2')
+---
+...
+box.ctl.promote_reset()
+---
+- true
+...
+-- Quorum is 3 to test that the quorum must contain an old master.
+promote_check_error({timeout = 0.5, quorum = 3})
+---
+- null
+- Timeout exceeded
+...
+promote_info()
+---
+- quorum: 3
+  initiator_uuid: box3
+  phase: error
+  role: initiator
+  round_id: 4
+  comment: 'the round failed due to the error: Timeout exceeded'
+  timeout: 0.5
+  round_uuid: round_4
+...
+_ = test_run:switch('box1')
+---
+...
+_ = test_run:cmd('stop server box3')
+---
+...
+_ = test_run:cmd('start server box2')
+---
+...
+_ = test_run:switch('box2')
+---
+...
+info = promote_info({'round_id', 'comment', 'phase', 'round_uuid'})
+---
+...
+info.comment = info.comment:match('Timeout exceeded')
+---
+...
+info
+---
+- round_id: 4
+  comment: Timeout exceeded
+  phase: error
+  round_uuid: round_4
+...
+_ = test_run:cmd('start server box3')
+---
+...
+_ = test_run:switch('box3')
+---
+...
+promote_info({'round_id', 'comment', 'phase', 'round_uuid', 'role'})
+---
+- phase: error
+  role: initiator
+  round_id: 4
+  comment: 'the round failed due to the error: Timeout exceeded'
+  round_uuid: round_4
+...
+--
+-- Test promotion in a completely read-only cluster.
+--
+_ = test_run:switch('box2')
+---
+...
+box.ctl.promote_reset()
+---
+- true
+...
+box.cfg{read_only = true}
+---
+...
+box.ctl.promote()
+---
+- true
+...
+promote_info()
+---
+- quorum: 4
+  initiator_uuid: box2
+  phase: success
+  role: initiator
+  round_id: 5
+  comment: the new master is promoted
+  timeout: 3153600000
+  round_uuid: round_5
+...
+--
+-- Test promotion reset of several rounds.
+--
+_ = test_run:switch('box3')
+---
+...
+box.ctl.promote()
+---
+- true
+...
+promote_info()
+---
+- quorum: 4
+  timeout: 3153600000
+  initiator_uuid: box3
+  old_master_uuid: box2
+  role: initiator
+  round_id: 6
+  comment: the new master is promoted
+  phase: success
+  round_uuid: round_6
+...
+box.ctl.promote_reset()
+---
+- true
+...
+promotion_history()
+---
+- []
+...
+_ = test_run:switch('default')
+---
+...
+test_run:drop_cluster(CLUSTER)
+---
+...
diff --git a/test/promote/basic.test.lua b/test/promote/basic.test.lua
new file mode 100644
index 000000000..4138745b5
--- /dev/null
+++ b/test/promote/basic.test.lua
@@ -0,0 +1,160 @@
+test_run = require('test_run').new()
+test_run:create_cluster(CLUSTER, 'promote')
+test_run:wait_fullmesh(CLUSTER)
+--
+-- Check the promote actually allows to switch the master.
+--
+_ = test_run:switch('box1')
+-- Box1 is a master.
+box.cfg.read_only
+
+_ = test_run:switch('box2')
+-- Box2 is a slave.
+box.cfg.read_only
+-- And can not do DDL/DML.
+box.schema.create_space('test') -- Fail.
+
+box.ctl.promote()
+promote_info()
+-- Now the slave has become a master.
+box.cfg.read_only
+-- And can do DDL/DML.
+s = box.schema.create_space('test')
+s:drop()
+
+_ = test_run:switch('box1')
+-- In turn, the old master is a slave now.
+box.cfg.read_only
+promote_info()
+-- For him any DDL/DML is forbidden.
+box.schema.create_space('test2')
+
+-- Check a watcher state.
+_ = test_run:switch('box3')
+promote_info()
+
+--
+-- Clear the basic successfull test and try different errors.
+--
+_ = test_run:switch('box2')
+box.ctl.promote_reset()
+promotion_history()
+
+prom = box.space._promotion
+
+-- Invalid UUIDs.
+prom:insert{1, 'invalid', 1, box.info.uuid, 1, 't'}
+prom:insert{1, box.info.uuid, 1, 'invalid', 1, 't'}
+-- Invalid ts.
+prom:insert{1, box.info.uuid, 1, box.info.uuid, -1, 't'}
+-- Invalid type.
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'invalid'}
+-- Invalid type-specific options.
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'begin', {quorum = 1}}
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'begin', {quorum = 'invalid', timeout = 1}}
+
+map = setmetatable({}, {__serialize = 'map'})
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'status', map}
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'status', {is_master = 'invalid'}}
+
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'error', map}
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'error', {code = 'code', message = 'msg'}}
+
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'sync', map}
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'success', map}
+
+--
+-- Test simple invalid scenarios.
+--
+
+-- Already master.
+box.ctl.promote()
+_ = test_run:switch('box1')
+-- Small quorum.
+box.ctl.promote({quorum = 2})
+-- Two masters.
+box.cfg{read_only = false}
+_ = test_run:switch('box3')
+promote_check_error()
+promotion_history_find_masters()
+box.cfg.read_only
+_ = test_run:switch('box1')
+box.cfg.read_only
+_ = test_run:switch('box2')
+box.cfg.read_only
+_ = test_run:switch('box4')
+box.cfg.read_only
+-- Box.cfg.read_only became immutable when promote had been
+-- called.
+box.cfg{read_only = false}
+
+--
+-- Test recovery after failed promotion.
+--
+_ = test_run:cmd('restart server box2')
+_ = test_run:cmd('restart server box3')
+_ = test_run:switch('box2')
+info = promote_info()
+info.old_master_uuid == 'box1' or info.old_master_uuid == 'box2'
+info.old_master_uuid = nil
+info.comment = info.comment:match('two masters exist')
+info
+_ = test_run:switch('box3')
+info = promote_info()
+info.old_master_uuid == 'box1' or info.old_master_uuid == 'box2'
+info.old_master_uuid = nil
+info
+
+--
+-- Test timeout.
+--
+_ = test_run:switch('box1')
+box.ctl.promote_reset()
+box.cfg{read_only = true}
+-- Now box2 is a single master.
+_ = test_run:switch('box3')
+promote_check_error({timeout = 0.00001})
+promote_info()
+
+--
+-- Test the case when the cluster is not read-only, but a single
+-- master is not available now. In such a case the promote()
+-- should fail regardless of quorum.
+--
+_ = test_run:cmd('stop server box2')
+box.ctl.promote_reset()
+-- Quorum is 3 to test that the quorum must contain an old master.
+promote_check_error({timeout = 0.5, quorum = 3})
+promote_info()
+_ = test_run:switch('box1')
+_ = test_run:cmd('stop server box3')
+_ = test_run:cmd('start server box2')
+_ = test_run:switch('box2')
+info = promote_info({'round_id', 'comment', 'phase', 'round_uuid'})
+info.comment = info.comment:match('Timeout exceeded')
+info
+
+_ = test_run:cmd('start server box3')
+_ = test_run:switch('box3')
+promote_info({'round_id', 'comment', 'phase', 'round_uuid', 'role'})
+
+--
+-- Test promotion in a completely read-only cluster.
+--
+_ = test_run:switch('box2')
+box.ctl.promote_reset()
+box.cfg{read_only = true}
+box.ctl.promote()
+promote_info()
+
+--
+-- Test promotion reset of several rounds.
+--
+_ = test_run:switch('box3')
+box.ctl.promote()
+promote_info()
+box.ctl.promote_reset()
+promotion_history()
+
+_ = test_run:switch('default')
+test_run:drop_cluster(CLUSTER)
diff --git a/test/promote/box.lua b/test/promote/box.lua
new file mode 100644
index 000000000..97d952cae
--- /dev/null
+++ b/test/promote/box.lua
@@ -0,0 +1,8 @@
+#!/usr/bin/env tarantool
+os = require('os')
+
+box.cfg{ listen = os.getenv("LISTEN") }
+
+CLUSTER = { 'box1', 'box2', 'box3', 'box4' }
+
+require('console').listen(os.getenv('ADMIN'))
diff --git a/test/promote/box1.lua b/test/promote/box1.lua
new file mode 100644
index 000000000..eca667f96
--- /dev/null
+++ b/test/promote/box1.lua
@@ -0,0 +1,112 @@
+#!/usr/bin/env tarantool
+
+local INSTANCE_ID = string.match(arg[0], "%d")
+local SOCKET_DIR = require('fio').cwd()
+local read_only = INSTANCE_ID ~= '1'
+local function instance_uri(instance_id)
+    return SOCKET_DIR..'/promote'..instance_id..'.sock';
+end
+local uuid_prefix = '4d71c17c-8c50-11e8-9eb6-529269fb145'
+local uuid_to_name = {}
+for i = 1, 4 do
+    local uuid = uuid_prefix..tostring(i)
+    uuid_to_name[uuid] = 'box'..tostring(i)
+end
+require('console').listen(os.getenv('ADMIN'))
+
+fiber = require('fiber')
+errinj = box.error.injection
+
+box.cfg({
+    listen = instance_uri(INSTANCE_ID),
+    replication = {instance_uri(1), instance_uri(2),
+                   instance_uri(3), instance_uri(4)},
+    read_only = read_only,
+    replication_connect_timeout = 0.1,
+    replication_timeout = 0.1,
+    instance_uuid = uuid_prefix..tostring(INSTANCE_ID),
+})
+
+local round_uuid_to_id = {}
+
+function uuid_free_str(str)
+    for uuid, id in pairs(round_uuid_to_id) do
+        local template = string.gsub(uuid, '%-', '%%-')
+        str = string.gsub(str, template, 'round_'..tostring(id))
+    end
+    for uuid, name in pairs(uuid_to_name) do
+        local template = string.gsub(uuid, '%-', '%%-')
+        str = string.gsub(str, template, name)
+    end
+    return str
+end
+
+function promotion_history()
+    local ret = {}
+    local prev_round_uuid
+    for i, t in box.space._promotion:pairs() do
+        t = setmetatable(t:tomap({names_only = true}), {__serialize = 'map'})
+        round_uuid_to_id[t.round_uuid] = t.id
+        t.round_uuid = 'round_'..tostring(t.id)
+        t.source_uuid = uuid_to_name[t.source_uuid]
+        t.ts = nil
+        if t.value == box.NULL then
+            t.value = nil
+        end
+        if t.type == 'error' then
+            t.value.message = uuid_free_str(t.value.message)
+        end
+        table.insert(ret, t)
+    end
+    return ret
+end
+
+-- For recovery rescan round_uuids.
+promotion_history()
+
+function promote_check_error(...)
+    local ok, err = box.ctl.promote(...)
+    if not ok then
+        promotion_history()
+        err = uuid_free_str(err:unpack().message)
+    end
+    return ok, err
+end
+
+function promotion_history_find_masters()
+    local res = {}
+    for _, record in pairs(promotion_history()) do
+        if record.type == 'status' and record.value.is_master then
+            table.insert(res, record)
+        end
+    end
+    return res
+end
+
+function promote_info(fields)
+    local info = box.ctl.promote_info()
+    if fields then
+        local tmp = {}
+        for _, k in pairs(fields) do
+            tmp[k] = info[k]
+        end
+        info = tmp
+    end
+    if info.old_master_uuid then
+        info.old_master_uuid = uuid_free_str(info.old_master_uuid)
+    end
+    if info.round_uuid then
+        info.round_uuid = 'round_'..tostring(info.round_id)
+    end
+    if info.initiator_uuid then
+        info.initiator_uuid = uuid_free_str(info.initiator_uuid)
+    end
+    if info.comment then
+        info.comment = uuid_free_str(info.comment)
+    end
+    return info
+end
+
+box.once("bootstrap", function()
+    box.schema.user.grant('guest', 'read,write,execute', 'universe')
+end)
diff --git a/test/promote/box2.lua b/test/promote/box2.lua
new file mode 120000
index 000000000..77f1e2aab
--- /dev/null
+++ b/test/promote/box2.lua
@@ -0,0 +1 @@
+box1.lua
\ No newline at end of file
diff --git a/test/promote/box3.lua b/test/promote/box3.lua
new file mode 120000
index 000000000..77f1e2aab
--- /dev/null
+++ b/test/promote/box3.lua
@@ -0,0 +1 @@
+box1.lua
\ No newline at end of file
diff --git a/test/promote/box4.lua b/test/promote/box4.lua
new file mode 120000
index 000000000..77f1e2aab
--- /dev/null
+++ b/test/promote/box4.lua
@@ -0,0 +1 @@
+box1.lua
\ No newline at end of file
diff --git a/test/promote/errinj.result b/test/promote/errinj.result
new file mode 100644
index 000000000..fe837239e
--- /dev/null
+++ b/test/promote/errinj.result
@@ -0,0 +1,222 @@
+test_run = require('test_run').new()
+---
+...
+test_run:create_cluster(CLUSTER, 'promote')
+---
+...
+test_run:wait_fullmesh(CLUSTER)
+---
+...
+--
+-- Test the case when two different promotions are started at the
+-- same time. Here the initiators are box2 and box3 while box1 is
+-- an old master and box4 is a watcher.
+--
+_ = test_run:switch('box1')
+---
+...
+errinj.set("ERRINJ_WAL_DELAY", true)
+---
+- ok
+...
+_ = test_run:switch('box2')
+---
+...
+errinj.set("ERRINJ_WAL_DELAY", true)
+---
+- ok
+...
+_ = test_run:switch('box3')
+---
+...
+errinj.set("ERRINJ_WAL_DELAY", true)
+---
+- ok
+...
+_ = test_run:switch('box2')
+---
+...
+err = nil
+---
+...
+ok = nil
+---
+...
+_ = fiber.create(function() ok, err = promote_check_error() end)
+---
+...
+_ = test_run:switch('box3')
+---
+...
+err = nil
+---
+...
+ok = nil
+---
+...
+f = fiber.create(function() ok, err = promote_check_error() end)
+---
+...
+while f:status() ~= 'suspended' do fiber.sleep(0.01) end
+---
+...
+errinj.set("ERRINJ_WAL_DELAY", false)
+---
+- ok
+...
+_ = test_run:switch('box2')
+---
+...
+errinj.set("ERRINJ_WAL_DELAY", false)
+---
+- ok
+...
+while not err do fiber.sleep(0.01) end
+---
+...
+ok, err
+---
+- null
+- 'Error during promotion with round UUID ''round_1'': unexpected message'
+...
+_ = test_run:switch('box1')
+---
+...
+errinj.set("ERRINJ_WAL_DELAY", false)
+---
+- ok
+...
+while promote_info().phase ~= 'error' do fiber.sleep(0.01) end
+---
+...
+info = promote_info()
+---
+...
+info.comment = info.comment:match('unexpected message')
+---
+...
+info
+---
+- quorum: 4
+  timeout: 3153600000
+  initiator_uuid: box3
+  old_master_uuid: box1
+  role: old master
+  round_id: 1
+  comment: unexpected message
+  phase: error
+  round_uuid: round_1
+...
+_ = test_run:switch('box3')
+---
+...
+while not err do fiber.sleep(0.01) end
+---
+...
+ok, err
+---
+- null
+- 'Error during promotion with round UUID ''round_1'': unexpected message'
+...
+--
+-- Test that after all a new promotion works.
+--
+box.ctl.promote()
+---
+- true
+...
+promote_info()
+---
+- quorum: 4
+  timeout: 3153600000
+  initiator_uuid: box3
+  old_master_uuid: box1
+  role: initiator
+  round_id: 2
+  comment: the new master is promoted
+  phase: success
+  round_uuid: round_2
+...
+--
+-- Test the case when during a promotion round an initiator is
+-- restarted after sending 'begin' and the round had been failed
+-- on timeout. On recovery the initiator has to detect by 'begin'
+-- that it was read only and make 'read_only' option be immutable
+-- for a user despite the fact that 'status' is never sent by this
+-- instance.
+--
+-- The test plan: disable watchers, start a promotion round, turn
+-- the initiator off, wait until the round is failed due to
+-- timeout, turn the initiator on. It should catch its own
+-- begin + error and went to read only mode, even if box.cfg was
+-- called with read_only = false.
+--
+_ = test_run:cmd('stop server box2')
+---
+...
+_ = test_run:cmd('stop server box4')
+---
+...
+-- Box1 is an initiator, box3 is an old master.
+_ = test_run:switch('box1')
+---
+...
+-- Do reset and snapshot to do not replay the previous round on
+-- restart.
+box.ctl.promote_reset()
+---
+- true
+...
+box.snapshot()
+---
+- ok
+...
+_ = fiber.create(function() box.ctl.promote({timeout = 0.1}) end)
+---
+...
+_ = test_run:switch('box3')
+---
+...
+while box.space._promotion:count() == 0 do fiber.sleep(0.01) end
+---
+...
+_ = test_run:cmd('stop server box1')
+---
+...
+while box.ctl.promote_info().phase ~= 'error' do fiber.sleep(0.01) end
+---
+...
+_ = test_run:cmd('start server box1')
+---
+...
+_ = test_run:switch('box1')
+---
+...
+promote_info()
+---
+- quorum: 4
+  timeout: 0.1
+  initiator_uuid: box1
+  old_master_uuid: box3
+  role: initiator
+  round_id: 3
+  comment: 'the round failed due to the error: Timeout exceeded'
+  phase: error
+  round_uuid: round_3
+...
+box.cfg.read_only
+---
+- true
+...
+_ = test_run:cmd('start server box2')
+---
+...
+_ = test_run:cmd('start server box4')
+---
+...
+_ = test_run:switch('default')
+---
+...
+test_run:drop_cluster(CLUSTER)
+---
+...
diff --git a/test/promote/errinj.test.lua b/test/promote/errinj.test.lua
new file mode 100644
index 000000000..63cb5e59e
--- /dev/null
+++ b/test/promote/errinj.test.lua
@@ -0,0 +1,87 @@
+test_run = require('test_run').new()
+test_run:create_cluster(CLUSTER, 'promote')
+test_run:wait_fullmesh(CLUSTER)
+--
+-- Test the case when two different promotions are started at the
+-- same time. Here the initiators are box2 and box3 while box1 is
+-- an old master and box4 is a watcher.
+--
+_ = test_run:switch('box1')
+errinj.set("ERRINJ_WAL_DELAY", true)
+
+_ = test_run:switch('box2')
+errinj.set("ERRINJ_WAL_DELAY", true)
+
+_ = test_run:switch('box3')
+errinj.set("ERRINJ_WAL_DELAY", true)
+
+_ = test_run:switch('box2')
+err = nil
+ok = nil
+_ = fiber.create(function() ok, err = promote_check_error() end)
+
+_ = test_run:switch('box3')
+err = nil
+ok = nil
+f = fiber.create(function() ok, err = promote_check_error() end)
+while f:status() ~= 'suspended' do fiber.sleep(0.01) end
+errinj.set("ERRINJ_WAL_DELAY", false)
+
+_ = test_run:switch('box2')
+errinj.set("ERRINJ_WAL_DELAY", false)
+while not err do fiber.sleep(0.01) end
+ok, err
+
+_ = test_run:switch('box1')
+errinj.set("ERRINJ_WAL_DELAY", false)
+while promote_info().phase ~= 'error' do fiber.sleep(0.01) end
+info = promote_info()
+info.comment = info.comment:match('unexpected message')
+info
+
+_ = test_run:switch('box3')
+while not err do fiber.sleep(0.01) end
+ok, err
+
+--
+-- Test that after all a new promotion works.
+--
+box.ctl.promote()
+promote_info()
+
+--
+-- Test the case when during a promotion round an initiator is
+-- restarted after sending 'begin' and the round had been failed
+-- on timeout. On recovery the initiator has to detect by 'begin'
+-- that it was read only and make 'read_only' option be immutable
+-- for a user despite the fact that 'status' is never sent by this
+-- instance.
+--
+-- The test plan: disable watchers, start a promotion round, turn
+-- the initiator off, wait until the round is failed due to
+-- timeout, turn the initiator on. It should catch its own
+-- begin + error and went to read only mode, even if box.cfg was
+-- called with read_only = false.
+--
+_ = test_run:cmd('stop server box2')
+_ = test_run:cmd('stop server box4')
+-- Box1 is an initiator, box3 is an old master.
+_ = test_run:switch('box1')
+-- Do reset and snapshot to do not replay the previous round on
+-- restart.
+box.ctl.promote_reset()
+box.snapshot()
+_ = fiber.create(function() box.ctl.promote({timeout = 0.1}) end)
+_ = test_run:switch('box3')
+while box.space._promotion:count() == 0 do fiber.sleep(0.01) end
+_ = test_run:cmd('stop server box1')
+while box.ctl.promote_info().phase ~= 'error' do fiber.sleep(0.01) end
+_ = test_run:cmd('start server box1')
+_ = test_run:switch('box1')
+promote_info()
+box.cfg.read_only
+_ = test_run:cmd('start server box2')
+_ = test_run:cmd('start server box4')
+
+_ = test_run:switch('default')
+test_run:drop_cluster(CLUSTER)
diff --git a/test/promote/suite.ini b/test/promote/suite.ini
new file mode 100644
index 000000000..9c94cb465
--- /dev/null
+++ b/test/promote/suite.ini
@@ -0,0 +1,6 @@
+[default]
+core = tarantool
+description = Promotion tests
+script = box.lua
+release_disabled = errinj.test.lua
+is_parallel = True
-- 
2.15.2 (Apple Git-101.1)




More information about the Tarantool-patches mailing list