Tarantool development patches archive
 help / color / mirror / Atom feed
* [PATCH 0/8] box.ctl.promote
@ 2018-08-07 22:03 Vladislav Shpilevoy
  2018-08-07 22:03 ` [PATCH 1/8] rfc: describe box.ctl.promote protocol Vladislav Shpilevoy
                   ` (7 more replies)
  0 siblings, 8 replies; 13+ messages in thread
From: Vladislav Shpilevoy @ 2018-08-07 22:03 UTC (permalink / raw)
  To: tarantool-patches; +Cc: vdavydov.dev

Replicaset master promotion is a procedure of atomic making one
slave be a new master, and an old master be a slave in a fullmesh
master-slave replicaset.

The promotion follows the protocol described in details in the
corresponding RFC. Shortly, the protocol collects a quorum of
instances who approves the promotion, syncs data with the old
master and demotes it.

The patchset consists of several preparation commits with the most
important one describing the promotion protocol in details. The
last two patches are the promotion itself and its garbage
collection implementation.

Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-3055-box-ctl-promote
Issue: https://github.com/tarantool/tarantool/issues/3055

Vladislav Shpilevoy (8):
  rfc: describe box.ctl.promote protocol
  box: rename process_rw to process_dml
  Add 'exact_field_count' parameter to options decoder
  box: remove orphan check from box_is_ro()
  Fix gcov on Mac
  box: introduce _promotion space
  box: introduce box.ctl.promote
  box: introduce promotion GC

 cmake/profile.cmake                   |   12 +-
 doc/rfc/3055-box_ctl_promote.md       |  237 +++++++
 doc/rfc/3055-box_ctl_promote_img1.svg |    2 +
 src/box/CMakeLists.txt                |    1 +
 src/box/alter.cc                      |   80 ++-
 src/box/alter.h                       |    1 +
 src/box/bootstrap.snap                |  Bin 1540 -> 1635 bytes
 src/box/box.cc                        |   47 +-
 src/box/box.h                         |   44 +-
 src/box/errcode.h                     |    3 +
 src/box/iproto.cc                     |    2 +-
 src/box/key_def.c                     |    2 +-
 src/box/lua/cfg.cc                    |    9 +-
 src/box/lua/ctl.c                     |   82 +++
 src/box/lua/info.c                    |    2 +-
 src/box/lua/space.cc                  |    2 +
 src/box/lua/upgrade.lua               |   19 +
 src/box/opt_def.c                     |   13 +-
 src/box/opt_def.h                     |   16 +-
 src/box/promote.c                     | 1086 +++++++++++++++++++++++++++++++++
 src/box/promote.h                     |  170 ++++++
 src/box/schema.cc                     |   15 +
 src/box/schema_def.h                  |   14 +
 src/cfg.c                             |   11 +
 src/cfg.h                             |    3 +
 src/main.cc                           |    1 +
 test/app-tap/tarantoolctl.test.lua    |    4 +-
 test/box-py/bootstrap.result          |    8 +-
 test/box/access_misc.result           |    4 +
 test/box/access_sysview.result        |    6 +-
 test/box/alter.result                 |    6 +-
 test/box/misc.result                  |    9 +-
 test/promote/basic.result             |  495 +++++++++++++++
 test/promote/basic.test.lua           |  171 ++++++
 test/promote/box.lua                  |    8 +
 test/promote/box1.lua                 |  112 ++++
 test/promote/box2.lua                 |    1 +
 test/promote/box3.lua                 |    1 +
 test/promote/box4.lua                 |    1 +
 test/promote/errinj.result            |  222 +++++++
 test/promote/errinj.test.lua          |   87 +++
 test/promote/suite.ini                |    6 +
 test/wal_off/alter.result             |    2 +-
 test/xlog/upgrade.result              |    8 +-
 44 files changed, 2970 insertions(+), 55 deletions(-)
 create mode 100644 doc/rfc/3055-box_ctl_promote.md
 create mode 100644 doc/rfc/3055-box_ctl_promote_img1.svg
 create mode 100644 src/box/promote.c
 create mode 100644 src/box/promote.h
 create mode 100644 test/promote/basic.result
 create mode 100644 test/promote/basic.test.lua
 create mode 100644 test/promote/box.lua
 create mode 100644 test/promote/box1.lua
 create mode 120000 test/promote/box2.lua
 create mode 120000 test/promote/box3.lua
 create mode 120000 test/promote/box4.lua
 create mode 100644 test/promote/errinj.result
 create mode 100644 test/promote/errinj.test.lua
 create mode 100644 test/promote/suite.ini

-- 
2.15.2 (Apple Git-101.1)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/8] rfc: describe box.ctl.promote protocol
  2018-08-07 22:03 [PATCH 0/8] box.ctl.promote Vladislav Shpilevoy
@ 2018-08-07 22:03 ` Vladislav Shpilevoy
  2018-08-07 22:03 ` [PATCH 2/8] box: rename process_rw to process_dml Vladislav Shpilevoy
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Vladislav Shpilevoy @ 2018-08-07 22:03 UTC (permalink / raw)
  To: tarantool-patches; +Cc: vdavydov.dev

Part of #3055
---
 doc/rfc/3055-box_ctl_promote.md       | 237 ++++++++++++++++++++++++++++++++++
 doc/rfc/3055-box_ctl_promote_img1.svg |   2 +
 2 files changed, 239 insertions(+)
 create mode 100644 doc/rfc/3055-box_ctl_promote.md
 create mode 100644 doc/rfc/3055-box_ctl_promote_img1.svg

diff --git a/doc/rfc/3055-box_ctl_promote.md b/doc/rfc/3055-box_ctl_promote.md
new file mode 100644
index 000000000..3f8e854e6
--- /dev/null
+++ b/doc/rfc/3055-box_ctl_promote.md
@@ -0,0 +1,237 @@
+# Replicaset master promotion
+
+* **Status**: In progress
+* **Start date**: 02-03-2018
+* **Authors**: Vladislav Shpilevoy @Gerold103 \<v.shpilevoy@tarantool.org\>,
+Konstantin Osipov @kostja \<kostja@tarantool.org\>
+* **Issues**: [#3055](https://github.com/tarantool/tarantool/issues/3055),
+[#2625](https://github.com/tarantool/tarantool/issues/2625)
+
+## Summary
+
+Replicaset master promotion is a procedure of atomic making one slave be a new
+master, and an old master be a slave in a full-mesh master-slave replicaset.
+Master is a replica in read-write mode. Slave is a replica in read-only mode.
+
+Master promotion has API:
+```Lua
+--
+-- Called on a slave promotes its role to master, demoting an old
+-- one to slave. Called on a master returns an error.
+-- @param opts Optional settings:
+--        * timeout - the time in which a promotion must be
+--          finished;
+--        * quorum - before an old master demotion its data must
+--          be synced with no less than quorum slave count,
+--          including the being promoted one.
+--
+-- @retval true Promotion is started.
+-- @retval nil, error Can not start promotion.
+--
+box.ctl.promote(opts)
+
+--
+-- Status of the latest finished or the currently working
+-- promotion round.
+-- @retval Empty table. Promote() was not called since the
+--         instance had started, or it had started on another
+--         instance, that did not sent a promotion info to the
+--         current instance yet.
+-- @retval status A table with the format:
+--    {
+--         round_id = <Promotion ID>,
+--         round_uuid = <Promotion round UUID>,
+--         initiator_uuid = <UUID of the promotion initiator>,
+--         timeout = <Timeout of the promotion round>,
+--         quorum = <Requested quorum>,
+--         role = <The instance role in the round: old master,
+--                 watcher, initiator, undefined>,
+--         phase = <The round phase: success, error, in progress>,
+--         comment = <A human readable comment about the current
+--                    promotion status>,
+--         old_master_uuid = <UUID of the old master>,
+--    }
+--
+box.ctl.promote_info()
+
+--
+-- Remove info about all promotions from the entire cluster.
+--
+box.ctl.promote_reset()
+```
+
+## Background and motivation
+
+The promote procedure strongly simplifies life of developers since they must not
+do all of the promotion steps manually, that in a common case is not a trivial
+task, as you will see in the algorithm description in the next section.
+
+The common algorithm, disregarding failures and their processing, consists of
+the following steps: 
+1. On an old master stop accepting DDL/DML - only DQL;
+2. Wait until all master data is received by needed slave count, including the
+new master candidate;
+3. Make the old master be a slave;
+4. Make the slave be a new master;
+5. Notify all other slaves, that master is changed.
+
+All of the steps are persisted in WAL, that guarantees, that even after a
+promotion participant is restarted, after waking up it will not forgot about
+promotion. Persistency together with the mandatory quorum 50% + 1 instances
+eliminates any possibility of making the cluster has two masters after a
+promotion.
+
+## Detailed design
+
+Each cluster member has a special system space to distribute promotion steps
+over the cluster via replication channels - `_promotion`. Each record in the
+space is a promotion message sent by one of instances.
+```Lua
+format = {}
+-- ID of the promotion round. Each round has an unique identifier
+-- of two parts: ID and UUID. ID is used to order rounds by the
+-- time of their start. Each new round has an ID > than all the
+-- known previous ones. Timestamps can not be used since clocks
+-- are not perfectly sinced over network.
+format[1] = {'id', 'unsigned'}
+
+-- UUID of the promotion round. UUID is generated by a promotion
+-- initiator and allows to protect from an error when promotions
+-- are started on different nodes at the same time with the same
+-- round IDs. UUIDs are different in them because of different
+-- initiators.
+format[2] = {'round_uuid', 'string'}
+
+-- The promotion round step. It is constantly growing number for
+-- each promotion participant and is used to persist order of sent
+-- messages. Each instance arranges its messages with step
+-- numbers. Also steps are used to persist relative order of
+-- messages from different sources.
+format[3] = {'step', 'unsigned'}
+
+-- UUID of the sender instance.
+format[4] = {'source_uuid', 'string'}
+
+-- Timestamp of the message dispatch time by the sender clock.
+-- Just debug attribute, that is persisted.
+format[5] = {'ts', 'unsigned'}
+
+-- Type is what the sender wants to get or send. Value depends on
+-- type.
+format[6] = {'type', 'string'}
+
+-- Depending on the message type, different values are stored.
+format[7] = {'value', 'any', is_nullable = true}
+--
+--            Here the type-value pairs are described.
+--
+-- 'begin'   - the message sent by a promotion initiator to start
+--             a round. Value contains promotion metadata: quorum
+--             and timeout.
+--
+-- 'status'  - the message sent by all promotion participants. It
+--             has several goals: cope with a case when the
+--             cluster has no masters; when has multiple; to
+--             persist read-only cfg flag for recovery.
+--
+-- 'sync'    - the message sent by an old master to sync with the
+--             slaves. Value is nil.
+--
+-- 'success' - the message sent by a slave on 'sync' and by an old
+--             master when all syncs are collected.
+--
+-- 'error'   - an error, that can be sent by any cluster member.
+--             For example, it can be failed sync, or an existing
+--             promotion is found, or timeout. Value is the error
+--             description.
+--
+s = box.schema.create_space('_promotion', {format = format})
+```
+To participate in a promotion a cluster member just writes into `_promotion`
+space and waits until the record is replicated. This space is cleared by a
+garbage collector from finished promotions - with error or success status. Only
+latest promotion is not deleted to be able to restore a role after recovery.
+
+Below the protocol is described. On the image the state machine is showed:
+![alt text](https://raw.githubusercontent.com/tarantool/tarantool/2e591965dfb4603ac1b197621c9c8eb5e8eb8d9f/doc/rfc/3328-wire_protocol_img1.svg?sanitize=true)
+
+In the simplest case the being promoted instance is a master already -
+immediately finish the promotion with the error and with no persisting that. Now
+assume promote() is called on a slave. At first, the initiator broadcasts
+`begin` request with the promotion status: quorum and timeout.
+
+Each cluster member, received the `begin`, checks if it already knows about
+another active promotions. If does, then responds `error` to the newer promotion
+request. Else broadcasts `status` message.
+
+If the cluster has no a master, the promotion initiator detects it collecting
+statuses from all of the cluster members. In such a case the initiator on behalf
+of an old master syncs with the slaves and becomes a master.
+
+If the cluster has a master, but it is not available, then the initiator
+terminates the round after timeout. Consider the case when a master exists and
+is active.
+
+An old master got `begin` request enters read-only mode and broadcasts `sync`
+request. A slave got `sync` finishes its participation in the round responding
+`success`. The old master collects quorum `success`es including the promotion
+initiator's. Once the old master has collected responses it writes its own
+`success`. The initiator, got `success` from the master, enters read-write mode
+and becomes a new master.
+
+### Recovery
+
+Recovery is quite simple and merely replays all promotion messages from the
+`_promotion` space. But it also does a tricky thing under the hood - it recovers
+`box.cfg.read_only` flag. Consider how is it possible.
+
+During a promotion round three cases exist which persist `read_only` one way or
+another.
+1. When an instance sends `status` message, its `not read_only` is stored in the
+message's value as `is_master`.
+2. When an instance sends `begin` message, it means, that it is an initiator,
+but only non-master can start a promotion. So if an instance has sent `begin`,
+it has `read_only = tue`.
+3. Due to messages reordering from different replication sources it is possible,
+that a non-initiator instance has received `sync` message before it succeeded to
+send `status`. Then the instance is a watcher and has `read_only = true`.
+
+On recovery the `read_only` value has to be recovered to exactly the same value,
+that was persisted, because other instances already aware of this value. And it
+can not be changed manually until the promotion history is cleaned up.
+
+## Rationale and alternatives
+
+The protocol has several disputable details.
+
+A one could notice that an old master on `begin` sends two messages: `status`
+and `sync` with no an intermediate message. And that `begin` plays both to
+notify about the new promotion round and to trigger an old master sync. This
+slightly complicates step numbers calculation and a round result figuring out
+from the history records. But it reduces number of messages and message types
+and optimizes the most common case - regular master promotion in a master-slave
+full-mesh cluster.
+
+An alternative - add a new phase between `status` collecting and `sync` of the
+old master. It could make the promotion protocol a bit more simple in an
+implementation and understanding but on the contrary it is obviously longer.
+
+
+Another option is to add a new message `commit`. This could be a message written
+by an initiator when it receives `success` from the old master. The `commit`
+message makes it clear when a round is successfully finished - when an initiator
+has `commit` record. Now it is possible, that an old master sent `success`, but
+during the message sending the initiator terminated the round with timeout. In
+such a case the cluster becomes read-only until next promotion, and it is hard
+to understand on recovery that the round is failed though the old master had
+sent `success`. A drawback of this proposal is +1 message type and +1 message.
+
+
+Apart from the branches above, there is a major available improvement of the
+whole protocol: allow promotion when an old master is down, but exists. The main
+challenge of such indulgence is how to learn is an old master really down or the
+promotion initiator just has no network link with it? To detect the old master
+failure a quorum of another instances is necessary. And here another problem
+arises, when a replicaset consists of two instances only - what should a
+promotion initiator do, when its neighbour is not available? - is it down or
+is it network problems?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/8] box: rename process_rw to process_dml
  2018-08-07 22:03 [PATCH 0/8] box.ctl.promote Vladislav Shpilevoy
  2018-08-07 22:03 ` [PATCH 1/8] rfc: describe box.ctl.promote protocol Vladislav Shpilevoy
@ 2018-08-07 22:03 ` Vladislav Shpilevoy
  2018-08-13  8:20   ` Vladimir Davydov
  2018-08-07 22:03 ` [PATCH 3/8] Add 'exact_field_count' parameter to options decoder Vladislav Shpilevoy
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 13+ messages in thread
From: Vladislav Shpilevoy @ 2018-08-07 22:03 UTC (permalink / raw)
  To: tarantool-patches; +Cc: vdavydov.dev

This fixes the mess of rw/dml/1 notions. In iproto_msg we have
dml, that is executed via process1, thar calls process_rw,
that calls space_execute_dml. Lets just rename all these things
to dml.
---
 src/box/box.cc    | 20 ++++++++++----------
 src/box/box.h     | 14 ++++++++------
 src/box/iproto.cc |  2 +-
 3 files changed, 19 insertions(+), 17 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index ee12d5738..6eb358442 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -161,7 +161,7 @@ box_check_memtx_min_tuple_size(ssize_t memtx_min_tuple_size)
 }
 
 static int
-process_rw(struct request *request, struct space *space, struct tuple **result)
+process_dml(struct request *request, struct space *space, struct tuple **result)
 {
 	assert(iproto_type_is_dml(request->type));
 	rmean_collect(rmean_box, request->type, 1);
@@ -301,7 +301,7 @@ apply_row(struct xstream *stream, struct xrow_header *row)
 		return;
 	}
 	struct space *space = space_cache_find_xc(request.space_id);
-	if (process_rw(&request, space, NULL) != 0) {
+	if (process_dml(&request, space, NULL) != 0) {
 		say_error("error applying row: %s", request_str(&request));
 		diag_raise();
 	}
@@ -901,7 +901,7 @@ boxk(int type, uint32_t space_id, const char *format, ...)
 	struct space *space = space_cache_find(space_id);
 	if (space == NULL)
 		return -1;
-	return process_rw(&request, space, NULL);
+	return process_dml(&request, space, NULL);
 }
 
 int
@@ -965,7 +965,7 @@ box_index_id_by_name(uint32_t space_id, const char *name, uint32_t len)
 /** \endcond public */
 
 int
-box_process1(struct request *request, box_tuple_t **result)
+box_process_dml(struct request *request, box_tuple_t **result)
 {
 	/* Allow to write to temporary spaces in read-only mode. */
 	struct space *space = space_cache_find(request->space_id);
@@ -975,7 +975,7 @@ box_process1(struct request *request, box_tuple_t **result)
 	    space_group_id(space) != GROUP_LOCAL &&
 	    box_check_writable() != 0)
 		return -1;
-	return process_rw(request, space, result);
+	return process_dml(request, space, result);
 }
 
 int
@@ -1064,7 +1064,7 @@ box_insert(uint32_t space_id, const char *tuple, const char *tuple_end,
 	request.space_id = space_id;
 	request.tuple = tuple;
 	request.tuple_end = tuple_end;
-	return box_process1(&request, result);
+	return box_process_dml(&request, result);
 }
 
 int
@@ -1078,7 +1078,7 @@ box_replace(uint32_t space_id, const char *tuple, const char *tuple_end,
 	request.space_id = space_id;
 	request.tuple = tuple;
 	request.tuple_end = tuple_end;
-	return box_process1(&request, result);
+	return box_process_dml(&request, result);
 }
 
 int
@@ -1093,7 +1093,7 @@ box_delete(uint32_t space_id, uint32_t index_id, const char *key,
 	request.index_id = index_id;
 	request.key = key;
 	request.key_end = key_end;
-	return box_process1(&request, result);
+	return box_process_dml(&request, result);
 }
 
 int
@@ -1114,7 +1114,7 @@ box_update(uint32_t space_id, uint32_t index_id, const char *key,
 	/** Legacy: in case of update, ops are passed in in request tuple */
 	request.tuple = ops;
 	request.tuple_end = ops_end;
-	return box_process1(&request, result);
+	return box_process_dml(&request, result);
 }
 
 int
@@ -1134,7 +1134,7 @@ box_upsert(uint32_t space_id, uint32_t index_id, const char *tuple,
 	request.tuple = tuple;
 	request.tuple_end = tuple_end;
 	request.index_base = index_base;
-	return box_process1(&request, result);
+	return box_process_dml(&request, result);
 }
 
 /**
diff --git a/src/box/box.h b/src/box/box.h
index e2e06d977..9e13378d9 100644
--- a/src/box/box.h
+++ b/src/box/box.h
@@ -391,14 +391,16 @@ box_sequence_reset(uint32_t seq_id);
 /** \endcond public */
 
 /**
- * The main entry point to the
- * Box: callbacks into the request processor.
- * These are function pointers since they can
- * change when entering/leaving read-only mode
- * (master->slave propagation).
+ * The main entry point to DML operations:
+ * INSERT/REPLACE/DELETE/UPDATE/UPSERT.
+ * @param request Request to process.
+ * @param[out] result Result tuple, can be NULL.
+ *
+ * @retval 0 Success.
+ * @retval -1 Error.
  */
 int
-box_process1(struct request *request, box_tuple_t **result);
+box_process_dml(struct request *request, box_tuple_t **result);
 
 int
 boxk(int type, uint32_t space_id, const char *format, ...);
diff --git a/src/box/iproto.cc b/src/box/iproto.cc
index bb7d2b868..f8b419c26 100644
--- a/src/box/iproto.cc
+++ b/src/box/iproto.cc
@@ -1368,7 +1368,7 @@ tx_process1(struct cmsg *m)
 	struct obuf_svp svp;
 	struct obuf *out;
 	tx_inject_delay();
-	if (box_process1(&msg->dml, &tuple) != 0)
+	if (box_process_dml(&msg->dml, &tuple) != 0)
 		goto error;
 	out = msg->connection->tx.p_obuf;
 	if (iproto_prepare_select(out, &svp) != 0)
-- 
2.15.2 (Apple Git-101.1)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 3/8] Add 'exact_field_count' parameter to options decoder
  2018-08-07 22:03 [PATCH 0/8] box.ctl.promote Vladislav Shpilevoy
  2018-08-07 22:03 ` [PATCH 1/8] rfc: describe box.ctl.promote protocol Vladislav Shpilevoy
  2018-08-07 22:03 ` [PATCH 2/8] box: rename process_rw to process_dml Vladislav Shpilevoy
@ 2018-08-07 22:03 ` Vladislav Shpilevoy
  2018-08-13  8:30   ` Vladimir Davydov
  2018-08-07 22:03 ` [PATCH 4/8] box: remove orphan check from box_is_ro() Vladislav Shpilevoy
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 13+ messages in thread
From: Vladislav Shpilevoy @ 2018-08-07 22:03 UTC (permalink / raw)
  To: tarantool-patches; +Cc: vdavydov.dev

Needed for promotion. Promotion uses system space
_promotion, into which a user can write tuples directly
with not API usage (and we can not do anything with it),
so _promotion should do severe validation of each field
of each tuple since it affects the cluster state.

For this a new parameter of options decoder is introduced,
that checks for exact field count.

Needed for #3055
---
 src/box/alter.cc  | 10 +++++-----
 src/box/key_def.c |  2 +-
 src/box/opt_def.c | 13 ++++++++-----
 src/box/opt_def.h | 16 +++++++++++++++-
 4 files changed, 29 insertions(+), 12 deletions(-)

diff --git a/src/box/alter.cc b/src/box/alter.cc
index 3007a131d..d13ecb783 100644
--- a/src/box/alter.cc
+++ b/src/box/alter.cc
@@ -239,7 +239,7 @@ index_opts_decode(struct index_opts *opts, const char *map)
 {
 	index_opts_create(opts);
 	if (opts_decode(opts, index_opts_reg, &map, ER_WRONG_INDEX_OPTIONS,
-			BOX_INDEX_FIELD_OPTS, NULL) != 0)
+			BOX_INDEX_FIELD_OPTS, NULL, 0) != 0)
 		diag_raise();
 	if (opts->distance == rtree_index_distance_type_MAX) {
 		tnt_raise(ClientError, ER_WRONG_INDEX_OPTIONS,
@@ -403,8 +403,8 @@ space_opts_decode(struct space_opts *opts, const char *data)
 				flags++;
 		}
 	} else if (opts_decode(opts, space_opts_reg, &data,
-			       ER_WRONG_SPACE_OPTIONS,
-			       BOX_SPACE_FIELD_OPTS, NULL) != 0) {
+			       ER_WRONG_SPACE_OPTIONS, BOX_SPACE_FIELD_OPTS,
+			       NULL, 0) != 0) {
 		diag_raise();
 	}
 }
@@ -2382,8 +2382,8 @@ coll_id_def_new_from_tuple(const struct tuple *tuple, struct coll_id_def *def)
 
 	assert(base->type == COLL_TYPE_ICU);
 	if (opts_decode(&base->icu, coll_icu_opts_reg, &options,
-			ER_WRONG_COLLATION_OPTIONS,
-			BOX_COLLATION_FIELD_OPTIONS, NULL) != 0)
+			ER_WRONG_COLLATION_OPTIONS, BOX_COLLATION_FIELD_OPTIONS,
+			NULL, 0) != 0)
 		diag_raise();
 
 	if (base->icu.french_collation == coll_icu_on_off_MAX) {
diff --git a/src/box/key_def.c b/src/box/key_def.c
index ee09dc99d..bc0c1ba35 100644
--- a/src/box/key_def.c
+++ b/src/box/key_def.c
@@ -454,7 +454,7 @@ key_def_decode_parts(struct key_part_def *parts, uint32_t part_count,
 		*part = key_part_def_default;
 		if (opts_decode(part, part_def_reg, data,
 				ER_WRONG_INDEX_OPTIONS, i + TUPLE_INDEX_BASE,
-				NULL) != 0)
+				NULL, 0) != 0)
 			return -1;
 		if (part->type == field_type_MAX) {
 			diag_set(ClientError, ER_WRONG_INDEX_OPTIONS,
diff --git a/src/box/opt_def.c b/src/box/opt_def.c
index cd93c23b8..6710f8187 100644
--- a/src/box/opt_def.c
+++ b/src/box/opt_def.c
@@ -176,13 +176,10 @@ opts_parse_key(void *opts, const struct opt_def *reg, const char *key,
 	return 0;
 }
 
-/**
- * Populate key options from their msgpack-encoded representation
- * (msgpack map).
- */
 int
 opts_decode(void *opts, const struct opt_def *reg, const char **map,
-	    uint32_t errcode, uint32_t field_no, struct region *region)
+	    uint32_t errcode, uint32_t field_no, struct region *region,
+	    uint32_t exact_field_count)
 {
 	assert(mp_typeof(**map) == MP_MAP);
 
@@ -191,6 +188,12 @@ opts_decode(void *opts, const struct opt_def *reg, const char **map,
 	 * DDL is not performance-critical, so this is not a problem.
 	 */
 	uint32_t map_size = mp_decode_map(map);
+	if (map_size != exact_field_count && exact_field_count != 0) {
+		diag_set(ClientError, errcode, field_no,
+			 tt_sprintf("expected %u keys but got %u",
+				    exact_field_count, map_size));
+		return -1;
+	}
 	for (uint32_t i = 0; i < map_size; i++) {
 		if (mp_typeof(**map) != MP_STR) {
 			diag_set(ClientError, errcode, field_no,
diff --git a/src/box/opt_def.h b/src/box/opt_def.h
index 633832af9..4cfebf62a 100644
--- a/src/box/opt_def.h
+++ b/src/box/opt_def.h
@@ -83,10 +83,24 @@ struct region;
 /**
  * Populate key options from their msgpack-encoded representation
  * (msgpack map).
+ * @param[out] opts Where decode options to.
+ * @param reg Field definitions array.
+ * @param map MessagePack to decode.
+ * @param errcode Error code to set on any error. The error code
+ *        has to accept field number and description as
+ *        parameters.
+ * @param field_no Field number to set for @a errcode.
+ * @param region Region for dynamic allocations such as strings.
+ * @param exact_field_count If non-zero, then @a map should
+ *        contain exactly this count of fields.
+ *
+ * @retval 0 Success.
+ * @retval -1 Error.
  */
 int
 opts_decode(void *opts, const struct opt_def *reg, const char **map,
-	    uint32_t errcode, uint32_t field_no, struct region *region);
+	    uint32_t errcode, uint32_t field_no, struct region *region,
+	    uint32_t exact_field_count);
 
 /**
  * Decode one option and store it into @a opts struct as a field.
-- 
2.15.2 (Apple Git-101.1)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 4/8] box: remove orphan check from box_is_ro()
  2018-08-07 22:03 [PATCH 0/8] box.ctl.promote Vladislav Shpilevoy
                   ` (2 preceding siblings ...)
  2018-08-07 22:03 ` [PATCH 3/8] Add 'exact_field_count' parameter to options decoder Vladislav Shpilevoy
@ 2018-08-07 22:03 ` Vladislav Shpilevoy
  2018-08-13  8:34   ` Vladimir Davydov
  2018-08-07 22:03 ` [PATCH 5/8] Fix gcov on Mac Vladislav Shpilevoy
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 13+ messages in thread
From: Vladislav Shpilevoy @ 2018-08-07 22:03 UTC (permalink / raw)
  To: tarantool-patches; +Cc: vdavydov.dev

Box_is_ro now checks both for 'read_only' and 'orphan' modes, but
in promotion only 'read_only' is needed. And now there is no a
method to get the current 'read_only' value. After replacing
box_is_ro with box_is_writable it is possible to reimplement
box_is_ro as a getter for 'read_only' option.
---
 src/box/box.cc     | 10 ++++++++--
 src/box/box.h      |  3 +++
 src/box/lua/info.c |  2 +-
 3 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 6eb358442..d8fbc6252 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -216,17 +216,23 @@ box_set_ro(bool ro)
 	fiber_cond_broadcast(&ro_cond);
 }
 
+bool
+box_is_writable(void)
+{
+	return !is_ro && !is_orphan;
+}
+
 bool
 box_is_ro(void)
 {
-	return is_ro || is_orphan;
+	return is_ro;
 }
 
 int
 box_wait_ro(bool ro, double timeout)
 {
 	double deadline = ev_monotonic_now(loop()) + timeout;
-	while (box_is_ro() != ro) {
+	while (!box_is_writable() != ro) {
 		if (fiber_cond_wait_deadline(&ro_cond, deadline) != 0)
 			return -1;
 		if (fiber_is_cancelled()) {
diff --git a/src/box/box.h b/src/box/box.h
index 9e13378d9..29618c9f8 100644
--- a/src/box/box.h
+++ b/src/box/box.h
@@ -86,6 +86,9 @@ box_atfork(void);
 void
 box_set_ro(bool ro);
 
+bool
+box_is_writable(void);
+
 bool
 box_is_ro(void);
 
diff --git a/src/box/lua/info.c b/src/box/lua/info.c
index d6697df9c..42729bea3 100644
--- a/src/box/lua/info.c
+++ b/src/box/lua/info.c
@@ -242,7 +242,7 @@ lbox_info_signature(struct lua_State *L)
 static int
 lbox_info_ro(struct lua_State *L)
 {
-	lua_pushboolean(L, box_is_ro());
+	lua_pushboolean(L, ! box_is_writable());
 	return 1;
 }
 
-- 
2.15.2 (Apple Git-101.1)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 5/8] Fix gcov on Mac
  2018-08-07 22:03 [PATCH 0/8] box.ctl.promote Vladislav Shpilevoy
                   ` (3 preceding siblings ...)
  2018-08-07 22:03 ` [PATCH 4/8] box: remove orphan check from box_is_ro() Vladislav Shpilevoy
@ 2018-08-07 22:03 ` Vladislav Shpilevoy
  2018-08-07 22:03 ` [PATCH 6/8] box: introduce _promotion space Vladislav Shpilevoy
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Vladislav Shpilevoy @ 2018-08-07 22:03 UTC (permalink / raw)
  To: tarantool-patches; +Cc: vdavydov.dev

---
 cmake/profile.cmake | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/cmake/profile.cmake b/cmake/profile.cmake
index 278399155..866c8a787 100644
--- a/cmake/profile.cmake
+++ b/cmake/profile.cmake
@@ -1,12 +1,16 @@
-check_library_exists (gcov __gcov_flush  ""  HAVE_GCOV)
+check_library_exists(gcov __gcov_flush  ""  HAVE_GCOV)
 
 set(ENABLE_GCOV_DEFAULT OFF)
 option(ENABLE_GCOV "Enable integration with gcov, a code coverage program" ${ENABLE_GCOV_DEFAULT})
 
 if (ENABLE_GCOV)
     if (NOT HAVE_GCOV)
-    message (FATAL_ERROR
-         "ENABLE_GCOV option requested but gcov library is not found")
+        if (CMAKE_COMPILER_IS_CLANG)
+            message(WARNING "GCOV is available on clang from 3.0.0")
+            set(HAVE_GCOV 1)
+        else()
+            message(FATAL_ERROR "ENABLE_GCOV option requested but gcov library is not found")
+        endif()
     endif()
 
     add_compile_flags("C;CXX"
@@ -18,8 +22,6 @@ if (ENABLE_GCOV)
     set (CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -ftest-coverage")
     set (CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -fprofile-arcs")
     set (CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -ftest-coverage")
-
-   # add_library(gcov SHARED IMPORTED)
 endif()
 
 if (NOT CMAKE_BUILD_TYPE STREQUAL "Debug")
-- 
2.15.2 (Apple Git-101.1)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 6/8] box: introduce _promotion space
  2018-08-07 22:03 [PATCH 0/8] box.ctl.promote Vladislav Shpilevoy
                   ` (4 preceding siblings ...)
  2018-08-07 22:03 ` [PATCH 5/8] Fix gcov on Mac Vladislav Shpilevoy
@ 2018-08-07 22:03 ` Vladislav Shpilevoy
  2018-08-07 22:03 ` [PATCH 7/8] box: introduce box.ctl.promote Vladislav Shpilevoy
  2018-08-07 22:03 ` [PATCH 8/8] box: introduce promotion GC Vladislav Shpilevoy
  7 siblings, 0 replies; 13+ messages in thread
From: Vladislav Shpilevoy @ 2018-08-07 22:03 UTC (permalink / raw)
  To: tarantool-patches; +Cc: vdavydov.dev

Promotion space keeps info about finished and going promotions.

Needed for #3055
---
 src/box/alter.cc                   |  11 +++++++++++
 src/box/alter.h                    |   1 +
 src/box/bootstrap.snap             | Bin 1540 -> 1635 bytes
 src/box/lua/space.cc               |   2 ++
 src/box/lua/upgrade.lua            |  19 +++++++++++++++++++
 src/box/schema.cc                  |  15 +++++++++++++++
 src/box/schema_def.h               |  14 ++++++++++++++
 test/app-tap/tarantoolctl.test.lua |   4 ++--
 test/box-py/bootstrap.result       |   8 +++++++-
 test/box/access_misc.result        |   4 ++++
 test/box/access_sysview.result     |   6 +++---
 test/box/alter.result              |   6 ++++--
 test/wal_off/alter.result          |   2 +-
 test/xlog/upgrade.result           |   8 +++++++-
 14 files changed, 90 insertions(+), 10 deletions(-)

diff --git a/src/box/alter.cc b/src/box/alter.cc
index d13ecb783..7a7325038 100644
--- a/src/box/alter.cc
+++ b/src/box/alter.cc
@@ -2924,6 +2924,13 @@ on_replace_dd_cluster(struct trigger *trigger, void *event)
 	txn_on_commit(txn, on_commit);
 }
 
+static void
+on_replace_dd_promotion(struct trigger *trigger, void *event)
+{
+	(void) trigger;
+	(void) event;
+}
+
 /* }}} cluster configuration */
 
 /* {{{ sequence */
@@ -3240,6 +3247,10 @@ struct trigger alter_space_on_replace_index = {
 	RLIST_LINK_INITIALIZER, on_replace_dd_index, NULL, NULL
 };
 
+struct trigger alter_space_on_replace_promotion = {
+	RLIST_LINK_INITIALIZER, on_replace_dd_promotion, NULL, NULL
+};
+
 struct trigger on_replace_truncate = {
 	RLIST_LINK_INITIALIZER, on_replace_dd_truncate, NULL, NULL
 };
diff --git a/src/box/alter.h b/src/box/alter.h
index fb5f65a68..c62ca3c95 100644
--- a/src/box/alter.h
+++ b/src/box/alter.h
@@ -34,6 +34,7 @@
 
 extern struct trigger alter_space_on_replace_space;
 extern struct trigger alter_space_on_replace_index;
+extern struct trigger alter_space_on_replace_promotion;
 extern struct trigger on_replace_truncate;
 extern struct trigger on_replace_schema;
 extern struct trigger on_replace_user;
diff --git a/src/box/bootstrap.snap b/src/box/bootstrap.snap
index b610828c9c9ae9a22acdd8c150c16c6838b7a273..ece20feaa47441dcf4b56fe90d91dfcf29e30341 100644
GIT binary patch
delta 1630
zcmV-k2BG<c4C4%t7=JM>GBGhNXE!rrFfcG;Wn~IUZgX^DZewLSAU82MF)=qVI4xmh
zH#IFZFg9c@I5;;pEjeLjHe)qrIb$<4GYVEiY;R+0Iv{&7Iv_B83JTS_3%bn(<N(eC
zi%dMF0000004TLD{QywiE&vKZ01ipeR&4?RUp&AUi}{oaK!0rxAO&i`VclW@4-Aq0
z;(_c(#Xv2kQlw;#OCt2zX2qO_dWxy1$Y873Hftj?!D<IxPlxf$F&9iJrN{ul0M`I@
zj^b?%Z7zl}@OmLA6JK0^?g`89^9t|Wi9USfM<|DwUl0tv&WNrE_#Z>xx;eJWSo1A=
z@x^05KtZMtZ-0HOysc+>yPHuJibO;!1+aCFFMV5Bi!&Bq<=1=@3Bg`xQ~}sJ#~?;-
zoM<sb{|h~w`-}={E7sOIRs%GLc<Xn~f^3~*5=HuO=W|;!YU>=gr`hh`Y~OZiPBJpj
zolrA1G^h7e-iiC|qHH!93{v#--9c|gYASS2C@Mc8dVi+Kn50C>c><u$pSI5NR}W$r
z)?!h|z4yTr9{q<rU%i^Suok<Lat&bmcSkX>b&f3FFHHaLT}d~UATM{7tCLZf1-8zy
z^WQ;U?y9b&FA9XT1p1049uo5TFXK)|T@Vy;bw<>5z}7i-@VZKp->*wTDHyPIj`=<q
zhCHD6dViL8y^`z8cj<F8ZVl;QJ<oV1KHsq-`8$x`T@h_5(DT@}Uwh{2W|W1%)%upV
zr9+i$bwO3ha}T}l10So7GG67E5qa;5yOYt<02|s13QJZpsK*H+x?^G|M+YZHaZ2Ej
zNR3E|NQEF}l>;%kl&y2L<PeTS-JonxH5<(~vwz9JpkOv2m=7;8FDF~)2=wAX>p*H8
zTwq9-7t;ol22%#szAOqDS}wF$Xwg_)SSqlXYAsQqBv2DtYAH>iRD!K@Tw$3ckdyB~
zaPBf8|C-093PlpqTuM$J<hdZ&bGPx)$kXascH}+X!+q3!xrf?_NZ1Os*cs6t5x(bF
ze}Dc}w=t5VAxh{D1Y74w^6TZUc~?Yr9@sj^PIg2TXT^oJm~s#^G&5dGRpZ(GO`U9R
zIPUb>Y&55uo9~{u?FNP7P&XK)P*TI$Q9%dIfn#x0og1kglns~~8V%D7$|7ThB9%1;
z%^~Jq?kb{6VHyV`DJcmGePQbyOY^sW7k|#3$c3$Q1W+f1X`GIJvG0;rXyIW{e0@YM
ziw8p;^DxNMMH#IyZCjCX>$@eTDr}wOh?R;ZzBZWSLWS3X29NJ~#o(LhqUT*K7ZcQF
z?UQ&}UXc*}yT|zczAEX2X<G_g=a_Tj#Sjfu8OP|~JtoE{<XJ)~EP*$u;aFr$iGR!h
z0Du4}009O^A;*j!5`e%sh{G@pVHgHv2rvi^7y*FbKuDkg)PezY^JNUWOs&>n0i?;M
z#j;oy%VGqdoxqvgQD!$QywxRHVY@5)uFS!t7~+P^g(3NjfsstSGrom30n03ligsi%
zyvsgq=nPHcsW{;Mt@5X`k8WZW$bY8?cn6<K0`Pngz_}ebJQ-D`ArT~!Vpgf_Bjv%?
zT2<yDaftGowQ!}P7YH%YljLq$!O;9T8O=dATJll%Klpe!5fX&iU`UXL(!v&fFuppJ
z;081po=9}J2u87h6stI+p9_ej1w$~mqB=(iu4oB)^3b*#?iSIPjKm7?1b;M%72pYI
z2>~;-<<Z{64T;y`W^44iF%kuRlHXc-18o0e_ocv{FLRfc!&nHKBgBw<(9zd8C2Gs)
zphR<g>_{-&*BadbQ=>!W-r}AL02DBR1fq3fHjabdGh{}oKXzmoUu+xJ$^E0Ui>~8*
zjFJltmP^z|%u8da4FDy=WPe9eAabr6Lgb12N!1Uug-~S*9UDWM+xjd1kO_~^a+Gj$
zOQcZLQEK3SwAR6cQtcYv%w@;4Jx0WItz$58NvQG#kVM8AVH+%*@qFh@3=BvN1MhHw
zkRhja>3VXD;0kee6#yl{q9pRLvH&544Lw0JBy`S*(-a6EmQiQS`d7)lK4~rkQ^`vp
z$y|0$8;lr)qJ2as*^|sbmHv?M7{8-efkMqtBt9riQj5?ELv8p08(LSR_V24sLOQq_
chxKoG9gB!)SeETbhkO;hC&{;&1Jw|%?E`A&?f?J)

delta 1534
zcmV<a1p)fw41^4j7=JJ=G%zh^V>dT4F*z~{Np5p=VQyn(Iv_b=V_`LAF*YqTV`gS8
zG-hTvEnzrjFfCy?W;Qf4VP!dGGGq!?Lu_wjYdRo%F*+bOeF_TIx(m9^1&9F7A|rpb
zr2qf`001bpFZ}>e{VM<lJ=RLl7I6XqUp$Ch%_7lYJ_Lj_@PFoXP~+we(A76EO7~<&
z)F3;RA|;uhw(`4WgTRw>I&<{eU?~Ai$VOI~v629dx2JaIm@kx4N&&zCxB$xlBK}6v
zM#}nreKVfmYggvF>;bcooE<&pjzw|&ydgCd;B)Xi4C|Up=iRyONm;Mu=<J|49rmyj
z!~XSg?^aVP(0>n31mId~$~K1OP2OJJrGIzc!y|v*5Q}mxHFpG|K|%#jeQ^A;E^l>*
z3*Bp}NefgxX1+gm^KmUTgDMc1Jsb?P)oZC4!mL!?OgGcvU!7uNP_0u4V%4c@$$Tde
z<JJuBKo=+bJ1@oWHsew%rA$(CR4!bhCq$aiEJ3j}Cx2ttc)M<~yh-n@9Ru2aUZxgY
zOU+j=F>OCDy|bcHe0;hLW1g*=6kJQqvOo9n=`!@rT1v3PA=sEaGtUmrgF1V*+HwF$
z%o|ck2Ck)MDdN~U>lpNk)Rh9)Qd8p>$FU`TvVVQ}JNw3+pEy`&7|(v!cb#wKcU}Xc
zztr*E^M9!-1a3aZxQyRo-mQ)X;JxvQVS5a1E-Z#;*X8(mp!hlb(b=Wn_TxWHa?e&p
z0<NVdiD5jvlVdHR&M*6U`SOA@wvN<F<4PlXstt_|t42g+g<yeLeOP%|b(q;~x1wEl
zEj90~M>ms_NyVg8C=MtXRLX=hZO}2jlujiL*MCw|r<8%iFyl-urCG}bk_A!)lFN(Z
z*9rxd2`ZdQ%ZOz}8m(9>QmHeYBvm9q6rtA9wbTrW=uAh+mpZa8Z9M*WuhEnWw8OPP
z0Ken?`duE+M1Fc0sWuO;rRJ>TPnWxUKDEZdwbU$UUQcC(VtJFR2G>&aT`%TX-lUJ$
zzkkQk8h`!+Ak42Gb5O~I9X_GL4xhR(RGq@`<8YEmeGLBa%Ed{YaV#4wH7X6(wXrob
zRx(mCOfa|?Bo}qm;#z9jPz5?Ii>k-mKV61Bw$iw6MNm)>lUj2vHAVFK{#e#!kBV!l
ziDGG`ab1f2WXaFUg5t{}`TM8d6EBXt_J3uuRg$)9X&lOg&M*+4wU)S+njl=^i<EJZ
zX2ppwLyQ}LBXiMF<cgb*y<$jG_{C2oetkYX_VcpV`IoM<QW}RMaV<4zXt3a^V}T&3
z(OhIq3CsWhfB+}}AqGb&W<?JYfWSD4qc93!7zRWLI0_9I0RX{)ke~>l77Rj$ynhqj
zxeG%W18Ui|uoin^FOJH&6BE><l!R4e54%@5i^3tJ2=RjuhoRy}$qP!PWxP;CvBe2+
zB`AQ$m7GU4)NS@llAFhZ>1-;h#Us!zr2@yPWOP}$xk=>TS*Ej6Q8ZgBx&}sXh}Wym
z;Yvs=Xm6s&oTIkS5VXc28)t^2+kc4qR4r1#;*gdpU~^0s$pG8pm?G)gVGer7p5*j@
zrQSOro_cGm&fR}<@b8h^=iXW!%4reE=RqJFW<g443Z&4$8`^`Azzc12c=p(Su(w)T
z1)jmCmtK6UOn@l~=HxoFllhK1;>sqt`1BcS0HC@As3ba#z^Ko9Pi8avYk$Oa!!S#&
zSx3_!A8xyrpfT#3Y0zk_Hi}Y>p%$pI61F+5M3N&%2xKJAC;$CDD}<6?=$#5uoTZ5Z
zCNj}EgB|fKv=$L@ag>MEJ_76DnN*W1Z`O`MZ9ptwx|K1yxMV1V1(Ltx%(4y7obk4E
zW?=^8GIQ=kv4A0`wS7N1*>`Z!opukPx!_g@c{8|RJi<`!QS!z#<&+X#nnV8V6uo^%
zFL^|hfwSZ-+|LAZ&d5i?Zt>-$tMH_1UnT!Yb{@SussOv@WfJ50JyfLCQd9Bw_N-dh
k2=V>Mm|$S01}gnuSl5<E4okR>OyKJnJxL~`9Mur5?WUI5(f|Me

diff --git a/src/box/lua/space.cc b/src/box/lua/space.cc
index 580e0ea2c..c1a1efb7d 100644
--- a/src/box/lua/space.cc
+++ b/src/box/lua/space.cc
@@ -554,6 +554,8 @@ box_lua_space_init(struct lua_State *L)
 	lua_setfield(L, -2, "VSEQUENCE_ID");
 	lua_pushnumber(L, BOX_SPACE_SEQUENCE_ID);
 	lua_setfield(L, -2, "SPACE_SEQUENCE_ID");
+	lua_pushnumber(L, BOX_PROMOTION_ID);
+	lua_setfield(L, -2, "PROMOTION_ID");
 	lua_pushnumber(L, BOX_SYSTEM_ID_MIN);
 	lua_setfield(L, -2, "SYSTEM_ID_MIN");
 	lua_pushnumber(L, BOX_SYSTEM_ID_MAX);
diff --git a/src/box/lua/upgrade.lua b/src/box/lua/upgrade.lua
index 0293f6ef8..39ab0df7d 100644
--- a/src/box/lua/upgrade.lua
+++ b/src/box/lua/upgrade.lua
@@ -964,6 +964,24 @@ local function upgrade_to_1_10_0()
     create_vsequence_space()
 end
 
+local function upgrade_to_1_10_2()
+    log.info('create space _promotion')
+    local format = {
+        {name = 'id', type = 'unsigned'},
+        {name = 'round_uuid', type = 'string'},
+        {name = 'step', type = 'unsigned'},
+        {name = 'source_uuid', type = 'string'},
+        {name = 'ts', type = 'number'},
+        {name = 'type', type = 'string'},
+        {name = 'value', type = 'map', is_nullable = true}
+    }
+    box.space._space:insert({box.space._promotion.id, ADMIN, '_promotion',
+                             'memtx', 0, setmap({}), format})
+    log.info('create index primary on _promotion')
+    box.space._index:insert({box.space._promotion.id, 0, 'primary', 'tree',
+                             {unique = true}, {{0, 'unsigned'}, {1, 'string'},
+                             {2, 'unsigned'}, {3, 'string'}}})
+end
 
 local function get_version()
     local version = box.space._schema:get{'version'}
@@ -991,6 +1009,7 @@ local function upgrade(options)
         {version = mkversion(1, 7, 6), func = upgrade_to_1_7_6, auto = false},
         {version = mkversion(1, 7, 7), func = upgrade_to_1_7_7, auto = true},
         {version = mkversion(1, 10, 0), func = upgrade_to_1_10_0, auto = true},
+        {version = mkversion(1, 10, 2), func = upgrade_to_1_10_2, auto = true},
     }
 
     for _, handler in ipairs(handlers) do
diff --git a/src/box/schema.cc b/src/box/schema.cc
index 433f52c08..adacb2569 100644
--- a/src/box/schema.cc
+++ b/src/box/schema.cc
@@ -338,8 +338,23 @@ schema_init()
 	 */
 	sc_space_new(BOX_CLUSTER_ID, "_cluster", key_def, &on_replace_cluster,
 		     NULL);
+	key_def_delete(key_def);
 
+	key_def = key_def_new(4);
+	if (key_def == NULL)
+		diag_raise();
+	key_def_set_part(key_def, 0, 0, FIELD_TYPE_UNSIGNED, false, NULL,
+			 COLL_NONE);
+	key_def_set_part(key_def, 1, 1, FIELD_TYPE_STRING, false, NULL,
+			 COLL_NONE);
+	key_def_set_part(key_def, 2, 2, FIELD_TYPE_UNSIGNED, false, NULL,
+			 COLL_NONE);
+	key_def_set_part(key_def, 3, 3, FIELD_TYPE_STRING, false, NULL,
+			 COLL_NONE);
+	sc_space_new(BOX_PROMOTION_ID, "_promotion", key_def,
+		     &alter_space_on_replace_promotion, NULL);
 	key_def_delete(key_def);
+
 	key_def = key_def_new(2); /* part count */
 	if (key_def == NULL)
 		diag_raise();
diff --git a/src/box/schema_def.h b/src/box/schema_def.h
index 2edb8d37f..079afd45a 100644
--- a/src/box/schema_def.h
+++ b/src/box/schema_def.h
@@ -102,6 +102,8 @@ enum {
 	BOX_TRUNCATE_ID = 330,
 	/** Space id of _space_sequence. */
 	BOX_SPACE_SEQUENCE_ID = 340,
+	/** Space id of _promotion. */
+	BOX_PROMOTION_ID = 348,
 	/** End of the reserved range of system spaces. */
 	BOX_SYSTEM_ID_MAX = 511,
 	BOX_ID_NIL = 2147483647
@@ -212,6 +214,18 @@ enum {
 	BOX_SPACE_SEQUENCE_FIELD_IS_GENERATED = 2,
 };
 
+/** _promotion fields. */
+enum {
+	BOX_PROMOTION_FIELD_ID = 0,
+	BOX_PROMOTION_FIELD_ROUND_UUID = 1,
+	BOX_PROMOTION_FIELD_PHASE = 2,
+	BOX_PROMOTION_FIELD_SOURCE_UUID = 3,
+	BOX_PROMOTION_FIELD_STEP = 4,
+	BOX_PROMOTION_FIELD_TS = 5,
+	BOX_PROMOTION_FIELD_TYPE = 6,
+	BOX_PROMOTION_FIELD_VALUE = 7,
+};
+
 /*
  * Different objects which can be subject to access
  * control.
diff --git a/test/app-tap/tarantoolctl.test.lua b/test/app-tap/tarantoolctl.test.lua
index 6946c8312..599519543 100755
--- a/test/app-tap/tarantoolctl.test.lua
+++ b/test/app-tap/tarantoolctl.test.lua
@@ -338,8 +338,8 @@ do
             check_ctlcat_xlog(test_i, dir, "--from=3 --to=6 --format=json --show-system --replica 1", "\n", 3)
             check_ctlcat_xlog(test_i, dir, "--from=3 --to=6 --format=json --show-system --replica 1 --replica 2", "\n", 3)
             check_ctlcat_xlog(test_i, dir, "--from=3 --to=6 --format=json --show-system --replica 2", "\n", 0)
-            check_ctlcat_snap(test_i, dir, "--space=280", "---\n", 18)
-            check_ctlcat_snap(test_i, dir, "--space=288", "---\n", 43)
+            check_ctlcat_snap(test_i, dir, "--space=280", "---\n", 19)
+            check_ctlcat_snap(test_i, dir, "--space=288", "---\n", 44)
         end)
     end)
 
diff --git a/test/box-py/bootstrap.result b/test/box-py/bootstrap.result
index 16c2027cf..a78c23945 100644
--- a/test/box-py/bootstrap.result
+++ b/test/box-py/bootstrap.result
@@ -5,7 +5,7 @@ box.space._schema:select{}
 ---
 - - ['cluster', '<cluster uuid>']
   - ['max_id', 511]
-  - ['version', 1, 10, 0]
+  - ['version', 1, 10, 2]
 ...
 box.space._cluster:select{}
 ---
@@ -68,6 +68,10 @@ box.space._space:select{}
         'type': 'unsigned'}]]
   - [340, 1, '_space_sequence', 'memtx', 0, {}, [{'name': 'id', 'type': 'unsigned'},
       {'name': 'sequence_id', 'type': 'unsigned'}, {'name': 'is_generated', 'type': 'boolean'}]]
+  - [348, 1, '_promotion', 'memtx', 0, {}, [{'name': 'id', 'type': 'unsigned'}, {
+        'name': 'round_uuid', 'type': 'string'}, {'name': 'step', 'type': 'unsigned'},
+      {'name': 'source_uuid', 'type': 'string'}, {'name': 'ts', 'type': 'number'},
+      {'name': 'type', 'type': 'string'}, {'type': 'map', 'name': 'value', 'is_nullable': true}]]
 ...
 box.space._index:select{}
 ---
@@ -116,6 +120,8 @@ box.space._index:select{}
   - [330, 0, 'primary', 'tree', {'unique': true}, [[0, 'unsigned']]]
   - [340, 0, 'primary', 'tree', {'unique': true}, [[0, 'unsigned']]]
   - [340, 1, 'sequence', 'tree', {'unique': false}, [[1, 'unsigned']]]
+  - [348, 0, 'primary', 'tree', {'unique': true}, [[0, 'unsigned'], [1, 'string'],
+      [2, 'unsigned'], [3, 'string']]]
 ...
 box.space._user:select{}
 ---
diff --git a/test/box/access_misc.result b/test/box/access_misc.result
index 2d87fa2d5..40b8a8118 100644
--- a/test/box/access_misc.result
+++ b/test/box/access_misc.result
@@ -807,6 +807,10 @@ box.space._space:select()
         'type': 'unsigned'}]]
   - [340, 1, '_space_sequence', 'memtx', 0, {}, [{'name': 'id', 'type': 'unsigned'},
       {'name': 'sequence_id', 'type': 'unsigned'}, {'name': 'is_generated', 'type': 'boolean'}]]
+  - [348, 1, '_promotion', 'memtx', 0, {}, [{'name': 'id', 'type': 'unsigned'}, {
+        'name': 'round_uuid', 'type': 'string'}, {'name': 'step', 'type': 'unsigned'},
+      {'name': 'source_uuid', 'type': 'string'}, {'name': 'ts', 'type': 'number'},
+      {'name': 'type', 'type': 'string'}, {'type': 'map', 'name': 'value', 'is_nullable': true}]]
 ...
 box.space._func:select()
 ---
diff --git a/test/box/access_sysview.result b/test/box/access_sysview.result
index 20efd2bbc..8a0079407 100644
--- a/test/box/access_sysview.result
+++ b/test/box/access_sysview.result
@@ -230,11 +230,11 @@ box.session.su('guest')
 ...
 #box.space._vspace:select{}
 ---
-- 19
+- 20
 ...
 #box.space._vindex:select{}
 ---
-- 44
+- 45
 ...
 #box.space._vuser:select{}
 ---
@@ -262,7 +262,7 @@ box.session.su('guest')
 ...
 #box.space._vindex:select{}
 ---
-- 44
+- 45
 ...
 #box.space._vuser:select{}
 ---
diff --git a/test/box/alter.result b/test/box/alter.result
index eb7014d8b..72f451938 100644
--- a/test/box/alter.result
+++ b/test/box/alter.result
@@ -107,7 +107,7 @@ space = box.space[t[1]]
 ...
 space.id
 ---
-- 341
+- 349
 ...
 space.field_count
 ---
@@ -152,7 +152,7 @@ space_deleted
 ...
 space:replace{0}
 ---
-- error: Space '341' does not exist
+- error: Space '349' does not exist
 ...
 _index:insert{_space.id, 0, 'primary', 'tree', 1, 1, 0, 'unsigned'}
 ---
@@ -226,6 +226,8 @@ _index:select{}
   - [330, 0, 'primary', 'tree', {'unique': true}, [[0, 'unsigned']]]
   - [340, 0, 'primary', 'tree', {'unique': true}, [[0, 'unsigned']]]
   - [340, 1, 'sequence', 'tree', {'unique': false}, [[1, 'unsigned']]]
+  - [348, 0, 'primary', 'tree', {'unique': true}, [[0, 'unsigned'], [1, 'string'],
+      [2, 'unsigned'], [3, 'string']]]
 ...
 -- modify indexes of a system space
 _index:delete{_index.id, 0}
diff --git a/test/wal_off/alter.result b/test/wal_off/alter.result
index afac1e55d..76fae0511 100644
--- a/test/wal_off/alter.result
+++ b/test/wal_off/alter.result
@@ -28,7 +28,7 @@ end;
 ...
 #spaces;
 ---
-- 65515
+- 65514
 ...
 -- cleanup
 for k, v in pairs(spaces) do
diff --git a/test/xlog/upgrade.result b/test/xlog/upgrade.result
index f02996bba..f9409c7dc 100644
--- a/test/xlog/upgrade.result
+++ b/test/xlog/upgrade.result
@@ -36,7 +36,7 @@ box.space._schema:select()
 ---
 - - ['cluster', '<server_uuid>']
   - ['max_id', 513]
-  - ['version', 1, 10, 0]
+  - ['version', 1, 10, 2]
 ...
 box.space._space:select()
 ---
@@ -95,6 +95,10 @@ box.space._space:select()
         'type': 'unsigned'}]]
   - [340, 1, '_space_sequence', 'memtx', 0, {}, [{'name': 'id', 'type': 'unsigned'},
       {'name': 'sequence_id', 'type': 'unsigned'}, {'name': 'is_generated', 'type': 'boolean'}]]
+  - [348, 1, '_promotion', 'memtx', 0, {}, [{'name': 'id', 'type': 'unsigned'}, {
+        'name': 'round_uuid', 'type': 'string'}, {'name': 'step', 'type': 'unsigned'},
+      {'name': 'source_uuid', 'type': 'string'}, {'name': 'ts', 'type': 'number'},
+      {'name': 'type', 'type': 'string'}, {'type': 'map', 'name': 'value', 'is_nullable': true}]]
   - [512, 1, 'distro', 'memtx', 0, {}, [{'name': 'os', 'type': 'str'}, {'name': 'dist',
         'type': 'str'}, {'name': 'version', 'type': 'num'}, {'name': 'time', 'type': 'num'}]]
   - [513, 1, 'temporary', 'memtx', 0, {'temporary': true}, []]
@@ -146,6 +150,8 @@ box.space._index:select()
   - [330, 0, 'primary', 'tree', {'unique': true}, [[0, 'unsigned']]]
   - [340, 0, 'primary', 'tree', {'unique': true}, [[0, 'unsigned']]]
   - [340, 1, 'sequence', 'tree', {'unique': false}, [[1, 'unsigned']]]
+  - [348, 0, 'primary', 'tree', {'unique': true}, [[0, 'unsigned'], [1, 'string'],
+      [2, 'unsigned'], [3, 'string']]]
   - [512, 0, 'primary', 'hash', {'unique': true}, [[0, 'string'], [1, 'string'], [
         2, 'unsigned']]]
   - [512, 1, 'codename', 'hash', {'unique': true}, [[1, 'string']]]
-- 
2.15.2 (Apple Git-101.1)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 7/8] box: introduce box.ctl.promote
  2018-08-07 22:03 [PATCH 0/8] box.ctl.promote Vladislav Shpilevoy
                   ` (5 preceding siblings ...)
  2018-08-07 22:03 ` [PATCH 6/8] box: introduce _promotion space Vladislav Shpilevoy
@ 2018-08-07 22:03 ` Vladislav Shpilevoy
  2018-08-13  8:58   ` Vladimir Davydov
  2018-08-07 22:03 ` [PATCH 8/8] box: introduce promotion GC Vladislav Shpilevoy
  7 siblings, 1 reply; 13+ messages in thread
From: Vladislav Shpilevoy @ 2018-08-07 22:03 UTC (permalink / raw)
  To: tarantool-patches; +Cc: vdavydov.dev

Replicaset master promotion is a procedure of atomic making one
slave be a new master, and an old master be a slave in a fullmesh
master-slave replicaset.

The promotion follows the protocol described in details in the
corresponding RFC. Shortly, the protocol collects a quorum of
instances who approves the promotion, syncs data with the old
master and demotes it.

The protocol is intended to work with a single master cluster and
with at least 50% + 1 quorum mandatory including an old master.
It is tolerant to messages reordering from different sources, to
errors like multiple masters, timeouts, restarts of any promotion
participant. Also the promote protocol supports promotion in a
completely read-only cluster. It is useful, for example, when
after one of rare cases of a promotion fail the cluster is left
in a read-only state with no masters. Then the promotion can just
be called again to fix it. Such read-only promotion has only one
restriction - all of the instances have to be safe and sound.

Once a promotion is executed, it makes box.cfg.read_only
attribute be immutable. It is because actually the promotion
protocol persists this attribute as a part of one of messages and
sends it to other instances. So a user can not both use the
promotion and manually change box.cfg.read_only.

The promotion has several API methods:

* box.ctl.promote({timeout = ..., quorum = ...}).
  This function is meant to be called on a slave to demote the
  old master if exists and promote the current instance.

* box.ctl.promote_info().
  This function shows info about the latest promotion (finished
  or running now - does not matter, just the latest).

* box.ctl.promote_reset().
  This function clears the promotion history so a user would be
  able to re-assign master/slave roles in a cluster manually.

Closes #3055

@TarantoolBot document
Title: Document box.ctl.promote()
Subj. For details of the patch see the commit message. For
details of the protocol and API see the RFC:
doc/rfc/3055-box_ctl_promote.md
---
 src/box/CMakeLists.txt       |    1 +
 src/box/alter.cc             |   61 ++-
 src/box/box.cc               |   17 +
 src/box/box.h                |   27 ++
 src/box/errcode.h            |    3 +
 src/box/lua/cfg.cc           |    9 +-
 src/box/lua/ctl.c            |   82 ++++
 src/box/promote.c            | 1075 ++++++++++++++++++++++++++++++++++++++++++
 src/box/promote.h            |  170 +++++++
 src/cfg.c                    |   11 +
 src/cfg.h                    |    3 +
 src/main.cc                  |    1 +
 test/box/misc.result         |    9 +-
 test/promote/basic.result    |  472 +++++++++++++++++++
 test/promote/basic.test.lua  |  160 +++++++
 test/promote/box.lua         |    8 +
 test/promote/box1.lua        |  112 +++++
 test/promote/box2.lua        |    1 +
 test/promote/box3.lua        |    1 +
 test/promote/box4.lua        |    1 +
 test/promote/errinj.result   |  222 +++++++++
 test/promote/errinj.test.lua |   87 ++++
 test/promote/suite.ini       |    6 +
 23 files changed, 2530 insertions(+), 9 deletions(-)
 create mode 100644 src/box/promote.c
 create mode 100644 src/box/promote.h
 create mode 100644 test/promote/basic.result
 create mode 100644 test/promote/basic.test.lua
 create mode 100644 test/promote/box.lua
 create mode 100644 test/promote/box1.lua
 create mode 120000 test/promote/box2.lua
 create mode 120000 test/promote/box3.lua
 create mode 120000 test/promote/box4.lua
 create mode 100644 test/promote/errinj.result
 create mode 100644 test/promote/errinj.test.lua
 create mode 100644 test/promote/suite.ini

diff --git a/src/box/CMakeLists.txt b/src/box/CMakeLists.txt
index ad544270b..1a1e7025c 100644
--- a/src/box/CMakeLists.txt
+++ b/src/box/CMakeLists.txt
@@ -112,6 +112,7 @@ add_library(box STATIC
     journal.c
     wal.c
     call.c
+    promote.c
     ${lua_sources}
     lua/init.c
     lua/call.c
diff --git a/src/box/alter.cc b/src/box/alter.cc
index 7a7325038..6df31e75a 100644
--- a/src/box/alter.cc
+++ b/src/box/alter.cc
@@ -52,6 +52,7 @@
 #include "identifier.h"
 #include "version.h"
 #include "sequence.h"
+#include "promote.h"
 
 /**
  * chap-sha1 of empty string, i.e.
@@ -2924,11 +2925,69 @@ on_replace_dd_cluster(struct trigger *trigger, void *event)
 	txn_on_commit(txn, on_commit);
 }
 
+/**
+ * Process promotion messages on commit only. Prepared but not
+ * committed messages can not be processed since they could
+ * rollback, but promotion requires each processed message is
+ * persisted and is able to recovery on restart.
+ */
 static void
-on_replace_dd_promotion(struct trigger *trigger, void *event)
+on_commit_process_promote_msg(struct trigger *trigger, void *event)
+{
+	(void) event;
+	promote_process((struct promote_msg *) trigger->data);
+}
+
+/**
+ * Check that the promotion space is empty and reset for this
+ * case the state. Manual reset here is used by replicas when on
+ * one of them box.ctl.promote_reset() is called. Then on the
+ * source replica the promotion state is dropped but on other
+ * replicas this action should be done under the hood. This is the
+ * only possible place to do it.
+ */
+static void
+on_commit_check_promotion_reset(struct trigger *trigger, void *event)
 {
 	(void) trigger;
 	(void) event;
+	if (index_count(space_index(space_by_id(BOX_PROMOTION_ID), 0), ITER_ALL,
+			NULL, 0) == 0)
+		box_ctl_promote_reset();
+}
+
+static void
+on_replace_dd_promotion(struct trigger *trigger, void *event)
+{
+	struct txn *txn = (struct txn *) event;
+	struct txn_stmt *stmt = txn_current_stmt(txn);
+	if (stmt->new_tuple == NULL && stmt->old_tuple != NULL) {
+		trigger = txn_alter_trigger_new(on_commit_check_promotion_reset,
+						NULL);
+		txn_on_commit(txn, trigger);
+		return;
+	}
+	assert(stmt->new_tuple != NULL);
+	if (stmt->old_tuple != NULL) {
+		tnt_raise(ClientError, ER_UNSUPPORTED, "Promotion",
+			  "history edit");
+	}
+	/*
+	 * Forbid multistatement only for non-DELETE since the
+	 * later is used for promotion reset in batches - the
+	 * whole round per one transaction is dropped.
+	 */
+	txn_check_singlestatement_xc(txn, "Space _promotion");
+	struct promote_msg *msg =
+		region_alloc_object_xc(&fiber()->gc, struct promote_msg);
+	/*
+	 * Decode the message before the commit to do message's
+	 * sanity check.
+	 */
+	if (promote_msg_decode(tuple_data(stmt->new_tuple), msg) != 0)
+		diag_raise();
+	trigger = txn_alter_trigger_new(on_commit_process_promote_msg, msg);
+	txn_on_commit(txn, trigger);
 }
 
 /* }}} cluster configuration */
diff --git a/src/box/box.cc b/src/box/box.cc
index d8fbc6252..8bbd0d424 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -73,6 +73,7 @@
 #include "call.h"
 #include "func.h"
 #include "sequence.h"
+#include "promote.h"
 
 static char status[64] = "unknown";
 
@@ -216,6 +217,12 @@ box_set_ro(bool ro)
 	fiber_cond_broadcast(&ro_cond);
 }
 
+void
+box_expose_ro()
+{
+	cfg_rawsetb("read_only", is_ro);
+}
+
 bool
 box_is_writable(void)
 {
@@ -970,6 +977,15 @@ box_index_id_by_name(uint32_t space_id, const char *name, uint32_t len)
 }
 /** \endcond public */
 
+int
+box_process_sys_dml(struct request *request)
+{
+	struct space *space = space_cache_find(request->space_id);
+	assert(space != NULL);
+	assert(space_is_system(space));
+	return process_dml(request, space, NULL);
+}
+
 int
 box_process_dml(struct request *request, box_tuple_t **result)
 {
@@ -1981,6 +1997,7 @@ box_cfg_xc(void)
 	port_init();
 	iproto_init();
 	wal_thread_start();
+	box_ctl_promote_init();
 
 	title("loading");
 
diff --git a/src/box/box.h b/src/box/box.h
index 29618c9f8..526e73608 100644
--- a/src/box/box.h
+++ b/src/box/box.h
@@ -86,6 +86,25 @@ box_atfork(void);
 void
 box_set_ro(bool ro);
 
+/**
+ * Expose current read-only flag into Lua config as
+ * box.cfg.read_only. Used when the value is changed internally,
+ * for example, by box.ctl.promote.
+ */
+void
+box_expose_ro();
+
+/**
+ * Check that read-only value can be changed via box.cfg. It can
+ * be immutable when a promotion is used, so that a user should
+ * either manipulate the flag manually or trust to
+ * box.ctl.promote.
+ * @retval 0 Success.
+ * @retval -1 Error. Diag is set.
+ */
+int
+box_check_ro_is_mutable();
+
 bool
 box_is_writable(void);
 
@@ -405,6 +424,14 @@ box_sequence_reset(uint32_t seq_id);
 int
 box_process_dml(struct request *request, box_tuple_t **result);
 
+/**
+ * Process DML operation on a system space without any RO checks.
+ * Can be used internally only. @Sa box_process_dml for the
+ * parameter and the returned value.
+ */
+int
+box_process_sys_dml(struct request *request);
+
 int
 boxk(int type, uint32_t space_id, const char *format, ...);
 
diff --git a/src/box/errcode.h b/src/box/errcode.h
index 3d5f66af8..4c56ad645 100644
--- a/src/box/errcode.h
+++ b/src/box/errcode.h
@@ -208,6 +208,9 @@ struct errcode_record {
 	/*153 */_(ER_NULLABLE_MISMATCH,		"Field %d is %s in space format, but %s in index parts") \
 	/*154 */_(ER_TRANSACTION_YIELD,		"Transaction has been aborted by a fiber yield") \
 	/*155 */_(ER_NO_SUCH_GROUP,		"Replication group '%s' does not exist") \
+	/*156 */_(ER_PROMOTE,			"Error during promotion with round UUID '%s': %s") \
+	/*157 */_(ER_WRONG_PROMOTION_RECORD,	"Wrong record in _promotion (field %u): %s") \
+	/*158 */_(ER_PROMOTE_EXISTS,		"Promotion is in progress") \
 
 /*
  * !IMPORTANT! Please follow instructions at start of the file
diff --git a/src/box/lua/cfg.cc b/src/box/lua/cfg.cc
index 0f6b8a5a3..877db0254 100644
--- a/src/box/lua/cfg.cc
+++ b/src/box/lua/cfg.cc
@@ -167,11 +167,10 @@ lbox_cfg_set_checkpoint_count(struct lua_State *L)
 static int
 lbox_cfg_set_read_only(struct lua_State *L)
 {
-	try {
-		box_set_ro(cfg_geti("read_only") != 0);
-	} catch (Exception *) {
-		luaT_error(L);
-	}
+	bool new_value = cfg_geti("read_only") != 0;
+	if (box_check_ro_is_mutable() != 0 && new_value != box_is_ro())
+		return luaT_error(L);
+	box_set_ro(new_value);
 	return 0;
 }
 
diff --git a/src/box/lua/ctl.c b/src/box/lua/ctl.c
index 9a105ed5c..08c2354bb 100644
--- a/src/box/lua/ctl.c
+++ b/src/box/lua/ctl.c
@@ -29,6 +29,7 @@
  * SUCH DAMAGE.
  */
 #include "box/lua/ctl.h"
+#include "box/lua/info.h"
 
 #include <tarantool_ev.h>
 
@@ -38,7 +39,10 @@
 
 #include "lua/utils.h"
 
+#include "box/info.h"
 #include "box/box.h"
+#include "box/promote.h"
+#include "box/error.h"
 
 static int
 lbox_ctl_wait_ro(struct lua_State *L)
@@ -64,9 +68,87 @@ lbox_ctl_wait_rw(struct lua_State *L)
 	return 0;
 }
 
+/**
+ * Lua binding for box_ctl_promote. Takes non-mandatory options:
+ * timeout and quorum.
+ * @param L Lua stack.
+ * @retval Number of values pushed onto the stack. 2 means nil and
+ *         error object. 1 means ok and the value is true.
+ */
+static int
+lbox_ctl_promote(struct lua_State *L)
+{
+	int quorum = -1;
+	double timeout = TIMEOUT_INFINITY;
+	int top = lua_gettop(L);
+	if (top > 1) {
+usage_error:
+		return luaL_error(L, "Usage: box.ctl.promote([{timeout = "\
+				  "<double>, quorum = <unsigned>}])");
+	} else if (top == 1) {
+		lua_getfield(L, 1, "quorum");
+		int ok;
+		if (! lua_isnil(L, -1)) {
+			quorum = lua_tointegerx(L, -1, &ok);
+			if (ok == 0)
+				goto usage_error;
+		}
+		lua_getfield(L, 1, "timeout");
+		if (! lua_isnil(L, -1)) {
+			timeout = lua_tonumberx(L, -1, &ok);
+			if (ok == 0)
+				goto usage_error;
+		}
+	}
+	if (box_ctl_promote(timeout, quorum) != 0) {
+		lua_pushnil(L);
+		luaT_pusherror(L, box_error_last());
+		return 2;
+	} else {
+		lua_pushboolean(L, true);
+		return 1;
+	}
+}
+
+/**
+ * Lua binding for box_ctl_promote_reset. Has no arguments.
+ * @param L Lua stack.
+ * @retval Number of values pushed onto the stack. 2 means nil and
+ *         error object. 1 means ok and the value is true.
+ */
+static int
+lbox_ctl_promote_reset(struct lua_State *L)
+{
+	if (box_ctl_promote_reset() != 0) {
+		lua_pushnil(L);
+		luaT_pusherror(L, box_error_last());
+		return 2;
+	}
+	lua_pushboolean(L, true);
+	return 1;
+}
+
+/**
+ * Lua binding for box_ctl_promote_info. Has no arguments.
+ * @param L Lua stack.
+ * @retval Number of values pushed onto the stack. Always is 1 -
+ *         a Lua table with info parameters.
+ */
+static int
+lbox_ctl_promote_info(struct lua_State *L)
+{
+	struct info_handler info;
+	luaT_info_handler_create(&info, L);
+	box_ctl_promote_info(&info);
+	return 1;
+}
+
 static const struct luaL_Reg lbox_ctl_lib[] = {
 	{"wait_ro", lbox_ctl_wait_ro},
 	{"wait_rw", lbox_ctl_wait_rw},
+	{"promote", lbox_ctl_promote},
+	{"promote_reset", lbox_ctl_promote_reset},
+	{"promote_info", lbox_ctl_promote_info},
 	{NULL, NULL}
 };
 
diff --git a/src/box/promote.c b/src/box/promote.c
new file mode 100644
index 000000000..dcc39b5bd
--- /dev/null
+++ b/src/box/promote.c
@@ -0,0 +1,1075 @@
+/*
+ * Copyright 2010-2018, Tarantool AUTHORS, please see AUTHORS file.
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above
+ *    copyright notice, this list of conditions and the
+ *    following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above
+ *    copyright notice, this list of conditions and the following
+ *    disclaimer in the documentation and/or other materials
+ *    provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY <COPYRIGHT HOLDER> ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
+ * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
+ * <COPYRIGHT HOLDER> OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
+ * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+ * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
+ * THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+#include "box.h"
+#include "replication.h"
+#include "promote.h"
+#include "error.h"
+#include "msgpuck.h"
+#include "xrow.h"
+#include "space.h"
+#include "schema.h"
+#include "schema_def.h"
+#include "txn.h"
+#include "tuple.h"
+#include "iproto_constants.h"
+#include "opt_def.h"
+#include "info.h"
+
+static const char *promote_msg_type_strs[] = {
+	"begin",
+	"status",
+	"sync",
+	"success",
+	"error",
+};
+
+/** True, if @a msg is created by the current instance. */
+static inline bool
+promote_msg_is_mine(const struct promote_msg *msg)
+{
+	return tt_uuid_is_equal(&msg->source_uuid, &INSTANCE_UUID);
+}
+
+enum promote_role {
+	PROMOTE_ROLE_UNDEFINED = 0,
+	PROMOTE_ROLE_INITIATOR,
+	PROMOTE_ROLE_OLD_MASTER,
+	PROMOTE_ROLE_WATCHER
+};
+
+static const char *promote_role_strs[] = {
+	"undefined",
+	"initiator",
+	"old master",
+	"watcher",
+};
+
+enum promote_phase {
+	PROMOTE_PHASE_NON_ACTIVE = 0,
+	PROMOTE_PHASE_ERROR,
+	PROMOTE_PHASE_SUCCESS,
+	PROMOTE_PHASE_IN_PROGRESS,
+};
+
+static const char *promote_phase_strs[] = {
+	"non-active",
+	"error",
+	"success",
+	"in progress",
+};
+
+/**
+ * The current promotion state. If the promotion is finished, then
+ * the latest one is stored here as a cache for
+ * box.ctl.promote_info().
+ */
+static struct promote_state {
+	/**
+	 * Each round has an unique identifier of two parts: ID
+	 * and UUID. ID is used to order rounds by the time of
+	 * their start. Each new round has an ID > than all the
+	 * known previous ones. Timestamps can not be used since
+	 * clocks are not perfectly sinced over network.
+	 */
+	int round_id;
+	/**
+	 * UUID is generated by a promotion initiator and allows
+	 * to protect from an error when promotions are started on
+	 * different nodes at the same time with the same round
+	 * IDs. UUIDs are different in them because of different
+	 * initiators.
+	 */
+	struct tt_uuid round_uuid;
+	/** UUID of an old master if known, nil UUID otherwise. */
+	struct tt_uuid old_master_uuid;
+	/** UUID of an initiator if known, nil UUID otherwise. */
+	struct tt_uuid initiator_uuid;
+	/** Diagnostics storing the current round error. */
+	struct diag diag;
+	/**
+	 * Condition emited each time the promotion state is
+	 * changed.
+	 */
+	struct fiber_cond on_change;
+	/**
+	 * Role of the current instance in the current promotion
+	 * round.
+	 */
+	enum promote_role role;
+	/**
+	 * Current round promotion phase. If the round is
+	 * finsihed, the result (error/success) is stored here as
+	 * well.
+	 */
+	enum promote_phase phase;
+	/**
+	 * Description of the latest thing done during the current
+	 * promotion round. It is not persisted anywhere and
+	 * exists merely to improve user experience. It is shown
+	 * in box.ctl.promote_info().
+	 */
+	char comment[DIAG_ERRMSG_MAX + 1];
+	/**
+	 * The current promotion round quorum. Becomes valid only
+	 * when an initiator becomes known. Quorum is number of
+	 * replicas that should approve the promotion and sync
+	 * with the old master before its demotion. The quorum
+	 * includes the old master and the initiator.
+	 */
+	int quorum;
+	/**
+	 * Number of already collected syncs with the old master.
+	 * Valid on the old master and on the initiator if it acts
+	 * on behalf of the later.
+	 */
+	int sync_count;
+	/**
+	 * The current promotion round timeout. Once it is
+	 * exceeded, the round is terminated with persisting that
+	 * fact. Becomes valid only when an initiator becomes
+	 * known.
+	 */
+	double timeout;
+	/** The promotion timer fiber. */
+	struct fiber *timer;
+	/**
+	 * Number of watchers participating in the current
+	 * promotion round. If this value + the initiator equals
+	 * the cluster size, then the cluster is read-only. In
+	 * such a case the promotion is allowed even though an old
+	 * master does not exist. The initiator acts on behalf of
+	 * the later then.
+	 */
+	int watcher_count;
+	/**
+	 * The current promotion step. It is constantly growing
+	 * number for each promotion participant and is used to
+	 * persist order of sent messages. Each instance arranges
+	 * its messages with step numbers. Also steps are used to
+	 * persist relative order of messages from different
+	 * sources.
+	 */
+	int step;
+	/**
+	 * True if this instance at least once succeeded to commit
+	 * its status and set role. Status message contains
+	 * is_master flag that actually is persisted read_only cfg
+	 * option. So other instances now are aware of this status
+	 * and it can not be changed by a user via box.cfg.
+	 */
+	bool is_role_committed;
+} promote_state;
+
+/**
+ * Getters for different attributes and properties of the
+ * promotion state.
+ */
+
+static inline bool
+promote_is_active(void)
+{
+	return promote_state.phase == PROMOTE_PHASE_IN_PROGRESS;
+}
+
+static inline bool
+promote_is_master_known(void)
+{
+	return !tt_uuid_is_equal(&promote_state.old_master_uuid, &uuid_nil);
+}
+
+static inline bool
+promote_is_initiator_known(void)
+{
+	return !tt_uuid_is_equal(&promote_state.initiator_uuid, &uuid_nil);
+}
+
+static inline bool
+promote_is_finished(void)
+{
+	return !promote_is_active() && promote_state.timer == NULL;
+}
+
+static inline bool
+promote_is_cluster_readonly(void)
+{
+	return promote_state.watcher_count + 1 == replicaset.applier.total;
+}
+
+static inline bool
+promote_is_this_round_msg(const struct promote_msg *msg)
+{
+	return promote_is_active() &&
+	       tt_uuid_is_equal(&msg->round_uuid, &promote_state.round_uuid);
+}
+
+/**
+ * Comment a promotion event. The comment text is available to be
+ * seen from box.ctl.promote_info(), and is logged.
+ */
+#define promote_comment(...) do { \
+	snprintf(promote_state.comment, sizeof(promote_state.comment), \
+		 __VA_ARGS__); \
+	say_info(promote_state.comment); \
+} while(0)
+
+/**
+ * Serialize the promotion message into a string.
+ * @param msg Message to serialize.
+ * @retval String with the serialized message.
+ */
+static inline const char *
+promote_msg_str(const struct promote_msg *msg)
+{
+	int offset = 0;
+	char *buf = tt_static_buf();
+	int len = TT_STATIC_BUF_LEN;
+
+	offset += snprintf(buf, len, "{id: %d, round: '", msg->round_id);
+	tt_uuid_to_string(&msg->round_uuid, buf + offset);
+	offset += UUID_STR_LEN;
+	offset += snprintf(buf + offset, len - offset, "', step: %d, source: '",
+			   msg->step);
+	tt_uuid_to_string(&msg->source_uuid, buf + offset);
+	offset += UUID_STR_LEN;
+	offset += snprintf(buf + offset, len - offset, "', ts: %f, type: '%s'",
+			   msg->ts, promote_msg_type_strs[msg->type]);
+	switch (msg->type) {
+	case PROMOTE_MSG_BEGIN:
+		offset += snprintf(buf + offset, len - offset, ", quorum: %d, "\
+				   "timeout: %f}", msg->begin.quorum,
+				   msg->begin.timeout);
+		break;
+	case PROMOTE_MSG_STATUS:
+		offset += snprintf(buf + offset, len - offset, ", is_master: "\
+				   "%d}", (int) msg->status.is_master);
+		break;
+	case PROMOTE_MSG_ERROR:
+		offset += snprintf(buf + offset, len - offset, ", code: %d, "\
+				   "message: '%s'}", msg->error.code,
+				   msg->error.message);
+		break;
+	default:
+		offset += snprintf(buf + offset, len - offset, "}");
+		break;
+	}
+	return buf;
+}
+
+/**
+ * Encode the promotion message into MessagePack tuple ready to
+ * be inserted into _promotion space.
+ * @param msg Promotion message to encode.
+ * @param[out] size_out Size of the result.
+ *
+ * @retval NULL Error.
+ * @retval not NULL MessagePack encoded message.
+ */
+static const char *
+promote_msg_encode(const struct promote_msg *msg, uint32_t *size_out)
+{
+	size_t size = 1024;
+	char *data = region_alloc(&fiber()->gc, size);
+	if (data == NULL) {
+		diag_set(OutOfMemory, size, "region_alloc", "data");
+		return NULL;
+	}
+	char *begin = data;
+	data = mp_encode_array(data, 7);
+	data = mp_encode_uint(data, msg->round_id);
+	data = mp_encode_str(data, tt_uuid_str(&msg->round_uuid),
+			     UUID_STR_LEN);
+	data = mp_encode_uint(data, msg->step);
+	data = mp_encode_str(data, tt_uuid_str(&msg->source_uuid),
+			     UUID_STR_LEN);
+	data = mp_encode_double(data, msg->ts);
+	const char *type_str = promote_msg_type_strs[msg->type];
+	data = mp_encode_str(data, type_str, strlen(type_str));
+	switch(msg->type) {
+	case PROMOTE_MSG_BEGIN:
+		data = mp_encode_map(data, 2);
+		data = mp_encode_str(data, "quorum", strlen("quorum"));
+		data = mp_encode_uint(data, msg->begin.quorum);
+		data = mp_encode_str(data, "timeout", strlen("timeout"));
+		data = mp_encode_double(data, msg->begin.timeout);
+		break;
+	case PROMOTE_MSG_STATUS:
+		data = mp_encode_map(data, 1);
+		data = mp_encode_str(data, "is_master", strlen("is_master"));
+		data = mp_encode_bool(data, msg->status.is_master);
+		break;
+	case PROMOTE_MSG_ERROR:
+		data = mp_encode_map(data, 2);
+		data = mp_encode_str(data, "code", strlen("code"));
+		data = mp_encode_uint(data, msg->error.code);
+		data = mp_encode_str(data, "message", strlen("message"));
+		data = mp_encode_str(data, msg->error.message,
+				     strlen(msg->error.message));
+		break;
+	default:
+		data = mp_encode_nil(data);
+		break;
+	};
+	*size_out = data - begin;
+	assert(*size_out <= size);
+	return begin;
+}
+
+const struct opt_def promote_msg_begin_format[] = {
+	OPT_DEF("quorum", OPT_UINT32, struct promote_msg, begin.quorum),
+	OPT_DEF("timeout", OPT_FLOAT, struct promote_msg, begin.timeout),
+	OPT_END,
+};
+
+const struct opt_def promote_msg_status_format[] = {
+	OPT_DEF("is_master", OPT_BOOL, struct promote_msg, status.is_master),
+	OPT_END,
+};
+
+const struct opt_def promote_msg_error_format[] = {
+	OPT_DEF("code", OPT_UINT32, struct promote_msg, error.code),
+	OPT_DEF("message", OPT_STRPTR, struct promote_msg, error.message),
+	OPT_END,
+};
+
+int
+promote_msg_decode(const char *data, struct promote_msg *msg)
+{
+	uint32_t size = mp_decode_array(&data);
+	assert(size == 7 || size == 6);
+	uint32_t len;
+	struct region *region = &fiber()->gc;
+	msg->round_id = (int) mp_decode_uint(&data);
+	const char *str = mp_decode_str(&data, &len);
+	if (tt_uuid_from_strl(str, len, &msg->round_uuid) != 0) {
+		diag_set(ClientError, ER_WRONG_PROMOTION_RECORD,
+			 BOX_PROMOTION_FIELD_ROUND_UUID, "invalid UUID");
+		return -1;
+	}
+	msg->step = (int) mp_decode_uint(&data);
+	str = mp_decode_str(&data, &len);
+	if (tt_uuid_from_strl(str, len, &msg->source_uuid) != 0) {
+		diag_set(ClientError, ER_WRONG_PROMOTION_RECORD,
+			 BOX_PROMOTION_FIELD_SOURCE_UUID, "invalid UUID");
+		return -1;
+	}
+	if (mp_read_double(&data, &msg->ts) != 0 || msg->ts < 0) {
+		diag_set(ClientError, ER_WRONG_PROMOTION_RECORD,
+			 BOX_PROMOTION_FIELD_TS, "wrong ts");
+		return -1;
+	}
+	str = mp_decode_str(&data, &len);
+	msg->type = STRN2ENUM(promote_msg_type, str, len);
+	if (msg->type == promote_msg_type_MAX) {
+		diag_set(ClientError, ER_WRONG_PROMOTION_RECORD,
+			 BOX_PROMOTION_FIELD_TYPE, "wrong type");
+		return -1;
+	}
+
+	switch(msg->type) {
+	case PROMOTE_MSG_BEGIN:
+		if (opts_decode(msg, promote_msg_begin_format, &data,
+				ER_WRONG_PROMOTION_RECORD,
+				BOX_PROMOTION_FIELD_VALUE, region, 2) != 0)
+			return -1;
+		break;
+	case PROMOTE_MSG_STATUS:
+		if (opts_decode(msg, promote_msg_status_format, &data,
+				ER_WRONG_PROMOTION_RECORD,
+				BOX_PROMOTION_FIELD_VALUE, region, 1) != 0)
+			return -1;
+		break;
+	case PROMOTE_MSG_ERROR:
+		if (opts_decode(msg, promote_msg_error_format, &data,
+				ER_WRONG_PROMOTION_RECORD,
+				BOX_PROMOTION_FIELD_VALUE, region, 2) != 0)
+			return -1;
+		break;
+	default:
+		if (mp_typeof(*data) != MP_NIL) {
+			diag_set(ClientError, ER_WRONG_PROMOTION_RECORD,
+				 BOX_PROMOTION_FIELD_VALUE,
+				 tt_sprintf("'%s' has to have value nil",
+					    promote_msg_type_strs[msg->type]));
+			return -1;
+		}
+		mp_decode_nil(&data);
+		break;
+	};
+	return 0;
+}
+
+/**
+ * Send the promotion message via its writing into _promotion
+ * space.
+ * @param ap Variable length argument list. Contains a single
+ *        element - pointer to a promotion message to send.
+ *
+ * @retval -1 Error.
+ * @retval 0 Success.
+ */
+static int
+promote_send_f(va_list ap)
+{
+	const struct promote_msg *msg = va_arg(ap, const struct promote_msg *);
+	struct request request;
+	memset(&request, 0, sizeof(request));
+	request.type = IPROTO_INSERT;
+	request.space_id = BOX_PROMOTION_ID;
+	uint32_t size;
+	request.tuple = promote_msg_encode(msg, &size);
+	if (request.tuple == NULL)
+		return -1;
+	request.tuple_end = request.tuple + size;
+	return box_process_sys_dml(&request);
+}
+
+/**
+ * Wrapper for promote_send_f to send the message in a separate
+ * fiber. It is needed to be able to write records into _promotion
+ * space from on_commit trigger where core promotion logic is
+ * concentrated and a transaction exists already (though it is
+ * committed).
+ */
+static inline int
+promote_send(const struct promote_msg *msg)
+{
+	/*
+	 * Do nothing on recovery. If a message was sent on the
+	 * previous work session, it would be recovered among next
+	 * rows.
+	 */
+	if (! box_is_configured())
+		return 0;
+	struct fiber *sender = fiber_new("promote sender", promote_send_f);
+	if (sender == NULL)
+		return -1;
+	say_info("send promotion message: %s", promote_msg_str(msg));
+	fiber_set_joinable(sender, true);
+	fiber_start(sender, msg);
+	int rc = fiber_join(sender);
+	if (rc != 0) {
+		say_info("promotion message has not sent: %s",
+			 box_error_message(box_error_last()));
+	}
+	return rc;
+}
+
+/**
+ * Create the promotion message.
+ * @param[out] msg Message to create.
+ * @param type Type to set to @a msg.
+ */
+static inline void
+promote_msg_create(struct promote_msg *msg, enum promote_msg_type type)
+{
+	msg->round_id = promote_state.round_id;
+	msg->round_uuid = promote_state.round_uuid;
+	msg->source_uuid = INSTANCE_UUID;
+	msg->ts = fiber_time();
+	msg->type = type;
+	msg->step = ++promote_state.step;
+}
+
+/**
+ * Send a 'begin' promotion message. For this a new round is
+ * initialized and round_id is incremented.
+ */
+static inline int
+promote_send_begin(int quorum, double timeout)
+{
+	struct promote_msg msg;
+	promote_msg_create(&msg, PROMOTE_MSG_BEGIN);
+	tt_uuid_create(&msg.round_uuid);
+	msg.begin.quorum = quorum;
+	msg.begin.timeout = timeout;
+	msg.round_id++;
+	msg.step = 1;
+	return promote_send(&msg);
+}
+
+/**
+ * Send a 'status' promotion message. It contains a role of this
+ * instance. The message is sent as a response to 'begin' message.
+ */
+static inline int
+promote_send_status(void)
+{
+	struct promote_msg msg;
+	promote_msg_create(&msg, PROMOTE_MSG_STATUS);
+	msg.status.is_master = ! box_is_ro();
+	return promote_send(&msg);
+}
+
+/**
+ * Send a 'sync' promotion message. It is sent by this instance if
+ * it is an old master to be demoted. Sync brings this instance
+ * into read-only mode, while watchers and the initiator responds
+ * to this message with 'success'.
+ */
+static inline int
+promote_send_sync(void)
+{
+	struct promote_msg msg;
+	promote_msg_create(&msg, PROMOTE_MSG_SYNC);
+	return promote_send(&msg);
+}
+
+/**
+ * Send a 'success' promotion message. It is sent by a promotion
+ * watcher and an initiator as a response to 'sync' and by an old
+ * master when the sync is successfull. The later means the whole
+ * promotion round success.
+ */
+static inline int
+promote_send_success(void)
+{
+	struct promote_msg msg;
+	promote_msg_create(&msg, PROMOTE_MSG_SUCCESS);
+	return promote_send(&msg);
+}
+
+/**
+ * Send an 'error' promotion message. It is sent by any instace
+ * on different errors like timeout, multiple masters discovery,
+ * local errors (OOM, WAL error etc). This message is sent in
+ * scope of the current round and on commit terminates the local
+ * promotion state.
+ */
+static inline int
+promote_send_error(void)
+{
+	struct promote_msg msg;
+	promote_msg_create(&msg, PROMOTE_MSG_ERROR);
+	struct error *e = box_error_last();
+	msg.error.code = box_error_code(e);
+	msg.error.message = box_error_message(e);
+	return promote_send(&msg);
+}
+
+/**
+ * Send an 'error' promotion message out of scope of the current
+ * round. For example, as a response to unexpected message from
+ * another round while there are the current round active.
+ */
+static inline int
+promote_send_out_of_bound_error(int round_id, const struct tt_uuid *round_uuid,
+				int step)
+{
+	struct promote_msg msg;
+	promote_msg_create(&msg, PROMOTE_MSG_ERROR);
+	msg.round_id = round_id;
+	msg.round_uuid = *round_uuid;
+	struct error *e = box_error_last();
+	msg.error.code = box_error_code(e);
+	msg.error.message = box_error_message(e);
+	msg.step = step;
+	return promote_send(&msg);
+}
+
+int
+box_check_ro_is_mutable()
+{
+	if (! promote_state.is_role_committed)
+		return 0;
+	diag_set(ClientError, ER_CFG, "read_only", "can not change the option "\
+		 "when box.ctl.promote() was used");
+	return -1;
+}
+
+int
+box_ctl_promote(double timeout, int quorum)
+{
+	if (quorum < 0)
+		quorum = replicaset.applier.total;
+	if (! box_is_ro()) {
+		diag_set(ClientError, ER_PROMOTE, "non-initialized",
+			 "the initiator is already master");
+		return -1;
+	}
+	if (! promote_is_finished()) {
+		diag_set(ClientError, ER_PROMOTE_EXISTS);
+		return -1;
+	}
+	if (quorum <= replicaset.applier.total / 2) {
+		diag_set(ClientError, ER_PROMOTE, "non-initialized",
+			 tt_sprintf("too small quorum, expected > %d, "\
+				    "but got %d", replicaset.applier.total / 2,
+				    quorum));
+		return -1;
+	}
+	if (promote_send_begin(quorum, timeout) != 0)
+		return -1;
+
+	while (promote_state.phase != PROMOTE_PHASE_SUCCESS) {
+		fiber_cond_wait(&promote_state.on_change);
+		if (promote_state.phase == PROMOTE_PHASE_ERROR) {
+			assert(! diag_is_empty(&promote_state.diag));
+			diag_move(&promote_state.diag, diag_get());
+			return -1;
+		}
+	}
+	return 0;
+}
+
+/**
+ * Delete the promotion round with the specified id.
+ * @param id Round ID to delete by.
+ * @param[out] next_id ID of a next round.
+ * @param pk Primary index of the _promotion space.
+ *
+ * @retval 0 Success.
+ * @retval -1 Error.
+ */
+static inline int
+promote_clean_round(uint32_t id, uint32_t *next_id, struct index *pk)
+{
+	if (! promote_is_finished()) {
+		diag_set(ClientError, ER_PROMOTE_EXISTS);
+		return -1;
+	}
+	char key[16];
+	mp_encode_uint(key, id);
+	if (index_count(pk, ITER_ALL, NULL, 0) == 0)
+		return 0;
+	struct request request;
+	memset(&request, 0, sizeof(request));
+	request.type = IPROTO_DELETE;
+	request.space_id = BOX_PROMOTION_ID;
+	struct iterator *it = index_create_iterator(pk, ITER_GE, key, 1);
+	if (it == NULL)
+		return -1;
+	if (box_txn_begin() != 0) {
+		iterator_delete(it);
+		return -1;
+	}
+	struct tuple *t;
+	int rc;
+	while ((rc = iterator_next(it, &t)) == 0 && t != NULL) {
+		uint32_t key_size;
+		tuple_field_u32(t, BOX_PROMOTION_FIELD_ID, next_id);
+		if (*next_id != id)
+			break;
+		request.key = tuple_extract_key(t, pk->def->key_def, &key_size);
+		if (request.key == NULL)
+			goto rollback;
+		request.key_end = request.key + key_size;
+		if (box_process_sys_dml(&request) != 0)
+			goto rollback;
+	}
+	if (rc != 0 || box_txn_commit() != 0)
+		goto rollback;
+	iterator_delete(it);
+	return 0;
+rollback:
+	box_txn_rollback();
+	iterator_delete(it);
+	return -1;
+}
+
+int
+box_ctl_promote_reset(void)
+{
+	uint32_t id, next_id = 0;
+	struct index *pk = space_index(space_by_id(BOX_PROMOTION_ID), 0);
+	do {
+		id = next_id;
+		if (promote_clean_round(id, &next_id, pk) != 0)
+			return -1;
+	} while (id != next_id);
+	promote_state.phase = PROMOTE_PHASE_NON_ACTIVE;
+	promote_state.is_role_committed = false;
+	return 0;
+}
+
+/**
+ * Promotion timer worker function. It waits on the promotion
+ * state change condition variable at most timeout seconds and
+ * if the current round is not finished in time, the timeout error
+ * is committed.
+ */
+static int
+promote_timer_f(va_list ap)
+{
+	(void) ap;
+	assert(promote_state.timeout >= 0);
+	fiber_set_cancellable(true);
+	double timeout = promote_state.timeout;
+	double start = fiber_clock();
+	while (fiber_cond_wait_timeout(&promote_state.on_change,
+				       timeout) == 0) {
+		if (!promote_is_active() || fiber_is_cancelled())
+			goto stop;
+		timeout -= fiber_clock() - start;
+		start = fiber_clock();
+	}
+	if (!promote_is_active() || fiber_is_cancelled())
+		goto stop;
+	diag_set(ClientError, ER_TIMEOUT);
+	promote_state.step++;
+	promote_send_error();
+stop:
+	say_info("promotion timer is stopped");
+	assert(fiber() == promote_state.timer);
+	promote_state.timer = NULL;
+	return 0;
+}
+
+/**
+ * Start a promotion timer to terminate the current round on
+ * timeout.
+ */
+static int
+promote_start_timer(void)
+{
+	assert(promote_state.timer == NULL);
+	promote_state.timer = fiber_new("promote timer", promote_timer_f);
+	if (promote_state.timer == NULL)
+		return -1;
+	say_info("start promotion timer for %f seconds", promote_state.timeout);
+	fiber_start(promote_state.timer);
+	return 0;
+}
+
+void
+box_ctl_promote_info(struct info_handler *info)
+{
+	struct promote_state *s = &promote_state;
+	info_begin(info);
+	if (s->phase == PROMOTE_PHASE_NON_ACTIVE) {
+		info_end(info);
+		return;
+	}
+	info_append_int(info, "round_id", s->round_id);
+	info_append_str(info, "round_uuid", tt_uuid_str(&s->round_uuid));
+	if (promote_is_initiator_known()) {
+		info_append_str(info, "initiator_uuid",
+				tt_uuid_str(&s->initiator_uuid));
+		info_append_int(info, "quorum", s->quorum);
+		info_append_double(info, "timeout", s->timeout);
+	}
+	info_append_str(info, "role", promote_role_strs[s->role]);
+	info_append_str(info, "phase", promote_phase_strs[s->phase]);
+	info_append_str(info, "comment", s->comment);
+	if (promote_is_master_known()) {
+		info_append_str(info, "old_master_uuid",
+				tt_uuid_str(&s->old_master_uuid));
+	}
+	info_end(info);
+}
+
+void
+promote_process(const struct promote_msg *msg)
+{
+	if (box_is_configured()) {
+		say_info("promotion message has %s: %s",
+			 promote_msg_is_mine(msg) ? "commited" : "received",
+			 promote_msg_str(msg));
+	} else {
+		say_info("promotion message has recovered: %s",
+			 promote_msg_str(msg));
+	}
+	if (! promote_is_active()) {
+		if (msg->round_id <= promote_state.round_id) {
+			say_info("Ignored outdated round id %u, expected > %u",
+				 msg->round_id, promote_state.round_id);
+			return;
+		}
+		/*
+		 * During recovery there are no yields so do them
+		 * manually when needed to stop the timer. Avoid
+		 * starting a timer is not possible since only a
+		 * part of the round could be persisted, so after
+		 * the recovery is finished it is necessary to
+		 * commit an error on timeout, or finish the round
+		 * with success.
+		 */
+		if (promote_state.timer != NULL) {
+			assert(! box_is_configured());
+			fiber_cancel(promote_state.timer);
+			while (promote_state.timer != NULL)
+				fiber_sleep(0);
+		}
+		promote_state.step = 1;
+		promote_state.round_id = msg->round_id;
+		promote_state.round_uuid = msg->round_uuid;
+		promote_state.old_master_uuid = uuid_nil;
+		promote_state.initiator_uuid = uuid_nil;
+		diag_clear(&promote_state.diag);
+		promote_state.phase = PROMOTE_PHASE_IN_PROGRESS;
+		/*
+		 * Until 'status' message is commited, the role is
+		 * undefined. It is not possible to use
+		 * box_is_ro() right here since it can be
+		 * recovery. And by recovery of its own 'status'
+		 * messages the instance restores its read_only
+		 * flag and the role.
+		 */
+		promote_state.role = PROMOTE_ROLE_UNDEFINED;
+		promote_state.sync_count = 0;
+		promote_state.watcher_count = 0;
+		/*
+		 * Begin and quorum can not be set right now,
+		 * because the first message may be non-begin and
+		 * thus does not contain any round initial info.
+		 * It is called messages reordeing and it possible
+		 * when, for example, one instance downloads the
+		 * same round messages from two different
+		 * instances. Some of them can be received
+		 * earlier, but commited later breaking the order.
+		 * So it is not allowed to trust the order.
+		 */
+	} else if (!promote_is_this_round_msg(msg)) {
+		/*
+		 * Do not respond error on error, or else an
+		 * infinite error messages exchange will be
+		 * started.
+		 */
+		if (msg->type == PROMOTE_MSG_ERROR)
+			return;
+		diag_set(ClientError, ER_PROMOTE, tt_uuid_str(&msg->round_uuid),
+			 "unexpected message");
+		promote_send_out_of_bound_error(msg->round_id, &msg->round_uuid,
+						msg->step + 1);
+		return;
+	} else {
+		promote_state.step = MAX(msg->step, promote_state.step);
+	}
+	/*
+	 * The main processing switch. Here each instance of each
+	 * type responds to each type of message.
+	 */
+	switch (msg->type) {
+	case PROMOTE_MSG_BEGIN:
+		promote_state.initiator_uuid = msg->source_uuid;
+		promote_state.quorum = msg->begin.quorum;
+		promote_state.timeout = msg->begin.timeout;
+		if (promote_start_timer() != 0) {
+			promote_send_error();
+			break;
+		}
+		if (! promote_msg_is_mine(msg)) {
+			promote_send_status();
+		} else {
+			promote_state.role = PROMOTE_ROLE_INITIATOR;
+			promote_state.is_role_committed = true;
+			/*
+			 * If an instance sent 'begin' then it was
+			 * not a master at the moment of sending.
+			 * Recovery this status.
+			 */
+			box_set_ro(true);
+			box_expose_ro();
+			promote_comment("promotion is started, my promotion "\
+					"role is %s",
+					promote_role_strs[promote_state.role]);
+		}
+		break;
+
+	case PROMOTE_MSG_STATUS:
+		if (promote_state.role == PROMOTE_ROLE_UNDEFINED &&
+		    promote_msg_is_mine(msg)) {
+			/*
+			 * An instance can restore its role ONLY
+			 * by its own status messages and only on
+			 * commit. Even it've just sent the status
+			 * one moment earlier. Also the 'status'
+			 * message is used to recovery read_only.
+			 */
+			if (! box_is_ro())
+				promote_state.role = PROMOTE_ROLE_OLD_MASTER;
+			else
+				promote_state.role = PROMOTE_ROLE_WATCHER;
+			promote_state.is_role_committed = true;
+			box_set_ro(! msg->status.is_master);
+			box_expose_ro();
+			promote_comment("promotion is started, my promotion "\
+					"role is %s",
+					promote_role_strs[promote_state.role]);
+		}
+		if (msg->status.is_master) {
+			if (! promote_is_master_known()) {
+				promote_state.old_master_uuid =
+					msg->source_uuid;
+				if (promote_state.role !=
+				    PROMOTE_ROLE_OLD_MASTER)
+					break;
+				if (promote_msg_is_mine(msg)) {
+					/* Synced with self. */
+					promote_state.sync_count++;
+					promote_send_sync();
+					break;
+				}
+			}
+			const char *r, *m1, *m2;
+			r = tt_uuid_str(&msg->round_uuid);
+			m1 = tt_uuid_str(&msg->source_uuid);
+			m2 = tt_uuid_str(&promote_state.old_master_uuid);
+			/*
+			 * Sort master UUIDs to stabilize the
+			 * error message. Mostly for tests.
+			 */
+			if (strcmp(m1, m2) > 0)
+				SWAP(m1, m2);
+			diag_set(ClientError, ER_PROMOTE, r,
+				 tt_sprintf("two masters exist: '%s' and '%s'",
+					    m1, m2));
+			promote_send_error();
+			break;
+		}
+		++promote_state.watcher_count;
+		if (promote_state.role != PROMOTE_ROLE_INITIATOR ||
+		    !promote_is_cluster_readonly())
+			break;
+		/*
+		 * The cluster is readonly and 100% available.
+		 * Then the promotion is safe allowed. But the
+		 * initiator plays for an old master.
+		 */
+		promote_comment("the cluster is completely readonly, the "\
+				"initiator acts on behalf of an old master ");
+		/* Synced with self. */
+		promote_state.sync_count++;
+		promote_send_sync();
+		break;
+
+	case PROMOTE_MSG_SYNC:
+		if (promote_msg_is_mine(msg)) {
+			if (promote_state.role == PROMOTE_ROLE_OLD_MASTER) {
+				promote_comment("old master entered readonly "\
+						"mode to sync with slaves");
+				box_set_ro(true);
+				box_expose_ro();
+			} else {
+				assert(promote_state.role ==
+				       PROMOTE_ROLE_INITIATOR);
+				assert(promote_is_cluster_readonly());
+			}
+		} else {
+			if (promote_state.role == PROMOTE_ROLE_UNDEFINED) {
+				promote_state.is_role_committed = true;
+				promote_state.role = PROMOTE_ROLE_WATCHER;
+				promote_comment("promotion is started, 'sync' "\
+						"is received before my status "\
+						"was committed so I am not a "\
+						"master and not an initiator, "\
+						"but watcher");
+				box_set_ro(true);
+				box_expose_ro();
+			}
+			promote_send_success();
+		}
+		break;
+
+	case PROMOTE_MSG_SUCCESS:
+		switch (promote_state.role) {
+		case PROMOTE_ROLE_OLD_MASTER:
+			/*
+			 * The old master sends 'success' to
+			 * notify the initiator about the round
+			 * successfull finish.
+			 */
+			if (promote_msg_is_mine(msg)) {
+				promote_state.phase = PROMOTE_PHASE_SUCCESS;
+				promote_comment("the old master is demoted "\
+						"completely");
+			} else if (++promote_state.sync_count ==
+				   promote_state.quorum) {
+				/*
+				 * On commit the code above is
+				 * called. But do nothing until
+				 * the commit.
+				 */
+				promote_send_success();
+			}
+			break;
+		case PROMOTE_ROLE_INITIATOR:
+			/*
+			 * The round is finished successfully in
+			 * two cases: the old master've sent
+			 * 'success' or the cluster is read-only
+			 * and each replica've sent 'success'.
+			 */
+			if (tt_uuid_is_equal(&msg->source_uuid,
+					     &promote_state.old_master_uuid) ||
+			    (promote_is_cluster_readonly() &&
+			     ++promote_state.sync_count ==
+			     promote_state.quorum)) {
+				promote_comment("the new master is promoted");
+				promote_state.phase = PROMOTE_PHASE_SUCCESS;
+				box_set_ro(false);
+				box_expose_ro();
+			}
+			break;
+		case PROMOTE_ROLE_WATCHER:
+			if (promote_msg_is_mine(msg)) {
+				promote_state.phase = PROMOTE_PHASE_SUCCESS;
+				promote_comment("the watcher has voted and "\
+						"left the round");
+			}
+			break;
+		default:
+			assert(promote_state.role == PROMOTE_ROLE_UNDEFINED);
+			assert(! promote_msg_is_mine(msg));
+			break;
+		}
+		break;
+
+	case PROMOTE_MSG_ERROR:
+		if (promote_state.role == PROMOTE_ROLE_OLD_MASTER &&
+		    promote_state.phase == PROMOTE_PHASE_IN_PROGRESS &&
+		    box_is_ro()) {
+			promote_comment("the old master is back in read-write "\
+					"mode due to the error: %s",
+					 msg->error.message);
+			box_set_ro(false);
+			box_expose_ro();
+		} else {
+			promote_comment("the round failed due to the error: %s",
+					msg->error.message);
+		}
+		promote_state.phase = PROMOTE_PHASE_ERROR;
+		box_error_raise(msg->error.code, "%s", msg->error.message);
+		diag_move(diag_get(), &promote_state.diag);
+		break;
+	default:
+		break;
+	}
+	promote_state.round_id = MAX(promote_state.round_id, msg->round_id);
+	fiber_cond_broadcast(&promote_state.on_change);
+}
+
+int
+box_ctl_promote_init(void)
+{
+	memset(&promote_state, 0, sizeof(promote_state));
+	fiber_cond_create(&promote_state.on_change);
+	return 0;
+}
diff --git a/src/box/promote.h b/src/box/promote.h
new file mode 100644
index 000000000..e66140ade
--- /dev/null
+++ b/src/box/promote.h
@@ -0,0 +1,170 @@
+#ifndef INCLUDES_TARANTOOL_BOX_PROMOTE_H
+#define INCLUDES_TARANTOOL_BOX_PROMOTE_H
+/*
+ * Copyright 2010-2018, Tarantool AUTHORS, please see AUTHORS file.
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above
+ *    copyright notice, this list of conditions and the
+ *    following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above
+ *    copyright notice, this list of conditions and the following
+ *    disclaimer in the documentation and/or other materials
+ *    provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY <COPYRIGHT HOLDER> ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
+ * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
+ * <COPYRIGHT HOLDER> OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
+ * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+ * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
+ * THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+#include "tt_uuid.h"
+#include "diag.h"
+#include "fiber_cond.h"
+
+#ifdef __cplusplus
+extern "C" {
+#endif /* __cplusplus */
+
+struct info_handler;
+
+enum promote_msg_type {
+	PROMOTE_MSG_BEGIN = 0,
+	PROMOTE_MSG_STATUS,
+	PROMOTE_MSG_SYNC,
+	PROMOTE_MSG_SUCCESS,
+	PROMOTE_MSG_ERROR,
+	promote_msg_type_MAX,
+};
+
+/**
+ * Promotion message. The unit of communication between an
+ * initiator, an old master and watchers.
+ */
+struct promote_msg {
+	/**
+	 * Round ID. Together with round UUID composes an unique
+	 * round identifier. For details see promotion_state.
+	 */
+	int round_id;
+	/** Promotion round UUID, generated by the initiator. */
+	struct tt_uuid round_uuid;
+	/** UUID of the message sender. */
+	struct tt_uuid source_uuid;
+	/**
+	 * Timestamp of the message send time by the sender clock.
+	 * Just debug attribute, that is persisted.
+	 */
+	double ts;
+	/** Promotion message type. */
+	enum promote_msg_type type;
+	/** Step of the round on which the message was sent. */
+	int step;
+	/**
+	 * Depending on the message type, different attributes
+	 * are available in the message.
+	 */
+	union {
+		struct {
+			/**
+			 * 'Begin' promotion message carries
+			 * quorum and timeout of the new round
+			 * among other common things above.
+			 */
+			int quorum;
+			double timeout;
+		} begin;
+		struct {
+			/**
+			 * 'Status' message carries the sender
+			 * role.
+			 */
+			bool is_master;
+		} status;
+		struct {
+			/**
+			 * 'Error' message carries the error code
+			 * and message to be set in diag.
+			 */
+			int code;
+			const char *message;
+		} error;
+	};
+};
+
+/**
+ * Decode the MessagePack encoded promotion message into @a msg.
+ * @param data MessagePack data to decode. Tuple from _promotion.
+ * @param[out] msg Object to fill up.
+ *
+ * @retval -1 Error during decoding.
+ * @retval 0 Success.
+ */
+int
+promote_msg_decode(const char *data, struct promote_msg *msg);
+
+/**
+ * Process the promotion message, update the promotion state. The
+ * processing is executed on commit of @a msg.
+ * @param msg Message to process.
+ */
+void
+promote_process(const struct promote_msg *msg);
+
+/**
+ * Promote the current instance to be a master in the fullmesh
+ * master-master cluster. The old master, if exists, is demoted.
+ * Once a promotion attempt is done anywhere, manual change of
+ * read_only flag is disabled.
+ * @param timeout Timeout during which the promotion should be
+ *        finished.
+ * @param quorum The promotion quorum of instances who should
+ *        approve the promotion and sync with the old master
+ *        before demotion. The quorum should be at least half of
+ *        the cluster size + 1 and include the old master. If an
+ *        old master does not exist, then the quorum is ignored
+ *        and the promotion waits for 100% of the cluster
+ *        members.
+ *
+ * @retval -1 Error.
+ * @retval 0 Success.
+ */
+int
+box_ctl_promote(double timeout, int quorum);
+
+/**
+ * Show status of the current active promotion round or the last
+ * finished one.
+ * @param info Info handler to collect the info into.
+ */
+void
+box_ctl_promote_info(struct info_handler *info);
+
+/**
+ * Remove all the promotion rounds from the history. That allows
+ * to change read_only manually again.
+ */
+int
+box_ctl_promote_reset(void);
+
+/** Initialize the promotion subsystem. */
+int
+box_ctl_promote_init(void);
+
+#ifdef __cplusplus
+}
+#endif /* __cplusplus */
+
+#endif /* INCLUDES_TARANTOOL_BOX_PROMOTE_H */
diff --git a/src/cfg.c b/src/cfg.c
index 7c7d6e793..4d02a315e 100644
--- a/src/cfg.c
+++ b/src/cfg.c
@@ -153,3 +153,14 @@ cfg_getarr_elem(const char *name, int i)
 	lua_pop(tarantool_L, 2);
 	return val;
 }
+
+void
+cfg_rawsetb(const char *name, bool b)
+{
+	lua_getfield(tarantool_L, LUA_GLOBALSINDEX, "box");
+	lua_getfield(tarantool_L, -1, "cfg");
+	lua_pushstring(tarantool_L, name);
+	lua_pushboolean(tarantool_L, b);
+	lua_rawset(tarantool_L, -3);
+	lua_pop(tarantool_L, 2);
+}
diff --git a/src/cfg.h b/src/cfg.h
index 8499388b8..a7e400fe5 100644
--- a/src/cfg.h
+++ b/src/cfg.h
@@ -61,6 +61,9 @@ cfg_getarr_size(const char *name);
 const char *
 cfg_getarr_elem(const char *name, int i);
 
+void
+cfg_rawsetb(const char *name, bool b);
+
 #if defined(__cplusplus)
 } /* extern "C" */
 #endif /* defined(__cplusplus) */
diff --git a/src/main.cc b/src/main.cc
index a36a2b0d0..e6ab28b92 100644
--- a/src/main.cc
+++ b/src/main.cc
@@ -514,6 +514,7 @@ load_cfg()
 	 */
 	say_crit("%s %s", tarantool_package(), tarantool_version());
 	say_crit("log level %i", cfg_geti("log_level"));
+	box_set_ro(cfg_geti("read_only") != 0);
 
 	if (pid_file_handle != NULL) {
 		if (pidfile_write(pid_file_handle) == -1)
diff --git a/test/box/misc.result b/test/box/misc.result
index 4895a78a2..14183829c 100644
--- a/test/box/misc.result
+++ b/test/box/misc.result
@@ -397,8 +397,11 @@ t;
   - 'box.error.FUNCTION_EXISTS : 52'
   - 'box.error.UPDATE_ARG_TYPE : 26'
   - 'box.error.CROSS_ENGINE_TRANSACTION : 81'
-  - 'box.error.FORMAT_MISMATCH_INDEX_PART : 27'
   - 'box.error.injection : table: <address>
+  - 'box.error.INVALID_XLOG_TYPE : 125'
+  - 'box.error.PROTOCOL : 104'
+  - 'box.error.FORMAT_MISMATCH_INDEX_PART : 27'
+  - 'box.error.PROMOTE : 156'
   - 'box.error.FUNCTION_TX_ACTIVE : 30'
   - 'box.error.ITERATOR_TYPE : 72'
   - 'box.error.TRANSACTION_YIELD : 154'
@@ -472,8 +475,8 @@ t;
   - 'box.error.UNSUPPORTED_PRIV : 98'
   - 'box.error.WRONG_SCHEMA_VERSION : 109'
   - 'box.error.ROLLBACK_IN_SUB_STMT : 123'
-  - 'box.error.PROTOCOL : 104'
-  - 'box.error.INVALID_XLOG_TYPE : 125'
+  - 'box.error.WRONG_PROMOTION_RECORD : 157'
+  - 'box.error.PROMOTE_EXISTS : 158'
   - 'box.error.INDEX_PART_TYPE_MISMATCH : 24'
   - 'box.error.UNSUPPORTED_INDEX_FEATURE : 112'
 ...
diff --git a/test/promote/basic.result b/test/promote/basic.result
new file mode 100644
index 000000000..f70659963
--- /dev/null
+++ b/test/promote/basic.result
@@ -0,0 +1,472 @@
+test_run = require('test_run').new()
+---
+...
+test_run:create_cluster(CLUSTER, 'promote')
+---
+...
+test_run:wait_fullmesh(CLUSTER)
+---
+...
+--
+-- Check the promote actually allows to switch the master.
+--
+_ = test_run:switch('box1')
+---
+...
+-- Box1 is a master.
+box.cfg.read_only
+---
+- false
+...
+_ = test_run:switch('box2')
+---
+...
+-- Box2 is a slave.
+box.cfg.read_only
+---
+- true
+...
+-- And can not do DDL/DML.
+box.schema.create_space('test') -- Fail.
+---
+- error: Can't modify data because this instance is in read-only mode.
+...
+box.ctl.promote()
+---
+- true
+...
+promote_info()
+---
+- quorum: 4
+  timeout: 3153600000
+  initiator_uuid: box2
+  old_master_uuid: box1
+  role: initiator
+  round_id: 1
+  comment: the new master is promoted
+  phase: success
+  round_uuid: round_1
+...
+-- Now the slave has become a master.
+box.cfg.read_only
+---
+- false
+...
+-- And can do DDL/DML.
+s = box.schema.create_space('test')
+---
+...
+s:drop()
+---
+...
+_ = test_run:switch('box1')
+---
+...
+-- In turn, the old master is a slave now.
+box.cfg.read_only
+---
+- true
+...
+promote_info()
+---
+- quorum: 4
+  timeout: 3153600000
+  initiator_uuid: box2
+  old_master_uuid: box1
+  role: old master
+  round_id: 1
+  comment: the old master is demoted completely
+  phase: success
+  round_uuid: round_1
+...
+-- For him any DDL/DML is forbidden.
+box.schema.create_space('test2')
+---
+- error: Can't modify data because this instance is in read-only mode.
+...
+-- Check a watcher state.
+_ = test_run:switch('box3')
+---
+...
+promote_info()
+---
+- quorum: 4
+  timeout: 3153600000
+  initiator_uuid: box2
+  old_master_uuid: box1
+  role: watcher
+  round_id: 1
+  comment: the watcher has voted and left the round
+  phase: success
+  round_uuid: round_1
+...
+--
+-- Clear the basic successfull test and try different errors.
+--
+_ = test_run:switch('box2')
+---
+...
+box.ctl.promote_reset()
+---
+- true
+...
+promotion_history()
+---
+- []
+...
+prom = box.space._promotion
+---
+...
+-- Invalid UUIDs.
+prom:insert{1, 'invalid', 1, box.info.uuid, 1, 't'}
+---
+- error: 'Wrong record in _promotion (field 1): invalid UUID'
+...
+prom:insert{1, box.info.uuid, 1, 'invalid', 1, 't'}
+---
+- error: 'Wrong record in _promotion (field 3): invalid UUID'
+...
+-- Invalid ts.
+prom:insert{1, box.info.uuid, 1, box.info.uuid, -1, 't'}
+---
+- error: 'Wrong record in _promotion (field 5): wrong ts'
+...
+-- Invalid type.
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'invalid'}
+---
+- error: 'Wrong record in _promotion (field 6): wrong type'
+...
+-- Invalid type-specific options.
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'begin', {quorum = 1}}
+---
+- error: 'Wrong record in _promotion (field 7): expected 2 keys but got 1'
+...
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'begin', {quorum = 'invalid', timeout = 1}}
+---
+- error: 'Wrong record in _promotion (field 7): ''quorum'' must be unsigned'
+...
+map = setmetatable({}, {__serialize = 'map'})
+---
+...
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'status', map}
+---
+- error: 'Wrong record in _promotion (field 7): expected 1 keys but got 0'
+...
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'status', {is_master = 'invalid'}}
+---
+- error: 'Wrong record in _promotion (field 7): ''is_master'' must be boolean'
+...
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'error', map}
+---
+- error: 'Wrong record in _promotion (field 7): expected 2 keys but got 0'
+...
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'error', {code = 'code', message = 'msg'}}
+---
+- error: 'Wrong record in _promotion (field 7): ''code'' must be unsigned'
+...
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'sync', map}
+---
+- error: 'Wrong record in _promotion (field 7): ''sync'' has to have value nil'
+...
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'success', map}
+---
+- error: 'Wrong record in _promotion (field 7): ''success'' has to have value nil'
+...
+--
+-- Test simple invalid scenarios.
+--
+-- Already master.
+box.ctl.promote()
+---
+- null
+- 'Error during promotion with round UUID ''non-initialized'': the initiator is already
+  master'
+...
+_ = test_run:switch('box1')
+---
+...
+-- Small quorum.
+box.ctl.promote({quorum = 2})
+---
+- null
+- 'Error during promotion with round UUID ''non-initialized'': too small quorum, expected
+  > 2, but got 2'
+...
+-- Two masters.
+box.cfg{read_only = false}
+---
+...
+_ = test_run:switch('box3')
+---
+...
+promote_check_error()
+---
+- null
+- 'Error during promotion with round UUID ''round_2'': two masters exist: ''box1''
+  and ''box2'''
+...
+promotion_history_find_masters()
+---
+- - {'step': 2, 'value': {'is_master': true}, 'id': 2, 'type': 'status', 'source_uuid': 'box1',
+    'round_uuid': 'round_2'}
+  - {'step': 2, 'value': {'is_master': true}, 'id': 2, 'type': 'status', 'source_uuid': 'box2',
+    'round_uuid': 'round_2'}
+...
+box.cfg.read_only
+---
+- true
+...
+_ = test_run:switch('box1')
+---
+...
+box.cfg.read_only
+---
+- false
+...
+_ = test_run:switch('box2')
+---
+...
+box.cfg.read_only
+---
+- false
+...
+_ = test_run:switch('box4')
+---
+...
+box.cfg.read_only
+---
+- true
+...
+-- Box.cfg.read_only became immutable when promote had been
+-- called.
+box.cfg{read_only = false}
+---
+- error: 'Incorrect value for option ''read_only'': can not change the option when
+    box.ctl.promote() was used'
+...
+--
+-- Test recovery after failed promotion.
+--
+_ = test_run:cmd('restart server box2')
+---
+...
+_ = test_run:cmd('restart server box3')
+---
+...
+_ = test_run:switch('box2')
+---
+...
+info = promote_info()
+---
+...
+info.old_master_uuid == 'box1' or info.old_master_uuid == 'box2'
+---
+- true
+...
+info.old_master_uuid = nil
+---
+...
+info.comment = info.comment:match('two masters exist')
+---
+...
+info
+---
+- quorum: 4
+  timeout: 3153600000
+  initiator_uuid: box3
+  role: old master
+  round_id: 2
+  comment: two masters exist
+  phase: error
+  round_uuid: round_2
+...
+_ = test_run:switch('box3')
+---
+...
+info = promote_info()
+---
+...
+info.old_master_uuid == 'box1' or info.old_master_uuid == 'box2'
+---
+- true
+...
+info.old_master_uuid = nil
+---
+...
+info
+---
+- quorum: 4
+  timeout: 3153600000
+  initiator_uuid: box3
+  role: initiator
+  round_id: 2
+  comment: 'the round failed due to the error: Error during promotion with round UUID
+    ''round_2'': two masters exist: ''box1'' and ''box2'''
+  phase: error
+  round_uuid: round_2
+...
+--
+-- Test timeout.
+--
+_ = test_run:switch('box1')
+---
+...
+box.ctl.promote_reset()
+---
+- true
+...
+box.cfg{read_only = true}
+---
+...
+-- Now box2 is a single master.
+_ = test_run:switch('box3')
+---
+...
+promote_check_error({timeout = 0.00001})
+---
+- null
+- Timeout exceeded
+...
+promote_info()
+---
+- quorum: 4
+  initiator_uuid: box3
+  phase: error
+  role: initiator
+  round_id: 3
+  comment: 'the round failed due to the error: Timeout exceeded'
+  timeout: 1e-05
+  round_uuid: round_3
+...
+--
+-- Test the case when the cluster is not read-only, but a single
+-- master is not available now. In such a case the promote()
+-- should fail regardless of quorum.
+--
+_ = test_run:cmd('stop server box2')
+---
+...
+box.ctl.promote_reset()
+---
+- true
+...
+-- Quorum is 3 to test that the quorum must contain an old master.
+promote_check_error({timeout = 0.5, quorum = 3})
+---
+- null
+- Timeout exceeded
+...
+promote_info()
+---
+- quorum: 3
+  initiator_uuid: box3
+  phase: error
+  role: initiator
+  round_id: 4
+  comment: 'the round failed due to the error: Timeout exceeded'
+  timeout: 0.5
+  round_uuid: round_4
+...
+_ = test_run:switch('box1')
+---
+...
+_ = test_run:cmd('stop server box3')
+---
+...
+_ = test_run:cmd('start server box2')
+---
+...
+_ = test_run:switch('box2')
+---
+...
+info = promote_info({'round_id', 'comment', 'phase', 'round_uuid'})
+---
+...
+info.comment = info.comment:match('Timeout exceeded')
+---
+...
+info
+---
+- round_id: 4
+  comment: Timeout exceeded
+  phase: error
+  round_uuid: round_4
+...
+_ = test_run:cmd('start server box3')
+---
+...
+_ = test_run:switch('box3')
+---
+...
+promote_info({'round_id', 'comment', 'phase', 'round_uuid', 'role'})
+---
+- phase: error
+  role: initiator
+  round_id: 4
+  comment: 'the round failed due to the error: Timeout exceeded'
+  round_uuid: round_4
+...
+--
+-- Test promotion in a completely read-only cluster.
+--
+_ = test_run:switch('box2')
+---
+...
+box.ctl.promote_reset()
+---
+- true
+...
+box.cfg{read_only = true}
+---
+...
+box.ctl.promote()
+---
+- true
+...
+promote_info()
+---
+- quorum: 4
+  initiator_uuid: box2
+  phase: success
+  role: initiator
+  round_id: 5
+  comment: the new master is promoted
+  timeout: 3153600000
+  round_uuid: round_5
+...
+--
+-- Test promotion reset of several rounds.
+--
+_ = test_run:switch('box3')
+---
+...
+box.ctl.promote()
+---
+- true
+...
+promote_info()
+---
+- quorum: 4
+  timeout: 3153600000
+  initiator_uuid: box3
+  old_master_uuid: box2
+  role: initiator
+  round_id: 6
+  comment: the new master is promoted
+  phase: success
+  round_uuid: round_6
+...
+box.ctl.promote_reset()
+---
+- true
+...
+promotion_history()
+---
+- []
+...
+_ = test_run:switch('default')
+---
+...
+test_run:drop_cluster(CLUSTER)
+---
+...
diff --git a/test/promote/basic.test.lua b/test/promote/basic.test.lua
new file mode 100644
index 000000000..4138745b5
--- /dev/null
+++ b/test/promote/basic.test.lua
@@ -0,0 +1,160 @@
+test_run = require('test_run').new()
+test_run:create_cluster(CLUSTER, 'promote')
+test_run:wait_fullmesh(CLUSTER)
+--
+-- Check the promote actually allows to switch the master.
+--
+_ = test_run:switch('box1')
+-- Box1 is a master.
+box.cfg.read_only
+
+_ = test_run:switch('box2')
+-- Box2 is a slave.
+box.cfg.read_only
+-- And can not do DDL/DML.
+box.schema.create_space('test') -- Fail.
+
+box.ctl.promote()
+promote_info()
+-- Now the slave has become a master.
+box.cfg.read_only
+-- And can do DDL/DML.
+s = box.schema.create_space('test')
+s:drop()
+
+_ = test_run:switch('box1')
+-- In turn, the old master is a slave now.
+box.cfg.read_only
+promote_info()
+-- For him any DDL/DML is forbidden.
+box.schema.create_space('test2')
+
+-- Check a watcher state.
+_ = test_run:switch('box3')
+promote_info()
+
+--
+-- Clear the basic successfull test and try different errors.
+--
+_ = test_run:switch('box2')
+box.ctl.promote_reset()
+promotion_history()
+
+prom = box.space._promotion
+
+-- Invalid UUIDs.
+prom:insert{1, 'invalid', 1, box.info.uuid, 1, 't'}
+prom:insert{1, box.info.uuid, 1, 'invalid', 1, 't'}
+-- Invalid ts.
+prom:insert{1, box.info.uuid, 1, box.info.uuid, -1, 't'}
+-- Invalid type.
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'invalid'}
+-- Invalid type-specific options.
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'begin', {quorum = 1}}
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'begin', {quorum = 'invalid', timeout = 1}}
+
+map = setmetatable({}, {__serialize = 'map'})
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'status', map}
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'status', {is_master = 'invalid'}}
+
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'error', map}
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'error', {code = 'code', message = 'msg'}}
+
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'sync', map}
+prom:insert{1, box.info.uuid, 1, box.info.uuid, 1, 'success', map}
+
+--
+-- Test simple invalid scenarios.
+--
+
+-- Already master.
+box.ctl.promote()
+_ = test_run:switch('box1')
+-- Small quorum.
+box.ctl.promote({quorum = 2})
+-- Two masters.
+box.cfg{read_only = false}
+_ = test_run:switch('box3')
+promote_check_error()
+promotion_history_find_masters()
+box.cfg.read_only
+_ = test_run:switch('box1')
+box.cfg.read_only
+_ = test_run:switch('box2')
+box.cfg.read_only
+_ = test_run:switch('box4')
+box.cfg.read_only
+-- Box.cfg.read_only became immutable when promote had been
+-- called.
+box.cfg{read_only = false}
+
+--
+-- Test recovery after failed promotion.
+--
+_ = test_run:cmd('restart server box2')
+_ = test_run:cmd('restart server box3')
+_ = test_run:switch('box2')
+info = promote_info()
+info.old_master_uuid == 'box1' or info.old_master_uuid == 'box2'
+info.old_master_uuid = nil
+info.comment = info.comment:match('two masters exist')
+info
+_ = test_run:switch('box3')
+info = promote_info()
+info.old_master_uuid == 'box1' or info.old_master_uuid == 'box2'
+info.old_master_uuid = nil
+info
+
+--
+-- Test timeout.
+--
+_ = test_run:switch('box1')
+box.ctl.promote_reset()
+box.cfg{read_only = true}
+-- Now box2 is a single master.
+_ = test_run:switch('box3')
+promote_check_error({timeout = 0.00001})
+promote_info()
+
+--
+-- Test the case when the cluster is not read-only, but a single
+-- master is not available now. In such a case the promote()
+-- should fail regardless of quorum.
+--
+_ = test_run:cmd('stop server box2')
+box.ctl.promote_reset()
+-- Quorum is 3 to test that the quorum must contain an old master.
+promote_check_error({timeout = 0.5, quorum = 3})
+promote_info()
+_ = test_run:switch('box1')
+_ = test_run:cmd('stop server box3')
+_ = test_run:cmd('start server box2')
+_ = test_run:switch('box2')
+info = promote_info({'round_id', 'comment', 'phase', 'round_uuid'})
+info.comment = info.comment:match('Timeout exceeded')
+info
+
+_ = test_run:cmd('start server box3')
+_ = test_run:switch('box3')
+promote_info({'round_id', 'comment', 'phase', 'round_uuid', 'role'})
+
+--
+-- Test promotion in a completely read-only cluster.
+--
+_ = test_run:switch('box2')
+box.ctl.promote_reset()
+box.cfg{read_only = true}
+box.ctl.promote()
+promote_info()
+
+--
+-- Test promotion reset of several rounds.
+--
+_ = test_run:switch('box3')
+box.ctl.promote()
+promote_info()
+box.ctl.promote_reset()
+promotion_history()
+
+_ = test_run:switch('default')
+test_run:drop_cluster(CLUSTER)
diff --git a/test/promote/box.lua b/test/promote/box.lua
new file mode 100644
index 000000000..97d952cae
--- /dev/null
+++ b/test/promote/box.lua
@@ -0,0 +1,8 @@
+#!/usr/bin/env tarantool
+os = require('os')
+
+box.cfg{ listen = os.getenv("LISTEN") }
+
+CLUSTER = { 'box1', 'box2', 'box3', 'box4' }
+
+require('console').listen(os.getenv('ADMIN'))
diff --git a/test/promote/box1.lua b/test/promote/box1.lua
new file mode 100644
index 000000000..eca667f96
--- /dev/null
+++ b/test/promote/box1.lua
@@ -0,0 +1,112 @@
+#!/usr/bin/env tarantool
+
+local INSTANCE_ID = string.match(arg[0], "%d")
+local SOCKET_DIR = require('fio').cwd()
+local read_only = INSTANCE_ID ~= '1'
+local function instance_uri(instance_id)
+    return SOCKET_DIR..'/promote'..instance_id..'.sock';
+end
+local uuid_prefix = '4d71c17c-8c50-11e8-9eb6-529269fb145'
+local uuid_to_name = {}
+for i = 1, 4 do
+    local uuid = uuid_prefix..tostring(i)
+    uuid_to_name[uuid] = 'box'..tostring(i)
+end
+require('console').listen(os.getenv('ADMIN'))
+
+fiber = require('fiber')
+errinj = box.error.injection
+
+box.cfg({
+    listen = instance_uri(INSTANCE_ID),
+    replication = {instance_uri(1), instance_uri(2),
+                   instance_uri(3), instance_uri(4)},
+    read_only = read_only,
+    replication_connect_timeout = 0.1,
+    replication_timeout = 0.1,
+    instance_uuid = uuid_prefix..tostring(INSTANCE_ID),
+})
+
+local round_uuid_to_id = {}
+
+function uuid_free_str(str)
+    for uuid, id in pairs(round_uuid_to_id) do
+        local template = string.gsub(uuid, '%-', '%%-')
+        str = string.gsub(str, template, 'round_'..tostring(id))
+    end
+    for uuid, name in pairs(uuid_to_name) do
+        local template = string.gsub(uuid, '%-', '%%-')
+        str = string.gsub(str, template, name)
+    end
+    return str
+end
+
+function promotion_history()
+    local ret = {}
+    local prev_round_uuid
+    for i, t in box.space._promotion:pairs() do
+        t = setmetatable(t:tomap({names_only = true}), {__serialize = 'map'})
+        round_uuid_to_id[t.round_uuid] = t.id
+        t.round_uuid = 'round_'..tostring(t.id)
+        t.source_uuid = uuid_to_name[t.source_uuid]
+        t.ts = nil
+        if t.value == box.NULL then
+            t.value = nil
+        end
+        if t.type == 'error' then
+            t.value.message = uuid_free_str(t.value.message)
+        end
+        table.insert(ret, t)
+    end
+    return ret
+end
+
+-- For recovery rescan round_uuids.
+promotion_history()
+
+function promote_check_error(...)
+    local ok, err = box.ctl.promote(...)
+    if not ok then
+        promotion_history()
+        err = uuid_free_str(err:unpack().message)
+    end
+    return ok, err
+end
+
+function promotion_history_find_masters()
+    local res = {}
+    for _, record in pairs(promotion_history()) do
+        if record.type == 'status' and record.value.is_master then
+            table.insert(res, record)
+        end
+    end
+    return res
+end
+
+function promote_info(fields)
+    local info = box.ctl.promote_info()
+    if fields then
+        local tmp = {}
+        for _, k in pairs(fields) do
+            tmp[k] = info[k]
+        end
+        info = tmp
+    end
+    if info.old_master_uuid then
+        info.old_master_uuid = uuid_free_str(info.old_master_uuid)
+    end
+    if info.round_uuid then
+        info.round_uuid = 'round_'..tostring(info.round_id)
+    end
+    if info.initiator_uuid then
+        info.initiator_uuid = uuid_free_str(info.initiator_uuid)
+    end
+    if info.comment then
+        info.comment = uuid_free_str(info.comment)
+    end
+    return info
+end
+
+box.once("bootstrap", function()
+    box.schema.user.grant('guest', 'read,write,execute', 'universe')
+end)
diff --git a/test/promote/box2.lua b/test/promote/box2.lua
new file mode 120000
index 000000000..77f1e2aab
--- /dev/null
+++ b/test/promote/box2.lua
@@ -0,0 +1 @@
+box1.lua
\ No newline at end of file
diff --git a/test/promote/box3.lua b/test/promote/box3.lua
new file mode 120000
index 000000000..77f1e2aab
--- /dev/null
+++ b/test/promote/box3.lua
@@ -0,0 +1 @@
+box1.lua
\ No newline at end of file
diff --git a/test/promote/box4.lua b/test/promote/box4.lua
new file mode 120000
index 000000000..77f1e2aab
--- /dev/null
+++ b/test/promote/box4.lua
@@ -0,0 +1 @@
+box1.lua
\ No newline at end of file
diff --git a/test/promote/errinj.result b/test/promote/errinj.result
new file mode 100644
index 000000000..fe837239e
--- /dev/null
+++ b/test/promote/errinj.result
@@ -0,0 +1,222 @@
+test_run = require('test_run').new()
+---
+...
+test_run:create_cluster(CLUSTER, 'promote')
+---
+...
+test_run:wait_fullmesh(CLUSTER)
+---
+...
+--
+-- Test the case when two different promotions are started at the
+-- same time. Here the initiators are box2 and box3 while box1 is
+-- an old master and box4 is a watcher.
+--
+_ = test_run:switch('box1')
+---
+...
+errinj.set("ERRINJ_WAL_DELAY", true)
+---
+- ok
+...
+_ = test_run:switch('box2')
+---
+...
+errinj.set("ERRINJ_WAL_DELAY", true)
+---
+- ok
+...
+_ = test_run:switch('box3')
+---
+...
+errinj.set("ERRINJ_WAL_DELAY", true)
+---
+- ok
+...
+_ = test_run:switch('box2')
+---
+...
+err = nil
+---
+...
+ok = nil
+---
+...
+_ = fiber.create(function() ok, err = promote_check_error() end)
+---
+...
+_ = test_run:switch('box3')
+---
+...
+err = nil
+---
+...
+ok = nil
+---
+...
+f = fiber.create(function() ok, err = promote_check_error() end)
+---
+...
+while f:status() ~= 'suspended' do fiber.sleep(0.01) end
+---
+...
+errinj.set("ERRINJ_WAL_DELAY", false)
+---
+- ok
+...
+_ = test_run:switch('box2')
+---
+...
+errinj.set("ERRINJ_WAL_DELAY", false)
+---
+- ok
+...
+while not err do fiber.sleep(0.01) end
+---
+...
+ok, err
+---
+- null
+- 'Error during promotion with round UUID ''round_1'': unexpected message'
+...
+_ = test_run:switch('box1')
+---
+...
+errinj.set("ERRINJ_WAL_DELAY", false)
+---
+- ok
+...
+while promote_info().phase ~= 'error' do fiber.sleep(0.01) end
+---
+...
+info = promote_info()
+---
+...
+info.comment = info.comment:match('unexpected message')
+---
+...
+info
+---
+- quorum: 4
+  timeout: 3153600000
+  initiator_uuid: box3
+  old_master_uuid: box1
+  role: old master
+  round_id: 1
+  comment: unexpected message
+  phase: error
+  round_uuid: round_1
+...
+_ = test_run:switch('box3')
+---
+...
+while not err do fiber.sleep(0.01) end
+---
+...
+ok, err
+---
+- null
+- 'Error during promotion with round UUID ''round_1'': unexpected message'
+...
+--
+-- Test that after all a new promotion works.
+--
+box.ctl.promote()
+---
+- true
+...
+promote_info()
+---
+- quorum: 4
+  timeout: 3153600000
+  initiator_uuid: box3
+  old_master_uuid: box1
+  role: initiator
+  round_id: 2
+  comment: the new master is promoted
+  phase: success
+  round_uuid: round_2
+...
+--
+-- Test the case when during a promotion round an initiator is
+-- restarted after sending 'begin' and the round had been failed
+-- on timeout. On recovery the initiator has to detect by 'begin'
+-- that it was read only and make 'read_only' option be immutable
+-- for a user despite the fact that 'status' is never sent by this
+-- instance.
+--
+-- The test plan: disable watchers, start a promotion round, turn
+-- the initiator off, wait until the round is failed due to
+-- timeout, turn the initiator on. It should catch its own
+-- begin + error and went to read only mode, even if box.cfg was
+-- called with read_only = false.
+--
+_ = test_run:cmd('stop server box2')
+---
+...
+_ = test_run:cmd('stop server box4')
+---
+...
+-- Box1 is an initiator, box3 is an old master.
+_ = test_run:switch('box1')
+---
+...
+-- Do reset and snapshot to do not replay the previous round on
+-- restart.
+box.ctl.promote_reset()
+---
+- true
+...
+box.snapshot()
+---
+- ok
+...
+_ = fiber.create(function() box.ctl.promote({timeout = 0.1}) end)
+---
+...
+_ = test_run:switch('box3')
+---
+...
+while box.space._promotion:count() == 0 do fiber.sleep(0.01) end
+---
+...
+_ = test_run:cmd('stop server box1')
+---
+...
+while box.ctl.promote_info().phase ~= 'error' do fiber.sleep(0.01) end
+---
+...
+_ = test_run:cmd('start server box1')
+---
+...
+_ = test_run:switch('box1')
+---
+...
+promote_info()
+---
+- quorum: 4
+  timeout: 0.1
+  initiator_uuid: box1
+  old_master_uuid: box3
+  role: initiator
+  round_id: 3
+  comment: 'the round failed due to the error: Timeout exceeded'
+  phase: error
+  round_uuid: round_3
+...
+box.cfg.read_only
+---
+- true
+...
+_ = test_run:cmd('start server box2')
+---
+...
+_ = test_run:cmd('start server box4')
+---
+...
+_ = test_run:switch('default')
+---
+...
+test_run:drop_cluster(CLUSTER)
+---
+...
diff --git a/test/promote/errinj.test.lua b/test/promote/errinj.test.lua
new file mode 100644
index 000000000..63cb5e59e
--- /dev/null
+++ b/test/promote/errinj.test.lua
@@ -0,0 +1,87 @@
+test_run = require('test_run').new()
+test_run:create_cluster(CLUSTER, 'promote')
+test_run:wait_fullmesh(CLUSTER)
+--
+-- Test the case when two different promotions are started at the
+-- same time. Here the initiators are box2 and box3 while box1 is
+-- an old master and box4 is a watcher.
+--
+_ = test_run:switch('box1')
+errinj.set("ERRINJ_WAL_DELAY", true)
+
+_ = test_run:switch('box2')
+errinj.set("ERRINJ_WAL_DELAY", true)
+
+_ = test_run:switch('box3')
+errinj.set("ERRINJ_WAL_DELAY", true)
+
+_ = test_run:switch('box2')
+err = nil
+ok = nil
+_ = fiber.create(function() ok, err = promote_check_error() end)
+
+_ = test_run:switch('box3')
+err = nil
+ok = nil
+f = fiber.create(function() ok, err = promote_check_error() end)
+while f:status() ~= 'suspended' do fiber.sleep(0.01) end
+errinj.set("ERRINJ_WAL_DELAY", false)
+
+_ = test_run:switch('box2')
+errinj.set("ERRINJ_WAL_DELAY", false)
+while not err do fiber.sleep(0.01) end
+ok, err
+
+_ = test_run:switch('box1')
+errinj.set("ERRINJ_WAL_DELAY", false)
+while promote_info().phase ~= 'error' do fiber.sleep(0.01) end
+info = promote_info()
+info.comment = info.comment:match('unexpected message')
+info
+
+_ = test_run:switch('box3')
+while not err do fiber.sleep(0.01) end
+ok, err
+
+--
+-- Test that after all a new promotion works.
+--
+box.ctl.promote()
+promote_info()
+
+--
+-- Test the case when during a promotion round an initiator is
+-- restarted after sending 'begin' and the round had been failed
+-- on timeout. On recovery the initiator has to detect by 'begin'
+-- that it was read only and make 'read_only' option be immutable
+-- for a user despite the fact that 'status' is never sent by this
+-- instance.
+--
+-- The test plan: disable watchers, start a promotion round, turn
+-- the initiator off, wait until the round is failed due to
+-- timeout, turn the initiator on. It should catch its own
+-- begin + error and went to read only mode, even if box.cfg was
+-- called with read_only = false.
+--
+_ = test_run:cmd('stop server box2')
+_ = test_run:cmd('stop server box4')
+-- Box1 is an initiator, box3 is an old master.
+_ = test_run:switch('box1')
+-- Do reset and snapshot to do not replay the previous round on
+-- restart.
+box.ctl.promote_reset()
+box.snapshot()
+_ = fiber.create(function() box.ctl.promote({timeout = 0.1}) end)
+_ = test_run:switch('box3')
+while box.space._promotion:count() == 0 do fiber.sleep(0.01) end
+_ = test_run:cmd('stop server box1')
+while box.ctl.promote_info().phase ~= 'error' do fiber.sleep(0.01) end
+_ = test_run:cmd('start server box1')
+_ = test_run:switch('box1')
+promote_info()
+box.cfg.read_only
+_ = test_run:cmd('start server box2')
+_ = test_run:cmd('start server box4')
+
+_ = test_run:switch('default')
+test_run:drop_cluster(CLUSTER)
diff --git a/test/promote/suite.ini b/test/promote/suite.ini
new file mode 100644
index 000000000..9c94cb465
--- /dev/null
+++ b/test/promote/suite.ini
@@ -0,0 +1,6 @@
+[default]
+core = tarantool
+description = Promotion tests
+script = box.lua
+release_disabled = errinj.test.lua
+is_parallel = True
-- 
2.15.2 (Apple Git-101.1)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 8/8] box: introduce promotion GC
  2018-08-07 22:03 [PATCH 0/8] box.ctl.promote Vladislav Shpilevoy
                   ` (6 preceding siblings ...)
  2018-08-07 22:03 ` [PATCH 7/8] box: introduce box.ctl.promote Vladislav Shpilevoy
@ 2018-08-07 22:03 ` Vladislav Shpilevoy
  7 siblings, 0 replies; 13+ messages in thread
From: Vladislav Shpilevoy @ 2018-08-07 22:03 UTC (permalink / raw)
  To: tarantool-patches; +Cc: vdavydov.dev

In the previous commit a promotion protocol was introduced. Each
promotion round produces messages stored in the _promotion space.
After a number of promotions the space can become too big. But it
is not necessary to store those promotion rounds after which
another successfull promotion was executed.

This patch introduces garbage collecting, so after each
successfull promotion old rounds are purged.

Follow up #3055
---
 src/box/promote.c           | 15 +++++++++++++--
 test/promote/basic.result   | 23 +++++++++++++++++++++++
 test/promote/basic.test.lua | 11 +++++++++++
 3 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/src/box/promote.c b/src/box/promote.c
index dcc39b5bd..eab348f70 100644
--- a/src/box/promote.c
+++ b/src/box/promote.c
@@ -695,7 +695,7 @@ rollback:
 }
 
 int
-box_ctl_promote_reset(void)
+promote_reset_until(uint32_t until)
 {
 	uint32_t id, next_id = 0;
 	struct index *pk = space_index(space_by_id(BOX_PROMOTION_ID), 0);
@@ -703,7 +703,15 @@ box_ctl_promote_reset(void)
 		id = next_id;
 		if (promote_clean_round(id, &next_id, pk) != 0)
 			return -1;
-	} while (id != next_id);
+	} while (id != next_id && next_id < until);
+	return 0;
+}
+
+int
+box_ctl_promote_reset(void)
+{
+	if (promote_reset_until(UINT32_MAX) != 0)
+		return -1;
 	promote_state.phase = PROMOTE_PHASE_NON_ACTIVE;
 	promote_state.is_role_committed = false;
 	return 0;
@@ -739,6 +747,9 @@ stop:
 	say_info("promotion timer is stopped");
 	assert(fiber() == promote_state.timer);
 	promote_state.timer = NULL;
+	if (promote_state.role == PROMOTE_ROLE_INITIATOR &&
+	    promote_state.phase == PROMOTE_PHASE_SUCCESS)
+		promote_reset_until(promote_state.round_id);
 	return 0;
 }
 
diff --git a/test/promote/basic.result b/test/promote/basic.result
index f70659963..47c92e257 100644
--- a/test/promote/basic.result
+++ b/test/promote/basic.result
@@ -464,6 +464,29 @@ promotion_history()
 ---
 - []
 ...
+--
+-- Test promotion GC.
+--
+_ = test_run:switch('box2')
+---
+...
+box.ctl.promote()
+---
+- true
+...
+_ = test_run:switch('box1')
+---
+...
+box.ctl.promote()
+---
+- true
+...
+-- Each successfull round for 4 instance cluster produces 9
+-- records.
+#promotion_history() < 10
+---
+- true
+...
 _ = test_run:switch('default')
 ---
 ...
diff --git a/test/promote/basic.test.lua b/test/promote/basic.test.lua
index 4138745b5..835208f06 100644
--- a/test/promote/basic.test.lua
+++ b/test/promote/basic.test.lua
@@ -156,5 +156,16 @@ promote_info()
 box.ctl.promote_reset()
 promotion_history()
 
+--
+-- Test promotion GC.
+--
+_ = test_run:switch('box2')
+box.ctl.promote()
+_ = test_run:switch('box1')
+box.ctl.promote()
+-- Each successfull round for 4 instance cluster produces 9
+-- records.
+#promotion_history() < 10
+
 _ = test_run:switch('default')
 test_run:drop_cluster(CLUSTER)
-- 
2.15.2 (Apple Git-101.1)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/8] box: rename process_rw to process_dml
  2018-08-07 22:03 ` [PATCH 2/8] box: rename process_rw to process_dml Vladislav Shpilevoy
@ 2018-08-13  8:20   ` Vladimir Davydov
  0 siblings, 0 replies; 13+ messages in thread
From: Vladimir Davydov @ 2018-08-13  8:20 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

On Wed, Aug 08, 2018 at 01:03:45AM +0300, Vladislav Shpilevoy wrote:
> diff --git a/src/box/iproto.cc b/src/box/iproto.cc
> index bb7d2b868..f8b419c26 100644
> --- a/src/box/iproto.cc
> +++ b/src/box/iproto.cc
> @@ -1368,7 +1368,7 @@ tx_process1(struct cmsg *m)
>  	struct obuf_svp svp;
>  	struct obuf *out;
>  	tx_inject_delay();
> -	if (box_process1(&msg->dml, &tuple) != 0)
> +	if (box_process_dml(&msg->dml, &tuple) != 0)
>  		goto error;
>  	out = msg->connection->tx.p_obuf;
>  	if (iproto_prepare_select(out, &svp) != 0)

Now tx_process1 calls not box_process1, as it used to, but
box_process_dml. IMO this doesn't look any better than what
we presently have.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/8] Add 'exact_field_count' parameter to options decoder
  2018-08-07 22:03 ` [PATCH 3/8] Add 'exact_field_count' parameter to options decoder Vladislav Shpilevoy
@ 2018-08-13  8:30   ` Vladimir Davydov
  0 siblings, 0 replies; 13+ messages in thread
From: Vladimir Davydov @ 2018-08-13  8:30 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

On Wed, Aug 08, 2018 at 01:03:46AM +0300, Vladislav Shpilevoy wrote:
> Needed for promotion. Promotion uses system space
> _promotion, into which a user can write tuples directly
> with not API usage (and we can not do anything with it),
> so _promotion should do severe validation of each field
> of each tuple since it affects the cluster state.
> 
> For this a new parameter of options decoder is introduced,
> that checks for exact field count.

TBH I don't think it's really necessary, because if the user writes to
this table, promotion logic may break anyways AFAIU. So why don't you
just use default parameters if some fields are omitted?

Anyway, passing exact_field_count to a function decoding options from a
map looks kinda weird. And if you decide to extend the options one day,
it will become useless, because you'll have to handle options generated
by older versions which don't have some parameters.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/8] box: remove orphan check from box_is_ro()
  2018-08-07 22:03 ` [PATCH 4/8] box: remove orphan check from box_is_ro() Vladislav Shpilevoy
@ 2018-08-13  8:34   ` Vladimir Davydov
  0 siblings, 0 replies; 13+ messages in thread
From: Vladimir Davydov @ 2018-08-13  8:34 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

On Wed, Aug 08, 2018 at 01:03:47AM +0300, Vladislav Shpilevoy wrote:
> Box_is_ro now checks both for 'read_only' and 'orphan' modes, but
> in promotion only 'read_only' is needed. And now there is no a
> method to get the current 'read_only' value. After replacing
> box_is_ro with box_is_writable it is possible to reimplement
> box_is_ro as a getter for 'read_only' option.
> ---
>  src/box/box.cc     | 10 ++++++++--
>  src/box/box.h      |  3 +++
>  src/box/lua/info.c |  2 +-
>  3 files changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/src/box/box.cc b/src/box/box.cc
> index 6eb358442..d8fbc6252 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -216,17 +216,23 @@ box_set_ro(bool ro)
>  	fiber_cond_broadcast(&ro_cond);
>  }
>  
> +bool
> +box_is_writable(void)
> +{
> +	return !is_ro && !is_orphan;
> +}
> +
>  bool
>  box_is_ro(void)
>  {
> -	return is_ro || is_orphan;
> +	return is_ro;
>  }
>  
>  int
>  box_wait_ro(bool ro, double timeout)
>  {
>  	double deadline = ev_monotonic_now(loop()) + timeout;
> -	while (box_is_ro() != ro) {
> +	while (!box_is_writable() != ro) {
>  		if (fiber_cond_wait_deadline(&ro_cond, deadline) != 0)
>  			return -1;
>  		if (fiber_is_cancelled()) {

So now we have box_wait_ro() that checks box_is_writable() !=
box.cfg.read_only and we have box_is_ro() that returns the value
of box.cfg.read_only. Looks ugly.

I think that the promotion algorithm shouldn't flip box.cfg.read_only.
Instead it should use its own flag and box_is_ro() should be defined as

  bool box_is_ro() { return is_ro || is_oprhan || is_slave; }

That would be consistent with the orphan state.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 7/8] box: introduce box.ctl.promote
  2018-08-07 22:03 ` [PATCH 7/8] box: introduce box.ctl.promote Vladislav Shpilevoy
@ 2018-08-13  8:58   ` Vladimir Davydov
  0 siblings, 0 replies; 13+ messages in thread
From: Vladimir Davydov @ 2018-08-13  8:58 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

On Wed, Aug 08, 2018 at 01:03:50AM +0300, Vladislav Shpilevoy wrote:
> Replicaset master promotion is a procedure of atomic making one
> slave be a new master, and an old master be a slave in a fullmesh
> master-slave replicaset.
> 
> The promotion follows the protocol described in details in the
> corresponding RFC. Shortly, the protocol collects a quorum of
> instances who approves the promotion, syncs data with the old
> master and demotes it.
> 
> The protocol is intended to work with a single master cluster and
> with at least 50% + 1 quorum mandatory including an old master.
> It is tolerant to messages reordering from different sources, to
> errors like multiple masters, timeouts, restarts of any promotion
> participant. Also the promote protocol supports promotion in a
> completely read-only cluster. It is useful, for example, when
> after one of rare cases of a promotion fail the cluster is left
> in a read-only state with no masters. Then the promotion can just
> be called again to fix it. Such read-only promotion has only one
> restriction - all of the instances have to be safe and sound.
> 
> Once a promotion is executed, it makes box.cfg.read_only
> attribute be immutable. It is because actually the promotion
> protocol persists this attribute as a part of one of messages and
> sends it to other instances. So a user can not both use the
> promotion and manually change box.cfg.read_only.
> 
> The promotion has several API methods:
> 
> * box.ctl.promote({timeout = ..., quorum = ...}).
>   This function is meant to be called on a slave to demote the
>   old master if exists and promote the current instance.
> 
> * box.ctl.promote_info().
>   This function shows info about the latest promotion (finished
>   or running now - does not matter, just the latest).
> 
> * box.ctl.promote_reset().
>   This function clears the promotion history so a user would be
>   able to re-assign master/slave roles in a cluster manually.

IMO the code is rather convoluted. At least, I couldn't wrap my brain
around it after having looked at it for an hour. I think the automaton
you described in the RFC should be defined more clearly in the code.

First, there should be a single fiber handling all transitions in the
promotion table. The fiber should wait on an event channel. The
on_commit trigger installed on the promotion space should simply push
events to that channel, without starting any fibers or implementing any
logic, just push an event without waiting for a result (yielding in an
on_commit trigger is unsettling). Then you wouldn't need a separate
fiber for handling timeouts - the main promotion fiber would handle the
timeout just like any other promotion event.

Second, I think that each state of the automaton should be represented
by an object. The class used for creating the state objects should have
a vtab with methods handling each transition, including timeouts. This
would clearly define all states of the automaton in the code making it
much easier for understanding IMO.

I haven't reviewed the code carefully yet, because first we need to
decide whether we need to rework it as per above. Here's just a few
comments regarding some things that I couldn't help noticing:

> +/**
> + * Check that the promotion space is empty and reset for this
> + * case the state. Manual reset here is used by replicas when on
> + * one of them box.ctl.promote_reset() is called. Then on the
> + * source replica the promotion state is dropped but on other
> + * replicas this action should be done under the hood. This is the
> + * only possible place to do it.
> + */
> +static void
> +on_commit_check_promotion_reset(struct trigger *trigger, void *event)
>  {
>  	(void) trigger;
>  	(void) event;
> +	if (index_count(space_index(space_by_id(BOX_PROMOTION_ID), 0), ITER_ALL,
> +			NULL, 0) == 0)
> +		box_ctl_promote_reset();

That is you process deletion of a tuple from the promotion space only if
the space is empty? How does it work?

> +}
> +
> +static void
> +on_replace_dd_promotion(struct trigger *trigger, void *event)
> +{
> +	struct txn *txn = (struct txn *) event;
> +	struct txn_stmt *stmt = txn_current_stmt(txn);
> +	if (stmt->new_tuple == NULL && stmt->old_tuple != NULL) {
> +		trigger = txn_alter_trigger_new(on_commit_check_promotion_reset,
> +						NULL);
> +		txn_on_commit(txn, trigger);
> +		return;
> +	}
> +	assert(stmt->new_tuple != NULL);
> +	if (stmt->old_tuple != NULL) {
> +		tnt_raise(ClientError, ER_UNSUPPORTED, "Promotion",
> +			  "history edit");
> +	}
> +	/*
> +	 * Forbid multistatement only for non-DELETE since the
> +	 * later is used for promotion reset in batches - the
> +	 * whole round per one transaction is dropped.
> +	 */
> +	txn_check_singlestatement_xc(txn, "Space _promotion");
> +	struct promote_msg *msg =
> +		region_alloc_object_xc(&fiber()->gc, struct promote_msg);
> +	/*
> +	 * Decode the message before the commit to do message's
> +	 * sanity check.
> +	 */
> +	if (promote_msg_decode(tuple_data(stmt->new_tuple), msg) != 0)
> +		diag_raise();
> +	trigger = txn_alter_trigger_new(on_commit_process_promote_msg, msg);
> +	txn_on_commit(txn, trigger);
>  }
>  
>  /* }}} cluster configuration */
> diff --git a/src/box/box.cc b/src/box/box.cc
> index d8fbc6252..8bbd0d424 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -73,6 +73,7 @@
>  #include "call.h"
>  #include "func.h"
>  #include "sequence.h"
> +#include "promote.h"
>  
>  static char status[64] = "unknown";
>  
> @@ -216,6 +217,12 @@ box_set_ro(bool ro)
>  	fiber_cond_broadcast(&ro_cond);
>  }
>  
> +void
> +box_expose_ro()
> +{
> +	cfg_rawsetb("read_only", is_ro);
> +}
> +

I really don't think that flipping box.cfg.read_only is a good idea.
We never modify config options set by the user and IMO doing that will
be rather unexpected from user pov. We have box.info.ro for reflecting
the read_only state - why touch box.cfg.read_only?

>  bool
>  box_is_writable(void)
>  {
> @@ -970,6 +977,15 @@ box_index_id_by_name(uint32_t space_id, const char *name, uint32_t len)
>  }
>  /** \endcond public */
>  
> +int
> +box_process_sys_dml(struct request *request)
> +{
> +	struct space *space = space_cache_find(request->space_id);
> +	assert(space != NULL);
> +	assert(space_is_system(space));
> +	return process_dml(request, space, NULL);
> +}
> +

We have boxk() for modifying system spaces without checking access
rights and read_only state. Why don't you reuse it instead of
introducing a new function?

> diff --git a/src/box/promote.c b/src/box/promote.c
> new file mode 100644
> index 000000000..dcc39b5bd
> --- /dev/null
> +++ b/src/box/promote.c
> @@ -0,0 +1,1075 @@
> +/*
> + * Copyright 2010-2018, Tarantool AUTHORS, please see AUTHORS file.
> + *
> + * Redistribution and use in source and binary forms, with or
> + * without modification, are permitted provided that the following
> + * conditions are met:
> + *
> + * 1. Redistributions of source code must retain the above
> + *    copyright notice, this list of conditions and the
> + *    following disclaimer.
> + *
> + * 2. Redistributions in binary form must reproduce the above
> + *    copyright notice, this list of conditions and the following
> + *    disclaimer in the documentation and/or other materials
> + *    provided with the distribution.
> + *
> + * THIS SOFTWARE IS PROVIDED BY <COPYRIGHT HOLDER> ``AS IS'' AND
> + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
> + * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
> + * <COPYRIGHT HOLDER> OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
> + * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
> + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
> + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
> + * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
> + * THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> + * SUCH DAMAGE.
> + */
> +#include "box.h"
> +#include "replication.h"
> +#include "promote.h"
> +#include "error.h"
> +#include "msgpuck.h"
> +#include "xrow.h"
> +#include "space.h"
> +#include "schema.h"
> +#include "schema_def.h"
> +#include "txn.h"
> +#include "tuple.h"
> +#include "iproto_constants.h"
> +#include "opt_def.h"
> +#include "info.h"
> +
> +static const char *promote_msg_type_strs[] = {
> +	"begin",
> +	"status",
> +	"sync",
> +	"success",
> +	"error",
> +};
> +
> +/** True, if @a msg is created by the current instance. */
> +static inline bool
> +promote_msg_is_mine(const struct promote_msg *msg)
> +{
> +	return tt_uuid_is_equal(&msg->source_uuid, &INSTANCE_UUID);
> +}
> +
> +enum promote_role {
> +	PROMOTE_ROLE_UNDEFINED = 0,
> +	PROMOTE_ROLE_INITIATOR,
> +	PROMOTE_ROLE_OLD_MASTER,

PROMOTE_ROLE_MASTER ?

Please write a brief comment to each role.

> +	PROMOTE_ROLE_WATCHER
> +};
> +
> +static const char *promote_role_strs[] = {
> +	"undefined",
> +	"initiator",
> +	"old master",
> +	"watcher",
> +};
> +
> +enum promote_phase {
> +	PROMOTE_PHASE_NON_ACTIVE = 0,

INACTIVE

Please write a brief comment to each phase.

> +	PROMOTE_PHASE_ERROR,
> +	PROMOTE_PHASE_SUCCESS,
> +	PROMOTE_PHASE_IN_PROGRESS,
> +};

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-08-13  8:58 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-07 22:03 [PATCH 0/8] box.ctl.promote Vladislav Shpilevoy
2018-08-07 22:03 ` [PATCH 1/8] rfc: describe box.ctl.promote protocol Vladislav Shpilevoy
2018-08-07 22:03 ` [PATCH 2/8] box: rename process_rw to process_dml Vladislav Shpilevoy
2018-08-13  8:20   ` Vladimir Davydov
2018-08-07 22:03 ` [PATCH 3/8] Add 'exact_field_count' parameter to options decoder Vladislav Shpilevoy
2018-08-13  8:30   ` Vladimir Davydov
2018-08-07 22:03 ` [PATCH 4/8] box: remove orphan check from box_is_ro() Vladislav Shpilevoy
2018-08-13  8:34   ` Vladimir Davydov
2018-08-07 22:03 ` [PATCH 5/8] Fix gcov on Mac Vladislav Shpilevoy
2018-08-07 22:03 ` [PATCH 6/8] box: introduce _promotion space Vladislav Shpilevoy
2018-08-07 22:03 ` [PATCH 7/8] box: introduce box.ctl.promote Vladislav Shpilevoy
2018-08-13  8:58   ` Vladimir Davydov
2018-08-07 22:03 ` [PATCH 8/8] box: introduce promotion GC Vladislav Shpilevoy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox