[PATCH 7/8] box: introduce box.ctl.promote

Vladimir Davydov vdavydov.dev at gmail.com
Mon Aug 13 11:58:59 MSK 2018


On Wed, Aug 08, 2018 at 01:03:50AM +0300, Vladislav Shpilevoy wrote:
> Replicaset master promotion is a procedure of atomic making one
> slave be a new master, and an old master be a slave in a fullmesh
> master-slave replicaset.
> 
> The promotion follows the protocol described in details in the
> corresponding RFC. Shortly, the protocol collects a quorum of
> instances who approves the promotion, syncs data with the old
> master and demotes it.
> 
> The protocol is intended to work with a single master cluster and
> with at least 50% + 1 quorum mandatory including an old master.
> It is tolerant to messages reordering from different sources, to
> errors like multiple masters, timeouts, restarts of any promotion
> participant. Also the promote protocol supports promotion in a
> completely read-only cluster. It is useful, for example, when
> after one of rare cases of a promotion fail the cluster is left
> in a read-only state with no masters. Then the promotion can just
> be called again to fix it. Such read-only promotion has only one
> restriction - all of the instances have to be safe and sound.
> 
> Once a promotion is executed, it makes box.cfg.read_only
> attribute be immutable. It is because actually the promotion
> protocol persists this attribute as a part of one of messages and
> sends it to other instances. So a user can not both use the
> promotion and manually change box.cfg.read_only.
> 
> The promotion has several API methods:
> 
> * box.ctl.promote({timeout = ..., quorum = ...}).
>   This function is meant to be called on a slave to demote the
>   old master if exists and promote the current instance.
> 
> * box.ctl.promote_info().
>   This function shows info about the latest promotion (finished
>   or running now - does not matter, just the latest).
> 
> * box.ctl.promote_reset().
>   This function clears the promotion history so a user would be
>   able to re-assign master/slave roles in a cluster manually.

IMO the code is rather convoluted. At least, I couldn't wrap my brain
around it after having looked at it for an hour. I think the automaton
you described in the RFC should be defined more clearly in the code.

First, there should be a single fiber handling all transitions in the
promotion table. The fiber should wait on an event channel. The
on_commit trigger installed on the promotion space should simply push
events to that channel, without starting any fibers or implementing any
logic, just push an event without waiting for a result (yielding in an
on_commit trigger is unsettling). Then you wouldn't need a separate
fiber for handling timeouts - the main promotion fiber would handle the
timeout just like any other promotion event.

Second, I think that each state of the automaton should be represented
by an object. The class used for creating the state objects should have
a vtab with methods handling each transition, including timeouts. This
would clearly define all states of the automaton in the code making it
much easier for understanding IMO.

I haven't reviewed the code carefully yet, because first we need to
decide whether we need to rework it as per above. Here's just a few
comments regarding some things that I couldn't help noticing:

> +/**
> + * Check that the promotion space is empty and reset for this
> + * case the state. Manual reset here is used by replicas when on
> + * one of them box.ctl.promote_reset() is called. Then on the
> + * source replica the promotion state is dropped but on other
> + * replicas this action should be done under the hood. This is the
> + * only possible place to do it.
> + */
> +static void
> +on_commit_check_promotion_reset(struct trigger *trigger, void *event)
>  {
>  	(void) trigger;
>  	(void) event;
> +	if (index_count(space_index(space_by_id(BOX_PROMOTION_ID), 0), ITER_ALL,
> +			NULL, 0) == 0)
> +		box_ctl_promote_reset();

That is you process deletion of a tuple from the promotion space only if
the space is empty? How does it work?

> +}
> +
> +static void
> +on_replace_dd_promotion(struct trigger *trigger, void *event)
> +{
> +	struct txn *txn = (struct txn *) event;
> +	struct txn_stmt *stmt = txn_current_stmt(txn);
> +	if (stmt->new_tuple == NULL && stmt->old_tuple != NULL) {
> +		trigger = txn_alter_trigger_new(on_commit_check_promotion_reset,
> +						NULL);
> +		txn_on_commit(txn, trigger);
> +		return;
> +	}
> +	assert(stmt->new_tuple != NULL);
> +	if (stmt->old_tuple != NULL) {
> +		tnt_raise(ClientError, ER_UNSUPPORTED, "Promotion",
> +			  "history edit");
> +	}
> +	/*
> +	 * Forbid multistatement only for non-DELETE since the
> +	 * later is used for promotion reset in batches - the
> +	 * whole round per one transaction is dropped.
> +	 */
> +	txn_check_singlestatement_xc(txn, "Space _promotion");
> +	struct promote_msg *msg =
> +		region_alloc_object_xc(&fiber()->gc, struct promote_msg);
> +	/*
> +	 * Decode the message before the commit to do message's
> +	 * sanity check.
> +	 */
> +	if (promote_msg_decode(tuple_data(stmt->new_tuple), msg) != 0)
> +		diag_raise();
> +	trigger = txn_alter_trigger_new(on_commit_process_promote_msg, msg);
> +	txn_on_commit(txn, trigger);
>  }
>  
>  /* }}} cluster configuration */
> diff --git a/src/box/box.cc b/src/box/box.cc
> index d8fbc6252..8bbd0d424 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -73,6 +73,7 @@
>  #include "call.h"
>  #include "func.h"
>  #include "sequence.h"
> +#include "promote.h"
>  
>  static char status[64] = "unknown";
>  
> @@ -216,6 +217,12 @@ box_set_ro(bool ro)
>  	fiber_cond_broadcast(&ro_cond);
>  }
>  
> +void
> +box_expose_ro()
> +{
> +	cfg_rawsetb("read_only", is_ro);
> +}
> +

I really don't think that flipping box.cfg.read_only is a good idea.
We never modify config options set by the user and IMO doing that will
be rather unexpected from user pov. We have box.info.ro for reflecting
the read_only state - why touch box.cfg.read_only?

>  bool
>  box_is_writable(void)
>  {
> @@ -970,6 +977,15 @@ box_index_id_by_name(uint32_t space_id, const char *name, uint32_t len)
>  }
>  /** \endcond public */
>  
> +int
> +box_process_sys_dml(struct request *request)
> +{
> +	struct space *space = space_cache_find(request->space_id);
> +	assert(space != NULL);
> +	assert(space_is_system(space));
> +	return process_dml(request, space, NULL);
> +}
> +

We have boxk() for modifying system spaces without checking access
rights and read_only state. Why don't you reuse it instead of
introducing a new function?

> diff --git a/src/box/promote.c b/src/box/promote.c
> new file mode 100644
> index 000000000..dcc39b5bd
> --- /dev/null
> +++ b/src/box/promote.c
> @@ -0,0 +1,1075 @@
> +/*
> + * Copyright 2010-2018, Tarantool AUTHORS, please see AUTHORS file.
> + *
> + * Redistribution and use in source and binary forms, with or
> + * without modification, are permitted provided that the following
> + * conditions are met:
> + *
> + * 1. Redistributions of source code must retain the above
> + *    copyright notice, this list of conditions and the
> + *    following disclaimer.
> + *
> + * 2. Redistributions in binary form must reproduce the above
> + *    copyright notice, this list of conditions and the following
> + *    disclaimer in the documentation and/or other materials
> + *    provided with the distribution.
> + *
> + * THIS SOFTWARE IS PROVIDED BY <COPYRIGHT HOLDER> ``AS IS'' AND
> + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
> + * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
> + * <COPYRIGHT HOLDER> OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
> + * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
> + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
> + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
> + * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
> + * THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> + * SUCH DAMAGE.
> + */
> +#include "box.h"
> +#include "replication.h"
> +#include "promote.h"
> +#include "error.h"
> +#include "msgpuck.h"
> +#include "xrow.h"
> +#include "space.h"
> +#include "schema.h"
> +#include "schema_def.h"
> +#include "txn.h"
> +#include "tuple.h"
> +#include "iproto_constants.h"
> +#include "opt_def.h"
> +#include "info.h"
> +
> +static const char *promote_msg_type_strs[] = {
> +	"begin",
> +	"status",
> +	"sync",
> +	"success",
> +	"error",
> +};
> +
> +/** True, if @a msg is created by the current instance. */
> +static inline bool
> +promote_msg_is_mine(const struct promote_msg *msg)
> +{
> +	return tt_uuid_is_equal(&msg->source_uuid, &INSTANCE_UUID);
> +}
> +
> +enum promote_role {
> +	PROMOTE_ROLE_UNDEFINED = 0,
> +	PROMOTE_ROLE_INITIATOR,
> +	PROMOTE_ROLE_OLD_MASTER,

PROMOTE_ROLE_MASTER ?

Please write a brief comment to each role.

> +	PROMOTE_ROLE_WATCHER
> +};
> +
> +static const char *promote_role_strs[] = {
> +	"undefined",
> +	"initiator",
> +	"old master",
> +	"watcher",
> +};
> +
> +enum promote_phase {
> +	PROMOTE_PHASE_NON_ACTIVE = 0,

INACTIVE

Please write a brief comment to each phase.

> +	PROMOTE_PHASE_ERROR,
> +	PROMOTE_PHASE_SUCCESS,
> +	PROMOTE_PHASE_IN_PROGRESS,
> +};



More information about the Tarantool-patches mailing list