From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 13 Aug 2018 11:58:59 +0300 From: Vladimir Davydov Subject: Re: [PATCH 7/8] box: introduce box.ctl.promote Message-ID: <20180813085859.kn6rzobydzsbcwv3@esperanza> References: <4d6561fab26913f881290fbdbd2759098816b6d8.1533679264.git.v.shpilevoy@tarantool.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4d6561fab26913f881290fbdbd2759098816b6d8.1533679264.git.v.shpilevoy@tarantool.org> To: Vladislav Shpilevoy Cc: tarantool-patches@freelists.org List-ID: On Wed, Aug 08, 2018 at 01:03:50AM +0300, Vladislav Shpilevoy wrote: > Replicaset master promotion is a procedure of atomic making one > slave be a new master, and an old master be a slave in a fullmesh > master-slave replicaset. > > The promotion follows the protocol described in details in the > corresponding RFC. Shortly, the protocol collects a quorum of > instances who approves the promotion, syncs data with the old > master and demotes it. > > The protocol is intended to work with a single master cluster and > with at least 50% + 1 quorum mandatory including an old master. > It is tolerant to messages reordering from different sources, to > errors like multiple masters, timeouts, restarts of any promotion > participant. Also the promote protocol supports promotion in a > completely read-only cluster. It is useful, for example, when > after one of rare cases of a promotion fail the cluster is left > in a read-only state with no masters. Then the promotion can just > be called again to fix it. Such read-only promotion has only one > restriction - all of the instances have to be safe and sound. > > Once a promotion is executed, it makes box.cfg.read_only > attribute be immutable. It is because actually the promotion > protocol persists this attribute as a part of one of messages and > sends it to other instances. So a user can not both use the > promotion and manually change box.cfg.read_only. > > The promotion has several API methods: > > * box.ctl.promote({timeout = ..., quorum = ...}). > This function is meant to be called on a slave to demote the > old master if exists and promote the current instance. > > * box.ctl.promote_info(). > This function shows info about the latest promotion (finished > or running now - does not matter, just the latest). > > * box.ctl.promote_reset(). > This function clears the promotion history so a user would be > able to re-assign master/slave roles in a cluster manually. IMO the code is rather convoluted. At least, I couldn't wrap my brain around it after having looked at it for an hour. I think the automaton you described in the RFC should be defined more clearly in the code. First, there should be a single fiber handling all transitions in the promotion table. The fiber should wait on an event channel. The on_commit trigger installed on the promotion space should simply push events to that channel, without starting any fibers or implementing any logic, just push an event without waiting for a result (yielding in an on_commit trigger is unsettling). Then you wouldn't need a separate fiber for handling timeouts - the main promotion fiber would handle the timeout just like any other promotion event. Second, I think that each state of the automaton should be represented by an object. The class used for creating the state objects should have a vtab with methods handling each transition, including timeouts. This would clearly define all states of the automaton in the code making it much easier for understanding IMO. I haven't reviewed the code carefully yet, because first we need to decide whether we need to rework it as per above. Here's just a few comments regarding some things that I couldn't help noticing: > +/** > + * Check that the promotion space is empty and reset for this > + * case the state. Manual reset here is used by replicas when on > + * one of them box.ctl.promote_reset() is called. Then on the > + * source replica the promotion state is dropped but on other > + * replicas this action should be done under the hood. This is the > + * only possible place to do it. > + */ > +static void > +on_commit_check_promotion_reset(struct trigger *trigger, void *event) > { > (void) trigger; > (void) event; > + if (index_count(space_index(space_by_id(BOX_PROMOTION_ID), 0), ITER_ALL, > + NULL, 0) == 0) > + box_ctl_promote_reset(); That is you process deletion of a tuple from the promotion space only if the space is empty? How does it work? > +} > + > +static void > +on_replace_dd_promotion(struct trigger *trigger, void *event) > +{ > + struct txn *txn = (struct txn *) event; > + struct txn_stmt *stmt = txn_current_stmt(txn); > + if (stmt->new_tuple == NULL && stmt->old_tuple != NULL) { > + trigger = txn_alter_trigger_new(on_commit_check_promotion_reset, > + NULL); > + txn_on_commit(txn, trigger); > + return; > + } > + assert(stmt->new_tuple != NULL); > + if (stmt->old_tuple != NULL) { > + tnt_raise(ClientError, ER_UNSUPPORTED, "Promotion", > + "history edit"); > + } > + /* > + * Forbid multistatement only for non-DELETE since the > + * later is used for promotion reset in batches - the > + * whole round per one transaction is dropped. > + */ > + txn_check_singlestatement_xc(txn, "Space _promotion"); > + struct promote_msg *msg = > + region_alloc_object_xc(&fiber()->gc, struct promote_msg); > + /* > + * Decode the message before the commit to do message's > + * sanity check. > + */ > + if (promote_msg_decode(tuple_data(stmt->new_tuple), msg) != 0) > + diag_raise(); > + trigger = txn_alter_trigger_new(on_commit_process_promote_msg, msg); > + txn_on_commit(txn, trigger); > } > > /* }}} cluster configuration */ > diff --git a/src/box/box.cc b/src/box/box.cc > index d8fbc6252..8bbd0d424 100644 > --- a/src/box/box.cc > +++ b/src/box/box.cc > @@ -73,6 +73,7 @@ > #include "call.h" > #include "func.h" > #include "sequence.h" > +#include "promote.h" > > static char status[64] = "unknown"; > > @@ -216,6 +217,12 @@ box_set_ro(bool ro) > fiber_cond_broadcast(&ro_cond); > } > > +void > +box_expose_ro() > +{ > + cfg_rawsetb("read_only", is_ro); > +} > + I really don't think that flipping box.cfg.read_only is a good idea. We never modify config options set by the user and IMO doing that will be rather unexpected from user pov. We have box.info.ro for reflecting the read_only state - why touch box.cfg.read_only? > bool > box_is_writable(void) > { > @@ -970,6 +977,15 @@ box_index_id_by_name(uint32_t space_id, const char *name, uint32_t len) > } > /** \endcond public */ > > +int > +box_process_sys_dml(struct request *request) > +{ > + struct space *space = space_cache_find(request->space_id); > + assert(space != NULL); > + assert(space_is_system(space)); > + return process_dml(request, space, NULL); > +} > + We have boxk() for modifying system spaces without checking access rights and read_only state. Why don't you reuse it instead of introducing a new function? > diff --git a/src/box/promote.c b/src/box/promote.c > new file mode 100644 > index 000000000..dcc39b5bd > --- /dev/null > +++ b/src/box/promote.c > @@ -0,0 +1,1075 @@ > +/* > + * Copyright 2010-2018, Tarantool AUTHORS, please see AUTHORS file. > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * 1. Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the > + * following disclaimer. > + * > + * 2. Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THIS SOFTWARE IS PROVIDED BY ``AS IS'' AND > + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED > + * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR > + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL > + * OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, > + * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL > + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF > + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR > + * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF > + * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT > + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF > + * THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF > + * SUCH DAMAGE. > + */ > +#include "box.h" > +#include "replication.h" > +#include "promote.h" > +#include "error.h" > +#include "msgpuck.h" > +#include "xrow.h" > +#include "space.h" > +#include "schema.h" > +#include "schema_def.h" > +#include "txn.h" > +#include "tuple.h" > +#include "iproto_constants.h" > +#include "opt_def.h" > +#include "info.h" > + > +static const char *promote_msg_type_strs[] = { > + "begin", > + "status", > + "sync", > + "success", > + "error", > +}; > + > +/** True, if @a msg is created by the current instance. */ > +static inline bool > +promote_msg_is_mine(const struct promote_msg *msg) > +{ > + return tt_uuid_is_equal(&msg->source_uuid, &INSTANCE_UUID); > +} > + > +enum promote_role { > + PROMOTE_ROLE_UNDEFINED = 0, > + PROMOTE_ROLE_INITIATOR, > + PROMOTE_ROLE_OLD_MASTER, PROMOTE_ROLE_MASTER ? Please write a brief comment to each role. > + PROMOTE_ROLE_WATCHER > +}; > + > +static const char *promote_role_strs[] = { > + "undefined", > + "initiator", > + "old master", > + "watcher", > +}; > + > +enum promote_phase { > + PROMOTE_PHASE_NON_ACTIVE = 0, INACTIVE Please write a brief comment to each phase. > + PROMOTE_PHASE_ERROR, > + PROMOTE_PHASE_SUCCESS, > + PROMOTE_PHASE_IN_PROGRESS, > +};