From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Thu, 20 Sep 2018 20:13:29 +0300 From: Vladimir Davydov Subject: box.ctl.promote review Message-ID: <20180920171329.utwpq3dea5purrvk@esperanza> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline To: Vladislav Shpilevoy Cc: Konstantin Osipov , Georgy Kirichenko , tarantool-patches@freelists.org List-ID: Hello, The code this review is for is here: https://github.com/tarantool/tarantool/tree/gerold103/gh-3055-box-ctl-promote First, I don't understand why the finite state automaton has to be so complex: https://github.com/tarantool/tarantool/blob/gerold103/gh-3055-box-ctl-promote/doc/rfc/3055-box_ctl_promote.md Why can't we simply use plain and simple 2PC? That is: 1. When box.ctl.promote() is called on Slave, broadcast Begin and switch to Promotee state. 2. Upon receiving Begin: - On Master: switch to Demotee state, enter read-only mode, and broadcast Ack. - On Slave: switch to Watcher state and broadcast Ack. - On Promotee, Demotee, or Watcher: broadcast Nak. 3. Upon gathering the quorum of Acks on Promotee, switch to Master state, enter read-write mode, and broadcast Commit. 4. Upon receiving Commit on Demotee or Watcher, switch to Slave. I understand that 2PC may block in case Promotee dies before sending Commit message, but the proposed design suffers from the very same problem: "Promotion initiator" may fail to send "Promote", in which case there will be no master in the cluster and a new election won't be possible. I suggest to resolve this problem by adding "force" flag to box.ctl.promote() options that would make Demotee and Watcher answer with Ack instead of Nak - the whole promotion procedure is manual so I don't think such a manual intervention in case of promotion failure would be a problem. Anyway, even though the proposed design isn't resilient against network failures, it still looks more complex than 3PC, which implements a way of recovery from some kinds of failures. AFAIU this is because - there's 'sync' phase, which is needed to synchronize all slaves with the old master; - the decision to promote a slave is made by the current master; - there's a timeout-based rollback from certain states. Regarding the 'sync' phase. I don't understand why we need to sync all slaves with the old master before demoting it. I do understand that we need to sync the promotee with the current master, but this is done anyway. The RFC doesn't answer this question AFAICS. May be, it is needed to make sure that in case both old and new masters die right after promotion, remaining replicas will have more-or-less recent data. However, this seems to be pointless, because the current master may die at any moment, even if there's no promotion in progress, potentially leaving replicas outdated. Regarding decision making on the master's behalf. If I understand correctly, the goal of this is to make sure no new master is elected in case the current master happens to be down when box.ctl.promote() is called, even if a quorum is formed. However, this could be resolved in the basic 2PC algorithm I presented above: one just needs to do is define a quorum as at a half of slaves *and* the current master (the current master should be known to all participating replicas, because it is stored in the promotion table). Regarding timeouts. Even though they do allow rollback in certain cases, the algorithm still remains vulnerable to network failures, as I pointed out above. Why do we need to bother then? That being said, I don't see any benefit of the proposed design over plain, simple, and well-known 2PC. I admit that I may be missing something crucial so I'd be happy to meet and discuss the algorithm in person. Anyway, despite the set of properties expected from the algorithm, I think that we should try to use some well-known algorithm rather than implementing a new one. A few notes regarding the code implementing the algorithm: 1. The algorithm has a few states that are not reflected in the code, namely "Waiting for promote", "Waiting for quorum success", "Slave", "Master". I don't like this discrepancy. I think that each and every state of the finite automaton must be present in the code. 2. I don't like the way you define the state transition table, because it's impossible to say what happens when a particular event occurs without diving deep in the code: static struct promote_state_vtab initiator_vtab = { /* .begin = */ promote_msg_unreachable, /* .new_replica = */ promote_initiator_new_participant, /* .new_master = */ promote_initiator_new_participant, /* .sync = */ promote_initiator_sync, /* .success = */ promote_msg_skip, /* .demote = */ promote_initiator_demote, /* .promote = */ promote_initiator_promote, /* .error = */ promote_default_error, }; I'd try to explicitly mention the destination state and all the handlers that have to be invoked on transition, e.g. // source state destination state action callbacks [PROMOTE_STATE_MASTER] = { PROMOTE_STATE_DEMOTEE, {enter_ro, send_ack, NULL} } 3. Probably, this is the problem of the algorithm rather than the implementation, but IMO the fact that the master instance may be promoted (rw) or demoted (ro) looks weird: struct { /** Sender role. */ bool is_master; /** * True if the sender is a currently * active master being able to accept * write requests. If false, then the * master is demoted, but no new one is * elected. */ bool is_promoted; } status; I think that a master in ro and rw mode should be assigned different states, rather than using a flag. For example, in case of 2PC those states would be Master and Demotee. 4. Sleeping in an on_commit trigger looks unsettling: void promote_msg_push(struct promote_msg *msg) { fiber_channel_put_msg_timeout(promote_state.channel, &msg->base, TIMEOUT_INFINITY); } void on_commit_process_promote_msg(struct trigger *trigger, void *event) { (void) event; promote_msg_push((struct promote_msg *) trigger->data); } I think we'd better implement a message queue with a plain stailq or rlist rather than using fiber_channel. 5. This looks very confusing, because 'ro' is an abbreviation for 'read-only', i.e. the function names are identical: /** * Return box.cfg.read_only reduced to boolean value. It is true * only and only when box.cfg.read_only == true. When read_only * is false or automatic, false is returned. */ bool box_is_read_only(void); /** * Check if this instance is in read only mode. It is accumulated * from orphan, box.cfg.read_only and promotion status. */ bool box_is_ro(void); I also hate the idea of tri-state box.cfg.read_only, which entails the need of this function as well as box_is_promotable(), but I guess we'll have to live with it as Kostja is quite convinced about it. 6. The following change looks suspicious to me: void box_process_vote(struct ballot *ballot) { - ballot->is_ro = cfg_geti("read_only") != 0; + ballot->is_ro = box_is_read_only(); The thing is we used the actual configuration value of 'read_only' option when bootstrapping a new instance before this change while now we will always send is_ro = true in this case, which might not always be correct. Also, when a fresh instance is bootstrapped it is supposed to select an rw instance as a master to bootstrap from while after this change it assumes that all instances are rw in case promotion is used. 7. IMO we'd better replicate resetting of the promotion state via the _truncate system space instead of using the code below: static void on_commit_check_promotion_reset(struct trigger *trigger, void *event) { (void) trigger; (void) event; if (index_count(space_index(space_by_id(BOX_PROMOTION_ID), 0), ITER_ALL, NULL, 0) == 0) box_ctl_promote_reset(); } 8. There's txn_is_first_statement() helper, which could probably be used instead of trigger_add_unique in the code below: static void on_replace_dd_promotion(struct trigger *trigger, void *event) { (void) trigger; struct txn *txn = (struct txn *) event; struct txn_stmt *stmt = txn_current_stmt(txn); struct trigger *on_commit, *on_rollback; if (stmt->new_tuple == NULL && stmt->old_tuple != NULL) { on_commit = txn_alter_trigger_new(on_commit_check_promotion_reset, NULL); txn_init_triggers(txn); /* * Promotion messages are deleted in batches by * whole rounds. It is enough to check reset once * for a transaction and use an unique trigger. */ trigger_add_unique(&txn->on_commit, on_commit); return; }