From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <vdavydov.dev@gmail.com>
Date: Thu, 20 Sep 2018 20:13:29 +0300
From: Vladimir Davydov <vdavydov.dev@gmail.com>
Subject: box.ctl.promote review
Message-ID: <20180920171329.utwpq3dea5purrvk@esperanza>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Cc: Konstantin Osipov <kostja@tarantool.org>, Georgy Kirichenko <georgy@tarantool.org>, tarantool-patches@freelists.org
List-ID: <tarantool-patches.dev.tarantool.org>

Hello,

The code this review is for is here:

  https://github.com/tarantool/tarantool/tree/gerold103/gh-3055-box-ctl-promote

First, I don't understand why the finite state automaton has to be so
complex:

  https://github.com/tarantool/tarantool/blob/gerold103/gh-3055-box-ctl-promote/doc/rfc/3055-box_ctl_promote.md

Why can't we simply use plain and simple 2PC? That is:

 1. When box.ctl.promote() is called on Slave, broadcast Begin and
    switch to Promotee state.
 2. Upon receiving Begin:
    - On Master: switch to Demotee state, enter read-only mode,
      and broadcast Ack.
    - On Slave: switch to Watcher state and broadcast Ack.
    - On Promotee, Demotee, or Watcher: broadcast Nak.
 3. Upon gathering the quorum of Acks on Promotee, switch to Master
    state, enter read-write mode, and broadcast Commit.
 4. Upon receiving Commit on Demotee or Watcher, switch to Slave.

I understand that 2PC may block in case Promotee dies before sending
Commit message, but the proposed design suffers from the very same
problem: "Promotion initiator" may fail to send "Promote", in which case
there will be no master in the cluster and a new election won't be
possible. I suggest to resolve this problem by adding "force" flag to
box.ctl.promote() options that would make Demotee and Watcher answer
with Ack instead of Nak - the whole promotion procedure is manual so I
don't think such a manual intervention in case of promotion failure
would be a problem.

Anyway, even though the proposed design isn't resilient against network
failures, it still looks more complex than 3PC, which implements a way
of recovery from some kinds of failures. AFAIU this is because
 - there's 'sync' phase, which is needed to synchronize all slaves with
   the old master;
 - the decision to promote a slave is made by the current master;
 - there's a timeout-based rollback from certain states.

Regarding the 'sync' phase. I don't understand why we need to sync all
slaves with the old master before demoting it. I do understand that we
need to sync the promotee with the current master, but this is done
anyway. The RFC doesn't answer this question AFAICS. May be, it is
needed to make sure that in case both old and new masters die right
after promotion, remaining replicas will have more-or-less recent data.
However, this seems to be pointless, because the current master may die
at any moment, even if there's no promotion in progress, potentially
leaving replicas outdated.

Regarding decision making on the master's behalf. If I understand
correctly, the goal of this is to make sure no new master is elected in
case the current master happens to be down when box.ctl.promote() is
called, even if a quorum is formed. However, this could be resolved in
the basic 2PC algorithm I presented above: one just needs to do is
define a quorum as at a half of slaves *and* the current master (the
current master should be known to all participating replicas, because it
is stored in the promotion table).

Regarding timeouts. Even though they do allow rollback in certain cases,
the algorithm still remains vulnerable to network failures, as I pointed
out above. Why do we need to bother then?

That being said, I don't see any benefit of the proposed design over
plain, simple, and well-known 2PC. I admit that I may be missing
something crucial so I'd be happy to meet and discuss the algorithm in
person. Anyway, despite the set of properties expected from the
algorithm, I think that we should try to use some well-known algorithm
rather than implementing a new one.

A few notes regarding the code implementing the algorithm:

 1. The algorithm has a few states that are not reflected in the code,
    namely "Waiting for promote", "Waiting for quorum success", "Slave",
    "Master". I don't like this discrepancy. I think that each and every
    state of the finite automaton must be present in the code.

 2. I don't like the way you define the state transition table, because
    it's impossible to say what happens when a particular event occurs
    without diving deep in the code:

    static struct promote_state_vtab initiator_vtab = {
            /* .begin = */ promote_msg_unreachable,
            /* .new_replica = */ promote_initiator_new_participant,
            /* .new_master = */ promote_initiator_new_participant,
            /* .sync = */ promote_initiator_sync,
            /* .success = */ promote_msg_skip,
            /* .demote = */ promote_initiator_demote,
            /* .promote = */ promote_initiator_promote,
            /* .error = */ promote_default_error,
    };

    I'd try to explicitly mention the destination state and all the
    handlers that have to be invoked on transition, e.g.

    //   source state             destination state        action callbacks
    [PROMOTE_STATE_MASTER] = { PROMOTE_STATE_DEMOTEE, {enter_ro, send_ack, NULL} }

 3. Probably, this is the problem of the algorithm rather than the
    implementation, but IMO the fact that the master instance may be
    promoted (rw) or demoted (ro) looks weird:

    struct {
            /** Sender role. */
            bool is_master;
            /**
             * True if the sender is a currently
             * active master being able to accept
             * write requests. If false, then the
             * master is demoted, but no new one is
             * elected.
             */
            bool is_promoted;
    } status;

    I think that a master in ro and rw mode should be assigned different
    states, rather than using a flag. For example, in case of 2PC those
    states would be Master and Demotee.

 4. Sleeping in an on_commit trigger looks unsettling:

    void
    promote_msg_push(struct promote_msg *msg)
    {
            fiber_channel_put_msg_timeout(promote_state.channel, &msg->base,
                                          TIMEOUT_INFINITY);
    }
    
    void
    on_commit_process_promote_msg(struct trigger *trigger, void *event)
    {
            (void) event;
            promote_msg_push((struct promote_msg *) trigger->data);
    }

    I think we'd better implement a message queue with a plain stailq or
    rlist rather than using fiber_channel.

 5. This looks very confusing, because 'ro' is an abbreviation for
    'read-only', i.e. the function names are identical:

    /**
     * Return box.cfg.read_only reduced to boolean value. It is true
     * only and only when box.cfg.read_only == true. When read_only
     * is false or automatic, false is returned.
     */
    bool
    box_is_read_only(void);

    /**
     * Check if this instance is in read only mode. It is accumulated
     * from orphan, box.cfg.read_only and promotion status.
     */
    bool
    box_is_ro(void);

    I also hate the idea of tri-state box.cfg.read_only, which entails
    the need of this function as well as box_is_promotable(), but I
    guess we'll have to live with it as Kostja is quite convinced about
    it.

 6. The following change looks suspicious to me:

     void
     box_process_vote(struct ballot *ballot)
     {
    -       ballot->is_ro = cfg_geti("read_only") != 0;
    +       ballot->is_ro = box_is_read_only();

    The thing is we used the actual configuration value of 'read_only'
    option when bootstrapping a new instance before this change while
    now we will always send is_ro = true in this case, which might not
    always be correct.

    Also, when a fresh instance is bootstrapped it is supposed to select
    an rw instance as a master to bootstrap from while after this change
    it assumes that all instances are rw in case promotion is used.

 7. IMO we'd better replicate resetting of the promotion state via the
    _truncate system space instead of using the code below:

    static void
    on_commit_check_promotion_reset(struct trigger *trigger, void *event)
    {
            (void) trigger;
            (void) event;
            if (index_count(space_index(space_by_id(BOX_PROMOTION_ID), 0), ITER_ALL,
                            NULL, 0) == 0)
                    box_ctl_promote_reset();
    }

 8. There's txn_is_first_statement() helper, which could probably be
    used instead of trigger_add_unique in the code below:

    static void
    on_replace_dd_promotion(struct trigger *trigger, void *event)
    {
            (void) trigger;
            struct txn *txn = (struct txn *) event;
            struct txn_stmt *stmt = txn_current_stmt(txn);
            struct trigger *on_commit, *on_rollback;
            if (stmt->new_tuple == NULL && stmt->old_tuple != NULL) {
                    on_commit =
                            txn_alter_trigger_new(on_commit_check_promotion_reset,
                                                  NULL);
                    txn_init_triggers(txn);
                    /*
                     * Promotion messages are deleted in batches by
                     * whole rounds. It is enough to check reset once
                     * for a transaction and use an unique trigger.
                     */
                    trigger_add_unique(&txn->on_commit, on_commit);
                    return;
            }