Changes in v2: - Moved 'manual' election mode fix into a separate commit; - Replaced raft_start/stop_candidate() with raft_promote/restore(); - Renamed ballot.can_be_leader to ballot.can_lead. Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-6018-boot-voter Issue: https://github.com/tarantool/tarantool/issues/6018 Vladislav Shpilevoy (5): replication: introduce ballot.can_lead box: save box_raft() into a variable raft: replace raft_start_candidate with _promote election: during bootstrap prefer candidates election: promote 'manual' bootstrap master .../unreleased/gh-6018-election-boot-voter.md | 4 + src/box/box.cc | 74 ++++++----- src/box/errcode.h | 2 +- src/box/iproto_constants.h | 1 + src/box/raft.c | 58 +++++++-- src/box/raft.h | 4 +- src/box/replication.cc | 11 +- src/box/xrow.c | 14 ++- src/box/xrow.h | 2 + src/lib/raft/raft.c | 98 ++++++++------- src/lib/raft/raft.h | 13 +- test/box/error.result | 2 +- .../gh-6018-election-boot-voter.result | 116 ++++++++++++++++++ .../gh-6018-election-boot-voter.test.lua | 59 +++++++++ test/replication/gh-6018-master.lua | 17 +++ test/replication/gh-6018-replica.lua | 15 +++ test/replication/suite.cfg | 1 + test/unit/raft.c | 41 ++++--- test/unit/raft.result | 22 ++-- test/unit/raft_test_utils.c | 17 +-- test/unit/raft_test_utils.h | 16 +-- 21 files changed, 438 insertions(+), 149 deletions(-) create mode 100644 changelogs/unreleased/gh-6018-election-boot-voter.md create mode 100644 test/replication/gh-6018-election-boot-voter.result create mode 100644 test/replication/gh-6018-election-boot-voter.test.lua create mode 100644 test/replication/gh-6018-master.lua create mode 100644 test/replication/gh-6018-replica.lua -- 2.24.3 (Apple Git-128)
The flag tells whether the sender has election mode 'candidate' or 'manual'. The new field during bootstrap will help to avoid selecting a 'voter' as a master. Voters can't write, they are unable to boot themselves nor register others. @TarantoolBot document Title: New field - IPROTO_BALLOT_CAN_LEAD It is sent as a part of `IPROTO_BALLOT (0x29)`. The field is a boolean flag which is true if the sender has `election_mode` in its config `'manual'` or `'candidate'`. During bootstrap the nodes able to become a leader are preferred over the nodes configured as `'voter'`. The field key is `0x07`. --- src/box/box.cc | 3 +++ src/box/iproto_constants.h | 1 + src/box/xrow.c | 14 ++++++++++++-- src/box/xrow.h | 2 ++ 4 files changed, 18 insertions(+), 2 deletions(-) diff --git a/src/box/box.cc b/src/box/box.cc index eeb57b04e..800f4d76f 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -2881,6 +2881,9 @@ void box_process_vote(struct ballot *ballot) { ballot->is_ro_cfg = cfg_geti("read_only") != 0; + enum election_mode mode = box_check_election_mode(); + ballot->can_lead = mode == ELECTION_MODE_CANDIDATE || + mode == ELECTION_MODE_MANUAL; ballot->is_anon = replication_anon; ballot->is_ro = is_ro_summary; ballot->is_booted = is_box_configured; diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h index 137bee9da..ea7290da6 100644 --- a/src/box/iproto_constants.h +++ b/src/box/iproto_constants.h @@ -168,6 +168,7 @@ enum iproto_ballot_key { IPROTO_BALLOT_IS_RO = 0x04, IPROTO_BALLOT_IS_ANON = 0x05, IPROTO_BALLOT_IS_BOOTED = 0x06, + IPROTO_BALLOT_CAN_LEAD = 0x07, }; #define bit(c) (1ULL<<IPROTO_##c) diff --git a/src/box/xrow.c b/src/box/xrow.c index 16cb2484c..cffaee0a5 100644 --- a/src/box/xrow.c +++ b/src/box/xrow.c @@ -459,7 +459,9 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot, mp_sizeof_uint(UINT32_MAX) + mp_sizeof_vclock_ignore0(&ballot->vclock) + mp_sizeof_uint(UINT32_MAX) + - mp_sizeof_vclock_ignore0(&ballot->gc_vclock); + mp_sizeof_vclock_ignore0(&ballot->gc_vclock) + + mp_sizeof_uint(IPROTO_BALLOT_CAN_LEAD) + + mp_sizeof_bool(ballot->can_lead); char *buf = obuf_reserve(out, max_size); if (buf == NULL) { @@ -471,7 +473,7 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot, char *data = buf + IPROTO_HEADER_LEN; data = mp_encode_map(data, 1); data = mp_encode_uint(data, IPROTO_BALLOT); - data = mp_encode_map(data, 6); + data = mp_encode_map(data, 7); data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO_CFG); data = mp_encode_bool(data, ballot->is_ro_cfg); data = mp_encode_uint(data, IPROTO_BALLOT_IS_RO); @@ -484,6 +486,8 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot, data = mp_encode_vclock_ignore0(data, &ballot->vclock); data = mp_encode_uint(data, IPROTO_BALLOT_GC_VCLOCK); data = mp_encode_vclock_ignore0(data, &ballot->gc_vclock); + data = mp_encode_uint(data, IPROTO_BALLOT_CAN_LEAD); + data = mp_encode_bool(data, ballot->can_lead); size_t size = data - buf; assert(size <= max_size); @@ -1362,6 +1366,7 @@ int xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot) { ballot->is_ro_cfg = false; + ballot->can_lead = false; ballot->is_ro = false; ballot->is_anon = false; ballot->is_booted = true; @@ -1434,6 +1439,11 @@ xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot) goto err; ballot->is_booted = mp_decode_bool(&data); break; + case IPROTO_BALLOT_CAN_LEAD: + if (mp_typeof(*data) != MP_BOOL) + goto err; + ballot->can_lead = mp_decode_bool(&data); + break; default: mp_next(&data); } diff --git a/src/box/xrow.h b/src/box/xrow.h index 9f61bfd47..000cd6b88 100644 --- a/src/box/xrow.h +++ b/src/box/xrow.h @@ -368,6 +368,8 @@ xrow_encode_auth(struct xrow_header *row, const char *salt, size_t salt_len, struct ballot { /** Set if the instance is configured in read-only mode. */ bool is_ro_cfg; + /** Set if the instance can become a Raft leader. */ + bool can_lead; /** * A flag whether the instance is anonymous, not having an * ID, and not going to request it. -- 2.24.3 (Apple Git-128)
It is shorter. Also in the next patches it is going to be used more, so better stop calling box_raft() constantly. --- src/box/box.cc | 35 ++++++++++++++++++----------------- 1 file changed, 18 insertions(+), 17 deletions(-) diff --git a/src/box/box.cc b/src/box/box.cc index 800f4d76f..673f6bed6 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -1536,7 +1536,7 @@ box_promote(void) "simultaneous invocations"); return -1; } - + struct raft *raft = box_raft(); /* * Do nothing when box isn't configured and when PROMOTE was already * written for this term (synchronous replication and leader election @@ -1552,12 +1552,12 @@ box_promote(void) try_wait = true; break; case ELECTION_MODE_VOTER: - assert(box_raft()->state == RAFT_STATE_FOLLOWER); + assert(raft->state == RAFT_STATE_FOLLOWER); diag_set(ClientError, ER_UNSUPPORTED, "election_mode='voter'", "manual elections"); return -1; case ELECTION_MODE_MANUAL: - if (box_raft()->state == RAFT_STATE_LEADER) + if (raft->state == RAFT_STATE_LEADER) return 0; run_elections = true; break; @@ -1567,13 +1567,13 @@ box_promote(void) * promote only if it's already an elected leader. No manual * elections. */ - if (box_raft()->state != RAFT_STATE_LEADER) { + if (raft->state != RAFT_STATE_LEADER) { diag_set(ClientError, ER_UNSUPPORTED, "election_mode=" "'candidate'", "manual elections"); return -1; } if (txn_limbo_replica_term(&txn_limbo, instance_id) == - box_raft()->term) + raft->term) return 0; break; @@ -1595,28 +1595,28 @@ box_promote(void) * Make this instance a candidate and run until some leader, not * necessarily this instance, emerges. */ - raft_start_candidate(box_raft()); + raft_start_candidate(raft); /* * Trigger new elections without waiting for an old leader to * disappear. */ - raft_new_term(box_raft()); + raft_new_term(raft); rc = box_raft_wait_leader_found(); /* * Do not reset raft mode if it was changed while running the * elections. */ if (box_election_mode == ELECTION_MODE_MANUAL) - raft_stop_candidate(box_raft(), false); + raft_stop_candidate(raft, false); if (rc != 0) return -1; - if (!box_raft()->is_enabled) { + if (!raft->is_enabled) { diag_set(ClientError, ER_RAFT_DISABLED); return -1; } - if (box_raft()->state != RAFT_STATE_LEADER) { + if (raft->state != RAFT_STATE_LEADER) { diag_set(ClientError, ER_INTERFERING_PROMOTE, - box_raft()->leader); + raft->leader); return -1; } } @@ -1682,15 +1682,15 @@ box_promote(void) } else { promote: /* We cannot possibly get here in a volatile state. */ - assert(box_raft()->volatile_term == box_raft()->term); + assert(raft->volatile_term == raft->term); txn_limbo_write_promote(&txn_limbo, wait_lsn, - box_raft()->term); + raft->term); struct synchro_request req = { .type = IPROTO_PROMOTE, .replica_id = former_leader_id, .origin_id = instance_id, .lsn = wait_lsn, - .term = box_raft()->term, + .term = raft->term, }; txn_limbo_process(&txn_limbo, &req); assert(txn_limbo_is_empty(&txn_limbo)); @@ -3493,9 +3493,10 @@ box_cfg_xc(void) * new records into WAL. Another reason - before recovery is done, * instance_id is not known, so Raft simply can't work. */ + struct raft *raft = box_raft(); if (!replication_anon) - raft_cfg_instance_id(box_raft(), instance_id); - raft_cfg_vclock(box_raft(), &replicaset.vclock); + raft_cfg_instance_id(raft, instance_id); + raft_cfg_vclock(raft, &replicaset.vclock); if (box_set_election_timeout() != 0) diag_raise(); @@ -3519,7 +3520,7 @@ box_cfg_xc(void) * should take the control over the situation and start a new * term immediately. */ - raft_new_term(box_raft()); + raft_new_term(raft); } /* box.cfg.read_only is not read yet. */ -- 2.24.3 (Apple Git-128)
raft_state_candiate() made the instance a candidate until it was reconfigured or raft_stop_candidate() was called. It was used during promotion of a 'manual' instance. This however was relatively hard to use - raft_stop_candidate() had a flag 'do_demote' which was passed as true only by Raft internally, and it also wasn't very consistent with box.promote() which works for one term only. One of the next patches is going to need Raft promotion again but for just one term. In order not to use raft_start/stop_leader() again, this patch replaces them with raft_promote() and raft_restore(). raft_promote() works very similar to box.promote() - it bumps the term and makes the instance a candidate for this term. If it wins, it stays a leader until next term happens. Otherwise either a new term is started and the node stops being a leader, or another node becomes a leader. In the former case raft_promote() might be called again to retry. Replacement couldn't be done independently from box/ changes, because raft_start/stop_candidate couldn't be kept - any term bump runs raft_restore() which is the same as raft_stop_candidate(do_demote=true), and that makes the old functions useless. Hence the patch also includes box/ changes. Part of #6018 --- src/box/box.cc | 27 +++------- src/box/errcode.h | 2 +- src/box/raft.c | 58 ++++++++++++++++++---- src/box/raft.h | 4 +- src/lib/raft/raft.c | 98 ++++++++++++++++++++----------------- src/lib/raft/raft.h | 13 +++-- test/box/error.result | 2 +- test/unit/raft.c | 41 +++++++++------- test/unit/raft.result | 22 +++++---- test/unit/raft_test_utils.c | 17 +++---- test/unit/raft_test_utils.h | 16 ++---- 11 files changed, 169 insertions(+), 131 deletions(-) diff --git a/src/box/box.cc b/src/box/box.cc index 673f6bed6..b828d3a31 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -1595,25 +1595,14 @@ box_promote(void) * Make this instance a candidate and run until some leader, not * necessarily this instance, emerges. */ - raft_start_candidate(raft); - /* - * Trigger new elections without waiting for an old leader to - * disappear. - */ - raft_new_term(raft); - rc = box_raft_wait_leader_found(); - /* - * Do not reset raft mode if it was changed while running the - * elections. - */ - if (box_election_mode == ELECTION_MODE_MANUAL) - raft_stop_candidate(raft, false); - if (rc != 0) - return -1; - if (!raft->is_enabled) { - diag_set(ClientError, ER_RAFT_DISABLED); - return -1; - } + do { + raft_promote(raft); + rc = box_raft_wait_term_outcome(); + if (rc != 0) { + raft_restore(raft); + return -1; + } + } while (raft->leader == 0); if (raft->state != RAFT_STATE_LEADER) { diag_set(ClientError, ER_INTERFERING_PROMOTE, raft->leader); diff --git a/src/box/errcode.h b/src/box/errcode.h index 49aec4bf6..d2854677f 100644 --- a/src/box/errcode.h +++ b/src/box/errcode.h @@ -276,7 +276,7 @@ struct errcode_record { /*221 */_(ER_SQL_CANT_ADD_AUTOINC, "Can't add AUTOINCREMENT: space %s can't feature more than one AUTOINCREMENT field") \ /*222 */_(ER_QUORUM_WAIT, "Couldn't wait for quorum %d: %s") \ /*223 */_(ER_INTERFERING_PROMOTE, "Instance with replica id %u was promoted first") \ - /*224 */_(ER_RAFT_DISABLED, "Elections were turned off while running box.ctl.promote()")\ + /*224 */_(ER_ELECTION_DISABLED, "Elections were turned off")\ /*225 */_(ER_TXN_ROLLBACK, "Transaction was rolled back") \ /* diff --git a/src/box/raft.c b/src/box/raft.c index 7f787c0c5..eb62e9630 100644 --- a/src/box/raft.c +++ b/src/box/raft.c @@ -327,30 +327,70 @@ fail: panic("Could not write a raft request to WAL\n"); } +/** + * Context of waiting for a Raft term outcome. Which is either a leader is + * elected, or a new term starts, or Raft is disabled. + */ +struct box_raft_watch_ctx { + bool is_done; + uint64_t term; + struct fiber *owner; +}; + static int -box_raft_wait_leader_found_f(struct trigger *trig, void *event) +box_raft_wait_term_outcome_f(struct trigger *trig, void *event) { struct raft *raft = event; assert(raft == box_raft()); - struct fiber *waiter = trig->data; - if (raft->leader != REPLICA_ID_NIL || !raft->is_enabled) - fiber_wakeup(waiter); + struct box_raft_watch_ctx *ctx = trig->data; + /* + * Term ended with nothing, probably split vote which led to a next + * term. + */ + if (raft->volatile_term > ctx->term) + goto done; + /* Instance does not participate in terms anymore. */ + if (!raft->is_enabled) + goto done; + /* The term ended with a leader being found. */ + if (raft->leader != REPLICA_ID_NIL) + goto done; + /* The term still continues with no resolution. */ + return 0; +done: + ctx->is_done = true; + fiber_wakeup(ctx->owner); return 0; } int -box_raft_wait_leader_found(void) +box_raft_wait_term_outcome(void) { + struct raft *raft = box_raft(); struct trigger trig; - trigger_create(&trig, box_raft_wait_leader_found_f, fiber(), NULL); - raft_on_update(box_raft(), &trig); - fiber_yield(); + struct box_raft_watch_ctx ctx = { + .is_done = false, + .term = raft->volatile_term, + .owner = fiber(), + }; + trigger_create(&trig, box_raft_wait_term_outcome_f, &ctx, NULL); + raft_on_update(raft, &trig); + /* + * XXX: it is not a good idea not to have a timeout here. If all nodes + * are voters, the term might never end with any result nor bump to a + * new value. + */ + while (!fiber_is_cancelled() && !ctx.is_done) + fiber_yield(); trigger_clear(&trig); if (fiber_is_cancelled()) { diag_set(FiberIsCancelled); return -1; } - assert(box_raft()->leader != REPLICA_ID_NIL || !box_raft()->is_enabled); + if (!raft->is_enabled) { + diag_set(ClientError, ER_ELECTION_DISABLED); + return -1; + } return 0; } diff --git a/src/box/raft.h b/src/box/raft.h index 6b6136510..d55e90d2f 100644 --- a/src/box/raft.h +++ b/src/box/raft.h @@ -97,9 +97,9 @@ box_raft_checkpoint_remote(struct raft_request *req); int box_raft_process(struct raft_request *req, uint32_t source); -/** Block this fiber until Raft leader is known. */ +/** Block this fiber until the current term's outcome is known. */ int -box_raft_wait_leader_found(); +box_raft_wait_term_outcome(void); void box_raft_init(void); diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c index ef11ef89f..24bf89dda 100644 --- a/src/lib/raft/raft.c +++ b/src/lib/raft/raft.c @@ -663,6 +663,11 @@ raft_sm_schedule_new_term(struct raft *raft, uint64_t new_term) raft->volatile_vote = 0; raft->leader = 0; raft->state = RAFT_STATE_FOLLOWER; + /* + * The instance could be promoted for the previous term. But promotion + * has no effect on following terms. + */ + raft_restore(raft); raft_sm_pause_and_dump(raft); /* * State is visible and it is changed - broadcast. Term is also visible, @@ -851,30 +856,15 @@ raft_cfg_is_enabled(struct raft *raft, bool is_enabled) raft_sm_start(raft); } -void -raft_cfg_is_candidate(struct raft *raft, bool is_candidate) -{ - raft->is_cfg_candidate = is_candidate; - is_candidate = is_candidate && raft->is_enabled; - if (is_candidate) - raft_start_candidate(raft); - else - raft_stop_candidate(raft, true); -} - -void +/** Make the instance a candidate. */ +static void raft_start_candidate(struct raft *raft) { + assert(raft->is_enabled); if (raft->is_candidate) return; + assert(raft->state == RAFT_STATE_FOLLOWER); raft->is_candidate = true; - assert(raft->state != RAFT_STATE_CANDIDATE); - /* - * May still be the leader after raft_stop_candidate - * with do_demote = false. - */ - if (raft->state == RAFT_STATE_LEADER) - return; if (raft->is_write_in_progress) { /* * If there is an on-going WAL write, it means there was @@ -889,35 +879,53 @@ raft_start_candidate(struct raft *raft) } } -void -raft_stop_candidate(struct raft *raft, bool do_demote) +/** + * Make the instance stop taking part in new elections and demote if it was a + * leader. + */ +static void +raft_stop_candidate(struct raft *raft) { - /* - * May still be the leader after raft_stop_candidate - * with do_demote = false. - */ - if (!raft->is_candidate && raft->state != RAFT_STATE_LEADER) + if (!raft->is_candidate) return; raft->is_candidate = false; - if (raft->state != RAFT_STATE_LEADER) { - /* Do not wait for anything while being a voter. */ - raft_ev_timer_stop(raft_loop(), &raft->timer); - } - if (raft->state != RAFT_STATE_FOLLOWER) { - if (raft->state == RAFT_STATE_LEADER) { - if (!do_demote) { - /* - * Remain leader until someone - * triggers new elections. - */ - return; - } - raft->leader = 0; - } - raft->state = RAFT_STATE_FOLLOWER; - /* State is visible and changed - broadcast. */ - raft_schedule_broadcast(raft); - } + /* Do not wait for anything while not being a candidate. */ + raft_ev_timer_stop(raft_loop(), &raft->timer); + if (raft->state == RAFT_STATE_LEADER) + raft->leader = 0; + raft->state = RAFT_STATE_FOLLOWER; + raft_schedule_broadcast(raft); +} + +static inline void +raft_set_candidate(struct raft *raft, bool is_candidate) +{ + if (is_candidate) + raft_start_candidate(raft); + else + raft_stop_candidate(raft); +} + +void +raft_cfg_is_candidate(struct raft *raft, bool is_candidate) +{ + raft->is_cfg_candidate = is_candidate; + raft_restore(raft); +} + +void +raft_promote(struct raft *raft) +{ + if (!raft->is_enabled) + return; + raft_sm_schedule_new_term(raft, raft->volatile_term + 1); + raft_start_candidate(raft); +} + +void +raft_restore(struct raft *raft) +{ + raft_set_candidate(raft, raft->is_cfg_candidate && raft->is_enabled); } void diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h index 223e0509f..53e2c58a8 100644 --- a/src/lib/raft/raft.h +++ b/src/lib/raft/raft.h @@ -275,17 +275,20 @@ void raft_cfg_is_candidate(struct raft *raft, bool is_candidate); /** - * Make the instance a candidate. + * Bump the term and become a candidate for it regardless of the config. In case + * of another term bump the node's role is restored according to its config + * automatically. */ void -raft_start_candidate(struct raft *raft); +raft_promote(struct raft *raft); /** - * Make the instance stop taking part in new elections. - * @param do_demote whether to stop being a leader immediately or not. + * Restore the instance role according to its config. In particular, if it was + * promoted and elected in the current term despite its config, restoration + * makes it a follower. */ void -raft_stop_candidate(struct raft *raft, bool do_demote); +raft_restore(struct raft *raft); /** Configure Raft leader election timeout. */ void diff --git a/test/box/error.result b/test/box/error.result index 062a90399..25def757e 100644 --- a/test/box/error.result +++ b/test/box/error.result @@ -442,7 +442,7 @@ t; | 221: box.error.SQL_CANT_ADD_AUTOINC | 222: box.error.QUORUM_WAIT | 223: box.error.INTERFERING_PROMOTE - | 224: box.error.RAFT_DISABLED + | 224: box.error.ELECTION_DISABLED | 225: box.error.TXN_ROLLBACK | ... diff --git a/test/unit/raft.c b/test/unit/raft.c index b9bc49b78..de7ada159 100644 --- a/test/unit/raft.c +++ b/test/unit/raft.c @@ -1354,38 +1354,31 @@ raft_test_too_long_wal_write(void) } static void -raft_test_start_stop_candidate(void) +raft_test_promote_restore(void) { - raft_start_test(8); + raft_start_test(12); struct raft_node node; raft_node_create(&node); raft_node_cfg_is_candidate(&node, false); raft_node_cfg_election_quorum(&node, 1); - raft_node_start_candidate(&node); + raft_node_promote(&node); raft_run_next_event(); - is(node.raft.state, RAFT_STATE_LEADER, "became leader after " - "start candidate"); + is(node.raft.state, RAFT_STATE_LEADER, "became leader after promotion"); - raft_node_stop_candidate(&node); - raft_run_for(node.cfg_death_timeout); - is(node.raft.state, RAFT_STATE_LEADER, "remain leader after " - "stop candidate"); - - raft_node_demote_candidate(&node); - is(node.raft.state, RAFT_STATE_FOLLOWER, "demote drops a non-candidate " - "leader to a follower"); + raft_node_restore(&node); + is(node.raft.state, RAFT_STATE_FOLLOWER, "restore drops a " + "non-candidate leader to a follower"); /* * Ensure the non-candidate leader is demoted when sees a new term, and * does not try election again. */ - raft_node_start_candidate(&node); + raft_node_promote(&node); raft_run_next_event(); - raft_node_stop_candidate(&node); - is(node.raft.state, RAFT_STATE_LEADER, "non-candidate but still " - "leader"); + is(node.raft.state, RAFT_STATE_LEADER, "became leader after promotion"); + ok(node.raft.is_candidate, "is a candidate"); is(raft_node_send_vote_request(&node, 4 /* Term. */, @@ -1394,11 +1387,23 @@ raft_test_start_stop_candidate(void) ), 0, "vote request from 2"); is(node.raft.state, RAFT_STATE_FOLLOWER, "demote once new election " "starts"); + ok(!node.raft.is_candidate, "is not a candidate after term bump"); raft_run_for(node.cfg_election_timeout * 2); is(node.raft.state, RAFT_STATE_FOLLOWER, "still follower"); is(node.raft.term, 4, "still the same term"); + /* Promote does not do anything on a disabled node. */ + raft_node_cfg_is_candidate(&node, true); + raft_node_cfg_is_enabled(&node, false); + raft_node_promote(&node); + is(node.raft.term, 4, "still old term"); + ok(!node.raft.is_candidate, "not a candidate"); + + /* Restore takes into account if Raft is enabled. */ + raft_node_restore(&node); + ok(!node.raft.is_candidate, "not a candidate"); + raft_node_destroy(&node); raft_finish_test(); } @@ -1424,7 +1429,7 @@ main_f(va_list ap) raft_test_death_timeout(); raft_test_enable_disable(); raft_test_too_long_wal_write(); - raft_test_start_stop_candidate(); + raft_test_promote_restore(); fakeev_free(); diff --git a/test/unit/raft.result b/test/unit/raft.result index f89cd1658..d2a9ac882 100644 --- a/test/unit/raft.result +++ b/test/unit/raft.result @@ -230,16 +230,20 @@ ok 12 - subtests ok 8 - became candidate ok 13 - subtests *** raft_test_too_long_wal_write: done *** - *** raft_test_start_stop_candidate *** - 1..8 - ok 1 - became leader after start candidate - ok 2 - remain leader after stop candidate - ok 3 - demote drops a non-candidate leader to a follower - ok 4 - non-candidate but still leader + *** raft_test_promote_restore *** + 1..12 + ok 1 - became leader after promotion + ok 2 - restore drops a non-candidate leader to a follower + ok 3 - became leader after promotion + ok 4 - is a candidate ok 5 - vote request from 2 ok 6 - demote once new election starts - ok 7 - still follower - ok 8 - still the same term + ok 7 - is not a candidate after term bump + ok 8 - still follower + ok 9 - still the same term + ok 10 - still old term + ok 11 - not a candidate + ok 12 - not a candidate ok 14 - subtests - *** raft_test_start_stop_candidate: done *** + *** raft_test_promote_restore: done *** *** main_f: done *** diff --git a/test/unit/raft_test_utils.c b/test/unit/raft_test_utils.c index 452c05c81..b5754cf78 100644 --- a/test/unit/raft_test_utils.c +++ b/test/unit/raft_test_utils.c @@ -388,24 +388,19 @@ raft_node_unblock(struct raft_node *node) } void -raft_node_start_candidate(struct raft_node *node) +raft_node_promote(struct raft_node *node) { assert(raft_node_is_started(node)); - raft_start_candidate(&node->raft); -} - -void -raft_node_stop_candidate(struct raft_node *node) -{ - assert(raft_node_is_started(node)); - raft_stop_candidate(&node->raft, false); + raft_promote(&node->raft); + raft_run_async_work(); } void -raft_node_demote_candidate(struct raft_node *node) +raft_node_restore(struct raft_node *node) { assert(raft_node_is_started(node)); - raft_stop_candidate(&node->raft, true); + raft_restore(&node->raft); + raft_run_async_work(); } void diff --git a/test/unit/raft_test_utils.h b/test/unit/raft_test_utils.h index 5f8618716..c68dc3b22 100644 --- a/test/unit/raft_test_utils.h +++ b/test/unit/raft_test_utils.h @@ -208,22 +208,16 @@ raft_node_block(struct raft_node *node); void raft_node_unblock(struct raft_node *node); -/** - * Make the node candidate, and maybe start election if a leader is not known. - */ +/** Promote the node. It bumps the term and tries to become a leader. */ void -raft_node_start_candidate(struct raft_node *node); +raft_node_promote(struct raft_node *node); /** - * Make the node non-candidate for next elections, but if it is a leader right - * now, it will stay a leader. + * Restore the node's state back to its is_candidate config. It is demoted if + * was a leader. */ void -raft_node_stop_candidate(struct raft_node *node); - -/** Stop the candidate and remove its leader role if present. */ -void -raft_node_demote_candidate(struct raft_node *node); +raft_node_restore(struct raft_node *node); /** Configuration methods. */ -- 2.24.3 (Apple Git-128)
During cluster bootstrap the boot master election algorithm didn't take into account election modes of the instances. It could be that all nodes have box.cfg.read_only = false, none is booted, all are read-only now. Then the node with the smallest UUID was chosen even if it was a box.cfg.election_mode='voter' node. It could neither boot nor register other nodes and the cluster couldn't start. The patch makes the boot master election prefer the instances which can become a Raft leader. If all the other parameters didn't help. Part #6018 --- src/box/replication.cc | 11 ++- .../gh-6018-election-boot-voter.result | 70 +++++++++++++++++++ .../gh-6018-election-boot-voter.test.lua | 38 ++++++++++ test/replication/gh-6018-master.lua | 17 +++++ test/replication/gh-6018-replica.lua | 15 ++++ test/replication/suite.cfg | 1 + 6 files changed, 150 insertions(+), 2 deletions(-) create mode 100644 test/replication/gh-6018-election-boot-voter.result create mode 100644 test/replication/gh-6018-election-boot-voter.test.lua create mode 100644 test/replication/gh-6018-master.lua create mode 100644 test/replication/gh-6018-replica.lua diff --git a/src/box/replication.cc b/src/box/replication.cc index a0b3e0186..45ad03dfd 100644 --- a/src/box/replication.cc +++ b/src/box/replication.cc @@ -978,12 +978,19 @@ replicaset_find_join_master(void) * config is stronger because if it is configured as read-only, * it is in read-only state for sure, until the config is * changed. + * + * In a cluster with leader election enabled all instances might + * look equal by the scores above. Then must prefer the ones + * which can be elected as a leader, because only they would be + * able to boot themselves and register the others. */ if (ballot->is_booted) - score += 10; + score += 1000; if (!ballot->is_ro_cfg) - score += 5; + score += 100; if (!ballot->is_ro) + score += 10; + if (ballot->can_lead) score += 1; if (leader_score < score) goto elect; diff --git a/test/replication/gh-6018-election-boot-voter.result b/test/replication/gh-6018-election-boot-voter.result new file mode 100644 index 000000000..6b05f0825 --- /dev/null +++ b/test/replication/gh-6018-election-boot-voter.result @@ -0,0 +1,70 @@ +-- test-run result file version 2 +-- +-- gh-6018: in an auto-election cluster nodes with voter state could be selected +-- as bootstrap leaders. They should not, because a voter can't be ever writable +-- and it can neither boot itself nor register other nodes. +-- +test_run = require('test_run').new() + | --- + | ... + +function boot_with_master_election_mode(mode) \ + test_run:cmd('create server master with '.. \ + 'script="replication/gh-6018-master.lua"') \ + test_run:cmd('start server master with wait=False, args="'..mode..'"') \ + test_run:cmd('create server replica with '.. \ + 'script="replication/gh-6018-replica.lua"') \ + test_run:cmd('start server replica') \ +end + | --- + | ... + +function stop_cluster() \ + test_run:cmd('stop server replica') \ + test_run:cmd('stop server master') \ + test_run:cmd('delete server replica') \ + test_run:cmd('delete server master') \ +end + | --- + | ... + +-- +-- Candidate leader. +-- +boot_with_master_election_mode('candidate') + | --- + | ... + +test_run:switch('master') + | --- + | - true + | ... +test_run:wait_cond(function() return not box.info.ro end) + | --- + | - true + | ... +assert(box.info.election.state == 'leader') + | --- + | - true + | ... + +test_run:switch('replica') + | --- + | - true + | ... +assert(box.info.ro) + | --- + | - true + | ... +assert(box.info.election.state == 'follower') + | --- + | - true + | ... + +test_run:switch('default') + | --- + | - true + | ... +stop_cluster() + | --- + | ... diff --git a/test/replication/gh-6018-election-boot-voter.test.lua b/test/replication/gh-6018-election-boot-voter.test.lua new file mode 100644 index 000000000..7222beb19 --- /dev/null +++ b/test/replication/gh-6018-election-boot-voter.test.lua @@ -0,0 +1,38 @@ +-- +-- gh-6018: in an auto-election cluster nodes with voter state could be selected +-- as bootstrap leaders. They should not, because a voter can't be ever writable +-- and it can neither boot itself nor register other nodes. +-- +test_run = require('test_run').new() + +function boot_with_master_election_mode(mode) \ + test_run:cmd('create server master with '.. \ + 'script="replication/gh-6018-master.lua"') \ + test_run:cmd('start server master with wait=False, args="'..mode..'"') \ + test_run:cmd('create server replica with '.. \ + 'script="replication/gh-6018-replica.lua"') \ + test_run:cmd('start server replica') \ +end + +function stop_cluster() \ + test_run:cmd('stop server replica') \ + test_run:cmd('stop server master') \ + test_run:cmd('delete server replica') \ + test_run:cmd('delete server master') \ +end + +-- +-- Candidate leader. +-- +boot_with_master_election_mode('candidate') + +test_run:switch('master') +test_run:wait_cond(function() return not box.info.ro end) +assert(box.info.election.state == 'leader') + +test_run:switch('replica') +assert(box.info.ro) +assert(box.info.election.state == 'follower') + +test_run:switch('default') +stop_cluster() diff --git a/test/replication/gh-6018-master.lua b/test/replication/gh-6018-master.lua new file mode 100644 index 000000000..1192204ff --- /dev/null +++ b/test/replication/gh-6018-master.lua @@ -0,0 +1,17 @@ +#!/usr/bin/env tarantool + +require('console').listen(os.getenv('ADMIN')) + +box.cfg({ + listen = 'unix/:./gh-6018-master.sock', + replication = { + 'unix/:./gh-6018-master.sock', + 'unix/:./gh-6018-replica.sock', + }, + election_mode = arg[1], + instance_uuid = 'cbf06940-0790-498b-948d-042b62cf3d29', + replication_timeout = 0.1, +}) + +box.ctl.wait_rw() +box.schema.user.grant('guest', 'super') diff --git a/test/replication/gh-6018-replica.lua b/test/replication/gh-6018-replica.lua new file mode 100644 index 000000000..71e669141 --- /dev/null +++ b/test/replication/gh-6018-replica.lua @@ -0,0 +1,15 @@ +#!/usr/bin/env tarantool + +require('console').listen(os.getenv('ADMIN')) + +box.cfg({ + listen = 'unix/:./gh-6018-replica.sock', + replication = { + 'unix/:./gh-6018-master.sock', + 'unix/:./gh-6018-replica.sock', + }, + election_mode = 'voter', + -- Smaller than master UUID. + instance_uuid = 'cbf06940-0790-498b-948d-042b62cf3d28', + replication_timeout = 0.1, +}) diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg index 69f2f3511..2bfc3b845 100644 --- a/test/replication/suite.cfg +++ b/test/replication/suite.cfg @@ -45,6 +45,7 @@ "gh-5536-wal-limit.test.lua": {}, "gh-5566-final-join-synchro.test.lua": {}, "gh-5613-bootstrap-prefer-booted.test.lua": {}, + "gh-6018-election-boot-voter.test.lua": {}, "gh-6027-applier-error-show.test.lua": {}, "gh-6032-promote-wal-write.test.lua": {}, "gh-6057-qsync-confirm-async-no-wal.test.lua": {}, -- 2.24.3 (Apple Git-128)
A cluster may consist of only voters and manual nodes. This means on bootstrap nobody would be elected as a Raft leader automatically to create the first snapshot and register the others. After the previous commit the manual nodes were preferred to be bootstrap masters, but they couldn't do anything. This patch makes 'manual' bootstrap master promote itself for one term so as it could boot the cluster. Closes #6018 --- .../unreleased/gh-6018-election-boot-voter.md | 4 ++ src/box/box.cc | 19 +++++++- .../gh-6018-election-boot-voter.result | 46 +++++++++++++++++++ .../gh-6018-election-boot-voter.test.lua | 21 +++++++++ 4 files changed, 88 insertions(+), 2 deletions(-) create mode 100644 changelogs/unreleased/gh-6018-election-boot-voter.md diff --git a/changelogs/unreleased/gh-6018-election-boot-voter.md b/changelogs/unreleased/gh-6018-election-boot-voter.md new file mode 100644 index 000000000..080484bbe --- /dev/null +++ b/changelogs/unreleased/gh-6018-election-boot-voter.md @@ -0,0 +1,4 @@ +## bugfix/replication + +* Fixed a cluster sometimes being unable to bootstrap if it contains nodes with + `election_mode` `manual` or `voter` (gh-6018). diff --git a/src/box/box.cc b/src/box/box.cc index b828d3a31..8c10a99dd 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -3501,7 +3501,8 @@ box_cfg_xc(void) if (!is_bootstrap_leader) { replicaset_sync(); - } else { + } else if (box_election_mode == ELECTION_MODE_CANDIDATE || + box_election_mode == ELECTION_MODE_MANUAL) { /* * When the cluster is just bootstrapped and this instance is a * leader, it makes no sense to wait for a leader appearance. @@ -3509,7 +3510,21 @@ box_cfg_xc(void) * should take the control over the situation and start a new * term immediately. */ - raft_new_term(raft); + raft_promote(raft); + int rc = box_raft_wait_term_outcome(); + if (rc == 0 && raft->leader != instance_id) { + /* + * It was promoted and is a single registered node - + * there can't be another leader or a new term bump. + */ + panic("Bootstrap master couldn't elect self as a " + "leader. Leader is %u, term is %llu", + raft->leader, (long long)raft->volatile_term); + } + if (rc != 0) { + raft_restore(raft); + diag_raise(); + } } /* box.cfg.read_only is not read yet. */ diff --git a/test/replication/gh-6018-election-boot-voter.result b/test/replication/gh-6018-election-boot-voter.result index 6b05f0825..1b7949bb9 100644 --- a/test/replication/gh-6018-election-boot-voter.result +++ b/test/replication/gh-6018-election-boot-voter.result @@ -4,6 +4,11 @@ -- as bootstrap leaders. They should not, because a voter can't be ever writable -- and it can neither boot itself nor register other nodes. -- +-- Similar situation was with the manual election. All instances might have +-- manual election mode. Such a cluster wouldn't be able to boot if their +-- bootstrap master wouldn't become an elected leader automatically at least +-- once. +-- test_run = require('test_run').new() | --- | ... @@ -68,3 +73,44 @@ test_run:switch('default') stop_cluster() | --- | ... + +-- +-- Manual leader. +-- +boot_with_master_election_mode('manual') + | --- + | ... + +test_run:switch('master') + | --- + | - true + | ... +test_run:wait_cond(function() return not box.info.ro end) + | --- + | - true + | ... +assert(box.info.election.state == 'leader') + | --- + | - true + | ... + +test_run:switch('replica') + | --- + | - true + | ... +assert(box.info.ro) + | --- + | - true + | ... +assert(box.info.election.state == 'follower') + | --- + | - true + | ... + +test_run:switch('default') + | --- + | - true + | ... +stop_cluster() + | --- + | ... diff --git a/test/replication/gh-6018-election-boot-voter.test.lua b/test/replication/gh-6018-election-boot-voter.test.lua index 7222beb19..fb08e2bc8 100644 --- a/test/replication/gh-6018-election-boot-voter.test.lua +++ b/test/replication/gh-6018-election-boot-voter.test.lua @@ -3,6 +3,11 @@ -- as bootstrap leaders. They should not, because a voter can't be ever writable -- and it can neither boot itself nor register other nodes. -- +-- Similar situation was with the manual election. All instances might have +-- manual election mode. Such a cluster wouldn't be able to boot if their +-- bootstrap master wouldn't become an elected leader automatically at least +-- once. +-- test_run = require('test_run').new() function boot_with_master_election_mode(mode) \ @@ -36,3 +41,19 @@ assert(box.info.election.state == 'follower') test_run:switch('default') stop_cluster() + +-- +-- Manual leader. +-- +boot_with_master_election_mode('manual') + +test_run:switch('master') +test_run:wait_cond(function() return not box.info.ro end) +assert(box.info.election.state == 'leader') + +test_run:switch('replica') +assert(box.info.ro) +assert(box.info.election.state == 'follower') + +test_run:switch('default') +stop_cluster() -- 2.24.3 (Apple Git-128)
18.07.2021 18:53, Vladislav Shpilevoy пишет:
> Changes in v2:
> - Moved 'manual' election mode fix into a separate commit;
> - Replaced raft_start/stop_candidate() with raft_promote/restore();
> - Renamed ballot.can_be_leader to ballot.can_lead.
>
> Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-6018-boot-voter
> Issue: https://github.com/tarantool/tarantool/issues/6018
>
> Vladislav Shpilevoy (5):
> replication: introduce ballot.can_lead
> box: save box_raft() into a variable
> raft: replace raft_start_candidate with _promote
> election: during bootstrap prefer candidates
> election: promote 'manual' bootstrap master
>
> .../unreleased/gh-6018-election-boot-voter.md | 4 +
> src/box/box.cc | 74 ++++++-----
> src/box/errcode.h | 2 +-
> src/box/iproto_constants.h | 1 +
> src/box/raft.c | 58 +++++++--
> src/box/raft.h | 4 +-
> src/box/replication.cc | 11 +-
> src/box/xrow.c | 14 ++-
> src/box/xrow.h | 2 +
> src/lib/raft/raft.c | 98 ++++++++-------
> src/lib/raft/raft.h | 13 +-
> test/box/error.result | 2 +-
> .../gh-6018-election-boot-voter.result | 116 ++++++++++++++++++
> .../gh-6018-election-boot-voter.test.lua | 59 +++++++++
> test/replication/gh-6018-master.lua | 17 +++
> test/replication/gh-6018-replica.lua | 15 +++
> test/replication/suite.cfg | 1 +
> test/unit/raft.c | 41 ++++---
> test/unit/raft.result | 22 ++--
> test/unit/raft_test_utils.c | 17 +--
> test/unit/raft_test_utils.h | 16 +--
> 21 files changed, 438 insertions(+), 149 deletions(-)
> create mode 100644 changelogs/unreleased/gh-6018-election-boot-voter.md
> create mode 100644 test/replication/gh-6018-election-boot-voter.result
> create mode 100644 test/replication/gh-6018-election-boot-voter.test.lua
> create mode 100644 test/replication/gh-6018-master.lua
> create mode 100644 test/replication/gh-6018-replica.lua
>
Thanks for the fixes! LGTM.
On Sun, Jul 18, 2021 at 06:53:24PM +0200, Vladislav Shpilevoy wrote:
> Changes in v2:
> - Moved 'manual' election mode fix into a separate commit;
> - Replaced raft_start/stop_candidate() with raft_promote/restore();
> - Renamed ballot.can_be_leader to ballot.can_lead.
>
> Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-6018-boot-voter
> Issue: https://github.com/tarantool/tarantool/issues/6018
>
> Vladislav Shpilevoy (5):
> replication: introduce ballot.can_lead
> box: save box_raft() into a variable
> raft: replace raft_start_candidate with _promote
> election: during bootstrap prefer candidates
> election: promote 'manual' bootstrap master
Looks ok to me, Ack!
Pushed to master, 2.8, 2.7.